Document Search Using Cosine Similarity In Haskell Using Linear¶
Reading OCharles’ excellent series about 24 days of hackage, in particular the post about Linear, I’ve been inspired to revisit some old code I wrote. The code is a document search engine that uses cosine similarity to rank matches. I like the following two articles if you’re not familiar with this technique:
Without further ado, here is the tiny piece of code that demonstrates this technique:
{-# LANGUAGE OverloadedStrings #-}
import Linear
import Data.List (sortBy)
import qualified Data.Map as Map
import qualified Data.ByteString.Lazy as B
import Data.ByteString.Lazy (ByteString)
import Text.Regex.Posix
type Vector = Map.Map ByteString Float
toVector :: ByteString -> Vector
toVector b = foldl (\m s -> Map.insertWith (+) s (1.0 :: Float) m) Map.empty tokenized
where tokenized = map head $ b =~ ("[A-Za-z]+" :: ByteString)
mostSimilar :: Vector -> [Vector] -> [Vector]
mostSimilar v = sortBy (\x y -> compare (cosTheta y) (cosTheta x))
where cosTheta x = dot x v / norm v * norm x
Linear allows me to shave off all the code defining norms/inner products
for Map String Float
and condense a proof of concept of the
technique in very little code (More than half is imports!). The snippet
is obviously not as efficient as it could be (toVector is slow for
example) but we can get the n
most similar documents for “free”
thanks to laziness even if we write code the naive way.
Btw, the snippet above has a (not so subtle) bug, can you spot it? ;)
Hint: Quickcheck!