Document Search Using Cosine Similarity In Haskell Using Linear

Reading OCharles’ excellent series about 24 days of hackage, in particular the post about Linear, I’ve been inspired to revisit some old code I wrote. The code is a document search engine that uses cosine similarity to rank matches. I like the following two articles if you’re not familiar with this technique:

Without further ado, here is the tiny piece of code that demonstrates this technique:

{-# LANGUAGE OverloadedStrings #-}
import Linear
import Data.List (sortBy)
import qualified Data.Map as Map
import qualified Data.ByteString.Lazy as B
import Data.ByteString.Lazy (ByteString)
import Text.Regex.Posix

type Vector = Map.Map ByteString Float

toVector :: ByteString -> Vector
toVector b = foldl (\m s -> Map.insertWith (+) s (1.0 :: Float) m) Map.empty tokenized
  where tokenized = map head $ b =~ ("[A-Za-z]+" :: ByteString)

mostSimilar :: Vector -> [Vector] -> [Vector]
mostSimilar v = sortBy (\x y -> compare (cosTheta y) (cosTheta x))
  where cosTheta x = dot x v / norm v * norm x

Linear allows me to shave off all the code defining norms/inner products for Map String Float and condense a proof of concept of the technique in very little code (More than half is imports!). The snippet is obviously not as efficient as it could be (toVector is slow for example) but we can get the n most similar documents for “free” thanks to laziness even if we write code the naive way.

Btw, the snippet above has a (not so subtle) bug, can you spot it? ;)

Hint: Quickcheck!


comments powered by Disqus