.. post:: 2013-12-26 :tags: Haskell :author: Rudi Grinberg Document Search Using Cosine Similarity In Haskell Using Linear =============================================================== Reading OCharles' excellent series about 24 days of hackage, in particular the post about `Linear `__, I've been inspired to revisit some old code I wrote. The code is a document search engine that uses cosine similarity to rank matches. I like the following two articles if you're not familiar with this technique: - `Building a Vector Space Search Engine in Perl `__ - `Decoding Captchas in Python `__ Without further ado, here is the tiny piece of code that demonstrates this technique: .. code-block:: haskell {-# LANGUAGE OverloadedStrings #-} import Linear import Data.List (sortBy) import qualified Data.Map as Map import qualified Data.ByteString.Lazy as B import Data.ByteString.Lazy (ByteString) import Text.Regex.Posix type Vector = Map.Map ByteString Float toVector :: ByteString -> Vector toVector b = foldl (\m s -> Map.insertWith (+) s (1.0 :: Float) m) Map.empty tokenized where tokenized = map head $ b =~ ("[A-Za-z]+" :: ByteString) mostSimilar :: Vector -> [Vector] -> [Vector] mostSimilar v = sortBy (\x y -> compare (cosTheta y) (cosTheta x)) where cosTheta x = dot x v / norm v * norm x Linear allows me to shave off all the code defining norms/inner products for ``Map String Float`` and condense a proof of concept of the technique in very little code (More than half is imports!). The snippet is obviously not as efficient as it could be (toVector is slow for example) but we can get the ``n`` most similar documents for "free" thanks to laziness even if we write code the naive way. Btw, the snippet above has a (not so subtle) bug, can you spot it? ;) Hint: Quickcheck!