.. post:: 2013-12-26
   :tags: Haskell
   :author: Rudi Grinberg

Document Search Using Cosine Similarity In Haskell Using Linear
===============================================================

Reading OCharles' excellent series about 24 days of hackage, in
particular the post about
`Linear <http://ocharles.org.uk/blog/posts/2013-12-02-24-days-of-hackage-linear.html>`__,
I've been inspired to revisit some old code I wrote. The code is a
document search engine that uses cosine similarity to rank matches. I
like the following two articles if you're not familiar with this
technique:

-  `Building a Vector Space Search Engine in
   Perl <http://www.perl.com/pub/2003/02/19/engine.html>`__
-  `Decoding Captchas in
   Python <http://www.boyter.org/decoding-captchas/>`__

Without further ado, here is the tiny piece of code that demonstrates
this technique:

.. code-block:: haskell

    {-# LANGUAGE OverloadedStrings #-}
    import Linear
    import Data.List (sortBy)
    import qualified Data.Map as Map
    import qualified Data.ByteString.Lazy as B
    import Data.ByteString.Lazy (ByteString)
    import Text.Regex.Posix

    type Vector = Map.Map ByteString Float

    toVector :: ByteString -> Vector
    toVector b = foldl (\m s -> Map.insertWith (+) s (1.0 :: Float) m) Map.empty tokenized
      where tokenized = map head $ b =~ ("[A-Za-z]+" :: ByteString)

    mostSimilar :: Vector -> [Vector] -> [Vector]
    mostSimilar v = sortBy (\x y -> compare (cosTheta y) (cosTheta x))
      where cosTheta x = dot x v / norm v * norm x

Linear allows me to shave off all the code defining norms/inner products
for ``Map String Float`` and condense a proof of concept of the
technique in very little code (More than half is imports!). The snippet
is obviously not as efficient as it could be (toVector is slow for
example) but we can get the ``n`` most similar documents for "free"
thanks to laziness even if we write code the naive way.

Btw, the snippet above has a (not so subtle) bug, can you spot it? ;)

Hint: Quickcheck!