.. post:: 2014-05-23 :tags: OCaml :author: Rudi Grinberg Introducing Humane-re ===================== OCaml is my favorite language, but one area where it (its tools rather) often falls short in practice is common string handling tasks where regular expressions are often involved. The kind of stuff that Awk and and scripting languages often get praised for. In other words, not getting in the way and allowing to get the job done with minimal boilerplate. The story for trying to accomplish the same thing in OCaml is not nearly as short. First, one usually looks at the ``String`` module and after a quick scroll realizes that there's no solution there. Second, the `Str `__ library is checked out. Realize that the interface is not user friendly nor thread safe. If you're a beginner, then you also start wandering what kind of a functional language OCaml really is. Luckily, now that we have OPAM, it's much easier to look for solutions beyond. If your string handling needs are simple enough then `core's `__ or `batteries' `__ ``String`` module. If you're lucky, your search ends there. Even if you're not, you still have plenty of good options: - `pcre-ocaml `__ Markus Mottl has written excellent bindings to the most popular flavor of regular expressions. In fact, I'd recommend these to anyone first if they're writing an application or an internal library. - `re2 `__ Janestreet's bindings to Google's re2. The interface is quite nice but there's a gajillion dependencies. Nevertheless, this is probably your best option if you're looking for speed. As always, profile your code first. - `ocaml-re `__ This is the most interesting regex library because it's written in OCaml. Its coolest feature is that it supports various regex syntaxes. Including: pcre, str, posix, glob, etc. In fact, it even has a drop in replacement for the builtin ``Str``. Unfortunately, re's interface is rather prickly, especially for beginners. Fixing that problem is going to be the meat of this blog post. Nevertheless, this is what I'd recommend if you're going to publish a library for others to use. It doesn't force any non-ocaml dependencies on users. Humane-re --------- Realizing that it's too hard for users to do the right thing and use ocaml-re, I've created a little wrapper called `Humane-re `__ around ocaml-re that makes it easier to accomplish the common tasks. The goal is to cover 90% of the use cases with minimal incidental complexity. For now, Humane-re is still an experiment so the interface isn't stable yet. I haven't fleshed out the interface for replacement either, and currently I'm only supporting the ``Str`` flavor of regular expressions. Even with these limitations it's already been useful for me. I'll do a few brief examples of how to do some common tasks. To follow along, install humane-re with: :: $ opam install humane-re and load it in utop: :: $ utop # require "humane_re";; # open Humane_re;; Super Naive Email Validation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: ocaml let is_valid_email = let email_re = Str.regexp ".+@.+" in let open Str.Infix in fun email -> email =~ email_re Extract all words ~~~~~~~~~~~~~~~~~ .. code-block:: ocaml let extract_words = Str.(find_matches (regexp "\\b\\([A-Za-z]+\\)\\b")) Parsing HTTP Header Like Value ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: ocaml let parse_header = let re = Str.regexp ":[ \t]*" in fun header -> match Str.split ~max:2 re header with | [name; value] -> Some (name, value) | _ -> None I'll admit, for these simple (but not contrived!) examples there's no great improvement in readibility over ``Str``. At least we're not relying on any global variables. However, humane-re pulls ahead of ``Str`` in readability when groups are involved. I'll show how to use groups in the next example. Extracting Links Matching a Predicate ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Where the predicate here is links to a certain website (e.g. imgur). Don't ever use regular expressions to parse HTML in practice. This is only for demonstration purposes: .. code-block:: ocaml let extract_imgur_links page = let is_imgur s = let open Str.Infix in s =~ (Str.regexp ".+\\bimgur\\.com.+") in let re = Str.regexp "\\([^<>]+\\)" in page |> Str.fold_left_groups re ~init:[] ~f:(fun acc g -> match Str.Group.all g with | [href; text] when is_imgur href -> (href, text)::acc | _ -> acc) |> List.rev The whole interface is contained in `S.mli `__. I could reproduce it here but it will just go out of date. The ocamldoc isn't there yet but the interface should be straight forward enough. Once again, send me suggestions, questions, critique, etc. What's next? ------------ At this point I'm trying to collect as much feedback as possible about the interface because providing a nice interface is the first goal of this library. In particular, an interface for substitution would be very welcome. The second goal is to support the different ways ocaml-re allows you to construct regular expressions. I'm not very fond of Str's regex syntax, but it does have the practical purpose of allowing me to port old ``Str`` code. The third and lofty goal is to implement humane-re's interface with other backends. There's probably some value in not having to commit to code to any particular regex implementation (aside from benchmarking purposes ;D).