OmNomNum

comments

OmNomNum is a library for both C and Ruby that “Gobbles up numbers in strings”, but what does that mean?

Say you have a string containing the phrase

It’s the twenty third of October.

The string contains the number 23, however it’s currently written in plain English. OmNomNum will “normalize” this string, intelligently converting all numbers into actual integers or decimals like so:

It’s the 23rd of October.

Given that I just pushed the first version of OmNomNum to rubygems.org a little while ago, I thought now would be a good time to share the motivation behind creating this library and some of the things that I learned.

Some background

I recently began programming in Ruby, a language with many awesome libraries. One of the libraries available is Chronic, “… a pure Ruby natural language date parser”. This library can convert strings containing datetimes into Ruby Time objects. For example, it can take the string this tuesday 5:00 and convert it into the Time object representing Tue Aug 29 05:00:00 PDT 2006.

This excellent library can handle many formats of dates and times, and has even been ported to C#, Java, and PHP. There are many other NLP datetime libraries like parsedatetime for Python, Natty for Java, and python-natty for Python, to name a few.

Why create a library to parse numbers in strings?

The motivating reason is that Chronic is slow and uses a lot of memory. One of the major contributors of this is an inefficient algorithm to normalize numbers within strings.

We use Chronic where I work and one of our products started showing performance issues after a recent change. After researching the problem, one of my coworkers realized that Chronic was using a large amount of memory and was making up roughly 80% of the total time within the affected code. Luckily the strings we were supplying to Chronic in this particular code path all followed the same datetime format so we were able to modify the logic to rely on strftime instead. However, we use Chronic in many places throughout our codebase, so I decided to look into Chronic’s resource problems and see what could be done.

Another reason is that I’m fairly new to Ruby and I am interested in learning what’s involved in writing extensions. In addition, I wanted to try using some libraries for building a lexer and parser that are new to me as I’ll describe below. Fixing the entire problem in one take by rewriting Chronic is quite a large task for someone new to Ruby, and all of these objectives add their own complexities and time investments. To limit the scope of the project I decided to focus on the number parsing sub component within Chronic. It makes up a large amount of the problem, so solving this issue is still a big win.

Lastly, fixing some of the performance problems within Chronic would benefit all the other Rubyists who are using it. As I write this, Chronic has 2,500 stars on GitHub and 13.5M downloads from rubygems.org, putting it in the top 300 of the 125k gems currently available. So any improvements we can make to Chronic have the potential to help quite a few Rubyists.

What makes the number parsing component of Chronic inefficient?

The algorithm to parse numbers within Chronic is simple, but at the cost of performance. It relies on building many arrays and looping over the elements in these arrays many times calling gsub to make replacements within the string you supply.

The current version of Chronic, version 0.10.2, is 4 years old at the time of this post. Since that release the logic for parsing numbers has been extracted into another library called Numerizer. Sadly no improvements were made and the logic used in this library is the same.

Can we do better?

I’ve always been interested in how programming languages work, and two of the techniques used in the first few stages of building or running code written in a language are lexing and parsing. These two techniques allow you to extract the syntactic and sometimes semantic relation of the instructions within the code you write.

A number written in a string represents some integer or decimal, and one way to determine the value is by parsing the words in the string. Regular expressions are very powerful but a LALR(1) parser is more powerful and can do the processing in a single pass, avoiding the large number of loops used in Numerizer’s algorithm.

The C library I created, OmNomNum, utilizes a lexer and a parser that both generate plain C code. The re2c library is used to do lexical analysis of the string, turning the words into tokens that can be used by a parser. The lemon parser takes in these tokens and extracts meaning from them. OmNomNum uses these two libraries to generate a list of numbers found within a string and their locations. It then creates a new string where the numbers have been normalized.

Benefits of the new approach

One benefit of my approach is that I separated the underlying C logic into an entirely separate library. This means that we can include it not only within our Ruby gem, but it also means that we could write, say, a Python wrapper around it making the logic available within Python’s ecosystem as well.

Using an actual lexer and parser allows us to more intelligently parse numbers. OmNomNum currently handles numbers in more formats than Numerizer, and extending OmNomNum to handle more formats is also fairly simple assuming you’re not intimidated by lexing and parsing.

Depending on your use case, one potentially damaging byproduct of Numerizer is that it will mangle your string by removing hyphens and extra whitespace. Lucky for us, OmNomNum does not share this problem.

OmNomNum also has support for optionally parsing the word second, while Numerizer does not. The reason for not parsing second by default is that it can be used in strings to represent time. For example, the string one second could be parsed as both 1 second and 1 2nd. Because of Numerizer’s origins within Chronic it makes sense to skip this normalization, but the option to perform this transformation should be available and is a current issue for Numerizer.

Lastly, there are great performance gains from taking this approach.

Drawbacks of the new approach

There are two major drawbacks to the new approach.

First, OmNomNum is capable of handling a lot of cases, however the most notable omission is support for parsing fractions. If you need support for fractions you’ll still need to rely on Numerizer, at least for now. Support for them in the current grammar creates some issues. If really need them you can submit an issue, but ideally you’d submit a PR! Anyone wanting to help out could possibly start there.

Second, because we’re using a Ruby extension anyone using an implementation of Ruby other than MRI will possibly need to do some extra work to get it working.

Performance impact of the new approach

So how well does the lexer/parser version work? Let’s take a look at some numbers.

The docs on GitHub report some numbers for processing the string two hundred 2,000 times, but here are the highlights:

Memory

Library Bytes at exit Total # of Allocations Total Frees Total # of Bytes Allocated Savings
Numerizer 973,092 10,793,738 10,786,497 1,104,859,439
OmNomNum 1,860 4,039 4,019 41,011 25,000+×

CPU

Library Iterations per second Speedup
Numerizer 323.271
OmNomNum 2.516M 7,750+×

Ouch! Numerizer makes almost 11M allocations of over 1GB and can only handle about 350 inputs per second! OmNomNum fairs much better, making far fewer allocations and using only 42KB. It’s also much faster!

Of course the performance improvement you see will depend on your hardware, but overall you should see a drastic improvement in both memory and cpu usage. If you replace Chronic’s numerizing code with OmNomNum, you should see a roughly 5x performance improvement. This means there’s still a lot of time and memory being used within Chronic, but the numerizing component made up the majority of the performance problems.

thumbs up

How to use it today

I need to reach out to the maintainers of Chronic, but until then you can monkey-patch Chronic to use OmNomNum with the following code snippet:

require 'chronic'

module Chronic
  class Numerizer
    class << self
      def numerize(value)
        OmNomNum.normalize(value)
      end
    end
  end
end

Toss that in a file and put it under your /config/initializers directory within any rails project using Chronic.

Some technical details

I plan on doing a follow-up post of technical details, including things like how to include a C submodule within a gem and some techniques that were used to improve performance.

What’s next?

I’m currently contemplating applying the same techniques that I used with OmNomNum to rewrite Chronic, but only if there’s considerable interest shown. Rewrites and improvements like these take quite a lot of time, so I’d need significant buy-in.