OmNomNum is a library for both C and Ruby that “Gobbles up numbers in strings”, but what does that mean?
Say you have a string containing the phrase
It’s the twenty third of October.
The string contains the number 23, however it’s currently written in plain English. OmNomNum will “normalize” this string, intelligently converting all numbers into actual integers or decimals like so:
It’s the 23rd of October.
Given that I just pushed the first version of OmNomNum to rubygems.org a little while ago, I thought now would be a good time to share the motivation behind creating this library and some of the things that I learned.
I recently began programming in Ruby, a language with many awesome libraries.
One of the libraries available is Chronic,
“… a pure Ruby natural language date parser”. This library can convert strings
containing datetimes into Ruby
Time objects. For example, it can take the
this tuesday 5:00 and convert it into the
Time object representing
Tue Aug 29 05:00:00 PDT 2006.
This excellent library can handle many formats of dates and times, and has even been ported to C#, Java, and PHP. There are many other NLP datetime libraries like parsedatetime for Python, Natty for Java, and python-natty for Python, to name a few.
Why create a library to parse numbers in strings?
The motivating reason is that Chronic is slow and uses a lot of memory. One of the major contributors of this is an inefficient algorithm to normalize numbers within strings.
We use Chronic where I work and one of our products started showing performance issues
after a recent change. After researching the problem, one of my coworkers realized
that Chronic was using a large amount of memory and was making up roughly 80%
of the total time within the affected code. Luckily the strings we were supplying
to Chronic in this particular code path all followed the same datetime format so
we were able to modify the logic to rely on
However, we use Chronic in many places throughout our codebase, so I decided to
look into Chronic’s resource problems and see what could be done.
Another reason is that I’m fairly new to Ruby and I am interested in learning what’s involved in writing extensions. In addition, I wanted to try using some libraries for building a lexer and parser that are new to me as I’ll describe below. Fixing the entire problem in one take by rewriting Chronic is quite a large task for someone new to Ruby, and all of these objectives add their own complexities and time investments. To limit the scope of the project I decided to focus on the number parsing sub component within Chronic. It makes up a large amount of the problem, so solving this issue is still a big win.
Lastly, fixing some of the performance problems within Chronic would benefit all the other Rubyists who are using it. As I write this, Chronic has 2,500 stars on GitHub and 13.5M downloads from rubygems.org, putting it in the top 300 of the 125k gems currently available. So any improvements we can make to Chronic have the potential to help quite a few Rubyists.
What makes the number parsing component of Chronic inefficient?
The algorithm to parse numbers
within Chronic is simple, but at the cost of performance. It relies on building
many arrays and looping over the elements in these arrays many times calling
gsub to make
replacements within the string you supply.
The current version of Chronic, version 0.10.2, is 4 years old at the time of this post. Since that release the logic for parsing numbers has been extracted into another library called Numerizer. Sadly no improvements were made and the logic used in this library is the same.
Can we do better?
I’ve always been interested in how programming languages work, and two of the techniques used in the first few stages of building or running code written in a language are lexing and parsing. These two techniques allow you to extract the syntactic and sometimes semantic relation of the instructions within the code you write.
A number written in a string represents some integer or decimal, and one way to determine the value is by parsing the words in the string. Regular expressions are very powerful but a LALR(1) parser is more powerful and can do the processing in a single pass, avoiding the large number of loops used in Numerizer’s algorithm.
The C library I created, OmNomNum, utilizes a lexer and a parser that both generate plain C code. The re2c library is used to do lexical analysis of the string, turning the words into tokens that can be used by a parser. The lemon parser takes in these tokens and extracts meaning from them. OmNomNum uses these two libraries to generate a list of numbers found within a string and their locations. It then creates a new string where the numbers have been normalized.
Benefits of the new approach
One benefit of my approach is that I separated the underlying C logic into an entirely separate library. This means that we can include it not only within our Ruby gem, but it also means that we could write, say, a Python wrapper around it making the logic available within Python’s ecosystem as well.
Using an actual lexer and parser allows us to more intelligently parse numbers. OmNomNum currently handles numbers in more formats than Numerizer, and extending OmNomNum to handle more formats is also fairly simple assuming you’re not intimidated by lexing and parsing.
Depending on your use case, one potentially damaging byproduct of Numerizer is that it will mangle your string by removing hyphens and extra whitespace. Lucky for us, OmNomNum does not share this problem.
OmNomNum also has support for optionally parsing the word
second, while Numerizer
does not. The reason for not parsing
second by default is that it can be used
in strings to represent time. For example, the string
one second could be
parsed as both
1 second and
1 2nd. Because of Numerizer’s origins within
Chronic it makes sense to skip this normalization, but the option to perform
this transformation should be available and is a current issue
Lastly, there are great performance gains from taking this approach.
Drawbacks of the new approach
There are two major drawbacks to the new approach.
First, OmNomNum is capable of handling a lot of cases, however the most notable omission is support for parsing fractions. If you need support for fractions you’ll still need to rely on Numerizer, at least for now. Support for them in the current grammar creates some issues. If really need them you can submit an issue, but ideally you’d submit a PR! Anyone wanting to help out could possibly start there.
Second, because we’re using a Ruby extension anyone using an implementation of Ruby other than MRI will possibly need to do some extra work to get it working.
Performance impact of the new approach
So how well does the lexer/parser version work? Let’s take a look at some numbers.
The docs on GitHub report
some numbers for processing the string
two hundred 2,000 times, but here are the
|Library||Bytes at exit||Total # of Allocations||Total Frees||Total # of Bytes Allocated||Savings|
|Library||Iterations per second||Speedup|
Ouch! Numerizer makes almost 11M allocations of over 1GB and can only handle about 350 inputs per second! OmNomNum fairs much better, making far fewer allocations and using only 42KB. It’s also much faster!
Of course the performance improvement you see will depend on your hardware, but overall you should see a drastic improvement in both memory and cpu usage. If you replace Chronic’s numerizing code with OmNomNum, you should see a roughly 5x performance improvement. This means there’s still a lot of time and memory being used within Chronic, but the numerizing component made up the majority of the performance problems.
How to use it today
I need to reach out to the maintainers of Chronic, but until then you can monkey-patch Chronic to use OmNomNum with the following code snippet:
require 'chronic' module Chronic class Numerizer class << self def numerize(value) OmNomNum.normalize(value) end end end end
Toss that in a file and put it under your
/config/initializers directory within
any rails project using Chronic.
Some technical details
I plan on doing a follow-up post of technical details, including things like how to include a C submodule within a gem and some techniques that were used to improve performance.
I’m currently contemplating applying the same techniques that I used with OmNomNum to rewrite Chronic, but only if there’s considerable interest shown. Rewrites and improvements like these take quite a lot of time, so I’d need significant buy-in.