This year in nom: 2.0 is here! · Unhandled Expression

Nearly one year ago, on November 15th 2015, I released the 1.0 version of nom, the fast parser combinators library I wrote in Rust. A lot happened around that project, and I have been really happy to interact with nom users around the world.

TL;DR: it's new nom day! The 2.0 release is here! Read the changelog. Follow the upgrade documentation if it breaks stuff.

celebrate

Interesting usage

I wouldn't be able to list all the projects using nom on this page, even the subset present on crates.io, but here are a few examples of what people built with nom:

semver briefly shipped with nom in February thanks to Steve Klabnik, until he replaced it with a regexp based solution (no hard feelings, I'd have done the same)
tomllib, a complete TOML implementation written by Joel Self
a JVM, because why not! Great work coming from a team of students at the University of Pennsylvania
Tagua VM, a great PHP implementation in Rust by Ivan Enderlin
syn, the Rust item parser written by David Tolnay everybody uses with the macros 1.1 feature to generate code from structures or enums, actually ships with its own fork of nom! It was forked to remove the incomplete data handling, and reduce compilation times
shen-rust, a complete implementation of the Shen language in Rust that was presented at Strangeloop 2016 by Aditya Siram
a series of parsers (DER, NTP, SNMP, IPSec, TLS) were developed for its integration in the Suricata network analysis tool. This work was presented at Suricon 2016 by Pierre Chifflier

And a lot of other projects. As a side note, people apparently like to build parsers for flac, bittorrent and bitcoin stuff, LISP and Scheme tokenizers and, oddly, ASN.1 libraries :D

I have been really humbled by what people achieved with this little library, and I hope it will enable even more awesome projects!

Growth and stabilization

The goal before 1.0 was to get a usable parsing library, and after 1.0, to add features people were missing and explore new ideas. A lot of code was contributed for bitstream and string parsing, and adding a lot of useful combinators like "peek!", "separated_list!" or "tuple!".

Unfortunately, a few parts of nom got increasingly painful to maintain and support, so the 2.0 was a good opportunity to clean them up, and add more features while we're at it.

The "chain!" combinator, which everybody uses to parse a sequence of things and accumulate the results in structs or tuple, is now deprecated, and will be replaced by "do_parse!", a simpler alternative. There are also a lot of specific helpers to make your code nicer, like "pair!", "preceded!", "delimited!", "separated_pair!", "separated_list!" and "delimited!". Yes, I went to great lengths to make sure you stop using chain :)

The "length_value!" and other associated combinators were refactored, to have more sensible names and behaviours. "eof", eol" and the basic token parsers like "digit" or "alphanumeric" got the same treatment. Those can be a source of issues in the upgrade to 2.0, but if the new behaviour does not work in your project, replacing them is still easy with the "is_a!" combinator and others.

At last, I changed the name of the "error!" macro that was conflicting with the one from the log crate. I hoped that by waiting long enough, the log people would change their macro, but it looks like I lost :p

New combinators

A few new simple combinators are here:

the previously mentioned "do_parse!" makes nicer code than "chain!":

The "chain!" version uses this weird closure-like syntax (while not actually using a closure) with a comma ending the parser list:

named!(filetype_parser<&[u8],FileType>, 
 chain!( 
 m: brand_name ~ 
 v: take!(4) ~ 
 c: many0!(brand_name) , 
 ||{ FileType{
   major_brand: m,
   major_brand_version:v,
   compatible_brands: c
 } } 
));

The "do_parse!" version only uses ">>" as separating token, and returns a value as a tuple. If the tuple contains only value, (A) is conveniently equivalent to A.

named!(filetype_parser<&[u8],FileType>, 
 do_parse!( 
   m: brand_name >> 
   v: take!(4) >> 
   c: many0!(brand_name) >> 
   (FileType{
     major_brand: m,
     major_brand_version:v,
     compatible_brands: c
   }) 
));

"chain!" had too many features, like a "?" indicating a parser was optional (which you can now do with "opt!"), and you could declare one of the values as mutable. All of those and the awkward syntax made it hard to maintain. Still, it was one of the first useful combinators in nom, and it can now happily retire

"permutation!" applies its child parser in any order, as long as all of them succeed once

  fn permutation() {
    named!(perm<(&[u8], &[u8], &[u8])>,
      permutation!(tag!("abcd"), tag!("efg"), tag!("hi"))
    );

    let expected = (&b"abcd"[..], &b"efg"[..], &b"hi"[..]);

    let a = &b"abcdefghijk"[..];
    assert_eq!(perm(a), Done(&b"jk"[..], expected));
    let b = &b"efgabcdhijk"[..];
    assert_eq!(perm(b), Done(&b"jk"[..], expected));
    let c = &b"hiefgabcdjk"[..];
    assert_eq!(perm(c), Done(&b"jk"[..], expected)
}

This one was very interesting to write :)

"tag_no_case!" works like "tag!", but compares independently from the case. This works great for ASCII strings, since the comparison requires no allocation, but the UTF-8 case is trickier, and I'm still looking for a correct way to handle it
"named_attr!" creates functions like "named!" but can add attributes like documentation. This was a big pain point, now nom parsers can have documentation generated by rustdoc
"many_till!" applies repeatedly its first child parser until the second succeeds

Whitespace separated formats

This is one of the biggest new additions, and a feature that people wanted for a long time. A lot of the other Rust parser libraries are designed with programming languages parsing in mind, while I started nom mainly to parse binary formats, like video containers. Those libraries usually handle whitespace parsing for you, and you only need to specify the different elements of your grammars. You essentially work on a list of already separated elements.

Previously, with nom, you had to explicitely parse the spaces, tabs and end of lines, which made the parsers harder to maintain. What we want in the following example is to recognize a "(", an expression, then a ")", and return the expression, but we have to introduce a lot more code:

named!(parens<i64>, delimited!(
    delimited!(opt!(multispace), tag!("("), opt!(multispace)),
    expr,
    delimited!(opt!(multispace), tag!(")"), opt!(multispace))
  )
);

This new release introduces "ws!", a combinator that will automatically insert the separators everywhere:

named!(parens<i64>, ws!(delimited!( tag!("("), expr, tag!(")") )) );

magic By default, it removes spaces, tabs, carriage returns and line feed, but you can easily specify your own separator parser and make your own version of "ws!".

This makes whitespace separated formats very easy to write. See for example the quickly put together, probably not spec compliant JSON parser I added as test.

If you're working on a language parsers, this should help you greatly.

Architecture changes

Error management

The error management system that accumulated errors and input positions as it backtracks through the parser tree is great for some projects like language parsers, but others were not using it and got a penalty because of vectors allocation and deallocation.

In the 2.0 release, this error management system is now activated by the "verbose-errors" feature. Projects that don't use it should build correctly right away, and their parsers could get 30% to 50% faster!

Input types

One of nom's original assumptions was that it should work on byte slices and strings instead of byte or char iterators, because the CPU likes contiguous data. As always, the reality is a bit more complex than that, but it worked well and made the code very simple: I only passed subslices from one parser to the next.

But I wrongly assumed that because of that design, nom could only work on contiguous data. Carl Lerche made the interesting point that there are few points where nom actually needs to read a serie of bytes or chars and those could accomodate other data structures like ropes or a list of buffers.

So I got to work on an abstraction for input types that would work for &[u8] and &str, but also for other types. In the process, I was able to factor most of the &str specific combinators with the &[u8] ones. This will make them easier to maintain in the future.

The result of that work is a list of traits that any input type should implement to be usable with nom. I experimented a bit with the BlockBuf type, and this approach looks promising. I expect that people will find cool applications for this, like parsers returning references to not yet loaded data, or blocking a coroutine on a tag comparison until the data is available.

A smooth upgrade process

For the 1.0 release, I choose a few projects using nom, and tried to build them to test the new version and document the upgrade. This was so useful that I did it again for 2.0, so if you're lucky, you maintain one of the 30 crates I tested, and you received a pull request doing that upgrade for you. Otherwise, I wrote an upgrade documentation that you can follow to fix the migration issues. You're still lucky, though, because most crates will build (or only require a one line fix in Cargo.toml).

fixingstuff

I'll write soon about that process and the benefits you can get by applying it to your projects.

The future

I have a lot of ideas for the next version, also a lot of pull requests to merge and issues to fix. Not everything could make it into the 2.0, otherwise I would never have released it.

In short, the plan:

rewrite completely the producers and consumers system. It is not very usable right now. It could be replaced by an implementation based on futures
improve the performance. I got a good enough library by choosing the most naive solutions, but there are lots of points I could improve (especially in helping LLVM generate faster code)
implement a new serialization library. I believe there is some room for a serialization system that does not rely on automatic code generation, and it would go well with nom
continue my work on writing nom demuxers for VLC media player. I have a good proof of concept, now I need to make it production ready
add new, interesting examples: indentation based programming languages, tokio integration, integration in high performance networking systems
I'll release very soon a large networking tool that relies heavily on nom. Expect some big news :)

That's it, now go and upgrade your code, you'll enjoy this new version!

nom parser performance security

Interesting usage

Growth and stabilization

New combinators

Whitespace separated formats

Architecture changes

Error management

Input types

A smooth upgrade process

The future

Related Posts

nom 5 is here 17 Jun 2019

FOSS is free as in toilet 27 Nov 2018

No, pest is not faster than nom 04 Oct 2018