nom 5 is here

nom, the Rust parser combinators library, is now available at version 5. This is the most mature version of nom. This is the one that feels “done”. This is the parser library that I wanted when I started nom 5 years ago. It’s here at last.

hamster eating broccoli: om nom nom nom!

nom 5 is a complete rewrite of the internal architecture, to use functions instead of macros, while keeping backward compatibility with existing macros based parsers, and making the error type completely generic.

As an example, here are some elements of a JSON parser:

fn string(i: &str) -> IResult<&str, &str> {
  delimited(
    char('\"'),
    parse_str,
    char('\"')
  )(i)
}

fn array(i: &str) -> IResult<&str, Vec<JsonValue>> {
  delimited(
    char('['),
    separated_list(
      char(','),
      json_value
    ),
    char(']')
  )(i)
}

fn json_value(i: &str) -> IResult<&str, JsonValue> {
  alt((
    map(hash, JsonValue::Object),
    map(array, JsonValue::Array),
    map(string, |s| JsonValue::Str(String::from(s))),
    map(double, JsonValue::Num),
    map(boolean, JsonValue::Boolean),
  ))(i)
}

Why?

Why move away from macros? And why use macros in the first place if it could be done with plain functions?

The short answer is that it couldn’t be done until recently. The long answer is that when I started nom in 2014, Rust was different, less powerful. At that time I had an intriguing idea: what if we had parser combinators that can return a reference to the input data instead of copying it? This could fix some of the performance problems that were frequent with parser combinators.

It turned out that yes, it was possible, but making it usable required some work. I needed to add lifetimes everywhere; closures were hard to manipulate. Combining functions directly, which was my initial vision based on Haskell’s Parsec, was not possible. Still, declarative macros were already working pretty well, and gave us a powerful meta programming tool.

So nom ended up being a macros based DSL, that would generate code depending on macros calling other macros, rewriting their arguments, etc.

// named is used to declare a function and its arguments
named!(hex_primary<&str, u8>,
  map_res!(
    take_while_m_n!(2, 2, is_hex_digit),
    from_hex
  )
);

named!(hex_color<&str, Color>,
  // do_parse applies parsers in sequence
  do_parse!(
           // tag recognizes a specific string
           tag!("#")   >>
    red:   hex_primary >>
    green: hex_primary >>
    blue:  hex_primary >>
    (Color { red, green, blue })
  )
);

assert_eq!(
  hex_color("#2F14DF"),
  Ok(("", Color {
    red: 47,
    green: 20,
    blue: 223,
  }))
);

For an insight into how it works, see how the opt! combinator, that wraps a parser result in an Option (using None if there’s an error), is defined:

#[macro_export(local_inner_macros)]
macro_rules! opt(
  ($i:expr, $submac:ident!( $($args:tt)* )) => (
    {
      use $crate::lib::std::result::Result::*;
      use $crate::lib::std::option::Option::*;
      use $crate::Err;

      let i_ = $i.clone();
      match $submac!(i_, $($args)*) {
        Ok((i,o))          => Ok((i, Some(o))),
        Err(Err::Error(_)) => Ok(($i, None)),
        Err(e)             => Err(e),
      }
    }
  );
  ($i:expr, $f:expr) => (
    opt!($i, call!($f));
  );
);

The first argument $i is the input passed by the calling macro, and it is given as first argument to the child parser, so if we had opt!(input, parser), we would generate the following code:

let i_ = input.clone();
match parser(i_) {
  Ok((i,o))          => Ok((i, Some(o))),
  Err(Err::Error(_)) => Ok((input, None)),
  Err(e)             => Err(e),
}

While macros could become complex, the generated code was quite simple, mainly using nested pattern matching. This made most combinators easy to write, and the generated code was fast.

At the same time, macros had issues that plagued nom for a long time. The smallest typo when calling a macro would result into inscrutable error messages. Macros variants that were not tested were never compiled (it was harder to catch mistakes). Macro argument parsing limited the syntax.

If you were willing to learn a bit about nom and accept some of its quirks, along with the array of tools to make building and debugging parsers much easier; writing parsers were a fun, interactive process of figuring out data byte after byte.

But macros were hard to learn for many people, and I had to find another way.

Using functions

Thanks to the development of the “impl Trait” feature in Rust, I had an idea: could I make a function that accepts a function as argument, and returns another function, making everything generic?

As it turns out, yes, I can. See, for example, the pair combinator, that takes as argument 2 parsers, and returns a new parser that produces a tuple of the results of both child parsers:

pub fn pair<I, O1, O2, E, F, G>(first: F, second: G)
  -> impl Fn(I) -> IResult<I, (O1, O2), E>
  where
    F: Fn(I) -> IResult<I, O1, E>,
    G: Fn(I) -> IResult<I, O2, E>,
    E: ParseError<I>,

{
  move |input: I| {
    let (input, o1) = first(input)?;
    second(input).map(|(i, o2)| (i, (o1, o2)))
  }
}

// you can call directly the resulting parser like this:
let result = pair(parser1, parser2)(input);

(IResult is the parser result type of nom)

Not only is it possible, but the code is reasonably easy to write: I built an example parser library based on this in a few days.

The previous hexadecimal color parser could be rewritten like this:

fn hex_primary(input: &str) -> IResult<&str, u8> {
  map_res(
    take_while_m_n(2, 2, is_hex_digit),
    from_hex
  )(input)
}

fn hex_color(input: &str) -> IResult<&str, Color> {
  let (input, _) = tag("#")(input)?;
  let (input, (red, green, blue)) = tuple((hex_primary, hex_primary, hex_primary))(input)?;

  Ok((input, Color { red, green, blue }))
}

And the existing macros were rewritten to use the function combinators under the hood, so existing parsers will work the same way with nom 5, except for how errors are handled, and some details around streaming parsers you’ll see below.

Error management

nom 4 relied on an error type that would change depending on the verbose-errors cargo feature. If you used the feature, you would get more context on errors at the price of performance, if you didn’t you could only see which combinator saw an error and the input position.

Unfortunately, cargo features are additive, so if any transitive dependency was using nom with that feature, it would be activated everywhere.

Additionally, this error type supported a “custom” error variant that was generic, which resulted in painful type inference errors from time to time.

nom 5 changes this by making the error management completely generic, and the way you write your code or call the parsers will decide on the error type. The trick is in the ParserError<Input> trait you saw earlier in the pair combinator’s code.

By default, the error type in nom is a tuple: (Input, nom::error::ErrorKind). But you can use a more precise error type, like nom::error::VerboseError, that will provide more context. Or you can define your own error type, with exactly the information you need. You can learn more about it in the error management guide.

Streaming VS complete parsers

Another pain point of previous nom versions was in the management of streaming or complete input. In streaming mode, we assume that we do not have the entire data, and might get more by reading again from a file or socket. In complete mode, we know we have the entire data.

Some parsers are built differently depending on how the input data works. If you were writing an integer parser from text, in complete mode you could just read all the digits, even until the end of input, because you would have all of the data. With streaming input, if you reach the end of the data, you do not know if more digits can appear. So you have to wait until you encounter a character that is not a digit.

From the beginning, nom has made this distinction explicit, because it was designed with network protocols and big file formats in mind, through the Incomplete variant of parser results, that indicates the parser cannot decide and needs more data.

Making it usable, though, was challenging, especially when building parsers for smaller formats, like configuration files or programming languages.

The first solution was to use the complete! combinator, which transforms an Incomplete result into an error. Then having specialized versions of some combinators. But this was making parsers hard to write, and figuring out which parser was returning Incomplete was annoying.

nom 4 introduced a new idea: what if the input type decided if we were in streaming or complete mode? The CompleteByteSlice and CompleteStr were the complete input versions of &[u8] and &str that were considered as streaming input. Unfortunately, this made things worse: those types were littering the code, converting back and forth with the underlying byte slices or strings was tedious, and they inexplicably made parsers slower.

So nom 5 comes with a cleaner approach:

  • CompleteByteSlice and CompleteStr are gone
  • macros based parsers always work in streaming mode
  • for function combinators that would work differently in streaming or complete mode, there are different versions in submodules, like nom::bytes::streaming::tag and nom::bytes::complete::tag

This way, you can choose explicitely which version you use, and can even use a bit of both depending on the context. As an example, in a binary format, you might use nom::bytes::streaming::take to get a slice of the input, then parse that slice with parsers that work on complete input.

Various other changes

nom now uses the lexical crate for float parsing. Having a good float parser was important, but it is a complex enough topic to leave it to another crate.

This release was also a good opportunity to clean things up. Over the years, nom accumulated a lot of code, ideas, bad function names. So a lot of parsers were renamed or removed entirely, and the code was reorganized. For more information, check out the changelog.

The documentation was entirely rewritten: documentation and examples are available for every type, function and macro. The guides were updated to follow the new code. And we now have cleaner examples:

  • The JSON parsing example explains how to write a parser, from simple parts to more complex parsers, and shows how the new flexible error management can generate great error messages.
  • The S expression example demonstrates how we can parse and interpret programming languages.
  • The iterator example shows new patterns that are available in nom 5. It is now easy to make an iterator out of a parser and input data, parsing and producing values as needed. There is even a new iterator combinator to help with that.

Performance

Switching to functions changes the entire code layout. Now, a lot of the combinator code can be shared, instead of being generated every time it is called.

So, in most cases it will improve performance, but in a few others we might get worse results. Future versions will smooth things out: nom 4 was the peak of what was possible with the macros system, while I’m just getting started with the new design!

To reproduce the experiments, check that:

  • you’re using jemallocator instead of the system allocator. Test with and without it and see how it affects the results
  • you’re using link time optimization (LTO) when compiling, which will reduce code size

Here is what I add to Cargo.toml for benchmarks:

[profile.bench]
lto = true
codegen-units = 1

And now, some benchmarks between nom 4.2.3 and 5.0.0:

  • HTTP parser: 20% faster (non optimized one, runs at 500MB/s)
  • INI file parser: 20% faster
  • JSON file parser: 20% slower (work in progress, allocations affect it a lot)
  • float parser: 98% faster thanks to the lexical crate
  • recognize_float parser: 62% faster (not using the lexical crate)

No comparison between parser libraries this time, I have not yet rewritten all the nom examples in the parser benchmarks repository.

Thanks

The macros to functions refactoring and the documentation rewrite were huge and required an enormous effort. I would not have been able to do this release without all of nom’s contributors that helped me along the way, the developers that started using nom 5 while it was still in beta. Also I’m thankful for the support from my employer, Clever Cloud, which is now a large nom user :D

I am happy that after all this time, this project is still going strong, helping people write good parsers to power production software or weekend projects. Happy hacking with nom!