nom 4.0: faster, safer, simpler parsers

I’m delighted to announce that nom, the extremely fast Rust parser combinators library, has reached major version 4.

TL;DR: the new nom version is simpler, faster, has a better documentation, and you can find a summary of what changed in the upgrade documentation.

side note: how fast is nom? it can reach 2GB/s when parsing HTTP requests.

nom logo

Since nom is now a well established, serious project, we got a brand new logo, courtesy of Ange Albertini. The nom monster will happily eat your data byte by byte :)

It took nearly 6 months of development and the library went through nearly 5 entire rewrites. Compare that to previous major releases, which took a month at most to do. But it was worth it! This new release cleans up a lot of old bugs and unintuitive behaviours, simplifies some common patterns, is faster, uses less memory, gives better errors, but the way parsers are written stay the same. It’s like an entirely new engine under the same body work!

Moving from IResult to Result

This was a long standing request. nom used a three-legged enum as return type for the parsers:

// example parser signature
fn parser(input: I) -> IResult<I,O> { ... }

pub enum IResult<I,O,E=u32> {
  /// remaining input, result value
  Done(I,O),
  /// indicates the parser encountered an error. E is a custom error type you can redefine
  Error(Err),
  /// Incomplete contains a Needed, an enum than can represent a known quantity of input data, or unknown
  Incomplete(Needed)
}

pub enum Needed {
  /// needs more data, but we do not know how much
  Unknown,
  /// contains the required total data size
  Size(usize)
}

// if the "verbose-errors" feature is not active
pub type Err<E=u32> = ErrorKind;

// if the "verbose-errors" feature is active
pub enum Err<P,E=u32> {
  /// An error code, represented by an ErrorKind, which can contain a custom error code represented by E
  Code(ErrorKind),
  /// An error code, and the next error
  Node(ErrorKind, Vec<Err<P,E>>),
  /// An error code, and the input position
  Position(ErrorKind, P),
  /// An error code, the input position and the next error
  NodePosition(ErrorKind, P, Vec<Err<P,E>>)
}

That old IResult structure did not transform well to the commonly used Result, people did not want to see the Incomplete case (when the parser indicates it does not have enough data to decide) if they do not need it. And the different error types depending on the verbose-errors feature were confusing and causing errors when nom appeared multiple times in dependency trees.

So I replaced it with a new, Result based design:

pub type IResult<I, O, E = u32> = Result<(I, O), Err<I, E>>;

pub enum Err<I, E = u32> {
  /// There was not enough data
  Incomplete(Needed),
  /// The parser had an error (recoverable)
  Error(Context<I, E>),
  /// The parser had an unrecoverable error
  Failure(Context<I, E>),
}

pub enum Needed {
  /// needs more data, but we do not know how much
  Unknown,
  /// contains the required additional data size
  Size(usize)
}

// if the "verbose-errors" feature is inactive
pub enum Context<I, E = u32> {
  Code(I, ErrorKind),
}

// if the "verbose-errors" feature is active
pub enum Context<I, E = u32> {
  Code(I, ErrorKind),
  List(Vec<(I, ErrorKind)>),
}

Aside from being more compatible with, like, the whole Rust ecosystem, this new design has lots of interesting points:

  • the Context enum is now extended by the verbose-errors feature, so it is the same type
  • errors always store position information
  • Incomplete has moved to the error case so you can easi