parser combinators with nom 8 are here!
2025-01-26nom is a parser combinator library written in Rust, providing you with tools to quickly build complex parsers, from binary formats to programming languages, without compromising on speed or memory consumption.
I am pleased to announce that nom is now available at version 8. This marks a significant rewrite of its core architecture for better performance, along with the introduction of the nom-language crate, that focuses on tools for text oriented parsers, like programming languages and configuration files.
5 years ago, I released nom v5, which finally moved away from early macros based parsers, towards combinators returning functions. This was the design I had been looking for, and it has worked extremely well. Parsers are easy to write, and performance is good. Versions 6 and 7 were mostly refinements above that model. So maybe I could have stopped there.
At this point, nom is an old library by Rust standards, largely used by other projects. This is an awesome result, and I am deeply thankful for the trust developers have put into that project, and the hundreds of contributors who shaped it over the years.
But it also puts it into a strange place. That visibility brings a lot of contributors with great ideas on how to improve the library, sometimes with huge breaking changes. While the large number of projects relying on it make it hard to introduce big changes, because that puts a strain on large projects. At the same time, this means that even small improvements can have a huge impact: parsers are often the first step when receiving a new rquest or reading a new file.
So it took me a while to figure out how nom could evolve within those constraints, and I landed on these steps:
- rewrite the internal architecture to use traits instead of functions, while keeping the breaking changes minimal
- extract the language parsing tools into a separate crate
Rewriting the internal architecture
A while ago, I noticed the work of Joshua Barretto on chumsky, another Rust parser combinator library (check it out, it's pretty cool!), to integrate generic associated types into parsers.
Practically, this relies on a Parser trait in which some methods have an associated type, which will modify the parser's behavior. end of
pub trait OutputMode {
type Output: Mode;
type Error: Mode;
type Incomplete: IsStreaming;
}
This type is used to decide if we need to generate the parser output value, the error value, and if it is used in streaming mode or complete mode.
as an example, if you write preceded(tag("hello "), is_not("!"))
, the
preceded combinator will
call both parsers, but only keep the output of the second one (is_not
). In previous
versions of nom, the output of the first parser was entirely generated, then dropped immediately
by preceded
. This was a bit wasteful, so this is what the OutputMode
type changes.
The OutputMode::Output
associated type implements the
Mode
trait that defines method to produce and
combine values. In practice, Mode
has two implementors:
Thanks to this new architecture, now preceded
can call the first parser with Check
, which will
run the parser without producing any output, and then call the second parser with Emit
, which will
produce the output. When using Check
, the code that produces the output value is not executed, it
is not even compiled!
This is already a performance improvement, but where it really shines is for errors. In the way parser
combinators work, they very often try to parse in some way, then on errors do something else,
like with the alt
combinator. So we can
spend a lot of time generating errors then discarding them.
As an example, the opt
combinator will execute
a parser and return an Option
of its output, with None
if it returned an error. That error will be
discarded (unless it is an unrecoverable error).
opt
calls the parser with the error mode set to Check
, so the error is never generated, it only needs
to know that the parser failed.
The interesting part here is that as long as parsers are implementing the Parser
trait with its process
method that follows the directions from the OutputMode
type, then the mode can be transmitted from one
combinator to the next, removing large parts of the code generation. But if a parser or combinator is written
in the old way, with closures, then it will still be compatible with the new architecture (though it will
ignore the OutputMode
directives).
The last field of OutputMode
is Incomplete
, which is used to determine if the parser is used in streaming
mode or complete mode. From its inception, nom was designed to work in streaming, with data coming from the
network that can be incomplete. But there was also the need to work with complete data, like local files. It
was doe previously by writing 2 versions of a lot of combinators, one for streaming and one for complete
mode. Then users had to make sure they employed the same version throughout their parsers, to avoid funny bugs.
OutputMode::Incomplete
was introduced to simplify that. Now we can decide at the top call site of a parser
whether it is used in streaming or complete mode, and the compiler will enforce that down to the lowest character
parser in the format. This is pretty powerful, because now we only need one version of each combinator.
While that new architecture introduces some complexity, compared to the closures based parsers, I believe it
will pay off immensely for production parsers, while keeping breaking changes to a minimum. In practice, updating
an existing parser from v7 to v8 is painless, just requiring to use the Parser
trait in some places,
where combinator(arg)(input)
is now written combinator(arg).parse(input)
.
nom-language, a new crate for language parsers
There is a large demand for tools to work with text (often around the end of year, for the Advent of Code :D), and it has historically been had to navigate for nom, because it was focused at first on binary formats, and primarily for production parsers without user interaction. The goal was to parse quickly and reject invalid data quickly. Returning actionable feedback for the user of a running parser was less of a concern. For programming languages and configuration formats, the needs are different: invalid data is much more common, and the user needs good feedback on what is wrong. Language focused users also tend to ask for a lot of very specific combinators to be included in nom.
That has been hard to balance with keeping the library manageable at its current scale. I'd rather go towards a smaller core library evolving slowly, with a few powerful but generic combinators, leaving users to write and share their own tools above it, like the excellent nom-supreme crate. But I understand the need for more specific tools, evolving on a different timeline.
Hence the creation of the nom-language crate, a new crate focused
on language parsers. For now, it ony contains the VerboseError
type, that was removed from nom in v8, and
combinators for precedence parsing. I expect it will be able to grow a lot in the future, and support a wide
range of projects.
Thanks
This release has been a long journey, and I am very grateful for the help of nom's many contributors, both in code, and the people who help each other daily in chat. To be honest, the past few years have been a bit rough for me on the open source side. The many demands of day jobs along with the time I want to spend with my family have left me with less time, to the point that dealing with growing expectations of the project have been difficult. And the whole deal with the winnow fork did not help at all.
But the project pushed through in the end, and is in a much better place now, with lots of people coming to help regularly. I went into Rust development because it was fitting my technical needs, but the community I met is why I stayed and contributed to it so much. There's a bunch of good people there.
So now, the only thing left for me is to leave you to discover the new version of nom and nom-language, take them for a spin, and let me know what you think. Happy hacking with nom!