How to rewrite your project in Rust · Unhandled Expression

In a previous post, I explained why rewriting existing software in Rust could be a good idea. The main point being that you should not rewrite the whole application, but replace the weaker parts without disturbing most of the code, to strengthen the codebase without disruption.

I also provided pointers to projects where other people and I did it succesfully, but without giving too many details. So let's get a real introduction to Rust rewrites now. This article requires a little bit of knowledge about Rust, but you should be able to follow it even as a
beginner.

As a reminder, here are the benefits Rust bring into a rewrite:

it can easily call C code
it can easily be called by C code (it can export C compatible functions and structures)
it does not need a garbage collector
if you want, it does not even need to handle allocations
the Rust compiler can produce static and dynamic libraries, and even object files
the Rust compiler avoids most of the memory vulnerabilities you get in C (yes, I had to mention it)
Rust is easier to maintain than C (this is discutable, but not the point of this article)

As it turns out, this is more or less the plan to replace C code with Rust:

import C structures and functions in Rust
import Rust structures and functions from C
reuse the host application's memory allocations whenever possible
write code (yes, we have to do it at some point)
produce artefacts that can be linked with the host application
integrate with the build system

We'll see how to apply this with examples from the Rust VLC plugin.

Import C structures and functions in Rust

Rust can easily use C code directly, by writing functions and structures definitions. A lot of the techniques you would use for this come from the "unsafe Rust" chapter of "The Rust Programming Language" book. For the following C code:

[code lang=C]
struct vlc_object_t {
const char *object_type;
char *header;
int flags;
bool force;
libvlc_int_t *libvlc;
vlc_object_t *parent;
};
[/code]

You would get the following Rust structure:

[code lang=C]
extern crate libc;
use libc::c_char;

#[repr(C)]
pub struct vlc_object_t {
pub psz_object_type: *const c_char,
pub psz_header: *mut c_char,
pub i_flags: c_int,
pub b_force: bool,
pub p_libvlc: *mut libvlc_int_t,
pub p_parent: *mut vlc_object_t,
}
[/code]

the #[repr(C)] tag indicates to the compiler that the structure should have a memory layout similar to the one generated by a C
compiler. We import types from the libc crate, like c_char. Those types are platform dependent (with their different form already handled in libc). Here, we use a lot of raw pointers (indicated by *), which means by using this structure directly, we're basically writing C, which is no good! A good approach, as we'll see later, is to write safer wrappers above those C bindings.

Importing C functions is quite straightforward too:

[code lang=C]
ssize_t vlc_stream_Peek(stream_t *, const uint8_t **, size_t);
ssize_t vlc_stream_Read(stream_t *, void *buf, size_t len);
uint64_t vlc_stream_Tell(const stream_t *);
[/code]

These C function declarations would get translated to:

[code lang=C]
#[link(name = "vlccore")]
extern {
pub fn vlc_stream_Peek(stream: *mut stream_t, buf: *mut *const uint8_t, size: size_t) -> ssize_t;
pub fn vlc_stream_Read(stream: *mut stream_t, buf: *const c_void, size: size_t) -> ssize_t;
pub fn vlc_stream_Tell(stream: *const stream_t) -> uint64_t;
}
[/code]

The #[link(name = "vlccore")] tag indicates to which library we are linking. It is equivalent to passing a -lvlccore argument to the linker. Libvlccore is a library all VLC plugins must link to. Those functions are declared like regular Rust functions, but like the previous structures, will mainly work on raw pointers.

bindgen

You can always write all your bindings manually like this, but when the amount of code to import is a bit large, it can be a good idea to employ the awesome bindgen tool, that will generate Rust code from C headers.

It can work as a command line tool, but can also work at compile time from a build script. First, add the dependency to your Cargo.toml file:

[code lang=toml]
[build-dependencies.bindgen]
version = "^0.25"
[/code]

You can then write your build script like this:

[code lang=C]
extern crate bindgen;
use std::fs::File;
use std::io::Write;
use std::path::Path;

fn main() {
let include_arg = concat!("-I", env!("INCLUDE_DIR"));
let vlc_common_path = concat!(env!("INCLUDE_DIR"), "/vlc_common.h");

let _ = bindgen::builder()
.clang_arg(include_arg)
.clang_arg("-include")
.clang_arg(vlc_common_path)
.header(concat!(env!("INCLUDE_DIR"), "/vlc_block.h"))
.hide_type("vlc_object_t")
.whitelist_recursively(true)
.whitelisted_type("block_t")
.whitelisted_function("block_Init")
.raw_line("use ffi::common::vlc_object_t;")
.use_core()
.generate().unwrap()
.write_to_file("src/ffi/block.rs");
}
[/code]

So there's a lot to unpack here, because bindgen is very flexible:

we use clang_arg to pass the include folder path and pre include a header everywhere (vlc_common.h is included pretty puch everywhere in VLC)
the header method specifies the header from which we will import definitions
hide_type prevents redefinition of elements we already defined (liek the ones from the common header)
whitelisted_type and whitelisted_function specify types and functions for which bindgen will create definitions
raw_line writes its argument at the top of the file. I apply it to reuse definitions from other files
write_to_file writes the whole definition to the specified path

You can apply that process to any C header you must import. With the build script, it can run every time the library is compiled, but be careful, generating a lot of headers can take some time. It might be a good idea to pregenerate them and commit the generated files, and update them from time to time.

It is usually a good idea to separate the imported definitions in another crate with the -sys suffix, and write the safe code in the main crate.
As an example, see the crates openssl and openssl-sys.

Writing safe wrappers

Previously, we imported the C function ssize_t vlc_stream_Read(stream_t *, void *buf, size_t len) as the Rust version pub fn vlc_stream_Read(stream: *mut stream_t, buf: *const c_void, size: size_t) -> ssize_t but kept an unsafe interface. Since we want to use those functions safely, we can now make a better wrapper:

[code lang=C]
use ffi;

pub fn stream_Read(stream: *mut stream_t, buf: &mut [u8]) -> ssize_t {
unsafe {
ffi::vlc_stream_Read(stream, buf.as_mut_ptr() as *mut c_void, buf.len())
}
}
[/code]

Here we replaced the raw pointer to memory and the length with a mutable slice. We still use a raw pointer to the stream_t instance, maybe we can do better:

[code lang=C]
use ffi;

pub struct Stream(*mut stream_t);

pub fn stream_Read(stream: Stream, buf: &mut [u8]) -> ssize_t {
unsafe {
ffi::vlc_stream_Read(stream.0, buf.as_mut_ptr() as *mut c_void, buf.len())
}
}
[/code]

Be careful if you plan to implement Drop for this type: is the Rust code supposed to free that object? Is there some reference counting involved? Here is an example of Drop implementation from the openssl crate:

[code lang=C]
pub struct SslContextBuilder(*mut ffi::SSL_CTX);

impl Drop for SslContextBuilder {
fn drop(&mut self) {
unsafe { ffi::SSL_CTX_free(self.as_ptr()) }
}
}
[/code]

Remember that it's likely the host application has a lot of infrastructure to keep track of memory, and as a rule, we should reuse the tools it offers for the code at the interface between Rust and C. See the Rust FFI omnibus for more examples of safe wrappers you can write.

Side note: as of now (2017/07/10) custom allocators are still not stable

Exporting Rust code to be called from C

Since the host application is written in C, it might need to call your code. This is quite easy in Rust: you need to write unsafe wrappers.

Here we will use as example the inverted index library for mobile apps I wrote for a conference. In this library, we have an Index type that we want to use from Java. Here is its definition:

[code lang=C]
#[repr(C)]
pub struct Index {
pub index: HashMap<String, HashSet<i32>>,
}
[/code]

This type has a few method we want to provide:

[code lang=C]
impl Index {
pub fn new() -> Index {
Index {
index: HashMap::new(),
}
}

pub fn insert(&mut self, id: i32, data: &str) {
[...]
}

pub fn search_word(&self, word: &str) -> Option<&HashSet<i32>> {
self.index.get(word)
}

pub fn search(&self, text: &str) -> HashSet<i32> {
[...]
}
}
[/code]

First, we need to write the functions to allocate and deallocate our index. Every use from C will be wrapped in a Box.

[code lang=C]
#[no_mangle]
pub extern "C" fn index_create() -> *mut Index {
Box::into_raw(Box::new(Index::new()))
}
[/code]

The Box type indicates and owns a heap allocation. When the box is dropped, the underlying data is dropped as well and the memory is freed. The following function takes ownership of its argument, so it is dropped at the end.

[code lang=C]
#[no_mangle]
pub extern "C" fn index_free(ptr: *mut Index) {
let _ = unsafe { Box::from_raw(ptr) };
}
[/code]

Now that allocation is handled, we can work on a real method. The following method takes an index, and id for a text, and the text itself, as a C string (ie, terminated by a null character).

Since we're kinda writing C in Rust here, we have to first check if the pointers are null. Then we can transform the C string in a slice. Then we check if it is correctly encoded as UTF-8 before inserting it into our index.

[code lang=C]
#[no_mangle]
pub extern "C" fn index_insert(index: *mut Index, id: i32, raw_text: *const c_char) {
unsafe { if index.is_null() || raw_text.is_null() { return } };
let slice = unsafe { CStr::from_ptr(raw_text).to_bytes() };
if let Ok(text) = str::from_utf8(slice) {
(*index).insert(id, text);
}
}
[/code]

Most of the code for those kinds of wrappers is just there to transform between C and Rust types and checking that the arguments coming from C code are correct. Even if we have to trust the host application, we should program defensively at the boundary.

There are other methods we could implement for the index, we'll leave those as exercise for the reader :)

Now, we need to write the C definitions to import those functions and types:

[code lang=C]
typedef struct Index Index;

Index* index_create();
void index_free(Index* index);
void index_insert(Index* index, int32_t id, char const* raw_text);
[/code]

We defined Index as an opaque type here. Since Rust structures can be compatible with C structures, we could export the real type, but since it only contains a Rust specific type, HashMap, it is better to hide it completely and write accessors and wrappers.

Generating bindings with rusty-cheddar

Writing function imports from C to Rust is tedious, so we have bindgen for this. We also have a great tool to go the other way: rusty-cheddar.

In the same way, it can be used from a build script:

[code lang=C]
extern crate cheddar;

fn main() {
cheddar::Cheddar::new().expect("could not read definitions")
.run_build("include/main.h");
cheddar::Cheddar::new().expect("could not read definitions")
.module("index").expect("malformed module path")
.insert_code("#include \"main.h\"")
.run_build("include/index.h");
}
[/code]

Here we run rusty-cheddar a first time without specifying the module: it will default to generate a header for the definitions in src/lib.rs.
The second run specifies a different module, and can insert a file inclusion at the top.

It can be a good idea to commit the generated headers, since you will see immediately if you changed the interface in a breaking way.

Integrating with the build system

As you might know, we can make dynamic libraries and executables with rustc and cargo. But often, the host application will have its own build system, and it might disagree with the way cargo builds its projects. So we have multiple strategies:

build Rust code separately, store libraries and headers in Maven or something (don't laugh, I've worked with such a system once, and it was actually great)
try to let rustc build dynamic libraries from inside the build system. We tried that for VLC and it was not great at all
build a static library from inside or outside the build system, include it in the libraries at link. This was done in Rusticata
build an object file and let the build system link it. This is what we ended up doing with VLC

Building a static library is as easy as specifying crate-type = ["staticlib"] in your Cargo.toml file. To build an object file, use the command cargo rustc --release -- --emit obj. You can see how we added it to the autotools usage in VLC.

Unfortunately, for this part we still do not have automated ways to fix the issues. Maybe with some time, people will write scripts for autotools,
CMake and others to handle Rust and Cargo.

Side note on reproducible builds: if you want to fix the set of Rust dependencies used in your project and make them always available, you can use cargo-vendor to store them in a specific folder

As you might have guessed, this is the most complex part, for which I have no good generic answer. I'd recommend that you spend the most time on this during the project's prototyping phase: import very little C code, export very little Rust code, try to make it build entirely from within the host application's build system. Once this is done, extending the project will get much easier. You really don't want to discover this task at the end of your project and try to retrofit your code in there.

Going further

While this article just explores the surface of Rust rewrites, I hope it provides a good starting point on the tools and techniques you can apply.
Any rewrite will be a large and complex project, but the result is worth the effort. The code you will write will be stronger, and Rust's type system will force you to review the assumptions made in the C version. You might even find better ways to write it once you start refactoring your code in a more Rusty way, safely hidden behind your wrappers.

security

Import C structures and functions in Rust

bindgen

Writing safe wrappers

Exporting Rust code to be called from C

Generating bindings with rusty-cheddar

Integrating with the build system

Going further

Related Posts

nom 5 is here 17 Jun 2019

FOSS is free as in toilet 27 Nov 2018

No, pest is not faster than nom 04 Oct 2018