VGEL•ME

mmap(1Tb): A Rust arena allocator (ab)using Linux overcommit

In this post, we'll build an arena allocator in Rust that allocates into a giant, overcommited mmap'd memory block. We'll then do some light benchmarking against a popular Rust arena allocator, typed-arena, that uses a more traditional Vec-of-chunks approach, and see what performance differences exist, if any.

A brief disclaimer: I'll be using typed-arena as a point of comparison in this post, since it's a popular and fast Rust arena allocator, with a similar API as the allocator we'll be writing (one T per arena, with alloc(&'a self, value: T) -> &'a mut T). Nothing in this post is meant as a criticism of typed-arena — in fact, I picked it as a baseline because it's fast and high-quality, and it stacks up well here.

permalink for A_Brief_Introduction_to_Arena_Allocators_and_Rust A Brief Introduction to Arena Allocators and Rust

Because of Rust's borrow checker, the "sea of objects" technique for implementing cyclic data structures like graphs is not so simple. We can't just write this moral-equivalent of a Java GraphNode class:

struct GraphNode {
    value: String,
    neighbors: Vec<&GraphNode>,
}

Why not? Because the borrow checker requires lifetimes on the &GraphNode references in neighbors to ensure they live at least as long as the enclosing GraphNode, to prevent accessors of a node from dereferencing dangling references. This requires something like this declaration instead:

struct GraphNode<'a> {
    value: String,
    neighbors: Vec<&'a GraphNode<'a>>,
}

Which has the effect of constraining the lifetime of all neighbors and the enclosing GraphNode to be compatible with some lifetime 'a. This can be accomplished by simply constructing and using the nodes within a single function; the lifetime of the nodes will be the body of the function where they were defined. But once you start wanting to pass graphs around your program as values, you need a place to allocate them that will ensure the lifetimes are consistent.

One way to accomplish this is to eschew references entirely, hold the list of objects in a Vec, and use usize indices (or more complex generational indices) into that Vec to reference other objects in the structure. The extreme end of this is an Entity Component System.

If you'd like to keep using references, however—they're generally easier to work with, and require less complex library support code to manage—you'll need a way to allocate memory that will hand you back a consistent lifetime for those references. Something like this:

trait Arena<T> {
    fn alloc<'a>(&'a self, value: T) -> &'a mut T;
}

This will allow us to construct GraphNode<'a>s, allocate them into an arena to get &'a GraphNode<'a>s, and pass those around as we please. When the arena is dropped, it will drop all the GraphNodes, and the borrow checker will ensure that can't happen while any references to those GraphNodes still exist.

If you're a bit more familiar with Rust, you may find it strange that we take &self but return &mut T:

trait Arena<T> {
    fn alloc<'a>(&'a self, value: T) -> &'a mut T;
    //           ^^^^^^^^               ?????????
}

This function generates a mut (exclusive) reference from a non-mut (shared) reference, which seems... wrong at first glance. Clippy even warns about this as "trivially unsound", which it usually would be. However, in this case the &mut doesn't come from &self, it comes from the owned T that's passed in. Since an owned T can only be passed to alloc once, by definition, only one &mut T reference to that value can be returned.

The benefit of taking &self instead of &mut self is that it makes the arena more useful. If the arena took &mut self, we could only have one live reference to an item at the time, since the &mut self is an exclusive borrow of the arena. That's not very useful for most arena needs, so we'll be writing our arena to take &self instead.

This does have an important consequence, however: we can't ever move the contents of the arena. Rust containers that allocate within themselves require &mut self because they can be moved: for example, when a Vec is resized, its address may change if the current memory location doesn't have enough room for the container to expand:

fn main() {
    let mut vec = vec![0; 100];
    println!("{:p}", &vec[0]); // 0x55f1146b39d0
    for _ in 0..100_000 {
        vec.push(1);
    }
    println!("{:p}", &vec[0]); // 0x7f6b8e758010
}

This is a common source of bugs in other languages, but not in safe Rust: the borrow checker forbids &Container<T> references from existing while an exclusive &mut Container<T> reference exists, because that reference could resize the container and turn the &Container<T> shared references into dangling ones.

This check is usually quite helpful, but since we won't ever be moving the contents of our arena, it doesn't apply to us. We'll use unsafe to work around it, which means we'll need to be careful to ensure we never move the contents of the arena. Luckily, as programmers, we don't often make mistakes :-)

permalink for typed-arena_and_Vec-based_arenas typed-arena and Vec-based arenas

The way typed-arena, and many other Rust arena allocators handle this is to have something like this (but more optimized):

struct Arena<T> {
    chunks: RefCell<Vec<Vec<T>>>
}

RefCell<T> is for interior mutability, since as we explained above, we must take &self and not &mut self in alloc.

You might be confused that Arena stores data in a Vec, since didn't we explain above that Vec requires &mut self because it may be resized? Won't our unsafe code be unsound due to this?

The answer is the reason why chunks is Vec<Vec<T>>, not Vec<T>. While the outer chunks list may grow to add more chunks of memory, the interior Vec<T>s are created with a fixed capacity that is never exceeded. When an allocation would cause a chunk to grow, a new chunk is created instead:

impl<T> Arena<T> {
    fn alloc(&self, value: T) -> &mut T {
        let mut chunks = self.chunks.borrow_mut();
        let last = chunks.len() - 1;

        let chunk = if chunks[last].len() < chunks[last].capacity() {
            chunks[last].push(value);
            &mut chunks[last]
        } else {
            let mut new_chunk = Vec::with_capacity(chunks[last].len() * 2);
            new_chunk.push(value);
            chunks.push(new_chunk);
            let idx = chunks.len() - 1;
            &mut chunks[idx]
        };

        // here we generate the exclusive reference to the newly-pushed element,
        // but it's sound for the reasons described above
        unsafe { &mut *chunk.as_mut_ptr().add(chunk.len() - 1) }
    }
}

This allows us to use our arena as we'd like!

fn main() {
    // look ma! no mut!
    let arena = Arena { chunks: RefCell::new(vec![Vec::with_capacity(1)]) };

    let mut x = arena.alloc(0xdeadbeee_usize);
    let y = arena.alloc(0xfee1dead_usize);
    let z = arena.alloc(0xdeaddead_usize);

    // the returned references are mutable:
    *x += 1;

    println!("{:#x?}", arena.chunks);
    println!("x + y - z = {:#x}", *x + *y - *z);
}
RefCell {
    value: [
        [
            0xdeadbeef,
        ],
        [
            0xfee1dead,
            0xdeaddead,
        ],
    ],
}
x + y - z = 0xfee1beef

This works great! But it's rather a lot of bookkeeping, isn't it? We need to maintain this chunk list and make sure we don't accidentally over-allocate into a Vec and cause it to resize. These chunks are also spread across memory, which might hurt cache performance for traversals (we'll test that later). Is there another way? A way more cursed way?

permalink for Linux_overcommit_and_you Linux overcommit and you

When is a memory allocation not a memory allocation? On Linux systems, quite often. The linux kernel has an optimization called overcommit, which allows programs to allocate more virtual memory than physical memory. This can sometimes lead to strange behavior, but is quite useful in other cases.

Here's how it works:

This behavior gives us exactly what we want for an arena allocator: a stable, contiguous block of memory we can write into, without having to worry about managing chunks of memory, container resizing, or memory fragmentation.

There are downsides, of course. It relies on Linux overcommit, which is configurable via /proc/sys/vm/overcommit_memory: this value defaults to 0, which uses a heuristic to reject "obvious overcommits", but seems to work with this scheme regardless. However other values can reject all overcommits. This is not to even mention other operating systems, which have their own behavior (or lack thereof) around overcommit.

But, for a fun/cursed Linux-only hack, it's perfect :-)

permalink for Implementing_an_overcommit_arena Implementing an overcommit arena

Here's what our Arena struct looks like:

pub struct Arena<T> {
    map: NonNull<T>,    // NonNull is a wrapper around a *mut T that's guaranteed to be non-null
    size: Cell<usize>,  // for interior mutability
}

impl<T> Arena<T> {
    /// We try to map ~1Tb of virtual memory, which is within the usual address
    /// size of 47 bits (128Tb), while being larger than almost any computer's
    /// working memory so there is almost no risk of running out
    const MAP_SIZE: usize = 1 << 40;

    pub fn new() -> anyhow::Result<Self> {
        let map = unsafe {
            libc::mmap(
                ptr::null_mut(),
                Self::MAP_SIZE,
                libc::PROT_READ | libc::PROT_WRITE,
                // MAP_PRIVATE:   this is not shared memory
                // MAP_ANONYMOUS: this is RAM, not a file-backed mmap
                // MAP_NORESERVE: don't reserve swap
                // MAP_HUGETLB:   use huge pages for better performance
                //                (make sure huge pages are enabled or this will SIGBUS:
                //                 # sysctl -w vm.nr_hugepages=2048)
                libc::MAP_PRIVATE | libc::MAP_ANONYMOUS | libc::MAP_NORESERVE | libc::MAP_HUGETLB,
                -1,
                0,
            )
        };

        let map = if map == libc::MAP_FAILED || map.is_null() {
            return Err(anyhow!("mmap failed: {}", get_errno()));
        } else {
            NonNull::new(map as *mut T).unwrap()
        };

        Ok(Self {
            map,
            size: Cell::new(0),
        })
    }
}

This is very similar to a decomposed VecNonNull<T> is a non-null raw pointer to T, which will point to our mmap'd area, and size is how many elements we've allocated so far. The only thing missing from the internals of Vec in std is a capacity field -- we don't need one because we're allocating enough virtual memory that we can "safely" assume the kernel will kill us before we're able to overrun our allocation.

Inside new we set up the memory map, using the libc crate to interface with glibc and mmap our totally-normal, not at all suspicious 1Tb memory block.

alloc is where the real meat is, and by real meat I mean almost nothing because working with this memory block is so simple:

impl<T> Arena<T> {
    fn alloc(&self, value: T) -> &mut T {
        let size = self.size.get();
        self.size.set(size + 1);

        // lazily assume we'll always have capacity since we allocated such a huge map
        unsafe {
            let end = self.map.as_ptr().add(size);
            ptr::write(end, value);
            &mut *end
        }
    }
}

We bump size (with interior mutability), write the value into the new slot, and return a mutable reference to that slot. Two adds and a memcpy, basically.

permalink for Performance Performance

So how does this stack up against the Vec-arena approach of typed-arena?

I wrote some simple benchmarks with criterion.rs to test this out. I'm no benchmarking expert, and I didn't verify that typed-arena is the fastest Vec-arena-using arena allocator for Rust, so take these benchmarks with a grain of salt. But, I also didn't heavily optimize my implementation using mmap, so lets go ahead regardless.

I ran the benchmarks on a laptop with these specs:

$ cat /etc/lsb-release | grep DESC # cat is more composable, fight me
DISTRIB_DESCRIPTION="Pop!_OS 20.04 LTS"

$ cat /proc/cpuinfo
...
model name	: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
cache size	: 8192 KB
cpu cores	: 4
...

$ cat /proc/meminfo
MemTotal:       48983684 kB
...

So with that out of the way, lets get benchmarking!

permalink for Allocation_speed Allocation speed

The first benchmark is a simple allocation stress-test: it tries to allocate as many possible byte arrays of a certain size. This benchmark was run for arrays of size 8, 16, 32, 64, 128, and 256 bytes.

Due to physical memory pressure, and issues controlling that with Criterion, these benchmarks aren't as stable as I would hope and had a few % variance between runs, so keep that in mind.

First, the absolute timings:

Allocation line plot

The first thing you'll probably notice is the massive spike for typed-area with 8-byte allocations. This was caused almost entirely by a single sample that took 15ms! This is more clear from the violin plot:

Allocation violin plot

I would write this off as an outlier, but it happened consistently on every run of the benchmark. It's possible there was some interaction within the benchmark that was causing this behavior, I'm not sure. Regardless, using an arena allocator for a structure that can only hold a single pointer is a somewhat pathological case, so we can ignore it and look at the trends for larger sizes.

It seems for smaller allocations the mmap approach does slightly better, but falls off at larger sizes. Flamegraphs suggest that memcpy times dominated at the larger sizes, so it's possible that the Vec approach was more amenable to the compiler optimizing or fusing memcpy operations. At smaller sizes where indexing and growing the Vec would happen more frequently, it makes sense that mmap would be slightly faster despite this.

Overall, however, it seems about a wash, since we're talking about single-digit nanoseconds either way, besides the possibly-spurious 8 byte test.

permalink for Traversal_speed Traversal speed

The second benchmark was a depth-first search of a (seeded) randomly-generated DAG. The search was crafted such that the value would rarely be found (1 / 2^64), requiring a traversal of the entire graph. The benchmark generated DAGs of size 100, 1,000, 10,000, 100,000, 1,000,000, and 10,000,000 nodes, and timed how long a search over the graph took.

Traversal line plot

As the size of the graph increases, the mmap-based arena starts to pull ahead. I would test with even larger graphs, but the benchmark runtime was hitting 15 minutes and I was getting impatient. However, I think the trend is clear enough.

I think locality is probably the dominant factor here: with smaller graph sizes, the whole graph fits in a few Vec chunks for typed-arena, so the mmap approach doesn't gain much. As the graph grows, the Vec chunks get larger and more spread out, making locality worse.

This does become significant: at the largest graph size, the nodes in the mmap arena take an average of 1.5060 seconds to traverse, vs 2.1965 seconds for the typed-arena nodes: 31% slower!

permalink for Should_you_use_this? Should you use this?

Mmap arenas? Maybe. I'm sure some people already do, though a cursory Google search for "mmap arena" didn't turn up much: most usages seemed to be using mmap for a typical chunk list. Using mmap with overcommit like this has some pros and cons: the (possible) performance and simplicity is good, the reliance on Linux overcommit behavior and ease of getting OOM killed is a con. My hunch would be your arena allocator is probably not your bottleneck most of the time, and a more robust solution like typed-arena is more than good enough.

This code, specifically? God no.