mmap(1Tb): A Rust arena allocator (ab)using Linux overcommit
In this post, we'll build an arena allocator in Rust that allocates into a giant, overcommited
mmap'd memory block. We'll then do some light benchmarking against a popular Rust arena allocator,
typed-arena
, that uses a more traditional Vec-of-chunks approach, and see what performance
differences exist, if any.
A brief disclaimer: I'll be using
typed-arena
as a point of comparison in this post, since it's a popular and fast Rust arena allocator, with a similar API as the allocator we'll be writing (oneT
per arena, withalloc(&'a self, value: T) -> &'a mut T
). Nothing in this post is meant as a criticism oftyped-arena
— in fact, I picked it as a baseline because it's fast and high-quality, and it stacks up well here.
A Brief Introduction to Arena Allocators and Rust
Because of Rust's borrow checker, the "sea of objects" technique for implementing
cyclic data structures like graphs is not so simple. We can't just write this moral-equivalent
of a Java GraphNode
class:
struct GraphNode {
value: String,
neighbors: Vec<&GraphNode>,
}
Why not? Because the borrow checker requires lifetimes on the &GraphNode
references in neighbors
to ensure they live at least as long as the enclosing GraphNode
, to prevent accessors of a node
from dereferencing dangling references. This requires something like this declaration instead:
struct GraphNode<'a> {
value: String,
neighbors: Vec<&'a GraphNode<'a>>,
}
Which has the effect of constraining the lifetime of all neighbors
and the enclosing GraphNode
to be compatible
with some lifetime 'a
. This can be accomplished by simply constructing and using the
nodes within a single function; the lifetime of the nodes will be the body of the function
where they were defined. But once you start wanting to pass graphs around your program as
values, you need a place to allocate them that will ensure the lifetimes are consistent.
One way to accomplish this is to eschew references entirely, hold the list of objects in a
Vec
, and use usize
indices (or more complex generational indices) into that Vec
to
reference other objects in the structure. The extreme end of this is an Entity Component System.
If you'd like to keep using references, however—they're generally easier to work with, and require less complex library support code to manage—you'll need a way to allocate memory that will hand you back a consistent lifetime for those references. Something like this:
trait Arena<T> {
fn alloc<'a>(&'a self, value: T) -> &'a mut T;
}
This will allow us to construct GraphNode<'a>
s, allocate them into an arena to
get &'a GraphNode<'a>
s, and pass those around as we please. When the arena
is dropped, it will drop all the GraphNode
s, and the borrow checker will ensure
that can't happen while any references to those GraphNode
s still exist.
If you're a bit more familiar with Rust, you may find it strange that we take &self
but return &mut T
:
trait Arena<T> {
fn alloc<'a>(&'a self, value: T) -> &'a mut T;
// ^^^^^^^^ ?????????
}
This function generates a mut (exclusive) reference from a non-mut (shared) reference,
which seems... wrong at first glance. Clippy even warns about this as "trivially unsound",
which it usually would be. However, in this case the &mut
doesn't come from &self
,
it comes from the owned T
that's passed in. Since an owned T
can only be passed to
alloc
once, by definition, only one &mut T
reference to that value can be returned.
The benefit of taking &self
instead of &mut self
is that it makes the arena more useful.
If the arena took &mut self
, we could only have one live reference to an item at the time,
since the &mut self
is an exclusive borrow of the arena. That's not very useful for most arena
needs, so we'll be writing our arena to take &self
instead.
This does have an important consequence, however: we can't ever move the contents
of the arena. Rust containers that allocate within themselves require &mut self
because
they can be moved: for example, when a Vec
is resized, its address may change if the current
memory location doesn't have enough room for the container to expand:
fn main() {
let mut vec = vec![0; 100];
println!("{:p}", &vec[0]); // 0x55f1146b39d0
for _ in 0..100_000 {
vec.push(1);
}
println!("{:p}", &vec[0]); // 0x7f6b8e758010
}
This is a common source of bugs in other languages,
but not in safe Rust: the borrow checker forbids &Container<T>
references
from existing while an exclusive &mut Container<T>
reference exists, because that
reference could resize the container and turn the &Container<T>
shared references into
dangling ones.
This check is usually quite helpful, but since we won't ever be moving the contents of our arena, it doesn't apply to us. We'll use unsafe to work around it, which means we'll need to be careful to ensure we never move the contents of the arena. Luckily, as programmers, we don't often make mistakes :-)
typed-arena
and Vec-based arenas
The way typed-arena
, and many other Rust arena allocators handle this is to
have something like this (but more optimized):
struct Arena<T> {
chunks: RefCell<Vec<Vec<T>>>
}
RefCell<T>
is for interior mutability,
since as we explained above, we must take &self
and not &mut self
in alloc
.
You might be confused that Arena
stores data in a Vec
, since didn't we explain above that
Vec
requires &mut self
because it may be resized? Won't our unsafe code be unsound due to this?
The answer is the reason why chunks is Vec<Vec<T>>
, not Vec<T>
. While the outer chunks list
may grow to add more chunks of memory, the interior Vec<T>
s are created with a fixed capacity
that is never exceeded. When an allocation would cause a chunk to grow, a new chunk is created instead:
impl<T> Arena<T> {
fn alloc(&self, value: T) -> &mut T {
let mut chunks = self.chunks.borrow_mut();
let last = chunks.len() - 1;
let chunk = if chunks[last].len() < chunks[last].capacity() {
chunks[last].push(value);
&mut chunks[last]
} else {
let mut new_chunk = Vec::with_capacity(chunks[last].len() * 2);
new_chunk.push(value);
chunks.push(new_chunk);
let idx = chunks.len() - 1;
&mut chunks[idx]
};
// here we generate the exclusive reference to the newly-pushed element,
// but it's sound for the reasons described above
unsafe { &mut *chunk.as_mut_ptr().add(chunk.len() - 1) }
}
}
This allows us to use our arena as we'd like!
fn main() {
// look ma! no mut!
let arena = Arena { chunks: RefCell::new(vec![Vec::with_capacity(1)]) };
let mut x = arena.alloc(0xdeadbeee_usize);
let y = arena.alloc(0xfee1dead_usize);
let z = arena.alloc(0xdeaddead_usize);
// the returned references are mutable:
*x += 1;
println!("{:#x?}", arena.chunks);
println!("x + y - z = {:#x}", *x + *y - *z);
}
RefCell {
value: [
[
0xdeadbeef,
],
[
0xfee1dead,
0xdeaddead,
],
],
}
x + y - z = 0xfee1beef
This works great! But it's rather a lot of bookkeeping, isn't it? We need to maintain this chunk list and make sure we don't accidentally over-allocate into a Vec and cause it to resize. These chunks are also spread across memory, which might hurt cache performance for traversals (we'll test that later). Is there another way? A way more cursed way?
Linux overcommit and you
When is a memory allocation not a memory allocation? On Linux systems, quite often. The linux kernel has an optimization called overcommit, which allows programs to allocate more virtual memory than physical memory. This can sometimes lead to strange behavior, but is quite useful in other cases.
Here's how it works:
- We allocate a massive block of memory, like 1Tb. This doesn't actually reserve any physical memory, just sets up a virtual mapping with a fault handler that will alert the kernel if we try to write to any of this memory.
- This block of virtual memory is divided into blocks, called pages. When we write to a page that isn't backed by physical memory, the kernel will go out and find a physical memory page to map this virtual memory page to.
- If there isn't any physical memory available, the kernel OOM killer will kill some some process it deems suitable to free up some memory. In most cases, that will be us.
This behavior gives us exactly what we want for an arena allocator: a stable, contiguous block of memory we can write into, without having to worry about managing chunks of memory, container resizing, or memory fragmentation.
There are downsides, of course. It relies on Linux overcommit, which is configurable
via /proc/sys/vm/overcommit_memory
: this value defaults to 0, which uses a heuristic
to reject "obvious overcommits", but seems to work with this scheme regardless. However
other values can reject
all overcommits. This is not to even mention other operating systems, which have their own
behavior (or lack thereof) around overcommit.
But, for a fun/cursed Linux-only hack, it's perfect :-)
Implementing an overcommit arena
Here's what our Arena
struct looks like:
pub struct Arena<T> {
map: NonNull<T>, // NonNull is a wrapper around a *mut T that's guaranteed to be non-null
size: Cell<usize>, // for interior mutability
}
impl<T> Arena<T> {
/// We try to map ~1Tb of virtual memory, which is within the usual address
/// size of 47 bits (128Tb), while being larger than almost any computer's
/// working memory so there is almost no risk of running out
const MAP_SIZE: usize = 1 << 40;
pub fn new() -> anyhow::Result<Self> {
let map = unsafe {
libc::mmap(
ptr::null_mut(),
Self::MAP_SIZE,
libc::PROT_READ | libc::PROT_WRITE,
// MAP_PRIVATE: this is not shared memory
// MAP_ANONYMOUS: this is RAM, not a file-backed mmap
// MAP_NORESERVE: don't reserve swap
// MAP_HUGETLB: use huge pages for better performance
// (make sure huge pages are enabled or this will SIGBUS:
// # sysctl -w vm.nr_hugepages=2048)
libc::MAP_PRIVATE | libc::MAP_ANONYMOUS | libc::MAP_NORESERVE | libc::MAP_HUGETLB,
-1,
0,
)
};
let map = if map == libc::MAP_FAILED || map.is_null() {
return Err(anyhow!("mmap failed: {}", get_errno()));
} else {
NonNull::new(map as *mut T).unwrap()
};
Ok(Self {
map,
size: Cell::new(0),
})
}
}
This is very similar to a decomposed Vec
—
NonNull<T>
is a non-null raw pointer to T
, which will point to our mmap
'd area,
and size
is how many elements we've allocated so far. The only thing missing from the
internals of Vec
in std
is a capacity field -- we don't need one because we're allocating enough
virtual memory that we can "safely" assume the kernel will kill us before we're able to overrun our
allocation.
Inside new
we set up the memory map, using the libc
crate to interface with
glibc
and mmap
our totally-normal, not at all suspicious 1Tb memory block.
alloc
is where the real meat is, and by real meat I mean almost nothing because
working with this memory block is so simple:
impl<T> Arena<T> {
fn alloc(&self, value: T) -> &mut T {
let size = self.size.get();
self.size.set(size + 1);
// lazily assume we'll always have capacity since we allocated such a huge map
unsafe {
let end = self.map.as_ptr().add(size);
ptr::write(end, value);
&mut *end
}
}
}
We bump size
(with interior mutability), write the value into the new slot, and return
a mutable reference to that slot. Two adds and a memcpy
, basically.
Performance
So how does this stack up against the Vec-arena approach of typed-arena
?
I wrote some simple benchmarks with criterion.rs
to test this out. I'm no benchmarking
expert, and I didn't verify that typed-arena
is the fastest Vec-arena-using arena
allocator for Rust, so take these benchmarks with a grain of salt. But, I also didn't
heavily optimize my implementation using mmap
, so lets go ahead regardless.
I ran the benchmarks on a laptop with these specs:
$ cat /etc/lsb-release | grep DESC # cat is more composable, fight me
DISTRIB_DESCRIPTION="Pop!_OS 20.04 LTS"
$ cat /proc/cpuinfo
...
model name : Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
cache size : 8192 KB
cpu cores : 4
...
$ cat /proc/meminfo
MemTotal: 48983684 kB
...
So with that out of the way, lets get benchmarking!
Allocation speed
The first benchmark is a simple allocation stress-test: it tries to allocate as many possible byte arrays of a certain size. This benchmark was run for arrays of size 8, 16, 32, 64, 128, and 256 bytes.
Due to physical memory pressure, and issues controlling that with Criterion, these benchmarks aren't as stable as I would hope and had a few % variance between runs, so keep that in mind.
First, the absolute timings:
The first thing you'll probably notice is the massive spike for typed-area
with
8-byte allocations. This was caused almost entirely by a single sample that took
15ms! This is more clear from the violin plot:
I would write this off as an outlier, but it happened consistently on every run of the benchmark. It's possible there was some interaction within the benchmark that was causing this behavior, I'm not sure. Regardless, using an arena allocator for a structure that can only hold a single pointer is a somewhat pathological case, so we can ignore it and look at the trends for larger sizes.
It seems for smaller allocations the mmap approach does slightly better, but falls
off at larger sizes. Flamegraphs suggest that memcpy
times dominated at
the larger sizes, so it's possible that the Vec approach was more amenable to the
compiler optimizing or fusing memcpy
operations. At smaller sizes where indexing
and growing the Vec
would happen more frequently, it makes sense that mmap
would
be slightly faster despite this.
Overall, however, it seems about a wash, since we're talking about single-digit nanoseconds either way, besides the possibly-spurious 8 byte test.
Traversal speed
The second benchmark was a depth-first search of a (seeded) randomly-generated DAG. The search was crafted such that the value would rarely be found (1 / 2^64), requiring a traversal of the entire graph. The benchmark generated DAGs of size 100, 1,000, 10,000, 100,000, 1,000,000, and 10,000,000 nodes, and timed how long a search over the graph took.
As the size of the graph increases, the mmap
-based arena starts to pull ahead. I would
test with even larger graphs, but the benchmark runtime was hitting 15 minutes and I was
getting impatient. However, I think the trend is clear enough.
I think locality is probably the dominant factor here: with smaller graph sizes,
the whole graph fits in a few Vec
chunks for typed-arena
, so the mmap
approach
doesn't gain much. As the graph grows, the Vec
chunks get larger and more spread out,
making locality worse.
This does become significant: at the largest graph size, the nodes in the mmap
arena
take an average of 1.5060 seconds to traverse, vs 2.1965 seconds for the typed-arena
nodes: 31% slower!
Should you use this?
Mmap arenas? Maybe. I'm sure some people already do, though a cursory Google search
for "mmap arena" didn't turn up much: most usages seemed to be using mmap
for a typical
chunk list. Using mmap
with overcommit like this has some pros and cons: the (possible)
performance and simplicity is good, the reliance on Linux overcommit behavior and ease of
getting OOM killed is a con. My hunch would be your arena allocator is probably not your
bottleneck most of the time, and a more robust solution like typed-arena
is more than
good enough.
This code, specifically? God no.