Skip to main content

iddqd, macrodb, and a tiny keyed set

·8 mins

Today I read a post by Oxide about a data structure they have built called iddqd. It is a map where the key is a projection of the value itself, and it is implemented with a fair amount of unsafe Rust. The post is mostly about correctness: how subtle the soundness reasoning gets when generic unsafe code calls back into user-supplied trait impls, and what layers of validation they apply to make sure the thing actually holds up.

That blog post is awesome, I can recommend reading it. But while I was reading it, it reminded me of two situations where I solved a similar problem (albeit in a very different, much less fancy way). Sometimes watching different people solve the same problem gets you very interesting outcomes. So I want to use this blog post to show my solutions, which live in a similar design space, but from different angles.

Explaining the problem #

You have some values. Each value carries something that uniquely identifies it: an email on a user, a path on an annotation, an id on a record. You want a collection that lets you treat those values as a set keyed on that field. For example:

pub struct User {
    // you want to look up users by this
    pub handle: String,
    pub email: EmailAddress,
    pub full_name: String,
}

The low-effort answer is sticking them in a Vec<User>, hoping that you don’t have duplicate keys, and iterating the entire vec for lookup. But hope is not a viable strategy when building reliable software, and we can do better than O(n) on lookup.

The standard library answer is BTreeMap<String, User> or HashMap<String, User>, where you split the key (String) out and store it next to the value. The downside is that you now need to do one of:

  • Store the same data in two places (once as the key, once inside the value). This is awkward, because it duplicates storage for no good reason (User will live as long as the String handle is alive, but now you’re storing it twice), and it makes mutation awkward (you don’t want to end up in a situation where you changed the handle, but the key is still the old value).
  • Give up on having the key inside the value at all, which makes using this awkward. Now instead of passing User structs around you have to pass around (String, User) tuples, because the handle is no longer inside the user struct.

iddqd solves this by borrowing the key out of the value at every operation. The map only stores the value. A user-supplied key() method hands back a reference whenever the map needs to compare or hash. This avoids the duplication, but as the Oxide post explains in detail, expressing this in Rust without leaking unsoundness is hard work, because every operation has to call back into user-supplied trait impls that the unsafe code must assume might be adversarial.

KeyedSet #

The smaller of the two lives in a project of mine called openvet. The whole file is about 200 lines, and the reason it exists is almost the opposite of why iddqd does.

I was mapping an existing wire format to Rust. Parts of that format are lists of things that the spec requires to be sorted and deduplicated on some projection. What I cared about was that no caller could ever construct an invalid representation. The job was making invalid wire states unrepresentable in the type system.

The whole thing is a BTreeMap<T::Key, T> behind a trait:

pub trait Keyed {
    type Key: Ord + Clone;
    fn key(&self) -> Self::Key;
}

pub struct KeyedSet<T: Keyed> {
    items: BTreeMap<T::Key, T>,
}

A user of the type implements Keyed for their record:

impl Keyed for Annotation {
    type Key = String;
    fn key(&self) -> String {
        self.path.clone()
    }
}

They get a KeyedSet<Annotation> which is, by construction, sorted on path and free of duplicates. There is no unsafe. There are no macros. The keys get cloned freely: Clone is right there in the trait bound. The BTreeMap does all the actual work.

With this design, you can tell that ultimate memory efficiency is not a priority for me: I am happy to duplicate a small string, or maybe make the key an Arc<str> so I can sleep better at night.

The one interesting wrinkle is mutation. If you want to update an entry and the closure returns an item with a different key than what was there before, you have a re-keying problem, and you have to decide what to do if another entry already lives at the new key. KeyedSet::modify handles this explicitly:

pub fn modify<F>(&mut self, key: &T::Key, f: F) -> Result<&T, ModifyError>
where
    F: FnOnce(T) -> T,
    T: Clone,
{
    let old = self.items.get(key).ok_or(ModifyError::NotFound)?.clone();
    let new = f(old);
    let new_key = new.key();
    if new_key != *key && self.items.contains_key(&new_key) {
        return Err(ModifyError::IdentityCollision);
    }
    self.items.remove(key);
    self.items.insert(new_key.clone(), new);
    Ok(&self.items[&new_key])
}

If the closure tries to mutate one entry’s key onto another existing one, you get an IdentityCollision back and the set is untouched. That is it. The whole file is around two screens of code.

What’s interesting about this is that I solved a problem that has a similar shape as iddqd, but for a totally different motivation. iddqd wants the data structure for fast lookup, I do not care about lookups here at all. I built this data structure for an entirely different reason: I had to build it that way to surface invariants of the wire-format. What this gives me is the guarantee that the items are always sorted, and I have no duplicates of that key.

macrodb #

The other thing I wrote, macrodb, goes in the opposite direction. And before I explain it further, I need to note that this is more of a prototype than anything I would seriously use. I built this because I wanted to know if it could be done, not because it is useful.

My thought process was this, roughly: building a data structure that keeps items unique by a key projection is not too hard. But what if you wanted to have two key projections? The more I thought about this, the more it sounded like a relational database. And then it got me thinking: why not commit fully and actually build a shitty in-memory relational database? So I did. And I used everyone’s favourite Rust language feature, which is macros.

This is not a fancy data structure. In fact, it is not a data structure at all: it’s more of a bring-your-own-storage type of situation, where macrodb sits on top and manages the data. You show up with a struct like this:

pub struct Database {
    users: BTreeMap<UserId, User>,
    user_by_name: HashMap<String, UserId>,
    groups: BTreeMap<GroupId, Group>,
    group_by_name: HashMap<String, GroupId>,
    users_by_group: BTreeMap<GroupId, BTreeSet<UserId>>,
}

Then you point a macro at it that describes the schema:

impl Database {
    table!(
        users: User,
        id: UserId,
        missing Error => Error::UserNotFound,
        primary users id => Error::UserIdExists,
        foreign groups group => Error::GroupNotFound,
        index users_by_group group => (),
        unique user_by_name name => Error::UserNameExists
    );
    table!(
        groups: Group,
        id: GroupId,
        missing Error => Error::GroupNotFound,
        primary groups id => Error::GroupIdExists,
        unique group_by_name name => Error::GroupNameExists,
        reverse users_by_group id => Error::GroupHasUsers
    );
}

The macro generates a users_insert, users_update, users_delete (and the same trio for groups) on the Database impl. Each generated method maintains every index that mentions that table, refuses to insert a user whose group does not exist (foreign), refuses to delete a group while users still reference it (reverse), enforces uniqueness on declared columns, and so on. You can also drop in user-defined constraint functions for anything the built-in vocabulary does not cover.

You get to pick the storage types because they are your structs. Want hashbrown::HashMap instead of std::collections::HashMap? Just swap it out. Want im::OrdMap for cheap clones? Same answer. The macro does not care, it generates the same insert, get and remove calls on whatever you put there.

The whole thing is macro_rules!, no proc-macro crate, no unsafe code anywhere. Keys get cloned at every step, just like in KeyedSet. The cost is that the macro internals are kind of grim to read, errors out of macro expansion are sometimes opaque, and rust-analyzer occasionally pretends none of it exists. The benefit is that the resulting code is plain safe Rust that you can step through and reason about without thinking about soundness at all.

In return for paying the duplication cost, you get foreign keys, reverse dependencies, multi-table constraints, and compound keys almost for free. iddqd has sets1 keyed on one projection (IdOrdMap), two projections (BiHashMap), and three projections (TriHashMap). macrodb generalizes that to where you can have N projections, and they can be unique or regular indices.

The only thing I did not build is a query language. The way you look things up is by querying (and joining) the indexes yourself. As my professor in college would say: building query engines is quite well-understood, it is left as an exercise for the reader at home.

What is interesting in the three of them #

All three of these implementations started with the same idea: building a set keyed on some projection of the stored values. But the reasoning is different, and so is the outcome.

  • iddqd wants it for efficient querying and lookup, and probably has the most refined and optimized implementation.
  • KeyedSet wants it because it needs to maintain ordering and uniqueness invariants, and is otherwise optimized for simplicity.
  • macrodb needs it because I got bored and wanted to know if I could build a relational database with macros in Rust, with no practical use for it besides having a cool story to tell at parties.

It leaves me thinking: how have other people solved this?


  1. iddqd calls them maps, I call them keyed sets. In some ways, they are both: set-like behaviour of the keys, but also map-like behaviour to look up values by keys. You get what I mean. Substitute set for whatever terminology makes more sense to you. ↩︎