UPID is as UPID does

# Context

I developed a new type of universally-unique ID, and it was pretty fun. I was inspired to do so when I read this blog post a month or two ago, and it mentioned Stripe’s pretty prefixed IDs. They look like this:

cus_MJA953cFzEuO1z
└─┘ └────────────┘
 └─ prefix    └─ random

I’ve used Stripe’s API before and thought these IDs were great. Any time I dumped out some JSON I didn’t have to wonder what each bunch of random characters stood for, it told me!

The downside is that you have to make a choice when you implement them. Either

  1. Store them as TEXT in the database, which means you’ll have slower lookups and waste MBs, or
  2. Store just the random part as a 128bit UUID in the database, which means your APIs have to strip/prepend the prefix at some boundary each time.

If your ORM is clever or extensible, you can solve option 2. For example, TypeID has an Elixir implementation that will handle this back-and-forthing for you. But TypeID is a pretty general spec and not able to make this universal. There’s no way to make this work with Prisma, for example.

# UPID

But the rest of use are stuck with one of these imperfect solutions. So I came up with UPID. It looks like this in its text encoding:

user_2accvpp5guht4dts56je5a
└──┘ └──────┘└───────────┘└─ version
 └─ prefix └─ time   └─ random

And something like this, in binary:

00000001 10010001 01011100 10110100
01111110 11101101 11100011 00011110
01001111 10100010 00010101 00101011
10101011 11010110 00010101 01110110

That’s right, it’s just 128 bits! That means anywhere you have a 128 bit datatype (like UUID), you can drop in a UPID with no issue. So you can use UPID in your server code and store them as UUIDs in (for example) Postgres. You don’t need to strip/prepend the prefix, because it’s always there.

It is most similar to a ULID, but with bits taken away from the randomness and time components to make space for the prefix. Like ULID, it uses a 32 character alphabet (making 5 bits per character), but unlike Crockford’s base32, it keeps the whole alpha part, so you can write anything with a-z in the prefix.

Also like ULID, it is “lexicographically sortable”, which basically just means it can be ordered by date. This is useful on its own, but is also valuable for things like ensuring index locality, WAL efficiency and easy sharding. However, UPID has lower time precision (256ms vs 1ms), which is arguably better for two reasons: some people are concerned you can leak information with time-based IDs, and 256ms leaks a lot less; and many system clocks have worse than 1ms precision, so you won’t have a false sense of monotonicity that might not exist.

Referring back to the two “problems” from further up:

  1. Stored as a 128 bit value in the database, so no inefficient TEXT values.
  2. Always knows what its prefix is, so you don’t need to remember to add/remove anything at any boundary.

# Data layout

So, how does it work? Here’s that same ID from further up, with the separate components broken up.

    user   _   2accvpp5      guht4dts56je5       a
   └────┘     └────────┘    └─────────────┘   └─────┘
   prefix       time            random        version     total
   4 chars      8 chars         13 chars      1 char      26 chars
       └────────│────────────────│───────────┐  
                                             
                                             
             40 bits            64 bits      24 bits     128 bits
             5 bytes            8 bytes      3 bytes      16 bytes
             time               random       prefix+version

Notice that in the binary format, the time bits are stored first, so that if you sort a bunch of UPIDs, they will be neatly ordered by time.

Any library that implements the UPID spec just needs to be able to shuffle back and forth between the 128 bit binary representation and the text representation. There are already Python, Rust, Postgres (those are all links to the same repo btw) implementations, along with one in TypeScript that powers a little demo website that looks like this:

# Cool?

This was the rare case where writing it in Rust was actually easier than in Python, because it’s more explicit about what’s a u8 or u64 or u128, whereas in Python it’s all just bytes and good luck figuring out how many or what the precision is.

This matters when you’re bit-shifting like a madman and need to keep track of which bits came from where:

let res = (time_bits << 88)
	| (random << 24)
	| ((prefix[0] as u128) << 16)
	| ((prefix[1] as u128) << 8)
	| prefix[2] as u128;

You can also do stuff like this, and Rust will shout at you if you accidentally make your alphabet too long:

pub const ENCODE: &[u8; 32] = b"234567abcdefghijklmnopqrstuvwxyz";

That’s the base32 alphabet by the way. The numbers come first so that, if for some reason you want to sort the text-encoded strings, they’ll still be sorted in the right order!

# Using it

In Python:

from upid import upid
upid("prod")            # prod_2acptrhhfi7asnb5iessba

Rust:

use upid::Upid;
UPID::new("cust")      // cust_2acptrk7ypgl2hl45g3vca

Or TypeScript:

import { upid } from "upid-ts";
upid("acct")           // acct_2acptrqnipixsl2xugym7a

In all of this cases, the actual data is just a 128 bit binary blob. And if you store it somewhere as a UUID or u128 or something, you can just load it right back into a UPID.

If anyone wants to write an implementation in another language (ChatGPT should be able to do most of the work), let me know and I’ll add it to the list.

Wait a second, what about Postgres!?

# Postgres

Thanks to pgrx, it’s now dead-easy to write extensions for Postgres in Rust. And this is the best part. If you install the extension, then you get the binary-text back-and-forthing directly in Postgres.

So you can do stuff like this:

CREATE EXTENSION upid_pg;

CREATE TABLE members (
    id   upid NOT NULL DEFAULT gen_upid('memb') PRIMARY KEY,
    name text NOT NULL
);

INSERT INTO members (name) VALUES('Bob');

SELECT * FROM members;
--              id              | name
-- -----------------------------+------
--  memb_2acptt2ytgmpf5cmj6iy5a | Bob

That means your library code doesn’t even need to care about UPID, or install any extra libraries. It will be transmitted over the wire as a pretty string, and stored in the DB as an efficient 128 bit blob.

The only downside is that, as easy as it is to write Postgres extensions, it’s not yet very easy to install them. I’ve created .deb packages (get yours at the Releases) so you can install it quite easily on Debian-based distros. Alternatively, you can try out the Docker image that has Postgres 16 with the UPID extension pre-installed:

docker run -e POSTGRES_HOST_AUTH_METHOD=trust \
    -p 5432:5432 carderne/postgres-upid:16

# Go on…

So that’s UPID! I hope someone finds it useful, or at least interesting. I’ll be using it in some of my personal projects, and trying to talk whoever I can into using it in their critical production projects. There’s not much risk — at the end of the day it’s just a 128 bit ID, and the full implementations are a few hundreds lines of code at most. Go on…