Summer's Words

Spec of the Week: RFC 6920: Naming Things with Hashes

In this age of endless data creation, an egalitarian form of addressing has become a must; not because all data is equally good, but because all data is equally bad.

For this, we need a format that not only describes the data, but describes its representation of the description of said data. (And optionally, how to aquire said data too.)

Many options exist in this recursive hellscape:

One of my most favorite RFCs as of late, RFC 6920 enables me, you, anyone, to be able to represent any data not just with a Uniform Resource Identifier, but a Universal one! (Which is no longer uniform!)

What is RFC 6920? (ni:)

In the 1960s, content-addressable storage was a new exciting concept in the field of storing data. In the 2010s, RFC 6920 was made, combining the address with its store. Probably certainly not the first, but it is quite a nice implementation of the idea.

Using the power of URIs, we can package the hash of the data, the content type, and a server that might store and provide said data

ni://example.com/sha-256-32;f4OxZQ?ct=text/plain

To resolve it, https://example.com/.well-known/ni/sha-256-32/f4OxZQ?ct=text/plain, and then make sure the hash prefix, after base64 decoding, matches what you get!

Simple!

I'm not going into the binary and human speak-able versions. I don't want to suffer. FINE.

What is the binary format?

A short psudostruct:

2bit Reserved 0b00
6bit Suite_ID
*    Hash_Value

So, basically a multihash but of a less futureproof style. The one usecase that is provided is using suite 3, sha-256-120 (120 bits in length) to make a hash that totals to 128 bits, able to be stored into a system that expects 128bit hashes. Great, but what does that get you? If you were migrating to this, you would probably want to consider migrating again in the future, however, there has been no other registered suite that is 120 bits in length, so if you needed to swap down the line, you will be doing a whole migration again!

Additionally, hashes generally lend to sharding on prefix, e.g. 123def goes into box 12, 456abc goes into box 45. With prefixed hashes, this approach is completely unusable (similar to multihash too!), so a system that is bucketing by prefix, migrating to RFC 6920 or multihash, will probably explode!

Remember, kids: If bucketing/sharding, hash the input regardless using something like xxhash, that way you'll be futureproof! (Until a future where people want longer digests again)

What is the human-speakable format? (nih:)

Not simple! We now add a check digit! This check digit algorithm is as defined in ISO/IEC 7812-1:2017, and I'm not shelling out CHF61 on writing my first article.

(Thankfully check digits are optional, so I can keep writing.)

Taking the ni: of Hello World! — ni://example.com/sha-256-32;f4OxZQ?ct=text/plain — we jiggle things around (also known as just using the hex value instead of base64 encoding) and add some dashes and get:

nih:6;837-f65-b1

Nice! Except... "colon" and "semi-colon" are liable to be confused, and, its hexidecimal digits, "bee cee dee ee" are not going to be fun, though I guess both sides can learn the NATO phonetic alphabet, its pretty easy.

A joke from the prior revision of the article that I'm too lazy to segue to:

[...] or human speak-able form for slowly reading to your grandma over the phone so she can finally play that cracked copy of starcraft that you've been nagging her to look at!

Why not a URN?

Uniform Resource Names are neat for naming things, however they are not neat for resolving things. If you have urn:isbn:978-1645679158, what do you do with that?

That being said, maybe a pairing would have been appropriate, a urn:example:ni:sha-256-32:f4OxZQ perhaps. Would a truncated form be okay in that case? Would the fact that multiple files can have the same hash also be okay? It's not a perfect mapping after all, just a statistically comfortable one.

Why not a CID?

CIDs like as used by Bittorrent and IPFS have a vastly different concern: How best to break a file down and reconstitute it at the other end. IPFS (as far as I know) is relatively consistent in how its breaking/unbreaking process is implemented, Bittorrent however is not so consistent, with files able to span multiple chunks, and chunks able to span multiple files. This makes an implementation go from 5 lines of work, to an entire easy afternoon of muddling through specifications.

Why not RFC 6920?

However, with prior-mentioned CIDs, you gain the ability to cryptographically address semi-consistent parts of the whole, making replication of such data over a network much easier compared to RFC 6920, which depends on its transport.

And while CDNs do support caching subsets of a whole object, there is no easy way for a RFC 6920 client to check whether their partial content is valid or not, as it uses a hash over the whole data, and not over a tree of parts.

Implementations in the wild

Hah! None so far! So watch this space. If I find any still in use, I'll update this article with them.

Hello World!

To illustrate how one could generate a ni: for a given data:

rfc6920 () {
    printf 'ni:///sha-256;'
    sha256sum -b |
    tr -cd '[:xdigit:]' |
    hex2bin |  # bring your own implementation!
    base64 |
    tr '/+' '_-' | 
    tr -d =
}

echo -n "Hello World!" | rfc6920 
# ni:///sha-256;f4OxZX_x_FO5LcGBSKHWXfwtSx-j1ncoSt3SABJtkGk

See Also

Discuss This Post

#spec-of-the-week