Tabulation Hashing Implementation and Analysis
March 30, 2023
There are a lot of hash algorithms for different use cases but tabulation hashing caught my attention years ago for its incredible simplicity and nice independence properties.
Fast and simple.
I will explore a cython implementation and see how fast really is.
What is a tabulation hashing?
The idea is to take keys of \(k\) bits and convert them into hashes of \(h\) bits.
We split the key into chunks of \(c\) bits and we construct a table of random numbers of \(h\) bits each.
This table will have one row per chunk and each row will be \(2^c\) numbers long.
The tabulation hashing is completed defined then by these parameters:
- key of \(k\) bits
- hash of \(h\) bits
- chunk of \(c\) bits
While in theory we can choose any value, the size of a chunk (\(c\)) should be small as the rows grows exponentially (\(2^c\)) and we want small tables to keep them in the cache.
And how a key is actually hashed? The following diagrams should explain it:
1.- Take the key, split it into \(k/c\) chunks.
2.- Use each to index each row of the table obtaining then \(k/c\) random numbers of \(h\) bits.
3.- Finally xor’ them and the result will be the hash of the key.
cython implementation
While we have 3 free parameters, I will restrict these:
- keys of 32 or 64 bits
- hashes of 32 or 64 bits
- chunk of 8 bits
We could code a specific implementation for each key/hash bit size but cython thankfully supports a kind of C++ templates called fused types
So, here are the definitions:
from libc.stdint cimport uint32_t, uint64_t
ctypedef fused key_dtype:
uint32_t
uint64_t
ctypedef fused hash_dtype:
uint32_t
uint64_t
Now let’s write a single generic hash function:
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.infer_types(True)
cdef inline hash_dtype c_hash_x(key_dtype k, hash_dtype[::1] table) nogil:
cdef:
hash_dtype h = 0, h0, h1, h2, h3, h4, h5, h6, h7
key_dtype k0, k1, k2, k3, k4, k5, k6, k7
k0 = k & 0x000000ff
k1 = (k >> 8) & 0x000000ff
k2 = (k >> 16) & 0x000000ff
k3 = (k >> 24) & 0x000000ff
h0 = table[k0]
h1 = table[k1 + 256]
h2 = table[k2 + 256 * 2]
h3 = table[k3 + 256 * 3]
h = h0 ^ h1 ^ h2 ^ h3
if key_dtype is uint64_t:
k4 = (k >> 32) & 0x000000ff
k5 = (k >> 40) & 0x000000ff
k6 = (k >> 48) & 0x000000ff
k7 = (k >> 56) & 0x000000ff
h4 = table[k4 + 256 * 4]
h5 = table[k5 + 256 * 5]
h6 = table[k6 + 256 * 6]
h7 = table[k7 + 256 * 7]
h ^= h4 ^ h5 ^ h6 ^ h7
return h
Few notes of the implementation:
cythonwill replacekey_dtypeandhash_dtypeby the specific types (uint32_tanduint64_t) doing a cross product of types. In this casecythonwill generate 4 functions in total.- the conditional
if key_dtype is uint64_tis resolved bycythonin compile time so it is a handy way to write conditional code without the C preprocessor. hash_dtype[::1] tabletells tocythonthat we expect a memory view, in particular an unidimensional contiguous array (faster access).- with
@cython.boundscheck(False)and@cython.wraparound(False)we will be playing with the table as a plain C array (faster access): no out of bound exceptions or fancy Pythonic indexing. - the entire
c_hash_xfunction does not use any Python object: everything is C data. In such case we can release the GIL during its execution withnogil(concurrent friendly)
c_hash_x is a cdef function that means it can be called only from within C code.
To make it accessible from Python we code:
def hash_x(key_dtype k, hash_dtype[::1] table):
return c_hash_x(k, table)
C functions are static by default so the C compiler should optimize the call. In fact, I marked c_hash_x as inline to hint the compiler.
Hash a vector of keys
The tabulation hashing shines when we hash a vector of keys as much of the table rows will be in the cache.
The cython code is:
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.infer_types(True)
@cython.cdivision(True)
def hash_vec_full(key_dtype[::1] kvec, hash_dtype[::1] table, hash_dtype[::1] out):
cdef:
hash_dtype hi, h
key_dtype k, xi
uint32_t vec_size = kvec.shape[0]
uint32_t row, shift, rebase
uint32_t nrows
with nogil:
for i in range(0, vec_size):
k = kvec[i]
h = c_hash_x[key_dtype, hash_dtype](k, table)
out[i] = h
A few notes:
- as before we require the arrays to be contiguous
- a Python function cannot be marked as
nogilbut we can mark as subpart of it with a context manager - the
c_hash_x[key_dtype, hash_dtype]tellscythonto call the specializedc_hash_xfunction for those types.
Runtime performance
I’ll analyse the 1.0.1 version of the tabulation hashing implementation.
For comparison I will use a simple linear hashing:
h = (a * k + b) (mod N)
Where a and b are random numbers of \(h\) bits and N is \(2^h\)
The beauty of this function is that is really simple and fast because we can use numpy to write a vectorized version to hash several keys in one shot.
The experiment run 1000 times for each setting and the plot shows the minimum elapsed time for each.
The error (dispersion) of the metric is too tiny to be visible but a more detailed analysis of the noise in the measurement is provided bellow.
Here the order means datasets of \(2^{order}\) random numbers to hash.
Linear hashing performs better than tabulation hashing, probably because numpy is well optimized for computing a multiplication and a addition in a vectorized fashion.
In contrast my cython implementation does not take any advantage of such instructions.
Nevertheless tabulation hashing shows a consistent performance across the different datasets.
Note also that the performance of the linear hashing gets worst on larger datasets. The why is an open question.
For reference here is the code for plotting and raw dataset.
Noise measurement evaluation
Here we can see how the elapsed times of linear hashing are left-shifted meaning that it ran faster than tabulation hashing.
But the dispersion density is much interesting!
For tabulation hashing we have very high peaks that means smaller dispersion of the measurements.
Tabulation hashing in mostly affected by cache misses as it is primary a set of memory lookups.
Linear hashing may use the ALU more intensively. Would this explain the dispersion?
I’m not sure: I disabled hyperthreading/SMT and isolated the CPUs for the experiment so nobody should had interfered with the ALU.
Conclusions and further research
Tabulation hashing is slower than linear hashing but close.
The comparison however is not taking into account that a linear hashing has less math properties than the tabulation hashing.
For example the latter is 3-way independence that some applications require and a linear hashing would not be applicable.
My cython implementation does not make any of vectorized memory access nor parallelism so it is not exploiting the full capabilities of a modern CPU.
Something that I may explore in the future.
Related tags: hash, hashing, perf, performance, cython

