Just finished my thesis

Posted on 2015-05-22

Against all procrastination, I finished my bachelor’s thesis in time. Huzzah!

It’s about how to make dictionaries fast in the main memory of a real computer. CPUs are slower than RAMs, so in order to have a fast data structure, you need to access main memory as little as possible, and your program needs to be friendly to the many caches between the CPU and RAM.

If you’d like to have a look, the source code of the thesis is at https://github.com/MichalPokorny/thesis.git. Aside from TeX, it’s a bunch of data structures implemented in ye olde C. If you prefer not building it yourself, you can read it here.

There are basically two kinds of dictionaries: ordered and unordered. The difference is basically the same as between C++ std::map and std::hash_map: while unordered dictionaries allow only reading the value for a key, inserting a pair and deleting a pair, ordered dictionaries also let you query for the previous smaller and next larger stored key of a given key.

When you just need an unordered dictionary, probably the most common choice is one of the many variants of a hash table. I had a look at open addressing with linear probing and cuckoo hashing. Both have interesting properties.

Ordered dictionaries you usually implement via some kind of a tree, frequently a binary one. Trees usually either have a balancing scheme (like AVL or red-black trees, or B-trees), or they are self-adjusting. The most well-known self-adjusting data structure is a splay tree.

Splay trees in particular are interesting, since they can specialize their structure to the access pattern you use. If you access a few elements very often, splay trees will keep them high up in the tree, and the accesses will be pretty fast. In some use cases, this makes splay trees are faster than balanced binary search trees.

I researched a bit and found out that some very smart people created self-adjusting structures that are (in theory) faster than splay trees. One of the reasons why splay trees are slow is that they are binary. When you load a single node, it costs you more or less one cache miss, and it sends you to the left or right child. In B-trees, your nodes are large, and you can load more useful routing information in fewer cache misses: for example, in 2-3-4 trees, a loaded node can send you into 2, 3, or 4 different children (by the way, I measured 2-3-4 trees to have an effective branching factor of something like 2.59). The self-adjusting data structures I implemented are named “K-splay trees” and “K-forests”. Both have some interesting ideas, but unfortunately both turned our pretty slow when benchmarked. Oh well, can’t win everything.

I also had a look at something called a cache-oblivious B-tree, which is basically a B-tree that automagically optimizes to whatever memory hierarchy you have, even if you don’t know that in advance. The data structure is somewhat more complicated than simple B-trees, but surprisingly, it’s very close in speed to hand-tuned B-trees, which is pretty neat.

I have some more results in the thesis, for example there’s a measurement of how do the various dictionaries do on recordings of hash table operation from a Firefox session or in a simulated database.