Lessons from Hash Table Merging
Lessons from Hash Table Merging Merging two hash maps seems like a O(N) operation. However, while merging millions of keys, I encountered a massive >10x performance degradation. This post explores why...
Lessons from Hash Table Merging
Merging two hash maps seems like a O(N) operation. However, while merging millions of keys, I encountered a massive >10x performance degradation. This post explores why some of the most popular libraries fall into this trap and how to fix it. The source code is available here.
The set up
Merging hash tables may be slow
Efficient hash table merging
Solution I: salted hash function
Solution II: preallocation
Solution III: non-linear iteration
Other contenders
Conclusion
Appendix
The set up
I generated 3N random numbers with splitmix64, where N ranges from 7 to 31 million. I put the first N numbers in a hash map and the rest 2N numbers in a second hash map. Then I merged the second map into the first map and recorded timing on M4 Pro. I used a decent hash function copied from a blog post by Chris Wellons:
static inline uint64_t hash64(uint64_t x) {
x = (x ^ (x>>32)) * 0xd6e8feb86659fd93ULL;
x = (x ^ (x>>32)) * 0xd6e8feb86659fd93ULL;
return x ^ (x>>32);
}
I evaluated the following hash table libraries, all based on linear probing.
Library
Version
Language
Algorithm
Abseil
20250814.1
C++
Swiss Table
Boost
1.90.0
C++
Swiss Table
unordered_dense
4.8.1
C++
Robin Hood hashing
khashl
r40
C
Basic linear probing
Rust standard
1.92.0
Rust
Swiss Table
hashbrown
0.16.1
Rust
Swiss Table
Merging hash tables may be slow
The plot on the left shows the time for inserting 2N keys to an empty hash map, and the plot on right shows the time for merging a map with 2N keys into another map with N keys.
Notably, Abseil and Boost are >20X slower on hash table merging than hash table creation, while unordered_dense and khashl are not affected (I will come to them later). To understand the cause, suppose we have hash map h0 and h1 of the same size and we merge h1 into h0 with:
for (auto const &iter : h1)
h0[iter.first] += iter.second;
With a typical iterator implementation, we traverse keys in table h1 in the order of their hash codes. Because h0 and h1 use the same hash function (also known as "hasher"), the original bucket position of a key in h1 is identical to the position in h0. With the loop above, we will populate the beginning of h0 first. That part of h0 is almost fully saturated although on average, h0 has not reached the targeted load factor. Such primary clustering leads to the poor performance. All Swiss Table-based implementations, including the Rust libraries in the table, suffer from this issue if a fixed hash function is naively used.
Efficient hash table merging
Solution I: salted hash function
The figure above uses my own hasher. Using Abseil's default hasher wouldn't have this problem. Internally, Abseil adds a random 16-bit seed to each hash table instance and mixes this seed with hash codes. As a result, the same key is placed differently in h0 and h1 above. The problem described above wouldn't exist.
When you have to use your own hasher for specialized data type, you won't benefit from Abseil's builtin hash functions, but you can wrap your hasher with a class like this:
class RandHasher {
size_t seed;
public:
RandHasher(void) {
std::random_device rd;
seed = std::uniform_int_distribution{}(rd);
}
size_t operator()(const uint64_t x) const {
return hash64(x ^ seed); // FIXME: for generic keys, mix hash(x) and seed
}
};
The plot on the left shows the timing with a salted hasher. You can see hash table merging is as fast as table creation now. Salted hasher was introduced to alleviate hash flooding. I don't know if the original designer has table merging in mind, but it works well anyway.
Solution II: preallocation
The second strategy is to reserve enough space to hold both h0 and h1. Although h0 will not be populated randomly when merging h1 into h0, primary clustering wouldn't occur due to sufficient empty buckets. This solution is more cache friendly and faster in practice than salted hasher (plot on the right). The downside is that when h0 and h1 have many duplicates, preallocation may waste memory.
Solution III: stride iteration
unordered_dense is efficient on hash table merging because its iterator traverses keys in the input order, not in the hash order. With a random input order from splitmix64, h0 will be populated randomly during merging without primary clustering. Inspired by this observation, I modified the khashl iteration loop to (conceptually):
for (size_t i = 0, k1 = 0; i < h1.n_buckets; ++i) {
if (h1.bucket[k1].occupied)
h0.insert(h1.bucket[k1]);
k1 = (k1 + 7) % h1.n_buckets; // instead of k1 = (k1 + 1) % h1.n_buckets
}
The loop guarantees to visit each bucket in h1 exactly once as long as the step size (7 here) and h1.n_buckets share no common factor. As is shown in the first plot on the right, this strategy can merge hash tables efficiently even if we use a fixed hasher. With better data locality, it is also faster than salted hasher. It is also possible to use other types of iteration such as quadratic iteration.
Other contenders
In comparison to linear probing, quadratic probing is more robust to non-random input order. This improves the performance but not as much as the other solutions above. I also tried to shard the bigger hash table into smaller subtables. It helps but does not solve the root problem. It is not recommended, either.
Conclusion
Strategy
Merging performance
Complexity
Trade-offs
Salted hasher
Good
Medium
Minor overhead per hash
Preallocaion
Best
Low
Higher memory floor
Stride iteration
Better
High
Requiring library-level changes
Merging hash tables can be very slow when you use a wrong library or a wrong hash function. Among the three solutions discussed above, preallocation (Solution II) might be the easiest to implement and can be the fastest if h0 and h1 have little overlap (plot below). Salted hasher (Solution I) is slower for merging, but this is not a concern for most other operations. Salted hasher also has other benefit and comes by default with Abseil. Stride iteration (Solution III) is a balanced strategy, but it is hard to implement in the user space. There is not a clear winner in my opinion.
Appendix
Widely considered as the fastest hash table in C++, Boost is more than twice as fast as Abseil (plot above), which Swiss Table was originated from. Rust with my own fixed hasher is as fast as Boost. Nonetheless, to be more robust to hash flooding, Rust uses SipHash by default which is several times slower. When performance matters, use your own hash functions.