High-Performance GPU Cuckoo Filter
GPU-Accelerated Cuckoo Filter A high-performance CUDA implementation of the Cuckoo Filter data structure, developed as part of the thesis "Design and Evaluation of a GPU-Accelerated Cuckoo Filter". O...
GPU-Accelerated Cuckoo Filter
A high-performance CUDA implementation of the Cuckoo Filter data structure, developed as part of the thesis "Design and Evaluation of a GPU-Accelerated Cuckoo Filter".
Overview
This library provides a GPU-accelerated Cuckoo Filter implementation optimized for high-throughput batch operations. Cuckoo Filters are space-efficient probabilistic data structures that support insertion, lookup, and deletion operations with a configurable false positive rate.
Features
CUDA-accelerated batch insert, lookup, and delete operations
Configurable fingerprint size and bucket size
Multiple eviction policies (DFS, BFS)
Sorted insertion mode for improved memory coalescing
Multi-GPU support via gossip
IPC support for cross-process filter sharing
Header-only library design
Performance
Benchmarks at 80% load factor on an NVIDIA GH200 (H100 HBM3, 3.4 TB/s). The GPU Cuckoo Filter is compared against:
CPU Cuckoo Filter
Bulk Two-Choice Filter (TCF)
GPU Counting Quotient Filter (GQF)
GPU Blocked Bloom Filter
L2-Resident (4M items, ~8 MiB)
Comparison
Insert
Query
Delete
GPU vs CPU Cuckoo
360× faster
973× faster
N/A
Cuckoo vs TCF
6× faster
42× faster
100× faster
Cuckoo vs GQF
585× faster
6× faster
273× faster
Cuckoo vs Bloom
0.6× (slower)
1.4× faster
N/A
DRAM-Resident (268M items, ~512 MiB)
Comparison
Insert
Query
Delete
GPU vs CPU Cuckoo
583× faster
1504× faster
N/A
Cuckoo vs TCF
1.9× faster
11.3× faster
35.3× faster
Cuckoo vs GQF
9.6× faster
2.6× faster
3.8× faster
Cuckoo vs Bloom
0.7× (slower)
1.0× (equal)
N/A
NoteFor a more comprehensive evaluation including additional systems and analysis, see the accompanying thesis.
Requirements
CUDA Toolkit (>= 12.9)
C++20 compatible compiler
Meson build system (>= 1.3.0)
Building
meson setup build
meson compile -C build
Benchmarks and tests are built by default. To disable them:
meson setup build -DBUILD_BENCHMARKS=false -DBUILD_TESTS=false
Usage
#include
// Configure the filter: key type, fingerprint bits, max evictions, block size, bucket size
using Config = CuckooConfig;
// Create a filter with the desired capacity
CuckooFilter filter(1 << 20); // capacity for ~1M items
// Insert keys (d_keys is a device pointer)
filter.insertMany(d_keys, numKeys);
// Or use sorted insertion
filter.insertManySorted(d_keys, numKeys);
// Check membership
filter.containsMany(d_keys, d_results, numKeys);
// Delete keys
filter.deleteMany(d_keys, d_results, numKeys);
Configuration Options
The CuckooConfig template accepts the following parameters:
Parameter
Description
Default
T
Key type
-
bitsPerTag
Fingerprint size in bits (8, 16, 32)
-
maxEvictions
Maximum eviction attempts before failure
500
blockSize
CUDA block size
256
bucketSize
Slots per bucket (must be power of 2)
16
AltBucketPolicy
Alternate bucket calculation policy
XorAltBucketPolicy
evictionPolicy
Eviction strategy (DFS or BFS)
BFS
WordType
Atomic type (uint32_t or uint64_t)
uint64_t
Multi-GPU Support
For workloads that exceed single GPU capacity:
#include
CuckooFilterMultiGPU filter(numGPUs, totalCapacity);
filter.insertMany(h_keys, numKeys);
filter.containsMany(h_keys, h_results, numKeys);
Project Structure
include/ - Header files
CuckooFilter.cuh - Main filter implementation
CuckooFilterMultiGPU.cuh - Multi-GPU implementation
CuckooFilterIPC.cuh - IPC support
bucket_policies.cuh - Alternative bucket policies
helpers.cuh - Helper functions
src/ - Example applications
benchmark/ - benchmarks
tests/ - Unit tests
scripts/ - Scripts for running/plotting benchmarks