← BackJan 7, 2026

High-Performance GPU Cuckoo Filter

GPU-Accelerated Cuckoo Filter A high-performance CUDA implementation of the Cuckoo Filter data structure, developed as part of the thesis "Design and Evaluation of a GPU-Accelerated Cuckoo Filter". O...

GPU-Accelerated Cuckoo Filter A high-performance CUDA implementation of the Cuckoo Filter data structure, developed as part of the thesis "Design and Evaluation of a GPU-Accelerated Cuckoo Filter". Overview This library provides a GPU-accelerated Cuckoo Filter implementation optimized for high-throughput batch operations. Cuckoo Filters are space-efficient probabilistic data structures that support insertion, lookup, and deletion operations with a configurable false positive rate. Features CUDA-accelerated batch insert, lookup, and delete operations Configurable fingerprint size and bucket size Multiple eviction policies (DFS, BFS) Sorted insertion mode for improved memory coalescing Multi-GPU support via gossip IPC support for cross-process filter sharing Header-only library design Performance Benchmarks at 80% load factor on an NVIDIA GH200 (H100 HBM3, 3.4 TB/s). The GPU Cuckoo Filter is compared against: CPU Cuckoo Filter Bulk Two-Choice Filter (TCF) GPU Counting Quotient Filter (GQF) GPU Blocked Bloom Filter L2-Resident (4M items, ~8 MiB) Comparison Insert Query Delete GPU vs CPU Cuckoo 360× faster 973× faster N/A Cuckoo vs TCF 6× faster 42× faster 100× faster Cuckoo vs GQF 585× faster 6× faster 273× faster Cuckoo vs Bloom 0.6× (slower) 1.4× faster N/A DRAM-Resident (268M items, ~512 MiB) Comparison Insert Query Delete GPU vs CPU Cuckoo 583× faster 1504× faster N/A Cuckoo vs TCF 1.9× faster 11.3× faster 35.3× faster Cuckoo vs GQF 9.6× faster 2.6× faster 3.8× faster Cuckoo vs Bloom 0.7× (slower) 1.0× (equal) N/A NoteFor a more comprehensive evaluation including additional systems and analysis, see the accompanying thesis. Requirements CUDA Toolkit (>= 12.9) C++20 compatible compiler Meson build system (>= 1.3.0) Building meson setup build meson compile -C build Benchmarks and tests are built by default. To disable them: meson setup build -DBUILD_BENCHMARKS=false -DBUILD_TESTS=false Usage #include // Configure the filter: key type, fingerprint bits, max evictions, block size, bucket size using Config = CuckooConfig; // Create a filter with the desired capacity CuckooFilter filter(1 << 20); // capacity for ~1M items // Insert keys (d_keys is a device pointer) filter.insertMany(d_keys, numKeys); // Or use sorted insertion filter.insertManySorted(d_keys, numKeys); // Check membership filter.containsMany(d_keys, d_results, numKeys); // Delete keys filter.deleteMany(d_keys, d_results, numKeys); Configuration Options The CuckooConfig template accepts the following parameters: Parameter Description Default T Key type - bitsPerTag Fingerprint size in bits (8, 16, 32) - maxEvictions Maximum eviction attempts before failure 500 blockSize CUDA block size 256 bucketSize Slots per bucket (must be power of 2) 16 AltBucketPolicy Alternate bucket calculation policy XorAltBucketPolicy evictionPolicy Eviction strategy (DFS or BFS) BFS WordType Atomic type (uint32_t or uint64_t) uint64_t Multi-GPU Support For workloads that exceed single GPU capacity: #include CuckooFilterMultiGPU filter(numGPUs, totalCapacity); filter.insertMany(h_keys, numKeys); filter.containsMany(h_keys, h_results, numKeys); Project Structure include/ - Header files CuckooFilter.cuh - Main filter implementation CuckooFilterMultiGPU.cuh - Multi-GPU implementation CuckooFilterIPC.cuh - IPC support bucket_policies.cuh - Alternative bucket policies helpers.cuh - Helper functions src/ - Example applications benchmark/ - benchmarks tests/ - Unit tests scripts/ - Scripts for running/plotting benchmarks