← BackJan 6, 2026

High-performance header-only container library for C++23 on x86-64

Fast Containers High-performance header-only container library for C++23 on x86-64. What's Included B+Tree (kressler::fast_containers::btree) - Cache-friendly B+tree with SIMD search and hugepage su...

Fast Containers High-performance header-only container library for C++23 on x86-64. What's Included B+Tree (kressler::fast_containers::btree) - Cache-friendly B+tree with SIMD search and hugepage support Dense Map (kressler::fast_containers::dense_map) - Fixed-size sorted array used internally by btree nodes Hugepage Allocators - Pooling allocators that reduce TLB misses and allocation overhead HugePageAllocator - Single-size allocator for uniform allocations MultiSizeHugePageAllocator - Multi-size pooling for variable-sized allocations (e.g., Abseil btree) PolicyBasedHugePageAllocator - Advanced control with shared pools Why This Library? The B+tree implementation provides significant performance improvements over industry standards for large trees. For some workloads with large trees, we've observed: vs Abseil B+tree: 2-5Ɨ faster across insert/find/erase operations vs std::map: 2-5Ɨ faster across insert/find/erase operations See benchmark results for detailed performance analysis. Important qualifications: Performance advantages are most significant for large tree sizes where TLB misses and allocation costs dominate Benchmarks currently focus on 10M element trees; smaller tree sizes have not been comprehensively tested Results are specific to the tested configurations (8-byte keys, 32-byte and 256-byte values) Key advantages over Abseil's btree: Hugepage allocator integration: 3-5Ɨ speedup from reduced TLB misses and pooled allocations SIMD-accelerated search: 3-10% faster node searches using AVX2 instructions Tunable node sizes: Optimize cache behavior for your specific key/value sizes Notes Work in progress This is a work in progress. I don't have plans for major changes to the B+tree currently, but am actively cleaning up the implementation. Platforms This library is really only built and tested on Linux, on x86-64 CPUs with AVX2 support. In theory, it could be built for Windows, though that hasn't been tested. The SIMD implementations are x86-64 specific. Timing in the custom benchmarks is also x86-64 specific (via use of rdtscp). History/Motivations This project started as an exploration of using AI agents for software development. Based on experience tuning systems using Abseil's B+tree, I was curious if performance could be improved through SIMD instructions, a customized allocator, and tunable node sizes. Claude proved surprisingly adept at helping implement this quickly, and the resulting B+tree showed compelling performance improvements, so I'm making it available here. Quick Start Installation Prerequisites: C++23 compiler (GCC 14+, Clang 19+) CMake 3.30+ AVX2-capable CPU (Intel Haswell 2013+, AMD Excavator 2015+) Include in your project: Add as a git submodule: git submodule add https://github.com/kressler/fast-containers.git third_party/fast-containers Link in CMakeLists.txt: add_subdirectory(third_party/fast-containers) target_link_libraries(your_target PRIVATE fast_containers::fast_containers) Include headers: #include #include Usage Basic B+tree Example #include #include #include int main() { // Create a btree mapping int64_t keys to int32_t values // Using defaults: auto-computed node sizes, Linear search using Tree = kressler::fast_containers::btree; Tree tree; // Insert key-value pairs tree.insert(42, 100); tree.insert(17, 200); tree.insert(99, 300); // Find a value auto it = tree.find(42); if (it != tree.end()) { std::cout << "Found: " << it->second << std::endl; // Prints: 100 } // Iterate over all elements (sorted by key) for (const auto& [key, value] : tree) { std::cout << key << " -> " << value << std::endl; } // Erase an element tree.erase(17); // Check size std::cout << "Size: " << tree.size() << std::endl; // Prints: 2 } B+tree with Hugepage Allocator (Recommended for Performance) #include #include #include #include int main() { // Use the hugepage allocator for 3-5Ɨ performance improvement // Allocator type must match the btree's value_type (std::pair) using Allocator = kressler::fast_containers::HugePageAllocator< std::pair>; using Tree = kressler::fast_containers::btree< int64_t, // Key type int32_t, // Value type 96, // Leaf node size 128, // Internal node size std::less, // Comparator kressler::fast_containers::SearchMode::SIMD, // SIMD search Allocator // Hugepage allocator >; // Tree will default-construct the allocator (256MB initial pool, 64MB growth) // The btree automatically creates separate pools for leaf and internal nodes Tree tree; // Insert 10 million elements - hugepages reduce TLB misses for (int64_t i = 0; i < 10'000'000; ++i) { tree.insert(i, i * 2); } // Find operations are much faster with hugepages auto it = tree.find(5'000'000); assert(it != tree.end() && it->second == 10'000'000); } Advanced: Policy-Based Allocator for Shared Pools For multiple trees or fine-grained control over pool sizes, use PolicyBasedHugePageAllocator: #include #include #include int main() { // Create separate pools for leaf and internal nodes with custom sizes auto leaf_pool = std::make_shared( 512 * 1024 * 1024, true); // 512MB for leaves auto internal_pool = std::make_shared( 256 * 1024 * 1024, true); // 256MB for internals // Create policy that routes types to appropriate pools kressler::fast_containers::TwoPoolPolicy policy{leaf_pool, internal_pool}; // Create allocator with the policy using Allocator = kressler::fast_containers::PolicyBasedHugePageAllocator< std::pair, kressler::fast_containers::TwoPoolPolicy>; Allocator alloc(policy); using Tree = kressler::fast_containers::btree< int64_t, int32_t, 96, 128, std::less, kressler::fast_containers::SearchMode::SIMD, Allocator>; // Multiple trees can share the same pools Tree tree1(alloc); Tree tree2(alloc); // Both trees share leaf_pool for leaves and internal_pool for internals tree1.insert(1, 100); tree2.insert(2, 200); } Advanced: Multi-Size Hugepage Allocator for Variable-Sized Allocations For containers that allocate variable-sized objects (like absl::btree_map), use MultiSizeHugePageAllocator: #include #include #include #include int main() { // absl::btree_map allocates different-sized nodes (leaf vs internal) // MultiSizeHugePageAllocator routes allocations to size-class-specific pools using ValueType = std::array; using Allocator = kressler::fast_containers::MultiSizeHugePageAllocator< std::pair>; // Helper function creates allocator with default settings // - 64MB initial size per size class // - Hugepages enabled // - 64MB growth size per size class auto alloc = kressler::fast_containers::make_multi_size_hugepage_allocator< std::pair>(); // Create absl::btree_map with hugepage allocator absl::btree_map, Allocator> tree(alloc); // Insert 1 million elements - multiple size classes created automatically for (int64_t i = 0; i < 1'000'000; ++i) { tree[i] = ValueType{}; } // Find operations benefit from reduced TLB misses auto it = tree.find(500'000); } How it works: Allocations are routed to size classes based on requested size Size classes: 0-512B (64B alignment), 513-2048B (256B alignment), 2049+B (power-of-2) Each size class maintains its own HugePagePool with uniform-sized blocks Pools created on-demand as different sizes are requested Provides 2-3Ɨ performance improvement for absl::btree_map over standard allocator When to use each allocator: HugePageAllocator: Simple, automatic separate pools per type (recommended for our btree) MultiSizeHugePageAllocator: Variable-sized allocations (e.g., absl::btree_map, other STL containers with allocator support) PolicyBasedHugePageAllocator: Fine-grained control, shared pools across trees, custom pool sizes API Overview The btree class provides an API similar to std::map: Insertion: std::pair insert(const Key& key, const Value& value) std::pair emplace(Args&&... args) Value& operator[](const Key& key) Lookup: iterator find(const Key& key) const_iterator find(const Key& key) const iterator lower_bound(const Key& key) iterator upper_bound(const Key& key) std::pair equal_range(const Key& key) Removal: size_type erase(const Key& key) iterator erase(iterator pos) Iteration: iterator begin() / const_iterator begin() const iterator end() / const_iterator end() const Range-based for loops supported Capacity: size_type size() const bool empty() const void clear() Other: void swap(btree& other) noexcept key_compare key_comp() const value_compare value_comp() const Template Parameters template < typename Key, typename Value, std::size_t LeafNodeSize = default_leaf_node_size(), std::size_t InternalNodeSize = default_internal_node_size(), typename Compare = std::less, SearchMode SearchModeT = SearchMode::Linear, typename Allocator = std::allocator> > class btree; Parameters: Key, Value: The key and value types LeafNodeSize: Number of key-value pairs per leaf node Default: Auto-computed heuristic targeting ~2KB (32 cache lines) Formula: 2048 / (sizeof(Key) + sizeof(Value)), rounded to multiple of 8, clamped to [8, 64] Manual tuning: Larger values (64-96) for small values, smaller values (8-16) for large values InternalNodeSize: Number of child pointers per internal node Default: Auto-computed heuristic targeting ~1KB (16 cache lines) Formula: 1024 / (sizeof(Key) + sizeof(void*)), rounded to multiple of 8, clamped to [16, 64] Generally leave at default (stores only 8-byte pointers) Compare: Comparison function (must satisfy ComparatorCompatible) Default: std::less Also supports std::greater for descending order SearchMode: How to search within a node Default: SearchMode::Linear (scalar linear search) SearchMode::SIMD: AVX2-accelerated search (3-10% faster, requires AVX2 CPU and SIMD-compatible keys: int32_t, uint32_t, int64_t, uint64_t, float, double) SearchMode::Binary: Binary search Allocator: Memory allocation strategy Default: std::allocator> Recommended for performance: HugePageAllocator> for working sets >1GB (3-5Ɨ faster) Automatically creates separate pools for leaf and internal nodes via rebind Default: 256MB initial pool, 64MB growth per pool Requires hugepages configured: sudo sysctl -w vm.nr_hugepages= Falls back to regular pages if unavailable For variable-sized allocations: MultiSizeHugePageAllocator> Routes allocations to size-class-specific pools Use with absl::btree_map or other containers that allocate different-sized objects Provides 2-3Ɨ speedup for Abseil btree over standard allocator Advanced control: PolicyBasedHugePageAllocator, TwoPoolPolicy> Fine-grained control over pool sizes Share pools across multiple trees Separate pools for leaf and internal nodes with custom sizes Performance Benchmarks comparing against Abseil's btree_map and std::map are available in results/btree_benchmark_results.md. Performance Highlights (8-byte keys, 32-byte values, 10M elements) Our btree with hugepages (btree_8_32_96_128_simd_hp): INSERT P99.9: 1,023 ns FIND P99.9: 864 ns ERASE P99.9: 1,086 ns Our btree with standard allocator (btree_8_32_96_128_simd): INSERT P99.9: 3,155 ns (3.1Ɨ slower than with hugepages) FIND P99.9: 950 ns (1.1Ɨ slower than with hugepages) ERASE P99.9: 1,323 ns (1.2Ɨ slower than with hugepages) vs. Abseil btree with hugepages (absl_8_32_hp using MultiSizeHugePageAllocator): INSERT P99.9: 1,401 ns (27% slower) FIND P99.9: 1,190 ns (38% slower) ERASE P99.9: 1,299 ns (20% slower) vs. Abseil btree with standard allocator (absl_8_32): INSERT P99.9: 3,287 ns (3.2Ɨ slower) FIND P99.9: 1,342 ns (55% slower) ERASE P99.9: 1,679 ns (55% slower) vs. std::map (map_8_32): INSERT P99.9: 3,587 ns (3.5Ɨ slower) FIND P99.9: 2,312 ns (2.7Ɨ slower) ERASE P99.9: 2,253 ns (2.1Ɨ slower) Key Insights Hugepage allocators provide massive performance improvements: Our btree: 2-3Ɨ faster with hugepages vs. standard allocator Abseil btree: 2Ɨ faster with MultiSizeHugePageAllocator vs. standard allocator Critical for large working sets (>1M elements) Our implementation maintains significant advantages even with fair comparison: 20-67% faster than Abseil btree even when both use hugepage allocators Advantages from SIMD search, tunable node sizes, and optimized bulk transfers Performance gap widens with larger values (256 bytes: 21-135% faster) Performance varies by tree size: Large trees (10M elements): Our btree dominates all metrics Small trees (10K elements): Competition intensifies, std::map becomes viable for some workloads See benchmark results for detailed analysis The hugepage allocator is the single most important optimization, providing benefits by reducing TLB misses (helps find operations) and making allocations extremely cheap through pooling (helps insert/erase operations). Building from Source Using CMake Presets (Recommended) # List available presets cmake --list-presets # Configure, build, and test in one workflow cmake --preset release cmake --build --preset release ctest --preset release # Common presets: cmake --preset debug # Debug build cmake --preset release # Release with AVX2 (default) cmake --preset asan # AddressSanitizer build cmake --preset release-no-avx2 # Release without AVX2 Manual Build (Alternative) # Clone with submodules git clone --recursive https://github.com/kressler/fast-containers.git cd fast-containers # Configure cmake -S . -B build -DCMAKE_BUILD_TYPE=Release # Build cmake --build build # Run tests ctest --test-dir build --output-on-failure Build Options Option Default Description ENABLE_AVX2 ON (Release), OFF (Debug) Enable AVX2 SIMD optimizations ENABLE_ASAN OFF Enable AddressSanitizer ENABLE_ALLOCATOR_STATS OFF Enable allocator statistics ENABLE_LTO ON Enable Link-Time Optimization ENABLE_NUMA Auto-detected Enable NUMA support (requires libnuma) Development Setup Clone with submodules: git clone --recursive https://github.com/kressler/fast-containers.git cd fast-containers One-time development setup: ./setup-dev.sh This installs pre-commit hooks and configures clang-tidy. Code Quality Tools Automatic formatting and checks (via pre-commit hook): git commit # Automatically formats code and runs clang-tidy The pre-commit hook will: Format all staged C++ files with clang-format Check production headers with clang-tidy Fail the commit if warnings are found Auto-create build directories if missing Manual formatting: cmake --build build --target format Manual static analysis: cmake --build build --target clang-tidy # Or manually: clang-tidy-19 -p cmake-build-clang-tidy include/fast_containers/*.hpp Requirements: clang-format (for code formatting) clang-tidy-19 (for static analysis) cmake (to auto-create build directories) Bypass hook (when needed): git commit --no-verify Development Workflow Make your changes Build and test: cmake --build build && ctest --test-dir build Commit (auto-formatted and checked): git add . git commit -m "Your changes" # Pre-commit hook runs automatically Code Conventions Follow Google C++ Style Guide (enforced by clang-format) Use C++23 features Write tests for new functionality using Catch2 Production code must be clang-tidy clean (enforced in CI and pre-commit) Run cmake --build build --target format before submitting PRs Project Structure . ā”œā”€ā”€ include/ │ └── fast_containers/ # Public header files │ ā”œā”€ā”€ btree.hpp, btree.ipp │ ā”œā”€ā”€ dense_map.hpp, dense_map.ipp │ ā”œā”€ā”€ hugepage_allocator.hpp │ ā”œā”€ā”€ multi_size_hugepage_allocator.hpp │ ā”œā”€ā”€ multi_size_hugepage_pool.hpp │ ā”œā”€ā”€ policy_based_hugepage_allocator.hpp │ └── hugepage_pool.hpp ā”œā”€ā”€ tests/ # Unit tests (Catch2) │ ā”œā”€ā”€ test_btree.cpp │ ā”œā”€ā”€ test_dense_map.cpp │ ā”œā”€ā”€ test_hugepage_allocator.cpp │ └── test_policy_based_allocator.cpp ā”œā”€ā”€ src/ │ ā”œā”€ā”€ benchmarks/ # Google Benchmark performance tests │ │ ā”œā”€ā”€ dense_map_search_benchmark.cpp │ │ └── hugepage_allocator_benchmark.cpp │ └── binary/ # Standalone benchmark executables │ ā”œā”€ā”€ btree_benchmark.cpp │ └── btree_stress.cpp ā”œā”€ā”€ scripts/ │ └── interleaved_btree_benchmark.py # A/B testing harness ā”œā”€ā”€ results/ │ └── btree_benchmark_results.md # Performance analysis ā”œā”€ā”€ third_party/ # Git submodules │ ā”œā”€ā”€ catch2/ # Unit testing framework │ ā”œā”€ā”€ benchmark/ # Google Benchmark │ ā”œā”€ā”€ histograms/ # Latency histogram library │ ā”œā”€ā”€ abseil-cpp/ # Comparison baseline │ ā”œā”€ā”€ lyra/ # Command-line parsing │ └── unordered_dense/ # Dense hash map ā”œā”€ā”€ hooks/ # Git hooks (install with setup-dev.sh) │ └── pre-commit # Auto-format and clang-tidy └── CMakeLists.txt # Build configuration Tech Stack Language: C++23 Build System: CMake 3.30+ Testing: Catch2 v3.11.0 Code Formatting: clang-format (Google C++ Style) Static Analysis: clang-tidy-19