High-performance header-only container library for C++23 on x86-64
Fast Containers High-performance header-only container library for C++23 on x86-64. What's Included B+Tree (kressler::fast_containers::btree) - Cache-friendly B+tree with SIMD search and hugepage su...
Fast Containers
High-performance header-only container library for C++23 on x86-64.
What's Included
B+Tree (kressler::fast_containers::btree) - Cache-friendly B+tree with SIMD search and hugepage support
Dense Map (kressler::fast_containers::dense_map) - Fixed-size sorted array used internally by btree nodes
Hugepage Allocators - Pooling allocators that reduce TLB misses and allocation overhead
HugePageAllocator - Single-size allocator for uniform allocations
MultiSizeHugePageAllocator - Multi-size pooling for variable-sized allocations (e.g., Abseil btree)
PolicyBasedHugePageAllocator - Advanced control with shared pools
Why This Library?
The B+tree implementation provides significant performance improvements over industry standards for large trees. For some workloads with large trees, we've observed:
vs Abseil B+tree: 2-5Ć faster across insert/find/erase operations
vs std::map: 2-5Ć faster across insert/find/erase operations
See benchmark results for detailed performance analysis.
Important qualifications:
Performance advantages are most significant for large tree sizes where TLB misses and allocation costs dominate
Benchmarks currently focus on 10M element trees; smaller tree sizes have not been comprehensively tested
Results are specific to the tested configurations (8-byte keys, 32-byte and 256-byte values)
Key advantages over Abseil's btree:
Hugepage allocator integration: 3-5Ć speedup from reduced TLB misses and pooled allocations
SIMD-accelerated search: 3-10% faster node searches using AVX2 instructions
Tunable node sizes: Optimize cache behavior for your specific key/value sizes
Notes
Work in progress This is a work in progress. I don't have plans for major changes to the B+tree currently, but am actively cleaning up the implementation.
Platforms This library is really only built and tested on Linux, on x86-64 CPUs with AVX2 support. In theory, it could be built for Windows, though that hasn't been tested. The SIMD implementations are x86-64 specific. Timing in the custom benchmarks is also x86-64 specific (via use of rdtscp).
History/Motivations This project started as an exploration of using AI agents for software development. Based on experience tuning systems using Abseil's B+tree, I was curious if performance could be improved through SIMD instructions, a customized allocator, and tunable node sizes. Claude proved surprisingly adept at helping implement this quickly, and the resulting B+tree showed compelling performance improvements, so I'm making it available here.
Quick Start
Installation
Prerequisites:
C++23 compiler (GCC 14+, Clang 19+)
CMake 3.30+
AVX2-capable CPU (Intel Haswell 2013+, AMD Excavator 2015+)
Include in your project:
Add as a git submodule:
git submodule add https://github.com/kressler/fast-containers.git third_party/fast-containers
Link in CMakeLists.txt:
add_subdirectory(third_party/fast-containers)
target_link_libraries(your_target PRIVATE fast_containers::fast_containers)
Include headers:
#include
#include
Usage
Basic B+tree Example
#include
#include
#include
int main() {
// Create a btree mapping int64_t keys to int32_t values
// Using defaults: auto-computed node sizes, Linear search
using Tree = kressler::fast_containers::btree;
Tree tree;
// Insert key-value pairs
tree.insert(42, 100);
tree.insert(17, 200);
tree.insert(99, 300);
// Find a value
auto it = tree.find(42);
if (it != tree.end()) {
std::cout << "Found: " << it->second << std::endl; // Prints: 100
}
// Iterate over all elements (sorted by key)
for (const auto& [key, value] : tree) {
std::cout << key << " -> " << value << std::endl;
}
// Erase an element
tree.erase(17);
// Check size
std::cout << "Size: " << tree.size() << std::endl; // Prints: 2
}
B+tree with Hugepage Allocator (Recommended for Performance)
#include
#include
#include
#include
int main() {
// Use the hugepage allocator for 3-5Ć performance improvement
// Allocator type must match the btree's value_type (std::pair)
using Allocator = kressler::fast_containers::HugePageAllocator<
std::pair>;
using Tree = kressler::fast_containers::btree<
int64_t, // Key type
int32_t, // Value type
96, // Leaf node size
128, // Internal node size
std::less, // Comparator
kressler::fast_containers::SearchMode::SIMD, // SIMD search
Allocator // Hugepage allocator
>;
// Tree will default-construct the allocator (256MB initial pool, 64MB growth)
// The btree automatically creates separate pools for leaf and internal nodes
Tree tree;
// Insert 10 million elements - hugepages reduce TLB misses
for (int64_t i = 0; i < 10'000'000; ++i) {
tree.insert(i, i * 2);
}
// Find operations are much faster with hugepages
auto it = tree.find(5'000'000);
assert(it != tree.end() && it->second == 10'000'000);
}
Advanced: Policy-Based Allocator for Shared Pools
For multiple trees or fine-grained control over pool sizes, use PolicyBasedHugePageAllocator:
#include
#include
#include
int main() {
// Create separate pools for leaf and internal nodes with custom sizes
auto leaf_pool = std::make_shared(
512 * 1024 * 1024, true); // 512MB for leaves
auto internal_pool = std::make_shared(
256 * 1024 * 1024, true); // 256MB for internals
// Create policy that routes types to appropriate pools
kressler::fast_containers::TwoPoolPolicy policy{leaf_pool, internal_pool};
// Create allocator with the policy
using Allocator = kressler::fast_containers::PolicyBasedHugePageAllocator<
std::pair,
kressler::fast_containers::TwoPoolPolicy>;
Allocator alloc(policy);
using Tree = kressler::fast_containers::btree<
int64_t, int32_t, 96, 128, std::less,
kressler::fast_containers::SearchMode::SIMD, Allocator>;
// Multiple trees can share the same pools
Tree tree1(alloc);
Tree tree2(alloc);
// Both trees share leaf_pool for leaves and internal_pool for internals
tree1.insert(1, 100);
tree2.insert(2, 200);
}
Advanced: Multi-Size Hugepage Allocator for Variable-Sized Allocations
For containers that allocate variable-sized objects (like absl::btree_map), use MultiSizeHugePageAllocator:
#include
#include
#include
#include
int main() {
// absl::btree_map allocates different-sized nodes (leaf vs internal)
// MultiSizeHugePageAllocator routes allocations to size-class-specific pools
using ValueType = std::array;
using Allocator = kressler::fast_containers::MultiSizeHugePageAllocator<
std::pair>;
// Helper function creates allocator with default settings
// - 64MB initial size per size class
// - Hugepages enabled
// - 64MB growth size per size class
auto alloc = kressler::fast_containers::make_multi_size_hugepage_allocator<
std::pair>();
// Create absl::btree_map with hugepage allocator
absl::btree_map, Allocator> tree(alloc);
// Insert 1 million elements - multiple size classes created automatically
for (int64_t i = 0; i < 1'000'000; ++i) {
tree[i] = ValueType{};
}
// Find operations benefit from reduced TLB misses
auto it = tree.find(500'000);
}
How it works:
Allocations are routed to size classes based on requested size
Size classes: 0-512B (64B alignment), 513-2048B (256B alignment), 2049+B (power-of-2)
Each size class maintains its own HugePagePool with uniform-sized blocks
Pools created on-demand as different sizes are requested
Provides 2-3Ć performance improvement for absl::btree_map over standard allocator
When to use each allocator:
HugePageAllocator: Simple, automatic separate pools per type (recommended for our btree)
MultiSizeHugePageAllocator: Variable-sized allocations (e.g., absl::btree_map, other STL containers with allocator support)
PolicyBasedHugePageAllocator: Fine-grained control, shared pools across trees, custom pool sizes
API Overview
The btree class provides an API similar to std::map:
Insertion:
std::pair insert(const Key& key, const Value& value)
std::pair emplace(Args&&... args)
Value& operator[](const Key& key)
Lookup:
iterator find(const Key& key)
const_iterator find(const Key& key) const
iterator lower_bound(const Key& key)
iterator upper_bound(const Key& key)
std::pair equal_range(const Key& key)
Removal:
size_type erase(const Key& key)
iterator erase(iterator pos)
Iteration:
iterator begin() / const_iterator begin() const
iterator end() / const_iterator end() const
Range-based for loops supported
Capacity:
size_type size() const
bool empty() const
void clear()
Other:
void swap(btree& other) noexcept
key_compare key_comp() const
value_compare value_comp() const
Template Parameters
template <
typename Key,
typename Value,
std::size_t LeafNodeSize = default_leaf_node_size(),
std::size_t InternalNodeSize = default_internal_node_size(),
typename Compare = std::less,
SearchMode SearchModeT = SearchMode::Linear,
typename Allocator = std::allocator>
>
class btree;
Parameters:
Key, Value: The key and value types
LeafNodeSize: Number of key-value pairs per leaf node
Default: Auto-computed heuristic targeting ~2KB (32 cache lines)
Formula: 2048 / (sizeof(Key) + sizeof(Value)), rounded to multiple of 8, clamped to [8, 64]
Manual tuning: Larger values (64-96) for small values, smaller values (8-16) for large values
InternalNodeSize: Number of child pointers per internal node
Default: Auto-computed heuristic targeting ~1KB (16 cache lines)
Formula: 1024 / (sizeof(Key) + sizeof(void*)), rounded to multiple of 8, clamped to [16, 64]
Generally leave at default (stores only 8-byte pointers)
Compare: Comparison function (must satisfy ComparatorCompatible)
Default: std::less
Also supports std::greater for descending order
SearchMode: How to search within a node
Default: SearchMode::Linear (scalar linear search)
SearchMode::SIMD: AVX2-accelerated search (3-10% faster, requires AVX2 CPU and SIMD-compatible keys: int32_t, uint32_t, int64_t, uint64_t, float, double)
SearchMode::Binary: Binary search
Allocator: Memory allocation strategy
Default: std::allocator>
Recommended for performance: HugePageAllocator> for working sets >1GB (3-5Ć faster)
Automatically creates separate pools for leaf and internal nodes via rebind
Default: 256MB initial pool, 64MB growth per pool
Requires hugepages configured: sudo sysctl -w vm.nr_hugepages=
Falls back to regular pages if unavailable
For variable-sized allocations: MultiSizeHugePageAllocator>
Routes allocations to size-class-specific pools
Use with absl::btree_map or other containers that allocate different-sized objects
Provides 2-3Ć speedup for Abseil btree over standard allocator
Advanced control: PolicyBasedHugePageAllocator, TwoPoolPolicy>
Fine-grained control over pool sizes
Share pools across multiple trees
Separate pools for leaf and internal nodes with custom sizes
Performance
Benchmarks comparing against Abseil's btree_map and std::map are available in results/btree_benchmark_results.md.
Performance Highlights (8-byte keys, 32-byte values, 10M elements)
Our btree with hugepages (btree_8_32_96_128_simd_hp):
INSERT P99.9: 1,023 ns
FIND P99.9: 864 ns
ERASE P99.9: 1,086 ns
Our btree with standard allocator (btree_8_32_96_128_simd):
INSERT P99.9: 3,155 ns (3.1Ć slower than with hugepages)
FIND P99.9: 950 ns (1.1Ć slower than with hugepages)
ERASE P99.9: 1,323 ns (1.2Ć slower than with hugepages)
vs. Abseil btree with hugepages (absl_8_32_hp using MultiSizeHugePageAllocator):
INSERT P99.9: 1,401 ns (27% slower)
FIND P99.9: 1,190 ns (38% slower)
ERASE P99.9: 1,299 ns (20% slower)
vs. Abseil btree with standard allocator (absl_8_32):
INSERT P99.9: 3,287 ns (3.2Ć slower)
FIND P99.9: 1,342 ns (55% slower)
ERASE P99.9: 1,679 ns (55% slower)
vs. std::map (map_8_32):
INSERT P99.9: 3,587 ns (3.5Ć slower)
FIND P99.9: 2,312 ns (2.7Ć slower)
ERASE P99.9: 2,253 ns (2.1Ć slower)
Key Insights
Hugepage allocators provide massive performance improvements:
Our btree: 2-3Ć faster with hugepages vs. standard allocator
Abseil btree: 2Ć faster with MultiSizeHugePageAllocator vs. standard allocator
Critical for large working sets (>1M elements)
Our implementation maintains significant advantages even with fair comparison:
20-67% faster than Abseil btree even when both use hugepage allocators
Advantages from SIMD search, tunable node sizes, and optimized bulk transfers
Performance gap widens with larger values (256 bytes: 21-135% faster)
Performance varies by tree size:
Large trees (10M elements): Our btree dominates all metrics
Small trees (10K elements): Competition intensifies, std::map becomes viable for some workloads
See benchmark results for detailed analysis
The hugepage allocator is the single most important optimization, providing benefits by reducing TLB misses (helps find operations) and making allocations extremely cheap through pooling (helps insert/erase operations).
Building from Source
Using CMake Presets (Recommended)
# List available presets
cmake --list-presets
# Configure, build, and test in one workflow
cmake --preset release
cmake --build --preset release
ctest --preset release
# Common presets:
cmake --preset debug # Debug build
cmake --preset release # Release with AVX2 (default)
cmake --preset asan # AddressSanitizer build
cmake --preset release-no-avx2 # Release without AVX2
Manual Build (Alternative)
# Clone with submodules
git clone --recursive https://github.com/kressler/fast-containers.git
cd fast-containers
# Configure
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
# Build
cmake --build build
# Run tests
ctest --test-dir build --output-on-failure
Build Options
Option
Default
Description
ENABLE_AVX2
ON (Release), OFF (Debug)
Enable AVX2 SIMD optimizations
ENABLE_ASAN
OFF
Enable AddressSanitizer
ENABLE_ALLOCATOR_STATS
OFF
Enable allocator statistics
ENABLE_LTO
ON
Enable Link-Time Optimization
ENABLE_NUMA
Auto-detected
Enable NUMA support (requires libnuma)
Development
Setup
Clone with submodules:
git clone --recursive https://github.com/kressler/fast-containers.git
cd fast-containers
One-time development setup:
./setup-dev.sh
This installs pre-commit hooks and configures clang-tidy.
Code Quality Tools
Automatic formatting and checks (via pre-commit hook):
git commit # Automatically formats code and runs clang-tidy
The pre-commit hook will:
Format all staged C++ files with clang-format
Check production headers with clang-tidy
Fail the commit if warnings are found
Auto-create build directories if missing
Manual formatting:
cmake --build build --target format
Manual static analysis:
cmake --build build --target clang-tidy
# Or manually:
clang-tidy-19 -p cmake-build-clang-tidy include/fast_containers/*.hpp
Requirements:
clang-format (for code formatting)
clang-tidy-19 (for static analysis)
cmake (to auto-create build directories)
Bypass hook (when needed):
git commit --no-verify
Development Workflow
Make your changes
Build and test:
cmake --build build && ctest --test-dir build
Commit (auto-formatted and checked):
git add .
git commit -m "Your changes"
# Pre-commit hook runs automatically
Code Conventions
Follow Google C++ Style Guide (enforced by clang-format)
Use C++23 features
Write tests for new functionality using Catch2
Production code must be clang-tidy clean (enforced in CI and pre-commit)
Run cmake --build build --target format before submitting PRs
Project Structure
.
āāā include/
ā āāā fast_containers/ # Public header files
ā āāā btree.hpp, btree.ipp
ā āāā dense_map.hpp, dense_map.ipp
ā āāā hugepage_allocator.hpp
ā āāā multi_size_hugepage_allocator.hpp
ā āāā multi_size_hugepage_pool.hpp
ā āāā policy_based_hugepage_allocator.hpp
ā āāā hugepage_pool.hpp
āāā tests/ # Unit tests (Catch2)
ā āāā test_btree.cpp
ā āāā test_dense_map.cpp
ā āāā test_hugepage_allocator.cpp
ā āāā test_policy_based_allocator.cpp
āāā src/
ā āāā benchmarks/ # Google Benchmark performance tests
ā ā āāā dense_map_search_benchmark.cpp
ā ā āāā hugepage_allocator_benchmark.cpp
ā āāā binary/ # Standalone benchmark executables
ā āāā btree_benchmark.cpp
ā āāā btree_stress.cpp
āāā scripts/
ā āāā interleaved_btree_benchmark.py # A/B testing harness
āāā results/
ā āāā btree_benchmark_results.md # Performance analysis
āāā third_party/ # Git submodules
ā āāā catch2/ # Unit testing framework
ā āāā benchmark/ # Google Benchmark
ā āāā histograms/ # Latency histogram library
ā āāā abseil-cpp/ # Comparison baseline
ā āāā lyra/ # Command-line parsing
ā āāā unordered_dense/ # Dense hash map
āāā hooks/ # Git hooks (install with setup-dev.sh)
ā āāā pre-commit # Auto-format and clang-tidy
āāā CMakeLists.txt # Build configuration
Tech Stack
Language: C++23
Build System: CMake 3.30+
Testing: Catch2 v3.11.0
Code Formatting: clang-format (Google C++ Style)
Static Analysis: clang-tidy-19