← BackJan 5, 2026

Memchunk: Ultra-Fast Text Chunking for Retrieval‑Augmented Generation

Memchunk solves the bottleneck of text segmentation in RAG pipelines by combining SIMD‑accelerated byte searches, reverse‑direction trimming, and lookup tables for delimiter sets. It achieves hundreds of gigabytes per second on Wikipedia‑scale data, vastly outperforming existing Rust libraries while offering lightweight Python and WASM bindings for production use.

## Introduction When building Retrieval‑Augmented Generation (RAG) systems, the first step is to split a vast corpus into manageable chunks that fit into embedding models or model context windows. In practice, the naïve strategy of splitting every N characters creates sentence fragments and degrades retrieval quality. We needed a fast, deterministic way to split on semantic boundaries such as periods, newlines, or question marks. This article traces how a quest for speed led to the creation of **memchunk**, a Rust‑based chunking library that outpaces its competitors by more than two orders of magnitude on Wikipedia‑scale data. ## The Chunking Problem A chunk is defined by the last delimiter before a target window size. If we target a 4 KiB chunk, we want the index of the last period, newline, or question mark that occurs before byte 4096. The naive algorithm scans forward, marks every delimiter, and stops only after crossing the window boundary – an O(n) scan with many updates. A more efficient pattern is to scan **backwards** from the window boundary, stopping at the first delimiter. This removes bookkeeping, reduces the number of look‑ups, and guarantees that each chunk ends at a natural boundary. ## Underlying Techniques ### 1. SIMD‑Accelerated Byte Search – memchr memchr, written by Andrew Gallant, is a highly optimized byte‑search library. It implements: * **SWAR** (SIMD Within A Register) — uses 64‑bit integer tricks to compare eight bytes at a time without explicit SIMD. * **AVX2 / SSE2** — when available, memchr loads 32 or 16 bytes into vector registers, compares them to a broadcasted needle, and extracts a 32‑bit mask that indicates matching positions. The first set bit gives the offset of the delimiter. For one to three delimiters, memchr offers `memrchr`, `memrchr2`, and `memrchr3`. The design intentionally caps at three because each additional needle incurs a broadcast, comparison, and `OR` operation, yielding diminishing returns. ### 2. Lookup Table for Many Delimiters When more than three delimiters are requested (e.g. ".?;!"), the library falls back to a classic 256‑entry lookup table: ```rust fn build_table(delimiters: &[u8]) -> [bool; 256] { let mut table = [false; 256]; for &b in delimiters { table[b as usize] = true; } table } fn is_delimiter(b: u8, table: &[bool; 256]) -> bool { table[b as usize] } ``` This achieves O(1) look‑ups without branches, which modern CPUs can process in a single cycle per byte. ### 3. Backwards Search and Zero‑Copy Views Both the SIMD and lookup‑table paths use *reverse* memchr (`memrchr*`). Starting from the window size and moving left ensures that the *first* encounter is the correct split point. The result is a byte offset that callers can use to slice the original input without allocations; Rust returns a `&[u8]` slice, Python provides a `memoryview`, and JavaScript exposes a `Uint8Array.subarray`. ## memchunk API ```rust use memchunk::chunk; let text = b"Hello world. How are you?"; for slice in chunk(text, size: 4096, delimiters: ".?") { // slice is a &[u8] that ends at the last matching delimiter } ``` The public API accepts a byte slice, a target chunk size, and an arbitrary set of delimiters. Internally it chooses one of the three search strategies outlined above and returns an iterator of zero‑copy slices. ## Performance Benchmarks across several Rust chunking crates show that memchunk achieves **≈164 GB/s** on a 20 GB English Wikipedia dump—roughly 0.12 ms per 4 KiB chunk. The table below compares it to other libraries: | Library | Throughput | Relative Performance | |---------|-------------|----------------------| | memchunk | 164 GB/s | – | | kiru | 4.5 GB/s | 36× slower | | langchain | 0.35 GB/s | 469× slower | | semchunk | 0.013 GB/s | 12 615× slower | | llama‑index | 0.0035 GB/s | 46 857× slower | | text‑splitter | 0.0017 GB/s | 96 471× slower | Even in the worst‑case scenario (small chunks, many delimiters) memchunk remains in the hundreds of GB/s range. ## Cross‑Platform Bindings The library ships with: * **Python** – a lightweight wrapper exposing a `chunk` generator that yields zero‑copy `memoryview` objects. * **JavaScript/WebAssembly** – a `chunk` function available via npm; it returns `Uint8Array.subarray` views. * **Rust** – the core implementation. These bindings preserve performance because they avoid data copies across the FFI boundary. ## Takeaway Memchunk is not about inventing new algorithms; it is about selecting the right primitive for the right number of delimiters, using inverse search to eliminate bookkeeping, and exposing zero‑copy views. The result is a chunking solution that is simple to use, trivially integrate, and orders‑of‑magnitude faster than existing alternatives. The library is available on crates.io, PyPI, and npm. Feel free to experiment and let us know how it scales in your RAG pipelines!