← BackJan 6, 2026

Can a Single‑Threaded Word Counter Approach Disk Read Bandwidth? – A Low‑Level Benchmark Study

Ben Hoyt’s claim that input–output is no longer the limiting factor in typical interview problems is put to the test with a 425 MB text file. Despite aggressive GCC/Clang optimizations and hand‑written AVX2 vector code, a hand‑tuned word counter tops out at roughly 1.45 GB/s – only about 11 % of the measured cold‑cache read speed. The experiment shows that modern compilers still struggle to auto‑vectorize highly branchy scalar loops, and that reaching hardware‑bound throughput demands explicit SIMD tricks.

### Introduction In late 2022, Ben Hoyt published a blog post that challenged the long‑standing belief that input–output (I/O) is the bottleneck in common interview‑style tasks such as counting word frequencies from a text stream. By comparing sequential read speeds against CPU performance, Hoyt argued that the CPU’s stalled progress is now the limiting factor while disk subsystems have become far less of a concern. The premise is simple: if a single thread can process a data stream at the speed the disk can deliver it, then I/O is no longer a constraint. To explore this claim, I ran a series of low‑level benchmarks on a 425 MB file that consists of 100 copies of the King James Bible. The target program is a naïve word‑frequency counter written in C. The goal was to push the implementation toward the theoretical upper bound of the on‑board DRAM bandwidth (measured at a warm cache of 12.8 GB/s) and see how close we could get. --- ### Baseline Disk Throughput Using a custom read loop that performs a 64‑byte sequential read into a cold buffer, I measured the host’s read bandwidth: - **Cold cache**: 1.6 GB/s - **Warm cache**: 12.8 GB/s (best of five runs) These figures set the context for the upper limit of any single‑threaded, CPU‑bound operation that reads the same data stream. The question is whether the algorithm can match or exceed 1.6 GB/s in the cold‑cache scenario. --- ### Initial Optimized C Implementation Today’s compilers can generate highly efficient code when the control flow is straightforward. I compiled the `optimized.c` file with GCC 12 using `-O3 -march=native`, and executed it against the 100‑copy Bible file. | Metric | Value | |--------|-------| | Real | 1.525 s | | User | 1.477 s | | System | 0.048 s | With a 425 MB input this translates to *278* MB/s — merely 17 % of the cold cache bandwidth. The limiting factor was a hot loop that performed character‑case conversion and hash mixing: ```c for (; i < num_process; i++) { char c = buf[i]; if (c <= ' ') break; if (c >= 'A' && c <= 'Z') { c += ('a' - 'A'); buf[i] = c; } hash *= FNV_PRIME; hash ^= (uint64_t)c; } ``` Branching in this loop hampers auto‑vectorization, because the compiler cannot confidently collapse it into a single SIMD lane. --- ### Improving Branch Behavior My first tweak moved the lower‑case conversion outside the main loop: ```c for (int i = 0; i < BUF_SIZE; ++i) buf[i] = (buf[i] >= 'A' && buf[i] <= 'Z') ? (buf[i] - 'A' + 'a') : buf[i]; ``` Running this with Clang’s vectorizer boosted throughput to *330* MB/s. Surprisingly, adding the same three lines *before* the original loop without removing it produced comparable results, showing that branch prediction can compensate for extra work. Nonetheless, the gains remain far below disk throughput. A clean‑room rewrite with AVX2 is required to approach the hardware limits. --- ### A Baseline: `wc -w` I compared the handwritten counter against the standard `wc -w`, which merely counts words without maintaining a frequency map. On the same file, `wc -w` ran in 1.758 s (245 MB/s). The difference stems from the fact that `wc` treats a larger set of whitespace (including newline, tab, and locale‑specific separators), adding conditional logic that reduces vector‑friendly regularity. --- ### SIMD Strategy The core operation in a word counter is identifying whitespace boundaries in a continuous block of characters. With a 128‑bit register holding 16 ASCII codes, a single `VPCMPEQB` instruction can produce a 16‑bit mask marking all whitespace positions. Once the mask is in hand, the typical strategy is to apply *PMOVMSKB*, packing the comparison result into a 32‑bit integer. The built‑in function `__builtin_ffs`—“find first set”—can then iterate over the bit positions. For each set bit, we check whether it represents the start of a new word by verifying the gap from the previous set bit. This approach yields a lightweight, purely data‑parallel word counter with minimal scalar overhead. --- ### Implementation Details I implemented the prototype using the `` interface, aligning the data to 32‑byte boundaries to satisfy AVX2 requirements. Six registers hold broadcasts of the ASCII space character, while the main loop is unrolled four times; each iteration processes 128 bytes. Key challenges included: 1. Off‑by‑one errors when the input length is not a multiple of 128. 2. Properly handling the final partial block. 3. Maintaining correct word boundaries across SIMD lane boundaries. After debugging, the program produced the correct word count for a diverse set of text inputs. --- ### Performance Results | Run | Time (s) | Throughput (GB/s) | |-----|----------|-------------------| | Warm cache | 0.227 | 1.45 | | Cold cache | 0.395 | 1.07 | On a warm cache, the hand‑tuned AVX2 routine achieves roughly 11 % of the theoretical sequential read speed (12.8 GB/s). Even in the cold‑cache scenario, the throughput remains well below the 1.6 GB/s baseline, suggesting that the CPU still lags behind the I/O subsystem. --- ### Conclusion The experiment confirms that *in practice* a single‑threaded word counter—whether written in pure C or hand‑optimized with AVX2—cannot saturate the bandwidth offered by modern SSDs or even DRAM. Auto‑vectorizers falter on highly branchy code, and manual SIMD is required to eke out performance gains, yet even then a single core only achieves a fraction of the available I/O throughput. Therefore, Ben Hoyt’s assertion that disk I/O is no longer the bottleneck holds true: the limiting factor in such problems has shifted dramatically to CPU and memory subsystem performance. Optimizing algorithms for data locality, reducing branching, and employing SIMD can help, but they cannot eliminate the inherent CPU‑bound constraint for a single thread. The full source code, including a GitHub repository and pull‑request template for new bit‑trick contributions, is available under the MIT license.