Can a SingleâThreaded Word Counter Approach Disk Read Bandwidth? â A LowâLevel Benchmark Study
BenâŻHoytâs claim that inputâoutput is no longer the limiting factor in typical interview problems is put to the test with a 425âŻMB text file. Despite aggressive GCC/Clang optimizations and handâwritten AVX2 vector code, a handâtuned word counter tops out at roughly 1.45âŻGB/s â only about 11âŻ% of the measured coldâcache read speed. The experiment shows that modern compilers still struggle to autoâvectorize highly branchy scalar loops, and that reaching hardwareâbound throughput demands explicit SIMD tricks.
### Introduction
In late 2022, BenâŻHoyt published a blog post that challenged the longâstanding belief that inputâoutput (I/O) is the bottleneck in common interviewâstyle tasks such as counting word frequencies from a text stream. By comparing sequential read speeds against CPU performance, Hoyt argued that the CPUâs stalled progress is now the limiting factor while disk subsystems have become far less of a concern.
The premise is simple: if a single thread can process a data stream at the speed the disk can deliver it, then I/O is no longer a constraint. To explore this claim, I ran a series of lowâlevel benchmarks on a 425âŻMB file that consists of 100 copies of the King James Bible. The target program is a naĂŻve wordâfrequency counter written in C. The goal was to push the implementation toward the theoretical upper bound of the onâboard DRAM bandwidth (measured at a warm cache of 12.8âŻGB/s) and see how close we could get.
---
### Baseline Disk Throughput
Using a custom read loop that performs a 64âbyte sequential read into a cold buffer, I measured the hostâs read bandwidth:
- **Cold cache**: 1.6âŻGB/s
- **Warm cache**: 12.8âŻGB/s (best of five runs)
These figures set the context for the upper limit of any singleâthreaded, CPUâbound operation that reads the same data stream. The question is whether the algorithm can match or exceed 1.6âŻGB/s in the coldâcache scenario.
---
### Initial Optimized C Implementation
Todayâs compilers can generate highly efficient code when the control flow is straightforward. I compiled the `optimized.c` file with GCCâŻ12 using `-O3 -march=native`, and executed it against the 100âcopy Bible file.
| Metric | Value |
|--------|-------|
| Real | 1.525âŻs |
| User | 1.477âŻs |
| System | 0.048âŻs |
With a 425âŻMB input this translates to *278*âŻMB/s â merely 17âŻ% of the cold cache bandwidth.
The limiting factor was a hot loop that performed characterâcase conversion and hash mixing:
```c
for (; i < num_process; i++) {
char c = buf[i];
if (c <= ' ') break;
if (c >= 'A' && c <= 'Z') {
c += ('a' - 'A');
buf[i] = c;
}
hash *= FNV_PRIME;
hash ^= (uint64_t)c;
}
```
Branching in this loop hampers autoâvectorization, because the compiler cannot confidently collapse it into a single SIMD lane.
---
### Improving Branch Behavior
My first tweak moved the lowerâcase conversion outside the main loop:
```c
for (int i = 0; i < BUF_SIZE; ++i)
buf[i] = (buf[i] >= 'A' && buf[i] <= 'Z') ? (buf[i] - 'A' + 'a') : buf[i];
```
Running this with Clangâs vectorizer boosted throughput to *330*âŻMB/s. Surprisingly, adding the same three lines *before* the original loop without removing it produced comparable results, showing that branch prediction can compensate for extra work.
Nonetheless, the gains remain far below disk throughput. A cleanâroom rewrite with AVX2 is required to approach the hardware limits.
---
### A Baseline: `wc -w`
I compared the handwritten counter against the standard `wc -w`, which merely counts words without maintaining a frequency map. On the same file, `wc -w` ran in 1.758âŻs (245âŻMB/s). The difference stems from the fact that `wc` treats a larger set of whitespace (including newline, tab, and localeâspecific separators), adding conditional logic that reduces vectorâfriendly regularity.
---
### SIMD Strategy
The core operation in a word counter is identifying whitespace boundaries in a continuous block of characters. With a 128âbit register holding 16 ASCII codes, a single `VPCMPEQB` instruction can produce a 16âbit mask marking all whitespace positions.
Once the mask is in hand, the typical strategy is to apply *PMOVMSKB*, packing the comparison result into a 32âbit integer. The builtâin function `__builtin_ffs`ââfind first setââcan then iterate over the bit positions. For each set bit, we check whether it represents the start of a new word by verifying the gap from the previous set bit.
This approach yields a lightweight, purely dataâparallel word counter with minimal scalar overhead.
---
### Implementation Details
I implemented the prototype using the `` interface, aligning the data to 32âbyte boundaries to satisfy AVX2 requirements. Six registers hold broadcasts of the ASCII space character, while the main loop is unrolled four times; each iteration processes 128 bytes.
Key challenges included:
1. Offâbyâone errors when the input length is not a multiple of 128.
2. Properly handling the final partial block.
3. Maintaining correct word boundaries across SIMD lane boundaries.
After debugging, the program produced the correct word count for a diverse set of text inputs.
---
### Performance Results
| Run | Time (s) | Throughput (GB/s) |
|-----|----------|-------------------|
| Warm cache | 0.227 | 1.45 |
| Cold cache | 0.395 | 1.07 |
On a warm cache, the handâtuned AVX2 routine achieves roughly 11âŻ% of the theoretical sequential read speed (12.8âŻGB/s). Even in the coldâcache scenario, the throughput remains well below the 1.6âŻGB/s baseline, suggesting that the CPU still lags behind the I/O subsystem.
---
### Conclusion
The experiment confirms that *in practice* a singleâthreaded word counterâwhether written in pure C or handâoptimized with AVX2âcannot saturate the bandwidth offered by modern SSDs or even DRAM. Autoâvectorizers falter on highly branchy code, and manual SIMD is required to eke out performance gains, yet even then a single core only achieves a fraction of the available I/O throughput.
Therefore, BenâŻHoytâs assertion that disk I/O is no longer the bottleneck holds true: the limiting factor in such problems has shifted dramatically to CPU and memory subsystem performance. Optimizing algorithms for data locality, reducing branching, and employing SIMD can help, but they cannot eliminate the inherent CPUâbound constraint for a single thread.
The full source code, including a GitHub repository and pullârequest template for new bitâtrick contributions, is available under the MIT license.