← BackJan 8, 2026

Kernel bugs hide for 2 years on average. Some hide for 20

There are bugs in your kernel right now that won't be found for years. I know because I analyzed 125,183 of them, every bug with a traceable Fixes: tag in the Linux kernel's 20-year git history. The average kernel bug lives 2.1 years before discovery. But some subsystems are far worse: CAN bus drivers average 4.2 years, SCTP networking 4.0 years. The longest-lived bug in my dataset, a buffer overflow in ethtool, sat in the kernel for 20.7 years. The one which I'll dissect in detail is refcount leak in netfilter, and it lasted 19 years. I built a tool that catches 92% of historical bugs in a held-out test set at commit time. Here's what I learned. Key findings at a glance 125,183 Bug-fix pairs with traceable Fixes: tags 123,696 Valid records after filtering (0 < lifetime < 27 years) 2.1 years Average time a bug hides before discovery 20.7 years Longest-lived bug (ethtool buffer overflow) 0% → 69% Bugs found within 1 year (2010 vs 2022) 92.2% Recall of VulnBERT on held-out 2024 test set 1.2% False positive rate (vs 48% for vanilla CodeBERT) The initial discovery I started by mining the most recent 10,000 commits with Fixes: tags from the Linux kernel. After filtering out invalid references (commits that pointed to hashes outside the repo, malformed tags, or merge commits), I had 9,876 valid vulnerability records. For the lifetime analysis, I excluded 27 same-day fixes (bugs introduced and fixed within hours), leaving 9,849 bugs with meaningful lifetimes. The results were striking: Metric Value Bugs analyzed 9,876 Average lifetime 2.8 years Median lifetime 1.0 year Maximum 20.7 years Almost 20% of bugs had been hiding for 5+ years. The networking subsystem looked particularly bad at 5.1 years average. I found a refcount leak in netfilter that had been in the kernel for 19 years. Initial findings: Half of bugs found within a year, but 20% hide for 5+ years. But something nagged at me: my dataset only contained fixes from 2025. Was I seeing the full picture, or just the tip of the iceberg? Going deeper: Mining the full history I rewrote my miner to capture every Fixes: tag since Linux moved to git in 2005. Six hours later, I had 125,183 vulnerability records which was 12x larger than my initial dataset. The numbers changed significantly: Metric 2025 Only Full History (2005-2025) Bugs analyzed 9,876 125,183 Average lifetime 2.8 years 2.1 years Median lifetime 1.0 year 0.7 years 5+ year bugs 19.4% 13.5% 10+ year bugs 6.6% 4.2% Full history: 57% of bugs found within a year. The long tail is smaller than it first appeared. Why the difference? My initial 2025-only dataset was biased. Fixes in 2025 include: New bugs introduced recently and caught quickly Ancient bugs that finally got discovered after years of hiding The ancient bugs skewed the average upward. When you include the full history with all the bugs that were introduced AND fixed within the same year, the average drops from 2.8 to 2.1 years. The real story: We're getting faster (but it's complicated) The most striking finding from the full dataset: bugs introduced in recent years appear to get fixed much faster. Year Introduced Bugs Avg Lifetime % Found <1yr 2010 1,033 9.9 years 0% 2014 3,991 3.9 years 31% 2018 11,334 1.7 years 54% 2022 11,090 0.8 years 69% Bugs introduced in 2010 took nearly 10 years to find and bugs introduced in 2024 are found in 5 months. At first glance it looks like a 20x improvement! But here's the catch: this data is right-censored. Bugs introduced in 2022 can't have a 10-year lifetime yet since we're only in 2026. We might find more 2022 bugs in 2030 that bring the average up. The fairer comparison is "% found within 1 year" and that IS improving: from 0% (2010) to 69% (2022). That's real progress, likely driven by: Syzkaller (released 2015) KASAN, KMSAN, KCSAN sanitizers Better static analysis More contributors reviewing code But there's a backlog. When I look at just the bugs fixed in 2024-2025: 60% were introduced in the last 2 years (new bugs, caught quickly) 18% were introduced 5-10 years ago 6.5% were introduced 10+ years ago We're simultaneously catching new bugs faster AND slowly working through ~5,400 ancient bugs that have been hiding for over 5 years. The methodology The kernel has a convention: when a commit fixes a bug, it includes a Fixes: tag pointing to the commit that introduced the bug. commit de788b2e6227 Author: Florian Westphal Date: Fri Aug 1 17:25:08 2025 +0200 netfilter: ctnetlink: fix refcount leak on table dump Fixes: d205dc40798d ("netfilter: ctnetlink: ...") I wrote a miner that: Runs git log --grep="Fixes:" to find all fixing commits Extracts the referenced commit hash from the Fixes: tag Pulls dates from both commits Classifies subsystem from file paths (70+ patterns) Detects bug type from commit message keywords Calculates the lifetime fixes_pattern = r'Fixes:\s*([0-9a-f]{12,40})' match = re.search(fixes_pattern, commit_message) if match: introducing_hash = match.group(1) lifetime_days = (fixing_date - introducing_date).days Dataset details: Parameter Value Kernel version v6.19-rc3 Mining date January 6, 2026 Fixes mined since 2005-04-16 (git epoch) Total records 125,183 Unique fixing commits 119,449 Unique bug-introducing authors 9,159 With CVE ID 158 With Cc: stable 27,875 (22%) Coverage note: The kernel has ~448,000 commits mentioning "fix" in some form, but only ~124,000 (28%) use proper Fixes: tags. My dataset captures the well-documented bugs aka the ones where maintainers traced the root cause. It varies by subsystem Some subsystems have bugs that persist far longer than others: Subsystem Bug Count Avg Lifetime drivers/can 446 4.2 years networking/sctp 279 4.0 years networking/ipv4 1,661 3.6 years usb 2,505 3.5 years tty 1,033 3.5 years netfilter 1,181 2.9 years networking 6,079 2.9 years memory 2,459 1.8 years gpu 5,212 1.4 years bpf 959 1.1 years CAN bus and SCTP bugs persist longest. BPF and GPU bugs get caught fastest. CAN bus drivers and SCTP networking have bugs that persist longest probably because both are niche protocols with less testing coverage. GPU (especially Intel i915) and BPF bugs get caught fastest, probably thanks to dedicated fuzzing infrastructure. Interesting finding from comparing 2025-only vs full history: Subsystem 2025-only Avg Full History Avg Difference networking 5.2 years 2.9 years -2.3 years filesystem 3.8 years 2.6 years -1.2 years drivers/net 3.3 years 2.2 years -1.1 years gpu 1.4 years 1.4 years 0 years Networking looked terrible in the 2025-only data (5.2 years!) but is actually closer to average in the full history (2.9 years). The 2025 fixes were catching a backlog of ancient networking bugs. GPU looks the same either way, and those bugs get caught consistently fast. Some bug types hide longer than others Race conditions are the hardest to find, averaging 5.1 years to discovery: Bug Type Count Avg Lifetime Median race-condition 1,188 5.1 years 2.6 years integer-overflow 298 3.9 years 2.2 years use-after-free 2,963 3.2 years 1.4 years memory-leak 2,846 3.1 years 1.4 years buffer-overflow 399 3.1 years 1.5 years refcount 2,209 2.8 years 1.3 years null-deref 4,931 2.2 years 0.7 years deadlock 1,683 2.2 years 0.8 years Why do race conditions hide so long? They're non-deterministic and only trigger under specific timing conditions that might occur once per million executions. Even sanitizers like KCSAN can only flag races they observe. 30% of bugs are self-fixes where the same person who introduced the bug eventually fixed it. I guess code ownership matters. Why some bugs hide longer Less fuzzing coverage. Syzkaller excels at syscall fuzzing but struggles with stateful protocols. Fuzzing netfilter effectively requires generating valid packet sequences that traverse specific connection tracking states. Harder to trigger. Many networking bugs require: Specific packet sequences Race conditions between concurrent flows Memory pressure during table operations Particular NUMA topologies Older code with fewer eyes. Core networking infrastructure like nf_conntrack was written in the mid-2000s. It works, so nobody rewrites it. But "stable" means fewer developers actively reviewing. Case study: 19 years in the kernel One of the oldest networking bug in my dataset was introduced in August 2006 and fixed in August 2025: // ctnetlink_dump_table() - the buggy code path if (res < 0) { nf_conntrack_get(&ct->ct_general); // increments refcount cb->args[1] = (unsigned long)ct; break; } The irony: Commit d205dc40798d was itself a fix: "[NETFILTER]: ctnetlink: fix deadlock in table dumping". Patrick McHardy was fixing a deadlock by removing a _put() call. In doing so, he introduced a refcount leak that would persist for 19 years. The bug: the code doesn't check if ct == last. If the current entry is the same as the one we already saved, we've now incremented its refcount twice but will only decrement it once. The object never gets freed. // What should have been checked: if (res < 0) { if (ct != last) // <-- this check was missing for 19 years nf_conntrack_get(&ct->ct_general); cb->args[1] = (unsigned long)ct; break; } The consequence: Memory leaks accumulate. Eventually nf_conntrack_cleanup_net_list() waits forever for the refcount to hit zero. The netns teardown hangs. If you're using containers, this blocks container cleanup indefinitely. Why it took 19 years: You had to run conntrack_resize.sh in a loop for ~20 minutes under memory pressure. The fix commit says: "This can be reproduced by running conntrack_resize.sh selftest in a loop. It takes ~20 minutes for me on a preemptible kernel." Nobody ran that specific test sequence for two decades. Incomplete fixes are common Here's a pattern I keep seeing: someone notices undefined behavior, ships a fix, but the fix doesn't fully close the hole. Case study: netfilter set field validation Date Commit What happened Jan 2020 f3a2181e16f1 Stefano Brivio adds support for sets with multiple ranged fields. Introduces NFTA_SET_DESC_CONCAT for specifying field lengths. Jan 2024 3ce67e3793f4 Pablo Neira notices the code doesn't validate that field lengths sum to the key length. Ships a fix. Commit message: "I did not manage to crash nft_set_pipapo with mismatch fields and set key length so far, but this is UB which must be disallowed." Jan 2025 1b9335a8000f Security researcher finds a bypass. The 2024 fix was incomplete—there were still code paths that could mismatch. Real fix shipped. The 2024 fix was an acknowledgment that something was wrong, but Pablo couldn't find a crash, so the fix was conservative. A year later, someone found the crash. This pattern suggests a detection opportunity: commits that say things like "this is undefined behavior" or "I couldn't trigger this but..." are flags. The author knows something is wrong but hasn't fully characterized the bug. These deserve extra scrutiny. The anatomy of a long-lived bug Looking at the bugs that survive 10+ years, I see common patterns: 1. Reference counting errors kref_get(&obj->ref); // ... error path returns without kref_put() These don't crash immediately. They leak memory slowly. In a long-running system, you might not notice until months later when OOM killer starts firing. 2. Missing NULL checks after dereference struct foo *f = get_foo(); f->bar = 1; // dereference happens first if (!f) return -EINVAL; // check comes too late The compiler might optimize away the NULL check since you already dereferenced. These survive because the pointer is rarely NULL in practice. 3. Integer overflow in size calculations size_t total = n_elements * element_size; // can overflow buf = kmalloc(total, GFP_KERNEL); memcpy(buf, src, n_elements * element_size); // copies more than allocated If n_elements comes from userspace, an attacker can cause allocation of a small buffer followed by a large copy. 4. Race conditions in state machines spin_lock(&lock); if (state == READY) { spin_unlock(&lock); // window here where another thread can change state do_operation(); // assumes state is still READY } These require precise timing to hit. They might manifest as rare crashes that nobody can reproduce. Can we catch these bugs automatically? Every day a bug lives in the kernel is another day millions of devices are vulnerable. Android phones, servers, embedded systems, cloud infrastructure, all running kernel code with bugs that won't be found for years. I built VulnBERT, a model that predicts whether a commit introduces a vulnerability. Model evolution: Model Recall FPR F1 Notes Random Forest 76.8% 15.9% 0.80 Hand-crafted features only CodeBERT (fine-tuned) 89.2% 48.1% 0.65 High recall, unusable FPR VulnBERT 92.2% 1.2% 0.95 Best of both approaches The problem with vanilla CodeBERT: I first tried fine-tuning CodeBERT directly. Results: 89% recall but 48% false positive rate (measured on the same test set). Unusable, flagging half of all commits. Why so bad? CodeBERT learns shortcuts: "big diff = dangerous", "lots of pointers = risky". These correlations exist in training data but don't generalize. The model pattern-matches on surface features, not actual bug patterns. The VulnBERT approach: Combine neural pattern recognition with human domain expertise. ┌─────────────────────────────────────────────────────────────────────┐ │ INPUT: Git Diff │ └───────────────────────────────┬─────────────────────────────────────┘ │ ┌───────────────┴───────────────┐ ▼ ▼ ┌───────────────────────────┐ ┌───────────────────────────────────┐ │ Chunked Diff Encoder │ │ Handcrafted Feature Extractor │ │ (CodeBERT + Attention) │ │ (51 engineered features) │ └─────────────┬─────────────┘ └─────────────────┬─────────────────┘ │ [768-dim] │ [51-dim] └───────────────┬───────────────────┘ ▼ ┌───────────────────────────────┐ │ Cross-Attention Fusion │ │ "When code looks like X, │ │ feature Y matters more" │ └───────────────┬───────────────┘ ▼ ┌───────────────────────────────┐ │ Risk Classifier │ └───────────────────────────────┘ Three innovations that drove performance: 1. Chunked encoding for long diffs. CodeBERT's 512-token limit truncates most kernel diffs (often 2000+ tokens). I split into chunks, encode each, then use learned attention to aggregate: # Learnable attention over chunks chunk_attention = nn.Sequential( nn.Linear(hidden_size, hidden_size // 4), nn.Tanh(), nn.Linear(hidden_size // 4, 1) ) attention_weights = F.softmax(chunk_attention(chunk_embeddings), dim=1) pooled = (attention_weights * chunk_embeddings).sum(dim=1) The model learns which chunks matter aka the one with spin_lock without spin_unlock, not the boilerplate. 2. Feature fusion via cross-attention. Neural networks miss domain-specific patterns. I extract 51 handcrafted features using regex and AST-like analysis of the diff: Category Features Basic (4) lines_added, lines_removed, files_changed, hunks_count Memory (3) has_kmalloc, has_kfree, has_alloc_no_free Refcount (5) has_get, has_put, get_count, put_count, unbalanced_refcount Locking (5) has_lock, has_unlock, lock_count, unlock_count, unbalanced_lock Pointers (4) has_deref, deref_count, has_null_check, has_deref_no_null_check Error handling (6) has_goto, goto_count, has_error_return, has_error_label, error_return_count, has_early_return Semantic (13) var_after_loop, iterator_modified_in_loop, list_iteration, list_del_in_loop, has_container_of, has_cast, cast_count, sizeof_type, sizeof_ptr, has_arithmetic, has_shift, has_copy, copy_count Structural (11) if_count, else_count, switch_count, case_count, loop_count, ternary_count, cyclomatic_complexity, max_nesting_depth, function_call_count, unique_functions_called, function_definitions The key bug-pattern features: 'unbalanced_refcount': 1, # kref_get without kref_put → leak 'unbalanced_lock': 1, # spin_lock without spin_unlock → deadlock 'has_deref_no_null_check': 0,# *ptr without if(!ptr) → null deref 'has_alloc_no_free': 0, # kmalloc without kfree → memory leak Cross-attention learns conditional relationships. When CodeBERT sees locking patterns AND unbalanced_lock=1, that's HIGH risk. Neither signal alone is sufficient, it's the combination. # Feature fusion via cross-attention feature_embedding = feature_projection(handcrafted_features) # 51 → 768 attended, _ = cross_attention( query=code_embedding, # What patterns does the code have? key=feature_embedding, # What do the hand-crafted features say? value=feature_embedding ) fused = fusion_layer(torch.cat([code_embedding, attended], dim=-1)) 3. Focal loss for hard examples. The training data is imbalanced where most commits are safe. Standard cross-entropy wastes gradient updates on easy examples. Focal loss: Standard loss when p=0.95 (easy): 0.05 Focal loss when p=0.95: 0.000125 (400x smaller) The model focuses on ambiguous commits: the hard 5% that matter. Impact of each component (estimated from ablation experiments): Component F1 Score CodeBERT baseline ~76% + Focal loss ~80% + Feature fusion ~88% + Contrastive learning ~91% Full VulnBERT 95.4% Note: Individual component impacts are approximate; interactions between components make precise attribution difficult. The key insight: neither neural networks nor hand-crafted rules alone achieve the best results. The combination does. Results on temporal validation (train ≤2023, test 2024): Metric Target Result Recall 90% 92.2% ✓ FPR <10% 1.2% ✓ Precision — 98.7% F1 — 95.4% AUC — 98.4% What these metrics mean: Recall (92.2%): Of all actual bug-introducing commits, we catch 92.2%. Missing 7.8% of bugs. False Positive Rate (1.2%): Of all safe commits, we incorrectly flag 1.2%. Low FPR = fewer false alarms. Precision (98.7%): Of commits we flag as risky, 98.7% actually are. When we raise an alarm, we're almost always right. F1 (95.4%): Harmonic mean of precision and recall. Single number summarizing overall performance. AUC (98.4%): Area under ROC curve. Measures ranking quality—how well the model separates bugs from safe commits across all thresholds. The model correctly differentiates the same bug at different stages: Commit Description Risk acf44a2361b8 Fix for UAF in xe_vfio 12.4% LOW ✓ 1f5556ec8b9e Introduced the UAF 83.8% HIGH ✓ What the model sees: The 19-year bug When analyzing the bug-introducing commit d205dc40798d: - if (ct == last) { - nf_conntrack_put(&last->ct_general); // removed! - } + if (ct == last) { + last = NULL; continue; } if (ctnetlink_fill_info(...) < 0) { nf_conntrack_get(&ct->ct_general); // still here Extracted features: Feature Value Signal get_count 1 nf_conntrack_get() present put_count 0 nf_conntrack_put() was removed unbalanced_refcount 1 Mismatch detected has_lock 1 Uses read_lock_bh() list_iteration 1 Uses list_for_each_prev() Model prediction: 72% risk: HIGH The unbalanced_refcount feature fires because _put() was removed but _get() remains. Classic refcount leak pattern. Limitations Dataset limitations: Only captures bugs with Fixes: tags (~28% of fix commits). Selection bias: well-documented bugs tend to be more serious. Mainline only, doesn't include stable-branch-only fixes or vendor patches Subsystem classification is heuristic-based (regex on file paths) Bug type detection based on keyword matching in commit messages and many bugs are "unknown" type Lifetime calculation uses author dates, not commit dates, rebasing can skew timestamps Some "bugs" may be theoretical (comments like "fix possible race" without confirmed trigger) Model limitations: 92.2% recall is on a held-out 2024 test set, not a guarantee for future bugs Can't catch semantic bugs (logic errors with no syntactic signal) Cross-function blind spots (bug spans multiple files) Training data bias (learns patterns from bugs that were found, novel patterns may be missed) False positives on intentional patterns (init/cleanup in different commits) Tested only on Linux kernel code, may not generalize to other codebases Statistical limitations: Survivorship bias in year-over-year comparisons (recent bugs can't have long lifetimes yet) Correlation ≠ causation for subsystem/bug-type lifetime differences What this means: VulnBERT is a triage tool, not a guarantee. It catches 92% of bugs with recognizable patterns. The remaining 8% and novel bug classes still need human review and fuzzing. What's next 92.2% recall with 1.2% FPR is production-ready. But there's more to do: RL-based exploration: Instead of static pattern matching, train an agent to explore code paths and find bugs autonomously. The current model predicts risk; an RL agent could generate triggering inputs. Syzkaller integration: Use fuzzer coverage as a reward signal. If the model flags a commit and Syzkaller finds a crash in that code path, that's strong positive signal. Subsystem-specific models: Networking bugs have different patterns than driver bugs. A model fine-tuned on netfilter might outperform the general model on netfilter commits. The goal isn't to replace human reviewers but to point them at the 10% of commits most likely to be problematic, so they can focus attention where it matters. Reproducing this The dataset extraction uses the kernel's Fixes: tag convention. Here's the core logic: def extract_fixes_tag(commit_msg: str) -> Optional[str]: """Extract the commit ID from a Fixes: tag""" pattern = r'Fixes:\s*([a-f0-9]{12,40})' match = re.search(pattern, commit_msg, re.IGNORECASE) return match.group(1) if match else None # Mine all Fixes: tags from git history git log --since="2005-04-16" --grep="Fixes:" --format="%H" # For each fixing commit: # - Extract introducing commit hash # - Get dates from both commits # - Calculate lifetime # - Classify subsystem from file paths Full miner code and dataset: github.com/quguanni/kernel-vuln-data TL;DR 125,183 bugs analyzed from 20 years of Linux kernel git history (123,696 with valid lifetimes) Average bug lifetime: 2.1 years (2.8 years in 2025-only data due to survivorship bias in recent fixes) 0% → 69% of bugs found within 1 year (2010 vs 2022) (real improvement from better tooling) 13.5% of bugs hide for 5+ years (these are the dangerous ones) Race conditions hide longest (5.1 years average) VulnBERT catches 92.2% of bugs on held-out 2024 test set with only 1.2% FPR (98.4% AUC) Dataset: github.com/quguanni/kernel-vuln-data If you're working on kernel security, vulnerability detection, or ML for code analysis, I'd love to talk: jenny@pebblebed.com