Uncovering a Faulty HTTP Retry in Rust: Lessons from the Bors Merge Bot
During the rigorous integration testing of the Bors GitHub merge bot, a mysterious emptyâbody failure exposed a subtle bug in the octocrab crate's retry mechanism. By dissecting the request cloning logic, the issue was traced to a shallowâclone of the request body, which swallowed the payload on retries. Fixing the clone to perform a true deep copy (or avoiding retries for unbuffered bodies) restored correct behavior and revealed broader insights about handling HTTP bodies, retry design, and debugging Rust dependencies.
### Background
The Bors merge bot is a lightweight, opinionated GitHub bot that automates PR processing as part of the **bors** project. Recent work focused on moving the bot into a productionâready configuration: raceâcondition fixes, a more robust HTTP client, and a dramatically expanded integrationâtest suite. The test harness is intentionally heavyâweightâeach test boots a full Bors web instance, connects to a live PostgreSQL database, and drives the bot using real GitHub webhook emulation via WireMock.
### The Mysterious Failure
After a large refactor, a handful of tests started to panic within the mocked GitHub endpoint: a PATCH request that should carry a JSON body returned empty, causing a deserialization failure.
```rust
let data: SetRefRequest = req.body_json().unwrap(); // panics
```
The bot uses **octocrab** to send the PATCH. Code that builds the request clearly includes a body:
```rust
let res = repo
.client()
._patch(
url.as_str(),
Some(&serde_json::json!({
"sha": sha.as_ref(),
"force": matches!(force, ForcePush::Yes),
})),
).await?;
```
The body was present on the first server hit, but disappeared on subsequent retries, which the bot executed automatically because the server replied with 5xx codes. The problem surfaced only after several retries, a phenomenon that made the bug look like a subtle race in the botâs own code.
### Diving into Dependencies
Because almost all Rust code is safe, the initial hypothesis was that the issue lied inside a dependency, but not necessarily in the bot itself. The debugging flow progressed as follows:
1. **WireMock** was inspected by cloning the crate locally and sprinkling debug logs. No anomalies surfaced.
2. **Hyper** (used by WireMock) never modified the incoming request bodies, ruling out a serverâside bug.
3. **Wireshark** inspection of the local HTTP traffic confirmed the client sent an empty body on retries.
4. Disabling octocrabâs builtâin retry mechanism (`add_retry_config(RetryConfig::None)`) eliminated the emptyâbody problem.
Thus the culprit was octocrabâs retry policy.
### Root Cause: Shallow Body Clone
Octocrabâs request body type, `OctoBody`, was defined as:
```rust
#[derive(Debug)]
pub struct OctoBody(Arc>);
```
The `Clone` implementation merely cloned the `Arc`, producing a *shallow copy* that shared the same `RwLock` instance.
```rust
impl Clone for OctoBody {
fn clone(&self) -> Self {
OctoBody(Arc::clone(&self.0))
}
}
```
The requestâbody is consumed on the first send: Hyper polls the body via `is_end_stream`, returning `true` once consumed. On retry, the same body is reused; Hyper sees the `is_end_stream` flag as `true` and sends an empty body, leading to the observed failures.
This bug hinges on two Rust idioms that often clash: *referenceâcounted* data (*Arc*) and *interior mutability* (*RwLock*). If the body had no interior mutability, a shallow clone would merely yield a second reference to the same buffer, incurring a negligible correctness impact. The combination, however, caused the logical state to be lost on retry.
### Fix Implemented in octocrab 0.49.1
The fix introduces a dedicated, byteâsized, immutable copy of the body used only for retries:
```rust
#[derive(Debug)]
pub struct OctoBody {
body: Arc>, // original body
buffered: Option, // deep copy for retries
}
```
During construction, when the underlying payload is available as a `Vec`, a `Bytes` clone is stored in `buffered`. The `try_clone` helper then performs a *deep* copy of this buffer for use in a new request:
```rust
fn try_clone(&self) -> Option {
self.buffered.as_ref().map(|b| {
Self::create(http_body_util::Full::from(b.clone()), Some(b.clone()))
})
}
```
If a request body cannot be buffered (e.g. a streaming body), the retry mechanism simply aborts the retry, avoiding the emptyâbody mishap.
### Implications for the Bors Project
By disabling octocrabâs retry policy in Bors, the upstream bug was masked, but the bot still suffered from doubleâretry logicâonce from its own retry mechanism and once from octocrab. Removing octocrabâs automatic retries both restored correct behavior and simplified the test logic.
The incident also highlighted the risk of heavy integration tests: the Bors test harness triggers a rare code path (retrying a PATCH with a body) that is uncommon in everyday bot usage but surfaced because the tests explicitly simulate transient failures.
### Lessons Learned
1. **Trust but Verify** â Even wellâreviewed crates can harbor subtle bugs in edge cases; a disciplined faultârecovery strategy in tests helps surface them.
2. **Understand Body Ownership** â HTTP bodies are often streamed; shallow cloning of body objects that use interior mutability can break retries.
3. **Avoid Implicit Retries** â Thirdâparty clients may enable retries behind the scenes. If your business logic already implements retries, consider disabling autoâretries to eliminate duplicate effort.
4. **LLM Assistance** â The author tested a modern language model to locate the bug. While the model quickly flagged the retry bug and the shallow clone issue, it got sidetracked by misinterpretations of dependency features. Human guidance remains essential.
### Conclusion
The Bors merge bot story illustrates how a seemingly innocuous bugâthe absence of a request body after an automatic retryârevealed a deeper flaw in the octocrab crateâs body cloning logic. The fix not only solved two bugs in Bors but also improved the robustness of octocrab for all users. Such lessons reinforce the importance of comprehensive integration testing, careful design of retry strategies, and a nuanced understanding of Rustâs ownership and concurrency primitives.
For those facing similar challenges, consider inspecting how your HTTP client clones request bodies and whether your retry logic overlaps with the clientâs builtâin mechanisms. Sharing findingsâlike the Bors team didâstrengthens the ecosystem and benefits everyone who relies on these crates in production.