Imagine hunting for a missing checkout in a sea of 130,000 l
Imagine hunting for a missing checkout in a sea of 130,000 l
Imagine hunting for a missing checkout in a sea of 130,000 log lines a millisecond after midnight.
People stare at timestamps and vague notices.
They think logs have lied to them.
They are not being malicious.
They simply do not tell the truth.
Hours pass while developers grep through noise.
They search for why a user cannot checkout.
They dig into why a webhook failed.
They trace why p99 latency spiked at 3 a.m.
Nothing useful remains except blanks and silence.
It is not anyoneâs fault.
Logging, as practiced, is broken.
OpenTelemetry, if slapped on code, will not fix it.
I will show what is wrong and how to correct it.
Logs were made for monoâserver days.
They were designed for 2005.
Today a single request touches fifteen services.
It uses three databases, two caches, and a message queue.
The logs still behave oddly.
Typical logging looks like this: (insert example lines).
That code shows thirteen lines for one success.
Multiply by ten thousand users and you get one hundred thirty thousand lines per second.
Most convey nothing useful.
When a problem arises, they fail to help.
Context is missing.
String search is broken.
When a user says âI canât finish my purchase,â you type the email or ID.
String search treats logs as a jumble of characters.
It cannot understand structure or correlation.
Searching for âuserâ123â yields many logs:
userâ123
user_id=userâ123
{"userId": "userâ123"}
[USER:userâ123]
processing user: userâ123
But downstream services might only log an order ID.
You must search again.
You search again.
You search again.
You play detective with one hand tied.
Logs are optimized for writing, not for querying.
Developers write console.log âPayment failedâ because it feels easy.
Nobody thinks how hard it is for the soul searching at 2 a.m. during an outage.
Letâs define some terms.
Structured logging emits keyâvalue pairs, usually JSON.
An example: {"event": "payment_failed", "user_id": "123"}.
It replaces plain sentences like âPayment failed for user 123.â
Structured logging is necessary, not sufficient.
Cardinality is the number of unique values a field can have.
user_id has high cardinality.
http_method has low cardinality.
High cardinality fields make logs useful for debugging.
Dimensionality is the number of fields in a log event.
Five fields give low dimensionality.
Fifty give high dimensionality.
More dimensions allow more queries.
Wide event is a single, rich log event per request per service.
Instead of thirteen lines, you emit one with fifty fields.
Canonical log line is another term for wide event.
Stripe popularised the term.
One line per request serves as the authoritative record.
OpenTelemetry standardises how telemetry data is collected and exported.
It does not decide what to log.
It does not add business context.
It does not fix mental models.
OpenTelemetry delivers, but you must instruct it to include subscription tiers or cart value.
The fix is wide events.
Stop logging what your code is doing.
Log what happened to this request.
Think of logs as business event records, not debugging diaries.
Emit one wide event per service hop.
Include every useful context.
Example wide event:
{
"timestamp": "2025-01-15T10:23:45.612Z",
"request_id": "req_8bf7ec2d",
"trace_id": "abc123",
"service": "checkout-service",
"version": "2.4.1",
"deployment_id": "deploy_789",
"region": "us-east-1",
"method": "POST",
"path": "/api/checkout",
"status_code": 500,
"duration_ms": 1247,
"user": {
"id": "user_456",
"subscription": "premium",
"account_age_days": 847,
"lifetime_value_cents": 284700
},
"cart": {
"id": "cart_xyz",
"item_count": 3,
"total_cents": 15999,
"coupon_applied": "SAVE20"
},
"payment": {
"method": "card",
"provider": "stripe",
"latency_ms": 1089,
"attempt": 3
},
"error": {
"type": "PaymentError",
"code": "card_declined",
"message": "Card declined by issuer",
"retriable": false,
"stripe_decline_code": "insufficient_funds"
},
"feature_flags": {
"new_checkout_flow": true,
"express_payment": false
}
}
With one event, you have everything.
If the user complains, search for user_id.
The event reveals the subscription tier.
It shows lifetime value and account age.
It shows payment attempt number and card decline reason.
It shows the feature flag used.
All found in a single look.
No more grep, no guessing, no second search.
Wide events change the query from text to structured data.
The data becomes analyticsâready.
You can aggregate for dashboards and debug with precision.
Implementing wide events requires middleware that accumulates context.
Example middleware code is provided.
It records start time, creates an event with request details, and exposes it to handlers.
The middleware captures status, outcome, error, and duration.
After processing, it logs the final event.
Handlers enrich the event: user profile, cart details, payment results.
They add business context as the request flows.
If a payment fails, they add the error details.
The JSON is logged once at the end.
Sampling keeps costs manageable.
If you log fifty fields per request at ten thousand per second, storage can explode.
Sampling records only a fraction of traffic.
Keep 10âŻ% or 1âŻ%.
Random sampling can miss critical outages.
Tail sampling decides after a request completes.
Rules: keep errors, keep slow requests, keep VIP users, keep flagged sessions, random sample the rest.
Example tail sampling decision function is given.
It inspects the event and returns whether to keep it.
Structured logging is not wide events.
Structured logging is just JSON.
Wide events are philosophically one comprehensive event.
They can contain more than five fields.
Wide events contain user lifecycle, cart, payment, errors, and flags.
OpenTelemetry is delivery, not decisionâmaking.
Many OpenTelemetry setups capture only span name, duration, and status.
That is insufficient.
Tracing shows service flow.
Wide events show context within a service.
They complement each other.
Logs and metrics are not separate.
Both can be powered by wide events.
Query for debugging, aggregate for dashboards.
Highâcardinality data is not automatically slow.
Modern columnar storage handles it efficiently.
The payoff is analytics.
With wide events you ask: âShow me all premium checkout failures in the last hour with new flow enabled, grouped by error code.â
The result is subâsecond.
Root cause is identified quickly.
Your logs no longer lie to you.
They now tell the complete truth.
This article invites you to examine your stack.
Answer a few questions.
Get a personalized report on what works.
What to log and what to stop logging.
Which tools bring value.
Quick wins for this week.
Questions share your logging nightmares.
Check your inbox for a detailed analysis.