Reliability

Writing better postmortems for downstream cascade failures

Leo Brandão · March 4, 2026 · 8 min read

Incident timeline showing downstream cascade failure pattern across services

Downstream cascade incidents have a structural property that most postmortem templates aren't built for: the service that caused the incident is not the service that experienced the pain. If your postmortem process starts with "what service failed," you're already asking the wrong question for this class of incident.

Why standard postmortem templates fail for cascades

The standard postmortem format asks: what service failed, when did it fail, what triggered it, and what do we do to prevent it? This works well for self-contained failures — a memory leak in service A, a bad database migration, a misconfigured connection pool. The blast radius is contained to the service that failed.

It breaks down for cascades because in a cascade, the answer to "what service failed" is genuinely multiple services across multiple teams. The triggering change is in service A. The first observable error rate spike is in service B. The user-visible impact is in service C. The engineer who gets paged at 2am is on the team that owns service D, which has nothing to do with the root cause — they own an endpoint that service C calls synchronously, and C started timing out.

Here's a concrete version of this pattern: a platform team running 55 services across four squads. The infrastructure squad merges a change to user-profile that removes the legacyTierId field from the user object — it's been deprecated for two quarters, internal only, no documented consumers. notification-svc doesn't declare a dependency on user-profile in any manifest, but it has been constructing notification payloads by calling the user object and reading legacyTierId for a templating branch added eight months ago. notification-svc starts returning 500s on any notification that hits that code path. order-processor calls notification-svc synchronously on every checkout. Checkout success rate drops by 12%. The on-call for order-processor is paged. It takes 45 minutes to trace the chain back to the user-profile field removal.

Postmortems written without a dependency frame produce action items that are technically correct but structurally incomplete: "add better error handling in notification-svc," "make the checkout notification call async." These reduce the failure mode for that specific incident but don't address the underlying condition: an undocumented dependency that allowed a routine cleanup to cascade into a checkout outage.

The four cascade-specific things you need to document

A postmortem for a downstream cascade needs to capture context that standard templates don't prompt for. Each of these is load-bearing for the action items that follow.

The dependency chain, not just the blast radius

Document the full directed path from the triggering change to each affected service. This is different from documenting which services were impacted — a blast radius list is a set; a dependency chain is an ordered path with edges.

The chain format: "service A's contract change propagated to service B because B consumes A's chargeCompleted event schema on the payments.completed Kafka topic; B's error rate increase caused C to fail because C polls B's status endpoint synchronously with a 500ms timeout; C's timeouts propagated to D because D depends on C's response for user-facing checkout rendering." Each edge has a type: API call, event subscription, SDK dependency. Naming the edge type matters because different edge types have different circuit-breaker options and different tooling for contract validation.

The chain tells you where the cascade could have been interrupted — not just at the triggering service, but at each edge where a circuit breaker, schema validation gate, or consumer contract test could have absorbed the failure before it propagated.

The contract state at time of incident

For the triggering change: what exactly changed in the contract? Field removed, field renamed, type changed, endpoint removed? What version did the change ship in? Was a deprecation annotation present (x-deprecated in OpenAPI, deprecated in proto3)? How long was the deprecation window?

For each consuming service in the chain: was the dependency declared anywhere — in an OpenAPI import, a proto dependency declaration, a buildpath.yaml manifest, a Backstage catalog entry? Was the consumer testing against the producer's schema in CI?

If the consuming service's dependency was undocumented — not declared in any manifest, not visible in any catalog — that fact belongs explicitly in the postmortem, formatted as a structural vulnerability finding rather than an operational detail. "Dependency between notification-svc and user-profile was undeclared in all schema registries at time of incident" is a different class of finding than "the deploying engineer didn't check Slack."

Detection lag at each hop

Document the time gap between the triggering change and when each downstream service started failing. Specifically: deploy time of triggering change; first error rate deviation observed in service B (and by whom); first alert firing; first page; time to identify root cause service; time to resolution.

This timeline reveals whether the cascade was actually detectable early and went undetected, or whether it only emerged once a specific traffic pattern hit the new code path. A 45-minute detection lag on an incident that happened at peak traffic suggests the monitoring threshold was too high. A 45-minute lag on an incident that happened at 3am on a low-traffic path suggests the cascade was genuinely dormant until a specific condition was met — which has different implications for prevention.

Which team was paged first, and whether they had the context to diagnose

In multi-squad organizations, the on-call for service D may have no knowledge of service A or the change that triggered the cascade. Document whether the team that was paged first had sufficient context to diagnose the incident without escalation to the team that made the triggering change. If escalation was required — and how long it took — that's a finding about on-call escalation path design, not just about the technical fix.

The question to ask: if the paged engineer had access to a live dependency graph filtered to services in the cascade chain, would they have been able to trace the root cause without an escalation call? If yes, that's an action item about tooling access. If no, it's an action item about on-call rotation design across squads.

Template structure: cascade-specific additions

The following additions to your existing postmortem template capture what standard formats miss. These are not replacements for your existing sections — they're inserts:

Incident summary — Lead with the triggering service and triggering change, not the paged service. "User-profile field removal triggered notification-svc errors, causing synchronous checkout failures" is the correct framing. "Order-processor SEV-2" is the wrong one — it names the pain point, not the cause.

Dependency chain section — A directed list from triggering change to each affected consumer, with edge types. If you have a dependency graph tool, include the graph snapshot. If not, reconstruct manually and note that the manual reconstruction is itself a finding.

Contract state finding — For each edge in the chain: was the dependency declared? Was deprecation signaled? Was a consumer contract test present? Three yes/no columns. Each "no" is a structural gap.

Pre-deploy detection analysis — If static dependency analysis had been run against the triggering change at PR time, would it have surfaced the impacted consumers? If yes: the action item is adding that analysis to the triggering service's CI pipeline. If no (because the dependency was undeclared): the action item is adding the declaration, then the analysis.

Action items — Named owner, named service, named artifact. "Payment Squad to add dependency declarations to notification-svc OpenAPI consumer manifest before next sprint" is a good action item. "Improve cross-team communication" is a postmortem tradition that prevents exactly nothing.

The action item that most cascade postmortems don't write

There's a class of action item that doesn't get generated from cascade postmortems because most templates don't prompt for it: adding pre-deploy dependency impact analysis to the triggering service's CI pipeline.

We're not saying this is the only action item. Better circuit breakers, async communication patterns, consumer contract tests — all of those are valid. What we're saying is that the action item addressing the structural cause of the cascade is specifically this: the team that shipped the triggering change didn't have tooling that told them which downstream services consumed their contract. That's not a process failure — it's a tooling gap. And a process action item ("coordinate better") doesn't close a tooling gap.

If your postmortem retrospective cycle produces cascade incidents with the same root cause pattern more than once — undocumented downstream consumer, contract change without consumer awareness — the template is missing the prompt. Add the pre-deploy detection analysis section. It will generate the right action items once, and then the action items will eliminate the pattern.

Buildpathio surfaces the downstream consumers your team doesn't know about — before you push, not after the postmortem.

Start Free Trial