Risk scoring for code changes sounds straightforward until you try to define what "risk" actually means in a platform engineering context. A 3,000-line PR that adds extensive tests and only touches one isolated service might carry lower actual risk than a 40-line change to a shared authentication library. Line count is a signal, but it's a weak one. Platform teams that operationalize risk scoring well have moved past surface metrics to something closer to graph-aware impact estimation.
This post is about what factors genuinely predict downstream incidents, which signals are noise, and how to wire a scoring model into your PR workflow without creating so much friction that engineers route around it.
What actually predicts incidents
The most useful way to validate a risk signal is to look backward: given a set of historical incidents, which properties of the triggering change were measurably different from changes that didn't cause incidents?
When platform teams do this analysis, a few signals consistently emerge. Blast radius — the number of services in the transitive dependent set of the changed service — correlates more strongly with incident severity than with incident likelihood. High blast radius changes that do cause issues tend to cause bad ones. Changes to API contracts in services with many consumers have a higher incident rate than internal-only changes. Test coverage delta on the changed code is a meaningful predictor: changes that reduce the percentage of covered lines on critical paths are disproportionately represented in postmortem data.
Signals that appear predictive but aren't: absolute PR size (too noisy), commit count (doesn't reflect change scope meaningfully), and author seniority (seniority predicts review attention, not change risk).
A practical four-factor model
A model that works in practice for growing platform teams typically weights four factors:
Blast radius (40% weight). Count the direct and transitive dependents of the changed service. A service at a graph depth of 1 — touched by three services — scores differently than a shared utility called by thirty. This factor requires an accurate dependency graph to compute correctly; without it, you're estimating blast radius from memory or documentation, which is unreliable.
Change volume on critical paths (25% weight). Not total lines changed — specifically lines changed on code paths that handle interface contracts, configuration, or error handling. A change that only touches logging boilerplate is not equivalent in risk to a change of the same size that modifies request validation logic.
Test coverage delta (20% weight). Did this PR increase or decrease the percentage of covered lines on the changed service? A PR that adds a feature without tests, especially on a high-blast-radius service, warrants higher scrutiny. Coverage delta rather than absolute coverage avoids penalizing legacy services with historically low coverage that can't be fully addressed in a single PR.
Dependency depth (15% weight). How many hops deep is this service in the overall graph? Services near the root of the dependency tree — those that others depend on, but themselves have few dependencies — tend to have outsized impact when they change. A shared auth service at depth 1 with 15 downstream consumers is more sensitive than a leaf analytics service at depth 6.
Calibrating thresholds for your team
Score bands — low, medium, high — need calibration against your specific graph topology and incident history. A team with 8 services where the largest service has 4 dependents will need different thresholds than a team with 80 services where the auth layer has 30 dependents.
A reasonable starting calibration: LOW (0–30) requires no additional steps beyond standard review. MEDIUM (31–65) prompts a suggested review from the service's downstream owners. HIGH (66–100) blocks merge pending explicit acknowledgment from a senior engineer who reviews the downstream impact list.
We're not saying a HIGH score means the PR is bad or should be rejected. We're saying it means the risk surface is large enough that the standard review process — designed for average changes — may not provide adequate coverage for this one. The score gates attention, not approval.
Wiring the score into your PR flow
A risk score that lives in a dashboard nobody reads does nothing. The score needs to appear where engineers already look: in the PR itself, as a required status check.
The practical implementation is a GitHub Actions workflow (or equivalent in your CI system) that calls a dependency analysis API on each push to a PR branch, receives a score in the response, and reports a named status check — for example, buildpath/risk-score — with the score and a brief explanation of the contributing factors. Branch protection rules can require this check to pass before merge is available.
The friction question is real. If every PR lands with a HIGH score, engineers will learn to ignore the signal. Effective calibration means HIGH scores should be genuinely infrequent — covering perhaps 5–10% of PRs on a well-instrumented codebase. Medium scores might cover 20–30%. Most PRs should score LOW and require no additional steps.
Managing false positives and signal decay
Any scoring model drifts over time. As teams add services, change deployment patterns, or refactor their graph topology, the thresholds calibrated six months ago may no longer reflect actual risk distribution.
Two practices keep the model honest. First, review the distribution of scores against actual incidents quarterly. If HIGH-scored PRs are not over-represented in your postmortems relative to their frequency, your high threshold may be too aggressive. If incidents keep coming from PRs that scored LOW, a factor is being under-weighted. Second, instrument the false-positive rate by tracking how often senior engineers override HIGH blocks with a comment like "reviewed, approving" — and spot-checking a sample of those cases in retrospective. A high override rate is the clearest signal that the model needs recalibration.
Risk scoring is not a substitute for good code review. It's a routing and attention mechanism. The goal is to ensure that the changes most likely to cause downstream incidents get the depth of review they warrant, without adding friction to the routine changes that comprise the majority of platform work.