CI/CD

Deploy risk scoring for platform teams: a practical framework

Priya Menon · March 21, 2026 · 9 min read

CI pipeline risk score visualization dashboard

Not all breaking changes are equal. A risk scoring system that treats a field rename in a rarely-consumed internal analytics API the same as a removed endpoint on a high-traffic checkout contract will either cry wolf on everything or miss the change that takes down your payment flow. Most platform teams that have implemented CI contract checks have learned this the hard way.

Why binary pass/fail creates its own problems

The simplest version of a CI contract check is binary: the change either breaks a declared consumer or it doesn't. Tools like Buf's breaking change detection and some OpenAPI linters work this way. The problem is that binary scoring conflates two situations that have nothing in common operationally:

A deprecated field being removed from an internal analytics reporting API that one batch job reads once a week.
An endpoint being removed from the payment gateway contract that four services hit on every user transaction.

Both are technically "breaking." The first is a scheduled migration the impacted team already knows about. The second is a SEV-1 waiting to happen. Treating them identically produces a check that either blocks constantly on low-stakes changes — generating alert fatigue until engineers start ignoring the check — or is configured permissively enough to let through the changes that actually matter.

Proportional risk scoring assigns a numerical severity to each change based on the actual characteristics of the dependency relationship: what changed, who consumes it, how critical those consumers are, and whether the change was signaled in advance.

The four variables that determine risk

A proportional risk score for a contract change should weight four independent variables. They're multiplicative, not additive — a low-severity change type against a P0 consumer produces a higher score than a high-severity change type against a P2 batch job.

1. Change type severity

Schema changes exist on a spectrum. For REST/OpenAPI, a practical severity ordering:

Score 0: Adding an optional field (purely additive; well-behaved consumers ignore unknown fields)
Score 1: Marking a previously required field as optional (potentially breaking for strict schema validators)
Score 2: Adding a required field without a default (breaking for any consumer that creates resources via this endpoint)
Score 3: Renaming a field (breaking for any consumer that references the old name by string)
Score 4: Changing a field's type (e.g., integer to string) — breaking for consumers that parse or compare the value
Score 5: Removing a field or endpoint entirely — hard breaking for all direct consumers

For gRPC/proto3, field number changes and the removal of optional defaults have their own severity tiers. For AsyncAPI event schemas, topic removal and message format changes map to a similar scale. The key principle is that the severity is specific to the change type, not just "breaking vs. non-breaking."

2. Consumer count and traffic weight

A change that affects one downstream consumer carries lower aggregate risk than one affecting seven. But raw consumer count doesn't fully capture the picture — a single synchronous consumer that calls the affected endpoint on every checkout request matters more than three asynchronous consumers running nightly reconciliation jobs.

Where runtime traffic data is available from your observability stack, weight consumer impact by P95 call volume against the affected endpoint. Where it's not available (static-only analysis), consumer count is the proxy. The formula is roughly:

consumer_risk = change_severity × weighted_consumer_count

This composite gives you a score that reflects both the nature of the change and the breadth of its reach.

3. SLA tier of affected consumers

A breaking change to a contract consumed exclusively by internal background jobs is recoverable with a scheduled deploy window and an off-peak rollout. A breaking change to a contract consumed by your P0 payment processing service is not — it requires immediate hotfix coordination, a canary, and a postmortem regardless of how it's handled.

Define SLA tier multipliers explicitly. P0 services (payment, auth, checkout): 3×. P1 (core product features, user-facing reads): 2×. P2 (internal tooling, reporting, analytics): 1×. The final risk score for a field rename on a P0-consumed contract should be materially higher than the same rename on a P2 contract, because the cost of a missed break is categorically different.

4. Deprecation signal status

A field that was annotated x-deprecated: true in v2.1 and is now being removed in v2.3 carries lower real-world risk than a field being removed without prior deprecation notice. Consumers of a properly deprecated field have had time to migrate; consuming it at time of removal represents a consumer migration failure, not a surprise producer break.

If your schema management tracks deprecation annotations — x-deprecated in OpenAPI, the deprecated option in proto3, or equivalent in AsyncAPI — that status should discount the base risk score proportionally to the deprecation period. A 90-day deprecation window with no consumer migration should score differently than a zero-notice removal.

Mapping scores to CI actions

The output of risk scoring isn't just a number to display — it maps to a specific required action before the PR can merge:

Score 0–3 (advisory): No CI gate. The impact report surfaces as a PR comment. Engineers see the downstream context; nothing blocks. This is the correct behavior for additive changes and internal-only migrations.
Score 4–7 (warning): CI check passes with a non-blocking warning status. The PR comment names affected consumers and their owning squads. Those squads are notified via Slack or PagerDuty routing. Merge is permitted; the change is flagged for canary monitoring during rollout.
Score 8+ (breaking): CI check fails. The PR is blocked until one of three conditions is met: (a) affected consumers have been updated to handle the new contract, (b) a backward-compatible migration path is documented and linked in the PR description, or (c) each impacted consumer team has explicitly acknowledged the change with a PR review approval.

The acknowledgment path for score 8+ changes is the most important design decision here. A hard block without an escape hatch creates a chokepoint: one upstream change can stall an entire release. The acknowledgment mechanism provides the escape — it documents that the coordination happened, records which teams reviewed the impact, and gives auditors a clear trail. The CI check failure surfaces the need for coordination; acknowledgment records that it occurred.

Three scoring mistakes to avoid

Scoring changes individually, not as a composite diff. A single PR that renames two fields and removes one endpoint isn't three separate scores summed — it's one aggregate score. Use the maximum severity across all changes in the diff, not the average. Averaging hides a single high-severity change buried among additive ones.

Ignoring transitive consumers. If service B consumes service A's internal API, and service A depends on contract C, then service B is a transitive consumer of C. A breaking change in C can cascade to B even though B has no direct declared dependency on C. A score that only counts direct consumers will under-report risk on deeply layered service meshes, which are exactly where the worst cascades originate.

Scoring against the latest schema version, not the pinned consumer version. If order-processor is pinned to [email protected] and the breaking change ships in v2.3.0, the risk depends on whether order-processor will receive the update. Semver-aware scoring only flags the change as breaking for consumers whose declared version range includes the new version. This prevents false positives on consumers that are pinned below the breaking version.

Calibrating thresholds to your team's actual risk profile

The threshold values above (3, 7, 8+) are a starting point. The right calibration depends on your organization's release cadence, rollback cost, and incident history. Teams running weekly release trains with expensive rollbacks should set lower thresholds — every deploy carries more risk per unit of change. Teams running continuous deployment with automated canary deployments can tolerate higher thresholds because the blast radius of any single deploy is bounded by the canary percentage.

The most useful calibration exercise: audit your last 12 months of production incidents involving cross-service breaking changes. For each one, reconstruct the diff and run it through your scoring model. Did the model produce a score that would have blocked or warned? If not, identify which variable was under-weighted: was the consumer count too low a proxy for traffic? Was the SLA multiplier wrong? Adjust those variables until the model correctly classifies the incidents that actually cost you, without mis-classifying the changes that were safe. That's the empirical floor; everything above it is tuning for false positive rate.

Buildpathio generates a pre-push risk score for every PR that touches a schema file — zero-impact, warning, or breaking, with named consumers and SLA tiers.

See how risk scoring works