Product

Buildpathio vs. Manual Dependency Review: What We Measured

Jan 14, 2026 • 8 min read • Buildpathio Team

We tracked 90 days of PR activity across seven platform engineering teams — some using manual dependency review as part of their code review checklist, others using Buildpathio's automated graph scoring. The results were consistent enough to publish.

How we structured the comparison

Manual review in this context means: a code reviewer is expected to assess downstream impact by reading the diff, consulting documentation or their own knowledge of the service graph, and flagging any concerns. This is the industry default — no special tooling, just experienced engineers doing what experienced engineers do.

Automated scoring means: Buildpathio runs a graph scan on every PR, computes a blast radius score from static analysis, OpenAPI contract detection, and optional service mesh telemetry, and posts the result as a PR check before any human review begins.

We tracked three outcomes: (1) missed high-risk merges — PRs that scored HIGH in retrospect but were merged without additional review, (2) review time per PR, and (3) P0 incidents in the 30 days following a given merge cohort.

Manual review: where it succeeds and where it breaks down

Manual review is genuinely effective for single-service changes with shallow dependency graphs. An experienced engineer who knows the codebase can identify most direct downstream impacts quickly. The problem is not reviewer competence — it is the graph depth problem.

For changes that affect a service with three or more transitive dependents, manual review accuracy drops significantly. Reviewers reliably catch direct dependencies (one hop). They miss second- and third-order dependencies at a much higher rate — particularly when those relationships were established by a different team, or when they are mediated by an API contract rather than a code import.

In our cohort, the teams using manual review had a missed high-risk merge rate of 34% for PRs with blast radius 5 or higher. This is not a criticism of those engineers — it reflects the fundamental limits of human graph traversal at scale.

Automated scoring: what the graph catches that reviewers miss

The automated graph catches two classes of dependency that manual reviewers consistently miss: API contract consumers and runtime-only call paths visible via service mesh telemetry.

A service may not import another service's code directly — it calls it over HTTP using an endpoint it gets from environment configuration. Static import analysis misses this. OpenAPI contract scanning catches it if both services maintain spec files. Service mesh telemetry catches it regardless. Reviewers rarely catch it at all unless they know both codebases well.

Across the 90-day period, the automated-scoring teams had a missed high-risk merge rate of 6% for PRs with blast radius 5 or higher. The difference in outcome metric was a 73% reduction in P0 incidents attributed to dependency-blind merges.

"Manual review is good at catching what you know to look for. Automated graph scoring is good at catching what you did not know existed."

P0 incident delta over 90 days

The outcome we care about most: production incidents caused by a merge that should have received more scrutiny.

Manual review cohort: 11 P0 incidents over 90 days where postmortem root cause identified a missed or undocumented dependency as the contributing factor.

Automated scoring cohort: 3 P0 incidents with the same root cause classification over the same period.

Three caveats worth noting: the teams were not perfectly matched in org size or service count; the automated scoring teams were already somewhat more process-mature (which may have contributed); and 90 days is a short window. We plan to publish a longer follow-up analysis when the 12-month data is available. But the directional signal is clear enough that we felt it was worth sharing now.

The most consistent observation from the engineers on the automated scoring teams: the risk score did not replace judgment, it concentrated it. Reviewers spent less time manually tracing blast radius on every PR and more time actually reviewing the code quality of the high-risk ones. That reallocation of attention is probably most of the improvement.

The hidden cost in manual review time

Incident delta is the headline metric, but there's a second cost in manual dependency review that's worth quantifying: time spent per PR doing the manual blast radius investigation.

In our cohort, engineers on manual review teams spent an average of 6–8 minutes per PR actively thinking about downstream impact — checking the service catalog, asking colleagues, reviewing code comments. Across a team handling 40 PRs per week, that's 240–320 minutes per week, or 4–5 engineering hours, spent on manual graph traversal that automated analysis handles in under 90 seconds.

That 4–5 hours doesn't disappear from automated teams — it gets reallocated. Reviewers on automated scoring teams reported spending more time on the actual diff quality for high-risk PRs, specifically because they weren't burning cognitive load on the blast radius question. The score answered that question; they could focus on whether the change was correct.

What manual review still does better

We'd be giving an incomplete picture if we only reported where automated scoring outperforms manual review. Manual review has genuine advantages that no graph analysis replaces.

Experienced engineers catch semantic issues — code that is syntactically correct and structurally sound but wrong in its business logic — that no graph analysis can detect. A senior engineer who knows the domain can flag that a PR changes the behavior of a payment retry flow in a way that will cause double charges under specific conditions. That's not a blast radius problem; it's a correctness problem. Graph analysis doesn't help there.

Manual review also catches architecture drift — situations where a PR is technically safe but represents an architectural decision that shouldn't be made at the PR level. An engineer who sees a service adding a new direct HTTP dependency on a third service might flag that this call should go through the API gateway rather than direct. No graph analysis would score this as HIGH risk, but an experienced reviewer would catch it as a design issue.

We're not saying automated scoring replaces code review. We're saying it removes one expensive, error-prone manual step from code review — the blast radius assessment — so reviewers can spend their attention on the things that genuinely require human judgment.