Platform Engineering

The Complete Guide to Microservice Dependency Mapping

May 12, 2025 • 9 min read • Buildpathio Team

Most platform engineering teams have a dependency map somewhere. It might be a Confluence page, a draw.io diagram someone exported six months ago, or a Notion doc with a hand-drawn service graph. The map looks plausible. Engineers trust it. Then a PR that touches the payments service causes the notification worker to start throwing 500s, and nobody can explain why — because nobody knew the notification worker called an internal API on the payments service that wasn't on any diagram.

The problem isn't that teams are careless. The problem is that manual dependency maps are documentation, and documentation reflects the system as it was understood at the moment it was written, not the system as it actually runs.

Why static documentation decays

When an engineer adds a new service-to-service call, they rarely update the architecture diagram. When a team migrates from REST to gRPC for one contract, the API spec in the wiki doesn't automatically reflect the new interface. When a shared library introduces an internal dependency on a config service, that dependency is invisible to anyone reading code at the call site.

The decay rate is proportional to team size and deployment frequency. A team shipping ten times per day across fifteen services accumulates architectural drift faster than any human process can track it. The map goes from "mostly correct" to "actively misleading" in weeks, not months.

We're not saying documentation is useless — architecture decision records, runbooks, and design docs all serve purposes that automated analysis cannot. We're saying that a dependency map maintained by humans will not stay accurate under the deploy velocity modern platform teams operate at. The map needs to come from the system itself.

The three-source model for accurate graphs

Effective dependency analysis requires combining three distinct data sources, each of which captures a different class of relationship that the others miss.

Static source analysis

Parsing import statements, function calls, package manifests, and module references gives you the foundational layer: dependencies that are declared explicitly in code. This catches direct library dependencies, internal package imports, and build-time relationships. A Go service that imports a shared pkg/auth module has that dependency discoverable purely from source.

Static analysis has a well-known limitation: it only sees what's declared, not what's called at runtime. A plugin architecture where service names are loaded from environment variables is invisible to a pure static scan. Runtime configuration dependencies — where service A reads an endpoint for service B from a config map — won't appear in import graphs.

API contract scanning

OpenAPI specifications, Protocol Buffer definitions, and GraphQL schemas define the interfaces between services. Scanning these specs — and detecting when a service's client code references another service's spec version — reveals contract-level dependencies that don't appear in import graphs.

Consider a scenario: a mid-size SaaS company's checkout-api service has a handwritten OpenAPI 3.0 spec. Their order-service and inventory-service both generate client stubs from that spec. When checkout-api's spec changes a field from optional to required, both downstream consumers will break at runtime. Static analysis of the source code might not catch this at all — but API contract scanning, by tracking which services consume which spec versions, makes the relationship explicit.

Runtime trace aggregation

Distributed tracing — whether from Istio, Linkerd, Jaeger, or OpenTelemetry — captures the dependencies that only exist at runtime. A service that dynamically resolves endpoints via service discovery, a cron job that calls three internal APIs on a schedule, a background worker consuming events from a queue and triggering downstream writes: these all appear in trace data and nowhere in static analysis.

Runtime tracing also surfaces temporal dependencies. Service A might only call service B during peak checkout windows, making the dependency invisible in off-hours monitoring. Aggregating trace data across a representative traffic window gives you the full actual dependency graph, not just the declared one.

Constructing the directed graph

With data from these three sources, you construct a directed graph where each service is a node and each dependency relationship is a directed edge pointing from dependent to dependency. The direction matters for blast radius calculation.

When you're about to change service A, the question you need to answer is: which services will be affected? That's not service A's dependencies (the services A depends on) — it's service A's dependents (the services that depend on A). You find these by traversing the graph in reverse: follow all incoming edges to A, then incoming edges to those services, recursively, until you've collected every service in the transitive dependent set.

The size of that set is your blast radius. A change to a leaf service with zero incoming edges has blast radius 0. A change to a shared authentication service that twelve other services call has blast radius 12 at minimum — and potentially much larger if those twelve services are themselves depended upon.

The freshness problem

A graph built from last month's data is worse than no graph in one specific way: it produces confident, wrong answers. Engineers will trust a stale graph's blast radius assessment, skip a review step, and ship a change that breaks a dependency added after the last graph update.

The solution is to treat the graph as a CI artifact, not a documentation artifact. Every PR that touches a service should trigger a graph refresh for that service's subgraph before review begins. This means integrating static analysis, contract scanning, and runtime trace aggregation into the CI pipeline — not as a weekly batch job, but as a per-PR step that takes under two minutes.

The practical implication: graph analysis must be incremental. Rebuilding a full 200-service graph from scratch on every PR is too slow. Instead, the system should identify which graph nodes are potentially affected by the changed code, re-analyze those nodes and their immediate neighbors, and merge the updated subgraph into the persistent graph store.

What a live graph changes about PR review

When the dependency graph is live and accurate, code review stops being a context-switching exercise where engineers manually trace potential impact. The graph tells you, before you open the diff, that this PR touches a service with five direct dependents and twelve transitive ones — and which of those dependents have test coverage gaps that make them higher risk.

A team using an accurate live graph can make three decisions they couldn't make before: they can route high-blast-radius changes to senior reviewers automatically, they can require sign-off from owners of dependent services before merge, and they can prioritize integration tests for the specific downstream services that are actually affected rather than running a full regression suite.

None of this requires organizational process change. It requires an accurate graph that's available at PR time — which is an infrastructure problem, not a culture problem. The teams that get this right are the ones that treat dependency analysis as a first-class CI output, not an afterthought in architecture review.