Team Operations

Onboarding engineers onto a 80-service platform without losing them

New engineer reviewing service ownership diagram during onboarding

When a new engineer joins a platform team running 80 services across six squads, the standard onboarding artifact — a Confluence page with an architecture diagram from 2022 — is worse than useless. It creates false confidence in a mental model that stopped being accurate 18 months ago. The question worth asking is what would actually work instead.

The actual problem with microservice onboarding

New engineers joining monolithic codebases have it comparatively easy: one repo, one deploy unit, one mental model to build. The codebase is large but navigable. Clone it, run it, read the call stack, build a working model in a few weeks.

Joining a mature microservice platform is different. The codebase is distributed across dozens of repos. Services have implicit dependencies that aren't captured in any single document — and often aren't known by any single person. The team that owns notification-svc may not know that order-processor has been calling their Kafka consumer group endpoint synchronously on every checkout for eight months. A new engineer asking "what calls what?" gets a fragmentary answer that's accurate for the squad that answered it and wrong for everything outside.

The result is a predictable onboarding pattern: the new engineer spends the first several weeks doing one-on-ones with senior engineers on their squad, accumulating tribal knowledge, and building a partial mental model that's good enough to work within their squad's service boundary but blind to anything outside it. They don't make dangerous cross-service changes — not because they understand the risk topology, but because they don't touch services they don't understand. This isn't safety. It's enforced ignorance.

It works until it doesn't. The moment a new engineer's feature work requires calling a service outside their squad's direct ownership, they're making architectural decisions with incomplete information. The tribal knowledge handoff doesn't scale with the team; the onboarding ramp doesn't compress with tenure.

What a dependency graph provides that documentation cannot

Architecture documentation has a fundamental structural problem: it's a snapshot of a system that changes continuously. The diagram shows what the architect intended, or what was true when the diagram was drawn. It doesn't show the user-profile service that started consuming auth-service's new session endpoint last sprint without a catalog entry update, or the async fan-out added to order-processor during a production incident six months ago that nobody documented because the team was exhausted and the incident was over.

A dependency graph built from live schema files reflects the system as it exists in the current state of the repos — not as it was designed, not as someone remembered it. Every time a service's OpenAPI spec or proto file is updated in source control, the graph updates. A new engineer looking at the graph sees the actual current dependency topology.

More concretely, the graph answers the questions that matter most during the first 60 days:

  • What does my squad's service depend on upstream? What would break our service if it degraded?
  • Which services depend on mine? These are the teams I need to coordinate with before changing any contract.
  • Which nodes in the graph have high in-degree (many consumers)? These are the highest-risk services to modify without broad coordination.
  • If I'm on call at 2am and an alert fires, which upstream services in my dependency chain could be the cause?

Before a dependency graph, these questions required a synchronous conversation with someone who had been on the platform long enough to build the mental model. The graph makes them self-service — and, importantly, reliable rather than dependent on the currency of any individual's memory.

A structured approach to graph-first onboarding

The dependency graph isn't a replacement for mentorship — it's a substrate that makes mentorship more efficient. The following progression works for teams that have tried it:

Week 1: squad-scoped view only

Start with a filtered graph view showing only the services your squad owns. Have the new engineer work through three specific questions before touching any code: (1) What are the upstream dependencies of each squad service, and what SLA tier do those dependencies carry? (2) Which external services consume our APIs or events? (3) Are any of our downstream consumers running at a higher SLA tier than our service itself?

That third question surfaces a common ownership gap: a P2-tier internal service that has somehow acquired P0 consumers. A new engineer who discovers this in week one knows immediately that changes to that service require unusual care. A new engineer who discovers it by causing an incident in month three has a much worse introduction to cross-service ownership.

Weeks 2–3: one hop in each direction

Expand the view to include services one hop upstream and one hop downstream from the squad's direct ownership — the services your services call, and the services that call yours. Now the engineer has the full context for their day-to-day work: the upstream dependencies that impose constraints on what you can assume, and the downstream consumers that impose obligations on what you can change.

At this stage, a useful exercise: walk the new engineer through a recent PR from the squad and read the dependency impact report together. Even if the change was zero-impact, the exercise builds the cognitive model of how a schema diff maps to consumer risk. After doing this two or three times with real PRs, engineers start running the check themselves before opening PRs — the behavior is instilled through repetition, not instruction.

Month 2: full graph, ownership navigation

By month two, the engineer should be navigating the full graph using the ownership layer — filtering by squad rather than by service name, reading the on-call schedule and Slack channel for each service before initiating a cross-team contract change. An engineer who can answer "I'm modifying the chargeCompleted event schema in payment-api — which squads own the consumers, what's their SLA tier, and where do I post the coordination notice?" in under 60 seconds is genuinely safe to ship cross-service changes.

That level of situational awareness through tribal knowledge alone typically takes 3–4 months and requires the new engineer to have experienced or observed at least one cross-service incident. With the dependency graph, it's reachable in six weeks because it's queryable rather than recalled.

The self-maintaining onboarding artifact

The hardest thing about maintaining onboarding documentation is decay. Every service added, removed, or restructured makes some documentation stale. The new engineer who follows 18-month-old architecture docs builds a mental model that diverges from reality with every sprint.

A dependency graph built from schema files doesn't decay independently. It's current as a function of schema hygiene — as long as the OpenAPI specs, proto files, and AsyncAPI documents in the repos accurately reflect the service's actual interface (a requirement that has to be true for the CI schema validation to function), the graph is current. You're not scheduling documentation reviews. You're not adding "update architecture diagram" to PR templates and watching it get skipped. The onboarding artifact is maintained as a side effect of the engineering practice you already need.

The tradeoff is worth naming: this approach has a schema hygiene dependency. Services that haven't published OpenAPI specs, that have undeclared gRPC interfaces, or that publish events without AsyncAPI documents are invisible in the graph. For those services, the onboarding artifact gap is the same as the dependency visibility gap — and both are resolved by the same investment. Teams typically find that the onboarding use case creates internal momentum for schema documentation work that "we should document our APIs" never did.

What the graph doesn't accelerate

The dependency graph compresses the time it takes a new engineer to understand what connects to what and what obligations those connections create. It doesn't substitute for the domain knowledge that explains why the system is structured the way it is — the historical context of architectural decisions, the technical debt that explains why legacy-pricing-engine still exists alongside pricing-service-v2, the institutional knowledge about which parts of the system are fragile under load conditions not visible in the graph.

A new engineer with a thorough understanding of the dependency graph has an accurate map. They still need engineers who know the territory. The right investment is both — the graph compresses the map-learning phase significantly, freeing mentorship time for the territory-learning that genuinely requires human transfer.

Buildpathio's dependency graph is filterable by squad, SLA tier, and ownership — built for teams that onboard engineers into complex service meshes.

Start Free Trial