A Maturity Model for Platform Engineering Teams
Platform Engineering

A Maturity Model for Platform Engineering Teams

Platform engineering maturity is often measured in platform capabilities — what golden paths exist, whether self-service is available, how much toil is automated. What gets measured less often is the quality of the pre-deploy risk signal: how accurately does the team understand the blast radius of a change before it ships? This model adds that dimension.

Stage 1: Basic infrastructure — "we have Kubernetes"

The team runs services on a container orchestration platform. Deployments happen. CI runs tests. But there is no shared model of how services depend on each other, no automated pre-deploy risk assessment, and no consistency in how teams coordinate when shared services change.

Checkpoints: containerized workloads, CI pipeline with basic tests, some deployment automation. Dependency awareness: tribal knowledge only.

Stage 2: Standardized delivery — "we have golden paths"

The platform team has established opinionated templates for how new services are created and deployed. Observability is instrumented. On-call rotations are formalized. Teams follow a consistent deployment process, and rollback is understood.

Checkpoints: service templates, centralized observability (metrics, logs, traces), defined incident response process. Dependency awareness: a service catalog exists (Backstage or equivalent), but is manually maintained and often out of date.

Stage 3: Self-service platform — "engineers don't need the platform team to deploy"

Product engineers can create, deploy, and scale services without opening a ticket to the platform team. Internal developer portals (IDPs) expose a self-service interface. Platform teams measure developer productivity, not just infrastructure uptime.

Checkpoints: IDP with self-service deployment, environment provisioning automation, SLO dashboards per service. Dependency awareness: service catalog is still mostly documentation-driven. High-impact PRs still require manual "who owns downstream X?" investigation.

"Stage 3 organizations know what they deployed. Stage 4 organizations know what they broke before they deployed it."

Stage 4: Risk-aware deployments — "the system tells you the blast radius"

Every PR receives a computed blast radius score derived from the live dependency graph — not from documentation. High-risk merges require additional review or explicit senior engineer override. The dependency graph is built continuously from code, API contracts, and service mesh telemetry, and it reflects the actual architecture rather than the documented one.

Checkpoints: automated dependency graph (live, not documented), pre-merge risk scoring in CI, blast radius gates on high-risk PRs, override audit trail. P0 incidents from dependency-blind merges are measurably declining.

The transition from Stage 3 to Stage 4 is where change impact analysis becomes infrastructure rather than a manual step. The dependency graph is a first-class artifact, queried by the CI system and visible to every engineer who opens a PR.

Stage 5: Proactive reliability — "known unknowns become known"

The platform uses the dependency graph not just reactively (risk-scoring PR changes) but proactively — identifying architectural fragility before any change triggers it. Blast radius concentration metrics (which services have the highest transitive dependent count) drive architectural decisions. Teams measure graph accuracy rate as a reliability metric alongside SLO compliance.

Checkpoints: blast radius concentration reporting, architectural review process informed by graph topology, cross-team dependency SLAs, postmortem findings fed back into graph configuration improvements.

Most organizations with 50+ services are somewhere between Stage 2 and Stage 4. Stage 5 is aspirational for most, but the gap between Stage 3 and Stage 4 is where the most addressable P0 reduction lives — and it is a tooling problem, not a process problem.

Progression indicators and anti-patterns

Each stage transition has common anti-patterns that stall teams without them realizing it:

Stage 2 → 3 stall: The platform team becomes the IDP. They build a portal that requires platform team involvement to use effectively. The portal is technically self-service but practically bottlenecked by platform team availability for edge cases. Resolution: treat the IDP as a product with its own user research; the customer is the product engineer, not the platform team.

Stage 3 → 4 stall: The team knows they need better dependency visibility but implements it as a documentation effort — Backstage with manually maintained service catalog entries, dependency diagrams in Confluence. This recreates the documentation drift problem at Stage 1. Resolution: the dependency graph must be derived from code and runtime data, not written by humans.

Stage 4 → 5 stall: The risk scoring system is deployed but not acted upon. Reviewers see HIGH risk scores and merge anyway without additional review. Override rates exceed 50%. Resolution: this is a cultural adoption problem, not a tooling problem — address it by involving senior engineers in setting override policies and regularly reviewing the override audit trail in retrospectives.

How to assess your current stage

The fastest way to assess your current stage is to answer three questions:

  1. When a senior engineer is about to merge a PR that touches a shared service, do they know — with confidence — how many other services will be affected if it behaves unexpectedly? If the answer is "yes, we check the graph" rather than "yes, we ask around," you're at Stage 4 or above. If the answer involves any form of manual knowledge, you're at Stage 3 or below.
  2. When a P0 occurs, does your postmortem process include checking whether the dependency graph correctly predicted the blast radius? If not, you have no feedback loop for improving graph accuracy — which means you're at best Stage 3.5.
  3. Does your platform team have a metric for dependency graph accuracy, alongside SLO compliance and MTTR? If the graph is infrastructure-grade, it should be measured infrastructure-grade. If it's still treated as "nice to have," it's not yet infrastructure.

We're not saying every team needs to be at Stage 5 — the investment required to reach Stage 5 is only justified once Stage 4 is solidly operating and the remaining P0 sources are genuinely architectural fragility rather than tooling gaps. But stages 1 through 4 have a clear priority ordering, and for teams still at Stage 3, the Stage 3 → 4 transition is where the best reliability improvement per engineering investment lives.