May 5, 2026 · 8567 words · 41 min read

14. Automation as a Product#

Six months before the conversation that changed how the team thought about their work, the network platform team had delivered something genuinely impressive. Two years of consistent effort had produced a provisioning platform that handled branch onboarding end to end: a self-service interface for site requests, a closed-loop validation workflow that caught misconfigurations before deployment, and an operational dashboard that tracked service health across three hundred locations. The team had gone from twenty-four-hour change windows to forty-minute automated deployments. They were proud of it, and they should have been.

Then the business development team finalized a retail chain acquisition. One hundred and twenty new stores needed network connectivity within six months. The infrastructure lead sent a single email: “We’re counting on the automation platform for this. How do we get started?”

The network platform team’s first internal response was confident: they could automate this. The workflows existed. The templates existed. The tooling was proven.

Their second response, when they tried to answer the email, was less confident. The infrastructure lead was not asking whether automation was technically possible. They were asking a different set of questions: What is the SLA for store onboarding, specifically the commitment for when a store gets connectivity after the site information is submitted? Who is the escalation path when a deployment fails mid-rollout? Is there a status page or dashboard that the infrastructure lead’s team can monitor without asking the network team directly? What is the capacity of the platform: can it handle twenty simultaneous deployments, or will they need to batch requests? And the most uncomfortable question: what happens to stores that were onboarded during a platform incident, are they automatically remediated or does someone need to audit each one?

The network team had automation. They did not have answers (in business/product terms).

This gap, between “we built automation that works” and “we provide a service that other teams can depend on and plan around”, is the subject of this chapter.

This is a topic I am passionate about. A session I delivered at Autocon3 covers this from a design-driven automation perspective and is a companion reference to this chapter.

14.1 From Capability to Product#

Chapter 13 described how teams transform their people, their roles, and their ways of working to build automation as an organizational capability. That transformation is necessary. It is not sufficient.

A capability is something the team can do. A product is something other teams (or eventually, third-party entities) can depend on. The distinction is not about technical quality: the retail chain onboarding platform was technically excellent. The distinction is about the relationship between the provider and the consumer. A capability exists for its authors. A product exists for its users.

The network team’s output has shifted. Before automation, the output was device configuration: a router provisioned, a VLAN extended, an ACL applied. The consumer of that output was the device itself, and the evidence of success was a passing ping or a working application. With automation, the output changes: the network team’s product is a network service consumed by other teams. Every interaction with the network, provisioning a branch site, expanding a segment for a new application, enforcing a security policy, temporarily granting conference access to a VLAN, adding a peering connection at an internet exchange, becomes a service request rather than a change ticket. The device configuration is the implementation detail. The service is the artifact.

This is the Network Service as Product pattern: the service is the primary artifact, and the underlying network is the implementation. This is not new in software engineering: APIs abstract infrastructure, and the caller does not know or care which servers handle the request. It is, however, a significant shift for network teams that have historically organized their work around devices, vendors, and protocols. The engineer who defined their identity around router configuration skill is now being asked to define their identity around service delivery capability. That reframe connects directly to the Craftsman’s Dilemma from Chapter 13, section 13.1: the expert at the old craft is the person who needs to make the reframe most urgently, and the engineer who makes it completely becomes more valuable, not less.

The technical home of this product is the Presentation block described in Chapter 8. The self-service interface, the API surface, the webhook integration, the role-based access model: these are where the service contract is visible to consumers. Chapter 14 zooms out from the technical interface to the organizational and business model surrounding it. What commitments come with the interface? Who owns it when it breaks? How does the service evolve? Who decides what it does next?

14.2 Defining the Product#

Two failure modes appear consistently when teams try to turn network automation into services.

The first is over-exposure: the interface reveals implementation details, and consumers must understand network internals to use it. A branch provisioning service that asks for a VLAN ID, a subnet mask, and an OSPF area number is not a service: it is a CLI with a web form. The consumer, who is a facilities coordinator for the retail chain, does not know what an OSPF area is and should not need to.

The second is over-restriction: the interface is so constrained that it only handles the exact use cases the network team anticipated. Any request that deviates from the template requires an exception process. The facilities coordinator who needs to provision a temporary pop-up store with a different connectivity requirement than a permanent retail location cannot self-serve. They file a ticket. The ticket goes to the network team. The benefit of automation has not reached that consumer.

The Service Contract Pattern resolves both failure modes by making the interface definition explicit, versioned, and deliberately bounded. A service contract has three components:

  • Input surface: what the consumer provides, expressed in business terms, not network terms (sorry to insist, but this is key). A branch site request takes a location name, a physical address, a site tier (standard, small, pop-up), and an activation date. It does not take a VLAN ID. The contract translates business intent into network implementation internally, and that translation is the automation platform’s responsibility. Tier definitions are not permanent: a pop-up tier defined for a temporary retail kiosk will not cover a pop-up event space with different connectivity requirements. The input surface must be reviewed with consuming teams on the same cadence as the roadmap, so that the tiers reflect actual use cases rather than the use cases the network team anticipated when the contract was first written.

  • Output surface: what the consumer observes, including both successful completion and failure. A well-designed output surface does not expose “deployment failed at step 7 of 14: gNMI push rejected with error code 400”. It exposes “activation failed: the physical circuit at this address has not yet been provisioned by the carrier. Expected completion date: [date from SoT]. No action required from you; the system will retry automatically when the circuit comes online”. The automation does not just succeed or fail: it emits observable lifecycle events in terms the consumer understands.

  • Inner dependencies: what the service tracks internally that consumers should not see but the team must manage. Circuit state from the carrier. Neighboring services that share infrastructure with this one. The consistency relationship between the new site’s SoT record and the inventory records that drive automated monitoring. When a circuit goes into carrier maintenance, the network team needs to know which services are affected and what SLO exposure that creates. The consumer may need to know about the impact on their service; they do not need to know the implementation detail that caused it.

flowchart LR
    classDef consumer fill:#d9e8f5,stroke:#4a7fa5,color:#1a2e3b
    classDef contract fill:#f5e8d9,stroke:#a57a4a,color:#1a2e3b
    classDef internal fill:#d9f5e8,stroke:#3a8a5a,color:#1a2e3b

    subgraph Consumer["Consumer Interface"]
        IN["Input Surface<br/>Location, tier, date<br/>Business intent"]
        OUT["Output Surface<br/>Status, lifecycle events<br/>Business language"]
    end

    subgraph Contract["Service Contract"]
        TRANS["Translation Layer<br/>Intent to implementation<br/>VLAN, subnet, OSPF area"]
    end

    subgraph Internal["Inner Dependencies"]
        CIRC["Circuit state"]
        SOT["SoT consistency"]
        NEIGHBOR["Neighboring services"]
    end

    IN --> TRANS
    TRANS --> OUT
    TRANS --> CIRC & SOT & NEIGHBOR
    CIRC & SOT & NEIGHBOR -.-> OUT

    class IN,OUT consumer
    class TRANS contract
    class CIRC,SOT,NEIGHBOR internal

With the service contract defined, the next question is what happens to it over time.

The lifecycle question is where many teams underinvest. A service contract is not just about the moment of provisioning. What happens to this service when the underlying infrastructure changes? A branch site running over a circuit that is going into scheduled carrier maintenance has an expected SLO impact. Who knows about that impact, who decides whether to notify the retail chain’s operations team, and who owns the communication if the maintenance window overruns? These questions require services to be first-class entities in the Source of Truth.

The SoT in Chapter 4 describes intent as the authoritative record of what the network should be. Services, in the product model, extend that intent upward: not just what the network elements should look like, but what business functions those elements are delivering and to whom. A SoT that models devices and circuits but not services cannot answer the question “which retail stores are affected by this circuit maintenance?” It cannot feed the dependency maps that make service-aware change management possible. The Orchestration block from Chapter 7 depends on this dependency graph when coordinating remediation: a closed-loop workflow that responds to a circuit failure needs to know which services are affected before it can route notifications and trigger the correct recovery sequence.

This is precisely the abstraction that Chapter 4, section 4.2.2 formalizes as the Design-Driven building block: an operator provides high-level intent (“add a branch”) and the SoT expands it into the fifty-plus technical objects the Executor needs before it touches a device. The service model in Ch14 extends that same principle one layer higher, from “what technical objects must exist” to “what business function do those objects deliver, and to whom.” The SoT that was designed to abstract device syntax from operators can equally abstract network internals from service consumers, if services are modeled as first-class objects within it.

The practical consequence: services must be modeled in the SoT with their dependency chains. The branch site service depends on a physical circuit, which depends on a provider, which has a maintenance window history. The network segment service depends on a set of access switches. The peering service at the internet exchange depends on a BGP session, which depends on a physical port, which lives in a specific facility. When any dependency changes state, the service model updates, and the affected consumer can be notified in terms they understand.

flowchart TB
    classDef service fill:#d9e8f5,stroke:#4a7fa5,color:#1a2e3b
    classDef infra fill:#d9f5e8,stroke:#3a8a5a,color:#1a2e3b
    classDef external fill:#f5e8d9,stroke:#a57a4a,color:#1a2e3b

    S1["Branch Site Service<br/>Store 847"]
    S2["Application Connectivity Service<br/>Inventory System"]

    C1["Circuit<br/>Provider X, CID-44821"]
    SW1["Access Switch<br/>bldg-b-sw-01"]
    BGP1["BGP Session<br/>AS64501"]
    PORT1["Physical Port<br/>rack-14-u32"]

    S1 --> C1
    S2 --> SW1
    S2 --> BGP1
    BGP1 --> PORT1

    MAINT["Carrier Maintenance Window<br/>2026-06-15 02:00 UTC"]
    C1 -.->|"affected by"| MAINT

    class S1,S2 service
    class C1,SW1,BGP1,PORT1 infra
    class MAINT external

Network services the team should be able to model in this way include: new branch site activation, temporary network access for a conference or pop-up event, application connectivity with ACLs enforcing dependency rules, internet peering expansion at an exchange point, and VLAN extension for a new project segment. Each of these has a business consumer, a lifecycle, dependencies, and a meaningful definition of health and failure.

Most teams building this model are not starting from a clean slate. The existing network already has three hundred branch sites, years of accumulated configuration changes, and a SoT that models devices and circuits but not services. Before those existing sites can participate in the service model, their actual state must be discovered and reconciled against what the SoT believes is true. A site whose running configuration has drifted from its SoT record cannot be safely brought under automated lifecycle management until that drift is resolved: the automation will push configurations based on the SoT’s view of intent, and if that view is wrong, the push will make things worse. Discovery and reconciliation, reading device state and comparing it against SoT records to identify and resolve gaps, is the prerequisite step for brownfield environments. This work is unglamorous and time-consuming, but skipping it means the service model is only valid for sites provisioned after the platform was built, which is typically a small fraction of the network.

Modeling services in the SoT is necessary but not sufficient: the team also needs to observe them at the right level. The Observability block from Chapter 6 closes the loop: service-level observability, tracking whether a service is healthy from the consumer’s perspective, is distinct from device-level observability, tracking whether the underlying infrastructure is healthy. Both are necessary. A service can appear degraded to its consumer while all underlying devices report healthy, if the service model is not instrumented at the right level.

14.3 Business Alignment#

The traditional argument for network automation focuses on operational efficiency: fewer tickets, faster provisioning, lower error rates. That argument is true and useful for justifying the initial investment. It is not sufficient for maintaining strategic investment over time.

Operational efficiency is measured against the current baseline. A team that reduces manual provisioning time from four hours to forty minutes has demonstrated a significant throughput improvement. The business unit leader who approved the budget three years ago is pleased but not strategically engaged: the network team is running better, and that is good, but it is not a reason to invest further.

The stronger argument is capability: automation enables business outcomes that would otherwise be impossible or prohibitively expensive. The retail chain expansion is a concrete example. Without a mature automation platform, onboarding one hundred and twenty stores in six months requires a team of network engineers dedicated to that project for six months. Assuming an aggressive manual provisioning rate of roughly one store per engineer per day (a number that varies significantly depending on connectivity type, device count, carrier coordination time, and whether site surveys are complete), that is a team of roughly eight people with no other responsibilities. With a mature automation platform, the same work is handled by the existing team running automated workflows in parallel. The business outcome, expansion completed in six months at acceptable cost, is only achievable because automation exists. That is not an efficiency argument. It is a capability argument. A business argument.

A second example: an organization running large-scale AI model training depends on interconnect provisioning latency and reliability. If bringing a new training cluster online requires two weeks of manual network provisioning, the business’s ability to run training experiments at competitive velocity is constrained by the network team’s throughput. Automation that reduces provisioning from two weeks to forty-eight hours is a direct input to a business capability the business unit considers strategic. The network team that cannot articulate that connection is leaving influence on the table.

A third example: a service provider whose engineers manually configure PE routers, VRF instances, BGP sessions with customer CE devices, and QoS policies for each new enterprise MPLS VPN contract takes four weeks from order acceptance to service live. That timeline is not an operations problem: it is a sales problem. Competitors who have automated the same provisioning sequence, generating consistent configurations from a service request, validating them against the existing topology, and pushing them through a tested workflow, promise two-week turn-ups in RFP responses. The four-week provider cannot match that commitment regardless of how skilled its engineers are, because the constraint is not skill: it is the serial, manual nature of the provisioning process. Automation that compresses turn-up from four weeks to three days changes what the sales team can promise, which changes which contracts the company can win. That is not an efficiency argument. It is a revenue argument, and it belongs in the conversation about whether to invest in the automation platform.

The sequence of reasoning matters. Automation designed from device behavior upward, starting with “what can we automate about router configuration” and working toward “what does the business get from this”, often cannot articulate business value because it was never designed to deliver a business outcome. Automation designed from business capability downward, starting with “what business outcomes require reliable network services” and working backward through service design to the automation that implements those services, can connect its work to business priorities from the first conversation.

The Business-Driven Service Map pattern makes this connection explicit: a mapping of network services to the business capabilities they enable. For each network service, the map answers three questions: which business processes depend on this service, what is the business impact if this service is degraded or unavailable, and what business capability would become possible if this service were faster, more reliable, or self-service. This is the document a Network Automation Product Manager would own, and it is the primary instrument for aligning the automation roadmap with business priorities.

flowchart TB
    classDef biz fill:#d9e8f5,stroke:#4a7fa5,color:#1a2e3b
    classDef svc fill:#f5e8d9,stroke:#a57a4a,color:#1a2e3b
    classDef auto fill:#d9f5e8,stroke:#3a8a5a,color:#1a2e3b

    subgraph BIZ["Business Capabilities"]
        B1["Retail Expansion"]
        B2["AI Training Velocity"]
        B3["Event Operations"]
    end

    subgraph SVC["Network Services"]
        S1["Branch Activation"]
        S2["Interconnect Provisioning"]
        S3["Temporary Access"]
    end

    subgraph AUTO["Automation"]
        A1["Site Onboarding Workflow"]
        A2["Circuit and BGP Provisioning"]
        A3["Conference VLAN Lifecycle"]
    end

    B1 --> S1 --> A1
    B2 --> S2 --> A2
    B3 --> S3 --> A3

    class B1,B2,B3 biz
    class S1,S2,S3 svc
    class A1,A2,A3 auto

This reframe is uncomfortable for many network teams, because it requires measuring different things. Operational metrics (tickets closed, change success rate, Mean Time To Resolution (MTTR)) are within the team’s control and easy to instrument. Business outcome metrics (time to open a new retail location, interconnect provisioning latency as an input to training throughput) require collaboration with other business units and an understanding of what they actually measure. The team that makes this shift, from measuring technical excellence to measuring business contribution, is answering a different question in budget conversations: not “how efficiently did we run the network this quarter” but “which business outcomes depended on the network platform this quarter, and what would have failed without it.” That is a harder question to answer, but it is the one that determines whether the platform gets funded for the next phase.

The Impact Measurement over Efficiency Measurement principle follows: prioritize measuring outcomes that matter to the business over operational metrics that matter only to the network team. Efficiency metrics are inputs. Outcomes are what the business cares about.

This reframe also changes what the team asks for in planning cycles. A team that presents “we reduced MTTR by 40%” is reporting on its own performance. A team that presents “the retail expansion timeline is achievable because our onboarding automation can handle forty concurrent activations without manual intervention” is reporting on business capability. Both facts may be true. Only one of them is relevant to the decision about whether to staff the retail expansion project.

14.4 The Internal SLA#

An automation platform that other teams depend on without reliability commitments is a trust trap. It works until it does not, and the consuming team has no data to plan around. The retail chain infrastructure lead who schedules twenty simultaneous branch activation requests on a Monday morning needs to know, before Monday morning, what the platform’s behavior will be: how long will each activation take, can twenty run in parallel or will they queue, what happens if one fails mid-deployment, and how will the platform communicate that failure?

These are SLA questions. In the product model, every automation service carries an explicit SLA, not as legal protection but as the operational contract that makes the service plannable.

An automation SLA has four components:

  • Availability: the percentage of time the service interface is reachable and accepting requests. A branch activation service with 99.5% monthly availability has roughly three and a half hours of allowed downtime per month. That number is a commitment: when the service is down, the team owes an explanation and a recovery timeline.

  • Execution latency: how long the service takes to fulfill a request from submission to completion. For branch activation, this might be: acknowledgment within thirty seconds, provisioning started within five minutes, completion within forty minutes for a standard site. These numbers define what “working” looks like, not just whether the service is reachable.

  • Error budget: how often the service can fail without violating the SLA. A service with 99% successful completions per week has a one percent error budget. When more than one percent of activations fail in a week, something is wrong and the team owes a review. The NRE role described in Chapter 13, section 13.2 is the person who owns defining and defending these budgets, and the error budget model from the SRE literature applies directly: when the error budget is consumed, new automation deployments pause until reliability is restored.

SRE (Site Reliability Engineering) is a discipline originating in large-scale software operations that applies engineering principles to reliability: defining service level objectives, measuring error budgets, and using reliability data to govern feature velocity. The NRE (Network Reliability Engineer) role adapts these principles to network automation platforms. Both roles and their application to network teams are covered in detail in Chapter 13, section 13.2.

  • Escalation path: when the service fails or misses its SLA, who does the consumer call, and what is the expected response time. An escalation path that routes to the general network team inbox is not an escalation path: it is a ticket queue. A product SLA requires a named or role-based escalation contact with a defined response commitment.

The support model question follows naturally: when automation fails, who owns it? Three failure modes appear in almost every incident, and confusing which one is active leads to incidents being dropped between teams.

Failure modeSymptomOwner
Automation bug: workflow logic is wrongConsistent failure on the same specific input; passes for other inputsAutomation Developer
Platform failure: execution engine, SoT, or observability infrastructure failedBroad failure across multiple unrelated services simultaneouslyPlatform Team
Network failure: underlying device or circuit failedAutomation completed without error; network state did not convergeNetwork Operations

The overlap between these categories is where incidents get dropped. An automation workflow that fails because the gRPC Network Management Interface (gNMI) push was rejected could be an automation bug (wrong data model), a platform failure (the gNMI collector lost its session), or a network failure (the device restarted during the push). The incident response process must be designed to triage across these categories without requiring the consuming team to understand which one is active. From the retail chain’s perspective, the store did not get connectivity. Who fixes it and when is the provider’s problem, not theirs.

A practical triage sequence for any automation failure follows three steps. First, check the automation logs: did the workflow report an error, and is that error consistent across multiple runs of the same input or specific to one? Consistent failure on a specific input points to an automation bug; random or intermittent failure points elsewhere. Second, check platform health: are other unrelated services failing simultaneously, and are the execution engine, SoT, and observability stack reporting healthy? Broad simultaneous failure across unrelated services is a platform failure signature, regardless of what the workflow logs say. Third, check device state: did the network element receive and apply the intended configuration, and does its current state match what the automation attempted to push? If the workflow completed without error but the network did not converge, the failure is in the network layer. This sequence can be encoded as the first three steps of an automated runbook, so that the on-call engineer arrives at an incident with the triage already done rather than starting from scratch.

The platform team reference in the second failure mode connects to Chapter 10: platform reliability is a precondition for service SLAs. A service cannot commit to 99.5% availability if the execution engine it runs on has no reliability target. The platform engineering patterns in Chapter 10, redundancy, health monitoring, automated recovery, are what make automation SLAs credible.

Design the escalation path before the incident happens. The post-incident conversation about who owned what is always harder than the pre-incident conversation that established clear boundaries.

Blast radius is a related design concern that belongs in the same pre-incident conversation. A manual provisioning mistake affects one site because an engineer configures one site at a time. An automation bug can affect every site that matches the input pattern before a human notices anything is wrong. The response to this is not to slow down automation: it is to design concurrency limits and staged rollout into the service contract as deliberate safety choices, not implementation details. A branch activation service that caps concurrent deployments at five active jobs, validates the first deployment to completion before releasing the next batch, and holds on any failed job until a human clears it is not a slow service: it is a service whose blast radius is bounded by design. That bound should appear in the service contract, alongside availability and execution latency, so that consuming teams understand both what the platform can do and what it will refuse to do in order to protect them. The retail chain infrastructure lead who understands that the platform will pause a forty-store rollout after the first failure, rather than completing thirty-nine more deployments on a broken template, will trust the platform more, not less.

The existence of SLA commitments, blast radius controls, and a triage framework creates the conditions for a different relationship with change governance. In organizations that still operate under traditional change management, every network change, including automated ones, may be routed through a Change Advisory Board for pre-approval. That process was designed for a world where each change is unique, hand-crafted, and unpredictable: the right person reviewing a manual change by a human engineer adds genuine risk reduction because human judgment varies. The same logic applied to an automated workflow that has been designed, tested in a pre-production environment, validated against the SoT, constrained by blast radius limits, and run successfully dozens of times does not add risk reduction: it adds latency to an operation whose risk profile was established before it was ever run in production.

The Pre-Approved Automation Pattern resolves this: change approval is applied once to the workflow design, not repeatedly to each execution of that workflow. When a branch activation workflow has passed its validation stages, been reviewed and approved by the relevant engineering and operations stakeholders, and been deployed to production with its safety constraints active, each subsequent execution of that workflow is not a new change requiring new approval. It is an instance of an already-approved, bounded operation. The appropriate governance question is “is this execution within the approved envelope?” not “should this execution happen at all?” A cloud provider does not require a human to approve each virtual network creation request: the service was designed with appropriate constraints, tested thoroughly, and approved as a service. Individual customer requests within that service boundary are not change events requiring review. They are service invocations. Network automation services, once properly designed and approved, should operate the same way. The work that justifies this trust is exactly what sections 14.2 through 14.4 describe: explicit service contracts, observable outputs, bounded blast radius, and a defined triage path when something goes wrong. That work is the change approval, done once, at the right moment.

14.5 Performance, Cost, and ROI#

Automation has costs. Compute infrastructure for the execution engine, the orchestrator, and the observability stack. Storage for SoT records, job history, and telemetry. Engineering time for building, testing, and maintaining automation code (and the corresponding AI coding tokens to support it). Tooling licenses for commercial components of the platform. Support burden when consumers file issues or request new capabilities. These costs are real and recurring.

The ROI question is also real, and network teams that avoid it cede budget decisions to finance and procurement teams who will answer it less accurately. The framework has three components:

  • Cost of automation: the fully loaded cost of building and running the platform. Engineering salaries allocated to platform development and maintenance, infrastructure costs (compute, storage, networking for the automation infrastructure itself), tooling licenses, and operational overhead. This number is knowable and the team should know it.

  • Cost of the manual equivalent: what it would cost to deliver the same services without automation. For branch activation, this is engineering hours per site multiplied by the hourly cost of the engineers who would do it, plus the cost of errors (incidents caused by manual provisioning mistakes, measured in Mean Time To Resolution (MTTR) and affected services). For the retail chain expansion, the manual cost is large enough to make the automation investment obviously justified. For a low-volume service provisioned twice a year, the calculus is different.

  • Value of capabilities unlocked: the business outcomes that would not be possible without automation. This is the hardest component to quantify and the highest-value argument to make. One hundred and twenty stores in six months is not a matter of efficiency: it is a binary capability. Without automation, it does not happen on that timeline regardless of engineering budget. The network team that can state clearly “the retail expansion timeline depends on our automation platform” is participating in a strategic conversation, not a budget negotiation.

Three axes define the design space of any automation service, and each represents a trade-off that the product model forces into the open:

  • Speed: how fast is the service from request submission to completion?
  • Safety: how reliably does the service avoid making things worse, through validation, dry-run stages, and rollback paths?
  • Utilization: how many concurrent requests can the platform handle without degradation?

These axes are in tension. Maximizing safety through extensive validation and supervised execution stages adds latency: a safer workflow is typically a slower one. Maximizing speed by reducing validation stages increases risk. Maximizing concurrent utilization requires infrastructure investment that may not be justified by actual request volume.

The product framing forces these trade-offs to be explicit in the service contract. When the retail chain infrastructure lead asks whether twenty simultaneous deployments are safe, the answer is not “it works in testing”. The answer is a concrete statement about the platform’s design: concurrent deployment capacity is bounded at twenty-four active jobs, each deployment has an independent rollback path, and the observability system confirms successful state convergence before marking a job complete. Those statements come from a team that has thought about the trade-offs as product design decisions, not as engineering implementation details.

ROI measurement feeds directly into prioritization. Which automation to build next should be informed by which business outcomes are most constrained by network limitations. The team that tracks which manual requests consume the most engineering time, which services have the highest failure rate during manual provisioning, and which business capabilities are blocked by network throughput has the data to make prioritization arguments that the business can evaluate. The team that does not have that data makes prioritization arguments based on what is technically interesting, and those arguments lose budget cycles to teams that have quantified their impact.

14.6 Prioritization and Roadmap#

Two questions face every network automation team consistently and rarely get answered formally: which tasks to automate next, and when to say no to a request. The product model requires formal answers to both. These are prioritization categories:

  • Business impact: which service, when automated, enables the highest-value business capability? The Business-Driven Service Map from section 14.3 is the input to this question. The service that, when automated, unblocks a strategic business initiative ranks above the service that, when automated, saves twelve engineering hours per year.

  • Frequency times effort: how often is this task done manually, and how much engineering time does each occurrence take? A task done daily that takes four hours each time has two hundred times the ROI of a task done weekly that takes thirty minutes. High-frequency manual tasks with significant effort per occurrence are the clearest cases for automation.

  • Risk reduction: some tasks are worth automating even if they are infrequent, because the cost of human error is catastrophic. Manual BGP peering changes that, when misconfigured, cause route leaks affecting hundreds of customers, are worth automating even if they happen only six times per year. The automation is justified not by throughput but by eliminating an error mode with unacceptable consequences.

  • Consumer demand: what are other teams actively requesting? Consumer demand is an imperfect signal, because teams often request what they know is possible rather than what would be most valuable. But consistent requests from the same teams for the same capability are meaningful data about where the service interface does not fit actual use cases.

The Automation Backlog pattern treats unautomated tasks the same way a product team treats a feature backlog: prioritized, estimated, and with clear acceptance criteria for what “done” means. Done is not “the automation ran successfully in the lab”. Done is “the automation has passed the Confidence Ladder stages described in Chapter 13, section 13.5.2, is documented, and is available for self-service consumption by the relevant consumer teams”. The backlog is visible to stakeholders, so they can see what is coming and plan accordingly.

Roadmap communication matters more than the roadmap itself. A network automation roadmap shared with dependent teams on a quarterly cadence is a trust signal. It makes the automation team’s work legible to the business. It allows consuming teams to plan their own work around what the platform will and will not be able to do in the next quarter. It creates the feedback opportunity for consuming teams to surface requirements that the network team did not know existed.

Feedback loops from consumers/stakeholders are the most valuable input to roadmap decisions. Which teams are filing the most exceptions? Which automation outputs require manual interpretation before the consumer can act on them? Where does the current service interface force consumers to submit requests in network terms rather than business terms? These are product feedback signals, and capturing them systematically is what separates a roadmap that reflects actual consumer needs from one that reflects what the automation team thinks consumers should want.

The stakeholder meeting cadence is worth naming explicitly. A recurring meeting, quarterly for most platforms and monthly for actively developing platforms, where the roadmap is reviewed, consumer feedback is collected, and upcoming changes are communicated, is not a status meeting. It is the mechanism by which an automation platform behaves like a product that listens to its users. Teams that skip this step build automation that solves last year’s problem while consuming this year’s budget.

The three resistance patterns from Chapter 13, section 13.4.1 appear in product conversations as well. The Frozen Expert resists the product framing because it redefines their expertise as an implementation detail. The Invisible ROI pattern manifests as consumers who stop reporting issues because they assume nothing will change. The Black Box pattern appears as a service that completes successfully but provides no visibility into what happened, leaving consumers unable to trust the output without manual verification. The responses are the same: make the expert the designer of the service contract, make success visible with explicit metrics, and build transparency into the service outputs.

14.6.1 The Product Management Function#

Not every team needs a dedicated Product Manager. Every mature automation program needs someone doing the product management function.

At small team sizes, the network architect or senior NRE can absorb the product management function alongside their technical work. They maintain the backlog, run the stakeholder meetings, and own the business alignment conversations. At this scale, a few hours per week is enough to keep the function alive.

As the platform grows, the translation work between external stakeholders and internal engineering becomes substantial. A platform serving ten consuming teams, with thirty services in production and a quarterly roadmap that requires negotiation across competing priorities, cannot be product-managed in a few hours per week. The function becomes a full-time role.

The Network Automation Product Manager role is emerging in organizations with mature automation programs. Its responsibilities are:

  • Own the stakeholder relationship on behalf of the platform: the primary point of contact for consuming teams, responsible for translating their needs into service requirements
  • Maintain the Business-Driven Service Map and the Automation Backlog, with prioritization that reflects both business impact and engineering reality
  • Communicate the roadmap externally and manage expectations when priorities shift
  • Define and track business impact metrics, making the platform’s contribution to business outcomes visible to leadership
  • Represent consumer feedback in team prioritization discussions, ensuring that engineering priorities reflect actual user needs rather than internal technical preferences

This role does not require deep networking expertise. It requires the ability to understand what the network team can deliver, what the business needs, and where those two things do and do not align. The collaboration model between the Network Automation Product Manager and the technical roles from Chapter 13, section 13.2 follows a familiar pattern from software product organizations: the Product Manager owns what gets built and why, the Network Platform Engineer and Automation Developer own how it gets built, and the NRE owns how it stays healthy in production. Conflicts between these groups are features of the model, not defects: the PM pushes for consumer needs, the engineers push for platform sustainability, and the tension between them produces better decisions than either would reach alone.

The Network Automation Product Manager role is controversial in some network organizations because it introduces a non-technical role into a technically-focused team structure. The concern is valid: a PM without sufficient technical grounding will make commitments the engineering team cannot keep, and will struggle to distinguish between requests that are straightforward to implement and requests that require fundamental platform changes. The solution is not to avoid the role but to make the collaboration boundaries concrete. Two guardrails are non-negotiable: the PM cannot commit a delivery date to an external stakeholder without explicit sign-off from the engineering lead on scope and timeline, and the engineering lead holds veto authority over any commitment that requires platform changes not yet designed or estimated. Without these guardrails, the PM role creates a new failure mode: external stakeholders receive promises the engineering team learns about after the fact, and the credibility damage falls on the platform, not the PM.

Two years after his platform transition, Jordi, a network architect who had led his team’s shift from manual operations to an automation platform (introduced in Chapter 13), was asked to join a meeting with the retail chain integration project. The business team lead walked through the onboarding timeline, pointed at a six-week period where forty stores needed to go live simultaneously, and asked: “Is the platform ready for that?” Jordi had the answer: the platform could handle it technically, but the monitoring coverage for newly activated sites had a twelve-hour delay before full telemetry was ingested. Forty simultaneous activations would mean forty sites running for half a day with reduced observability coverage. He said so directly, named the risk, and proposed a modification to the onboarding sequence that would spread activations across a three-day window without extending the overall timeline. The business team accepted it. The conversation happened in fifteen minutes because Jordi understood both the technical constraint and its business consequence. That translation, between what the platform does and what the business experiences, is the product management function regardless of who holds the title.

14.6.2 Managing the Service Lifecycle#

The PM role description in section 14.6.1 names the responsibilities. This section shows how those responsibilities operate across the four stages that every network service moves through: definition, delivery, operations, and evolution.

flowchart LR
    classDef def fill:#d9e8f5,stroke:#4a7fa5,color:#1a2e3b
    classDef del fill:#f5e8d9,stroke:#a57a4a,color:#1a2e3b
    classDef ops fill:#d9f5e8,stroke:#3a8a5a,color:#1a2e3b
    classDef evo fill:#d9e8f5,stroke:#4a7fa5,color:#1a2e3b

    DEF["Definition<br/>Service contract<br/>Business justification"]
    DEL["Delivery<br/>Backlog management<br/>Acceptance criteria"]
    OPS["Operations<br/>SLA monitoring<br/>Consumer communication"]
    EVO["Evolution<br/>Versioning<br/>Deprecation"]

    DEF --> DEL --> OPS --> EVO
    EVO -->|"next version"| DEF

    class DEF def
    class DEL del
    class OPS ops
    class EVO evo

Definition

The PM’s first engagement with a new service is before the engineering team writes a line of automation code. A consuming team arrives with a request: “we need to onboard stores faster,” or “our application team cannot provision network segments without filing a ticket.” The PM’s job at this stage is to translate that request into a bounded service contract using the three-component structure from section 14.2: what does the consumer provide (input surface), what do they observe (output surface), and what internal dependencies does the service carry that consumers should not need to understand? That translation is not a requirements document handed to engineering. It is a negotiation between what the consumer needs and what the platform can deliver, with the PM and the engineering lead both in the room.

The PM owns the question “what does done look like from the consumer’s perspective.” The engineering lead owns “what does done look like from the platform’s perspective.” Both questions must be answered before definition is complete, because a service contract that satisfies the consumer but ignores platform constraints will generate rework during delivery, and one that satisfies the platform but not the consumer will go unused after launch.

Before definition closes, the PM places the new service in the Business-Driven Service Map from section 14.3. This step ensures that the service has an explicit business justification and that its priority relative to other Automation Backlog items can be evaluated on consistent criteria. A service that cannot be placed in the map, because no one can articulate which business capability it enables or what the impact of its absence is, is not ready to be defined. That is a signal to go back to the consumer conversation, not to start engineering.

Delivery

Once the service contract is agreed, the PM manages the corresponding Automation Backlog entry through development. Acceptance criteria are written in consumer terms: “a standard branch activation completes within forty minutes of submission, with a lifecycle event emitted at each stage,” not “the gNMI push to the PE router completes without error.” The distinction matters because acceptance criteria written in implementation terms can pass while the consumer experience fails: the push completed, but the consumer received no notification and cannot tell whether their store is active.

During delivery, the PM owns all external communication about timeline and scope. The engineering lead owns the internal technical decisions. This division protects the engineering team from the pattern that derails most automation projects: scope additions that arrive after the service contract is agreed. Every new requirement after contract sign-off is a new Automation Backlog item with its own prioritization, not a modification to the current delivery. The PM’s job is to hold that boundary with consuming teams, because the engineering team cannot do it without damaging the consumer relationship. A delivery scope that can be expanded by any stakeholder at any time is a delivery that never completes.

Operations

When the service goes live, the PM’s role shifts from delivery to trust maintenance. The internal SLA from section 14.4 defines what the service commits to: availability, execution latency, error budget, and escalation path. The PM monitors these metrics not to diagnose failures, which is the NRE’s responsibility, but to own the consumer-facing communication when thresholds are approached or breached. A consuming team that discovers their service has been running outside its SLA by reading a dashboard they were not told to watch has not received an SLA: they have received a metric without a relationship. The PM is the relationship.

The PM also runs the operational feedback collection process. Which consuming teams are filing the most exception requests? Which service outputs require manual interpretation before the consumer can act on them? Where does the current input surface force consumers to submit requests in network terms rather than business terms? These signals are not complaints: they are product data, and the PM’s job is to aggregate them systematically and bring them to the roadmap conversation with enough specificity that the engineering team can act on them. Feedback that arrives as “consumers are unhappy with branch activation” is not actionable. Feedback that arrives as “seven of the last twelve exception requests were for pop-up sites that do not match any defined tier, and all seven required direct network team engagement” is actionable: it names a gap in the input surface and quantifies its operational cost.

Evolution

Services that stop evolving accumulate mismatches between what they were designed to handle and what consumers actually need. The PM owns the decision of when and how a service evolves, informed by the operational feedback collected in the previous stage and the changing entries in the Business-Driven Service Map.

Not all evolution carries the same operational consequence. A change that adds a new optional input field or emits a richer output event is additive: existing consumers are unaffected, and adoption of the new capability is opt-in. A change that renames a required field, removes an output event, or alters an SLA commitment breaks the existing contract for every consumer relying on it. The PM must distinguish between these two categories before any evolution begins, because they require different processes: additive changes can be delivered as minor updates, while breaking changes require a version increment, a migration path, and a deprecation timeline communicated to consuming teams before the change ships.

The deprecation timeline is where many teams fail. Engineering teams naturally want to remove the old version as soon as the new one is stable. Consuming teams naturally want to stay on the version they depend on until they have capacity to migrate. The PM negotiates the window between those two positions and commits to both sides: a specific date after which the old version is unsupported, communicated early enough that consuming teams can plan their migration. Services evolved without this process erode the trust that the service contract was designed to build. The consuming team that discovers a breaking change in production has not experienced a technical failure: they have experienced a relationship failure, and that is harder to recover from.

14.7 Summary#

Four themes anchor this chapter:

Services, not scripts: the product shift is from building automation that runs to providing services that others depend on. The Network Service as Product pattern names this reframe: the service is the primary artifact, the underlying network is the implementation detail. The Service Contract Pattern is the artifact that makes this concrete: an explicit, versioned definition of input surface, output surface, and inner dependencies that bounds the interface to what consumers need to provide and what they can observe, without exposing network implementation details they should not need to understand.

Business alignment is structural: measuring business impact requires designing automation from business outcomes downward, not from device behavior upward. The Business-Driven Service Map is the instrument: an explicit mapping of network services to the business capabilities they enable and the business impact of degradation or unavailability. Teams that can make this mapping fluently are the ones other business units call when they are planning something new, because those teams have already answered the question “what does the network make possible.” Teams that cannot are the ones who hear about new initiatives after the timeline has been set, and are left explaining why the network cannot be ready in time. The Automation Backlog applies the same logic to prioritization: which automation to build next should be determined by which business outcomes are most constrained by network limitations, not by which automation is most technically interesting.

SLAs and support models before they are needed: defining reliability commitments, escalation paths, and ownership of failures before the first major incident is what separates a platform from a collection of scripts. The internal SLA, with its four components of availability, execution latency, error budget, and escalation path, is the instrument that makes trust explicit. The three failure mode taxonomy (automation bug, platform failure, network failure) is the triage framework that prevents incidents from being dropped between teams. The Pre-Approved Automation Pattern extends this: once a workflow has been designed, tested, and approved with its safety constraints active, individual executions are instances of an approved operation, not new changes requiring re-approval. Both the SLA model and the governance model must be established before the first incident, not after.

The product management function across the service lifecycle: every network service moves through four stages — definition, delivery, operations, and evolution — and the product management function owns the continuity across all four. At definition, the PM translates consumer requests into a bounded service contract before engineering starts. At delivery, the PM holds the scope boundary and owns external communication. At operations, the PM owns consumer-facing SLA communication and aggregates feedback into actionable product data. At evolution, the PM distinguishes additive from breaking changes, owns the versioning decision, and negotiates the deprecation timeline with both engineering and consuming teams. Without this function, services are defined ad hoc, delivered without acceptance criteria, operated without a consumer relationship, and evolved without notice. With it, the platform behaves like a product that listening to its users and earns their trust over time.

The product discipline described in this chapter is a precondition for what Part 5 describes. Closed-loop automation makes real-time remediation decisions. Self-healing networks respond to failure without human intervention. Autonomous networks make routing and provisioning decisions on behalf of their consumers. Each of these represents the platform taking action that a consumer depends on without direct human authorization for that specific action. Without defined service boundaries, observable state, reliability commitments, and clear escalation when thresholds are exceeded, autonomous operation is not a product. It is an unpredictable system acting without a contract. The work in this chapter is what makes autonomous operation something that can be trusted, which is the foundation Part 5 builds on.

References and Further Reading#

  • Continuous Delivery, Jez Humble, Dave Farley (Addison-Wesley, 2010). The foundational text on software delivery lifecycle: how to build deployment pipelines that make release a reliable, low-risk operation. The service lifecycle patterns in this chapter, from development through production operation, draw on these principles applied to network automation.

  • The Art of SLOs, Google SRE Workbook (available at sre.google). The practical guide to defining service level objectives, error budgets, and the relationship between reliability commitments and feature velocity. The internal SLA model in section 14.4 applies these principles to automation platforms serving internal consumers.

  • Empowered, Marty Cagan (Wiley, 2020). A product management framework for technical organizations: how to organize teams around outcomes rather than features, how to define what “done” means for a product team, and how to maintain strategic alignment between engineering work and business priorities. The Network Automation Product Manager role described in section 14.6.1 draws on this framework.

  • Team Topologies, Matthew Skelton, Manuel Pais (IT Revolution Press, 2019). Referenced in Chapter 13, it remains relevant here: the platform team model, where an automation platform serves stream-aligned teams as internal consumers, is the organizational structure that the product model in this chapter is designed to support.

  • Accelerate: The Science of Lean Software and DevOps, Nicole Forsgren, Jez Humble, Gene Kim (IT Revolution Press, 2018). Empirical research on what separates high-performing technology organizations from low-performing ones. Relevant to section 14.4: the data shows that change approval boards are not correlated with better stability outcomes but are strongly correlated with slower delivery, the evidence base behind the Pre-Approved Automation Pattern and the argument that change governance belongs at workflow design time, not at each execution.

💬 Found something to improve? Send feedback for this chapter