13. The Cultural Shift#
The meeting invite arrived on a Thursday, titled “Network Team Structure Update”. Jordi had been a network engineer for fifteen years. He had his CCIE. He had survived three acquisitions, two NOC consolidations, and one BGP routing incident so severe it became an internal case study. He assumed this was a staffing update.
It was not.
His manager explained that the network team was being reorganized under Platform Engineering. The team’s name would change to Network Automation Platform. The work would evolve: less manual provisioning, more building and operating the automation systems that handled provisioning. The new job description was already written. It was titled “Network Platform Engineer”.
Jordi went home that evening and searched the title. Most results pointed to cloud job postings and container orchestration infrastructure. He read for an hour and understood maybe half of what he read. He had not written a Python script. He did not know what Git really was. He was not sure whether to be excited or terrified.
By the end of the year, Jordi had shipped two automation workflows to production, reviewed hundreds of lines of code from software engineers who had never configured a router, and written the team’s first architecture decision record. He was not a software developer. He was something harder to categorize and more valuable: an engineer who understood both the network and the systems that operated it.
This chapter is about that transformation. Not just Jordi’s, but the organization’s. The technology covered in Chapters 1 through 12 does not operate itself. It requires people with different skills, different roles, and different ways of collaborating than network engineering traditionally developed. Getting the technology right is hard. Getting the organizational transformation right is harder, and more commonly where automation initiatives fail.
13.1 The Identity Crisis#
The hardest part of the cultural shift is not learning Git. It is letting go of the identity built around years of deep Command Line Interface (CLI) mastery.
Network engineering built its professional culture around expertise that was, until recently, difficult to acquire and impossible to fake. Understanding BGP convergence under failure conditions. Debugging spanning tree topology changes at 2 AM (and yes, spanning tree is still very much in production). Reading a packet capture and identifying a subtle MTU mismatch before the application team had finished filing the ticket. These are hard skills, developed over years of practice, and they built strong professional identities.
Automation disrupts the surface of that expertise. When a well-designed automation platform handles VLAN provisioning, configuration hardening, and zero-touch device onboarding, the engineer who spent years mastering those tasks through manual CLI work faces a choice that feels like a contradiction: use automation to replace the thing you are good at, or resist automation to protect the skill that defines you.
This is the Craftsman’s Dilemma: the more expert you are at a craft, the more any abstraction that hides the craft feels like a threat rather than a tool. The network engineer who knows every vendor-specific CLI nuance resists the abstraction layer not from irrationality, but from a completely rational assessment: they are being asked to trade expertise for unfamiliarity.
The reframe that breaks the dilemma is this: automation does not replace deep networking knowledge. It requires it. The difference is where and how that knowledge is applied. The engineer who deeply understands BGP convergence is not less valuable in an automated world. They are the only person qualified to design a reliable closed-loop BGP automation. The skill moves from execution to design, from running commands to defining what commands should be run and under what conditions.
This shift, from “I configure networks” to “I build systems that configure networks”, is the cultural transformation at the center of this chapter. It is not a skill gap. It is a framing gap.
flowchart LR
classDef legacy fill:#d9e8f5,stroke:#4a7fa5,color:#1a2e3b
classDef emerging fill:#d9f5e8,stroke:#3a8a5a,color:#1a2e3b
A["**Network Engineer** - Executes configuration - Deep device expertise"]
B["**Network Platform Engineer** - Designs automation systems - Expertise embedded in code"]
A -->|"framing shift"| B
class A legacy
class B emerging
The engineer who makes this transition successfully is almost never the one who decided to become a software developer. They are the one who was curious about why the automation kept failing on a specific vendor edge case, and stayed with that question long enough to understand the system beneath the symptom. Curiosity about how things work, not just how to configure them but how the configuration travels from intent to device, how a failure propagates, how the system recovers, is the mindset that makes the transition possible. It is also the mindset that distinguished the best network engineers before automation: the ones who wanted to understand the protocol, not just apply the recipe.
The payoff of that curiosity is impact at a scale no individual engineer can reach through manual work. A single automation workflow running correctly across eight hundred switches does more work in minutes than a team could complete in a week of manual change windows. The engineer who built that workflow has not been replaced by it. They have become a force multiplier. Their expertise is now embedded in a system that runs while they sleep, handles routine work consistently, and frees engineering capacity for the problems that still require judgment. That is a more powerful position than any amount of manual CLI mastery could produce.
In Chapter 1 the automation barriers included fear of change and wrong skills. Those barriers do not disappear because a manager reorganizes the team. They require deliberate attention: how roles are defined, how skills are developed, and how the organization signals that deep network expertise is more valuable in the new model, not less.
The identity crisis often starts exactly there: a new title that the engineer cannot yet define, for a role that the organization is still figuring out how to support.
13.2 New Roles in an Automated World#
The most counterproductive question a team can ask when starting automation transformation is “which jobs are going away?” The more useful question is “which roles are being created, and what do they require?”
Most roles evolve rather than disappear. What changes is where the value concentrates. The operator who spent eight hours a day processing manual change tickets is not eliminated. The eight hours of ticket processing is eliminated, and the freed capacity either moves toward higher-value work or, if the team does not actively shape that transition, toward organizational friction.
The following roles emerge consistently in teams that have made the transition successfully.
Network Platform Engineer: This role owns the automation platform as an engineering artifact. The platform includes the Source of Truth (SoT) schema design, the execution pipeline configuration, the CI/CD workflows that validate and deploy automation changes, the network observability platform, and the interface contracts between automation blocks. The Network Platform Engineer is the counterpart to the software platform engineer who runs container clusters or manages internal developer tools: they build and maintain the system that other engineers use to operate the network. This role connects directly to the architectural work in Chapters 4 through 8 and the platform engineering patterns in Chapter 10.
Network Reliability Engineer (NRE): The Site Reliability Engineering model, developed for large-scale software services, applies to network automation with meaningful modifications. The NRE role adapts SRE principles to network operations: defining service level objectives for automation pipelines, building incident response processes for automation failures, and maintaining error budgets that balance feature velocity with operational stability. When a closed-loop automation misfires at 3 AM, the NRE is the person whose job it is to have already designed the runbook for that failure mode.
Network Architect: This role shifts further from devices and closer to intent. The Network Architect defines what the network should be: the intent models in the Source of Truth, the design patterns that automation will enforce, the policies that govern topology and addressing. They spend less time in device CLI and more time in schema design, cross-team architectural review, and evaluating how design decisions constrain or enable automation. Chapter 4 describes the intent layer that this role primarily owns.
Network Data Engineer: Closed-loop automation, self-healing networks, and autonomous operation (covered in Chapters 15 through 17) depend on high-quality operational data. The Network Data Engineer builds and curates the data pipelines that feed Observability systems, defines the schemas that make telemetry actionable, and owns the quality of the data that automation decisions are based on. This role connects the Observability building block from Chapter 6 to the closed-loop patterns in Part 5.
Network Automation Developer: This role writes the automation code itself: the integration logic, workflow orchestration, validation frameworks, and tooling that the platform depends on. The Network Automation Developer may be a software engineer who embedded in the network team and learned enough networking to be productive, or a network engineer who learned enough software development to be effective. Both paths work. The important distinction from the Network Platform Engineer is scope: the Automation Developer delivers specific automation capabilities; the Platform Engineer owns the system those capabilities run on.
The traditional operator role contracts but does not disappear. What disappears is the repetitive execution: the manual provisioning, the ticket-by-ticket CLI sessions, the human-as-script-runner workflows. The underlying judgment (knowing when something is wrong, catching what automation missed, making the call in an ambiguous incident) remains valuable and moves into quality assurance and escalation ownership rather than routine execution.
The diagram below is not a promotion chart. It is a map of where the value moves: from repetitive execution toward design, reliability, data, and platform ownership. Most engineers will not follow a single arrow. They will follow the one that matches what they are already curious about.
flowchart LR
classDef legacy fill:#d9e8f5,stroke:#4a7fa5,color:#1a2e3b
classDef emerging fill:#d9f5e8,stroke:#3a8a5a,color:#1a2e3b
subgraph Legacy["Legacy Role Profiles"]
direction TB
L1[**Network Operator** - Manual provisioning - Device CLI mastery]
L2[**Network Engineer** - Design and troubleshoot - Vendor expertise]
end
subgraph Emerging["Emerging Role Profiles"]
direction TB
E1[**Network Platform Engineer** - Automation platform - CI/CD and SoT ownership]
E2[**Network Reliability Engineer** - SLOs and incident response - Error budgets]
E3[**Network Architect** - Intent models - Design governance]
E4[**Network Data Engineer** - Telemetry pipelines - Observability quality]
E5[**Network Automation Developer** - Workflow and integration code - Validation frameworks]
end
L1 -->|evolves to| E1
L1 -->|evolves to| E5
L2 -->|evolves to| E2
L2 -->|evolves to| E3
L2 -->|evolves to| E4
class L1,L2 legacy
class E1,E2,E3,E4,E5 emerging
13.3 The Skill Transformation Path#
Two failure modes appear in every organization attempting this transformation. The first is expecting every network engineer to become a software developer. The second is expecting software engineers embedded in network teams to own network automation without developing deep protocol knowledge. Both fail for the same reason: they assume that one domain’s skills transfer through intention rather than practice.
13.3.1 The T-Shaped Engineer#
The concept of the T-shaped engineer, introduced by Tim Brown at IDEO and adopted broadly in platform and DevOps organizations, names the working model that succeeds. Deep vertical expertise in one domain, broad horizontal literacy in the other. Not symmetry. Not a full second specialization. Productive asymmetry.
A network engineer on the path to Network Platform Engineer needs enough software development literacy to read, debug, and modify coding languages (e.g., Python) and DSLs (e.g., YAML), understand Version Control System (VCS) workflows, reason about CI/CD pipeline failures, and collaborate on schema design decisions. They do not need to design data structures from scratch or optimize algorithm complexity. They need operational literacy in software practices, not a second career in software engineering.
A software engineer moving into network automation needs enough networking depth to understand why BGP convergence takes the time it does, how spanning tree topology changes propagate, what “eventually consistent” means in the context of a distributed routing table, and why automation that works correctly in the lab can fail in production due to vendor implementation differences. They do not need a CCIE. They need enough foundation to have informed conversations with network engineers and to recognize when their assumptions are wrong.
The T-shape is ultimately about creating well-defined interfaces between domains: each person understands enough of the other’s language to communicate clearly, debug together, and make shared decisions without requiring a translator for every conversation.
The T-shape is not a fixed target. It evolves as the role does. The important thing is identifying the axis of depth and the axis of breadth for each person, and building learning paths that respect both.
When Jordi submitted his first automation pull request, the software engineer reviewing it left eleven comments: variable naming, missing error handling, a test that only covered the happy path. His first instinct was defensiveness. He had been configuring networks correctly for fifteen years. His second was to close the browser tab. His third was to read the comments again, slowly, as if they were interface errors from a device he had not worked with before. By the third pull request, he was leaving questions in the comments instead of explanations. The reviewer became a collaborator. The shift from expert to beginner and back to expert in a new domain is not comfortable. But it is the shape of the transition.
13.3.2 The Fundamentals Debate#
The question comes up in every team undergoing automation transformation: is deep protocol knowledge still relevant? Is CCIE certification (or equivalent) still worth pursuing? Yes, but differently.
The fundamentals are not less important in an automated world. They are differently applied. An engineer who does not understand BGP convergence behavior cannot design a reliable closed-loop BGP automation. An engineer who does not understand spanning tree cannot build a simulation environment that faithfully replicates production topology failure modes. The automation layer sits on top of the network. Its correctness depends on the quality of its model of how the network behaves, and that model is only as good as the people who built it understood the underlying protocols.
What changes is the learning path. The traditional certification track (study the theory, pass the exam, apply in production) is not wrong, but it is no longer the only credible path. The problem-first path has become at least as effective: start with a real operational problem, build automation to solve it, and learn the specific protocol behavior or design principle that the problem requires. Certifications validate existing knowledge. They are more valuable taken after practice than before.
The CCIE remains a strong signal of depth. In an automated world, it signals the kind of depth that automation depends on. What it no longer signals on its own is operational currency: knowing how to configure devices by hand is necessary but no longer sufficient.
Automation-specific certifications have emerged alongside the traditional networking tracks, validating the horizontal axis of the T-shape: version control hygiene, API design, CI/CD pipeline fundamentals. They are useful complements to protocol depth, not replacements for it.
13.3.3 The AI Effect on Skill Requirements#
Artificial Intelligence (AI) coding assistants have changed the entry point for software development in network automation significantly. An engineer who cannot write a Python class from memory can now prompt their way to working code for routine automation tasks: parsing device output, generating configuration templates, writing basic integration scripts. This lowers the floor for getting started. It does not lower the ceiling for getting it right.
What AI assistance does not replace: the judgment to know when automation should not run, the ability to diagnose an automation failure that manifests in unexpected ways, and the architectural reasoning that connects a specific automation workflow to the broader platform design across Chapters 4 through 12. An AI assistant will generate code that passes the tests you thought to write. It will not tell you which tests you forgot.
The relevant pattern here is the Supervised Colleague: treat AI-generated automation code the way you treat code from a competent but junior engineer. Review it. Test it. Understand it well enough to debug it. Own it. The moment automation enters a production pipeline, it is yours, regardless of whether you typed every line.
The AI tooling landscape is evolving faster than any book can track. Specific capabilities described here may be outdated by the time you read this. The underlying patterns, treating AI-generated code as requiring the same review and ownership as any other contribution, and understanding that AI assistance does not replace engineering judgment, are durable regardless of which tools exist at any given moment.
How much coding knowledge is actually needed: enough to read and understand code written by others, debug common failure modes, modify existing automation for new requirements, and make informed decisions in code review. Operational literacy in code, not professional software developer level. The bar is higher than “can use a CLI tool”. It is lower than “can design a distributed system from scratch”.
13.3.4 Skill Development Paths#
Two concrete development paths appear in teams making the transition.
For network engineers moving toward platform and automation roles: the practical starting sequence is Python fundamentals focused on reading and debugging before writing, VCS workflows and hygiene, understanding CI/CD pipelines well enough to diagnose failures, and YAML and JavaScript Object Notation (JSON) schema design. The emphasis should be on reading and debugging existing code before generating new code. The first meaningful milestone is not “wrote a script”. It is “debugged someone else’s automation failure and understood why it happened”. Six months into his transition, Jordi encountered exactly that: a forty-line Python traceback in a production workflow he had not written. His first instinct was to forward it to the software engineer on the team. Instead he started reading from the top, line by line, the way he used to read a routing table. The network-specific assumption that caused the failure was on line 23: a hardcoded expectation about BGP session state that made perfect sense to anyone who had configured BGP by hand and had never occurred to anyone as something that needed a test. He fixed it himself. The traceback had stopped being noise. It had become a map.
For software engineers moving into network automation: the practical starting sequence is IP fundamentals, one routing protocol studied deeply enough to reason about failure modes, the operational trust model that makes network engineers cautious about changes (a single wrong command can take down a production network in ways that a bug in a web application usually cannot), and shadow experience alongside network engineers during production incidents. The first meaningful milestone is not “understands networking theory”. It is “knows what a network engineer is afraid of and why”.
flowchart LR
classDef net fill:#d9e8f5,stroke:#4a7fa5,color:#1a2e3b
classDef sw fill:#d9f5e8,stroke:#3a8a5a,color:#1a2e3b
classDef shared fill:#f5e8d9,stroke:#a57a4a,color:#1a2e3b
subgraph Network["Network Engineering"]
N1["Protocol expertise - Device behavior - Fault diagnosis"]
end
subgraph Overlap["Shared Zone"]
O1["Automation design - SoT schema - Simulation fidelity - Change trust model"]
end
subgraph Software["Software Engineering"]
S1["Code structure - Testing discipline - CI/CD pipelines"]
end
Network --- Overlap --- Software
class N1 net
class O1 shared
class S1 sw
The Rotation Model is a pattern for accelerating this process: structured cross-team rotations that put both groups in situations where they must learn by doing alongside the other. A network engineer spending two weeks embedded with the software team, a software engineer spending two weeks on-call with the network team. Not for cross-training in the formal sense, but for building the shared vocabulary and mutual respect that makes collaboration sustainable.
Two years into his transition, Jordi spent three months embedded with the cloud infrastructure team as part of a planned rotation his manager proposed as a development experiment. He accepted it skeptically. The cloud team did not think about devices or protocols. They thought about what applications assumed the network would provide. Those assumptions were never written down and were frequently wrong. The automation Jordi built after he returned was the first version the team had produced that modeled network intent from the application layer downward rather than from the device layer upward. It was also the most reliable automation they had shipped to date. Neither Jordi nor his manager had anticipated that. The rotation had not cross-trained him in cloud engineering. It had given him a new frame for the thing he already understood deeply, and that reframing is what architectural thinking actually looks like in practice.
Three months after returning, Jordi ran an informal session for his team on what he had observed about how application teams modeled network assumptions. Six engineers attended. Two of them changed the design of their next automation workflows based on what he shared. The rotation had not just transformed Jordi: it had, through him, changed how other engineers thought about the problem. One person’s learning, when it is shared deliberately, multiplies.
For engineering managers and team leads: the skill transformation described in this section does not happen through documentation and good intentions. It requires time allocation (learning cannot happen only in the margins of a full operational load), psychological safety (engineers need to make mistakes in low-stakes environments, which requires lab access and deliberate non-production experimentation time), and visible sponsorship from leadership that deep network knowledge is valued in the new model, not a liability.
The teams that fail at this transformation are almost never the ones that lack tools or plans. They are the ones where engineers feel that learning new skills is something they are supposed to do after they finish their “real work”.
13.3.5 The Practical Toolkit#
The tool sequence below is ordered by priority for a network engineer starting the transition. The goal at each stage is not mastery before moving on. It is enough fluency to be useful and to recognize when something is wrong.
flowchart BT
classDef foundation fill:#4a7fa5,stroke:#2d5f80,color:#fff
classDef automation fill:#3a8a5a,stroke:#2d6b45,color:#fff
classDef platform fill:#7a5a8a,stroke:#5d4570,color:#fff
F["**Foundation Layer** - Git · Python · YAML"]
A["**Automation Layer** - Ansible · Jinja2 · Netmiko/NAPALM · Nornir"]
P["**Platform Layer** - SoT · Testing · Docker · Kubernetes · CI/CD"]
F --> A --> P
class F foundation
class A automation
class P platform
Foundation layer (start here):
- Git: The non-negotiable starting point. Commit, branch, pull request, merge conflict resolution. Everything else in the automation platform depends on version control hygiene. Learn it before any automation tool.
- Python: Focus on reading and debugging existing code before writing new code. The first milestone is reading a traceback and understanding what it is telling you, not writing a class from scratch.
- YAML: The configuration language of most network automation tooling. Understanding structure, indentation, and data types is required to work with Ansible, NetBox, and most CI/CD pipelines.
Python is the dominant language in network automation today, but it is not the only valid choice. Go is increasingly used for performance-sensitive tooling and platform components, and the scripting concepts transfer across languages. The investment in learning Python fundamentals is an investment in programming literacy, not Python loyalty. The mental models carry forward regardless of which language a given platform uses.
Automation layer:
- Ansible: The most widely deployed execution tool in network environments. Playbooks, inventory, roles, and variable precedence. Covers the majority of provisioning and configuration automation use cases.
- Jinja2: The templating engine used with Ansible and most configuration generation workflows. Understanding how templates render against variable data is essential for configuration management at scale.
- Netmiko or NAPALM: Netmiko handles SSH/CLI automation for legacy devices. NAPALM provides a multi-vendor abstraction layer for devices that support structured APIs. One or both will appear in most existing network automation codebases.
- Nornir: A Python-native automation framework that handles connection management, task execution, and parallelism across large device inventories. Where Ansible abstracts away the Python, Nornir exposes it, making it the better fit for complex workflows where full programmatic control is required.
This list is a recommendation to learn, not a recommendation to use. Many of these tools have limitations that become visible at scale, and a well-designed platform will abstract them behind cleaner interfaces. Understanding how they work, including their failure modes and edge cases, is what allows an engineer to make informed decisions about when to use them, when to replace them, and what to watch for when they misbehave in production.
Platform layer:
- Source of Truth: The most common Source of Truth implementations. Understanding schema design, API interaction, and how automation reads from and writes back to the SoT connects the execution tools to the architectural model from Chapter 4.
- Testing: Writing tests for automation code. The first meaningful use is not writing new tests but understanding what existing tests cover and what they miss.
- Docker basics: Enough to run a local development environment, understand what a container image is, and read a Compose file. Not container orchestration, just enough to stop being blocked by environment setup.
- Kubernetes basics: Automation platform components (the orchestrator, the API gateway, the observability stack) increasingly run as containerized workloads on Kubernetes. An engineer does not need to operate a cluster, but reading a deployment manifest, understanding pod restarts, and knowing how to inspect logs from a running container are skills that appear regularly when debugging platform issues.
- CI/CD pipeline: Reading and understanding pipeline definitions, diagnosing pipeline failures, and knowing where in the pipeline a failure occurred. Writing pipelines from scratch comes later.
The sequence matters more than the specific tools. An engineer who knows Git well and can debug Python is more useful to a network automation team than one who has installed every tool but understands none of them deeply. The depth comes from use, not from installation.
AI coding assistants do not make this toolkit optional. An engineer who cannot read the code they prompted into existence cannot debug it when it fails in production, cannot review it when a colleague submits it, and cannot explain to a stakeholder why the automation did what it did. The foundations above are what make AI assistance safe to use, not unnecessary.
13.4 Promoting Adoption#
Getting a team to start using automation is a different problem from getting a team to build and maintain automation as an organizational capability. Both require attention, and they require different things. This section addresses the first: how to get a team off the ground and past the predictable obstacles that stall most automation programs before they reach self-sustaining scale.
The Adoption Ladder pattern names the three stages: pilot, scale, sustain. Each rung has a different success criterion.
Pilot is proving that automation works for at least one use case reliably enough that the team trusts it. The success criterion is not coverage or code volume. It is trust. Does the team actually use this automation instead of the manual process? If the answer is “mostly, but we still do it manually for important changes,” the pilot has not succeeded.
Scale is extending automation to more use cases and more engineers. The success criterion is self-service: can engineers who did not build the automation use it without asking the original authors? A platform that only its builders can operate has not scaled. It has just moved the manual dependency from device CLI to automation tool.
Sustain is automation that outlives its original authors. The success criterion is onboarding: can a new engineer understand, modify, and extend existing automation without requiring the original team to explain it? If the answer is no, the automation is a key-person dependency with better tooling.
13.4.1 Change Resistance Patterns#
Three resistance patterns appear with enough consistency to name:
The Frozen Expert pattern: the engineer with the most seniority, built on deep expertise in the manual way of working, becomes the loudest critic of automation. This is not irrational. Their status, their career trajectory, and their professional identity are more threatened by the change than any junior engineer’s. The response that works is making them the author of the automation, not its audience. The Frozen Expert who designs the first automation workflow is invested in its success. The Frozen Expert who is handed a finished platform and told to use it is motivated to find its flaws.
The Invisible ROI pattern: automation that works generates no tickets and no visible activity. Only failures are visible. A team that automated VLAN provisioning successfully can go months without any signal that the automation is delivering value, because the signal would be hundreds of tickets that were never filed. Countering this requires deliberate instrumentation: tracking provisioning time before and after, counting avoided incidents, measuring Mean Time To Resolution (MTTR) improvement, and making those numbers visible to stakeholders who otherwise only see automation when it fails.
The Black Box pattern: automation used only by the engineers who built it, not because others lack access, but because others do not trust what they cannot understand. The automation produces correct results but provides no insight into what it is doing or why. Other engineers bypass it because the manual process, however slower, is at least legible. The response is building transparency into the automation itself: audit logs, step-by-step execution traces, clear error messages, and Dry Run modes that show what would happen before anything does. Trust follows understanding. An automation system that cannot explain its own actions will not earn adoption beyond its authors. Chapter 2, section 2.1 established the foundation for building trust in automation: the qualities of understandable, predictable, and usable are not features layered on top of a working system. They are what makes a system trustworthy enough to use.
13.4.2 Adoption Metrics#
Adoption cannot be measured by counting scripts or lines of code. The following metrics track whether teams are actually embracing automation:
- Self-service ratio: the percentage of changes executed without the network team’s direct involvement. A high self-service ratio means application teams, security teams, and other consumers can operate the platform independently.
- Manual escape rate: how often engineers bypass automation for direct device access, and why. Some bypasses are legitimate (automation does not cover that case yet). Others signal a trust deficit. Tracking the reasons matters as much as tracking the count.
- Time-to-production for new automation: how long from “this should be automated” to “this is running in production”. Long cycle times indicate process friction that reduces the team’s incentive to automate new things.
- Onboarding time: how long for a new team member to make their first meaningful automation contribution. This measures documentation quality, codebase clarity, and environment setup friction simultaneously.
13.5 Sustaining the Platform#
Getting to the Sustain rung of the Adoption Ladder is not the end of the problem. Automation that reaches production and earns trust still needs active organizational support to remain healthy as the team grows, the codebase ages, and the engineers who built it move on. The practices in this section address a different challenge from adoption: not how to start, but how to last.
13.5.1 The DevOps and Agile Inheritance#
The patterns described in this chapter did not originate in network engineering. They were worked out over the preceding decade in software engineering organizations, first in web operations (the DevOps movement), then in product development (Agile methodologies).
DevOps addressed the structural tension between development teams that wanted to ship fast and operations teams that needed to protect stability. The resolution was not to make operations teams accept more risk, but to integrate development and operations practices so that deployment reliability became a shared responsibility. The same tension exists in network automation: the automation developers who want to ship new workflows and the network operations engineers who need production stability. The DevOps inheritance is directly relevant: shared ownership of the automation pipeline, automated testing before production, and blameless post-mortems that improve the system rather than assign fault.
Agile methodologies addressed a different problem: how to deliver incremental value in complex systems without requiring full up-front specification. The automation equivalent is the Adoption Ladder described above: deliver a working pilot before attempting full coverage, extend coverage iteratively, and treat each increment as something that must work before the next increment begins. This sounds obvious but conflicts with how network projects have traditionally been scoped: full design before any deployment, comprehensive coverage before any production use.
The borrowing from software engineering culture is not about adopting its vocabulary. It is about applying solutions that were earned through a decade of similar organizational challenges. The failure mode to avoid is semantic adoption: teams that rename their change process “CI/CD”, call their quarterly planning “sprints”, and declare themselves DevOps organizations while keeping every behavioral habit that those movements were specifically designed to break. The signal is a team with an automation pipeline that still requires a four-week CAB approval for every change the pipeline would execute. The vocabulary changed. The culture did not.
13.5.2 New Change Management#
Automation does not eliminate change management. It transforms it.
Traditional network change management was built around scheduled windows, manual review, and explicit approval chains, because every change was a manual operation executed by a person who could make mistakes. The process existed to slow down the path from decision to execution, adding checkpoints where humans could catch errors.
Automation changes the risk profile. When a change is automated: it was reviewed when the automation was written (not when it runs), it is tested before it reaches production, and it produces an audit trail automatically. The arguments for a four-week change freeze before a routine provisioning operation become weaker when that provisioning operation has been executed successfully three hundred times in the last year without a failure.
Automation-enabled change management is earned, not declared. A team that announces it has moved to autonomous execution without completing the dry-run and supervised stages has not reduced risk. It has hidden it. The Confidence Ladder only works if each stage is completed honestly, based on actual execution history, not on a policy decision or a manager’s confidence that the automation is probably fine.
The transition to automation-enabled change management is earned incrementally. The Confidence Ladder pattern names the progression: automation earns autonomy in stages. New automation runs in dry-run mode first, producing execution previews but making no changes. After sufficient successful dry runs with human review, it earns supervised execution: changes applied with active monitoring and an engineer ready to intervene. After sufficient successful supervised runs with no interventions required, it earns autonomous execution for that class of change, with exception-based human oversight.
This process mirrors the trust model a good engineer applies to a junior colleague: autonomy is earned through demonstrated reliability, not granted upfront and revoked after failure.
13.5.3 Learning and Documentation Culture#
Automation that is not documented dies with its author. This is not a platitude. It is a pattern that appears every time an engineer who built a critical automation workflow leaves a team.
The documentation challenge in automation is an architecture problem, not a writing problem. Documentation written after the fact, as a separate artifact from the automation itself, is almost always incomplete and almost always becomes stale. The documentation that survives is embedded in the automation: schemas that describe themselves, dry-run outputs that explain what is happening, code structured clearly enough that reading it answers most questions, and architecture decision records (ADRs) that capture the non-obvious choices (why this design was chosen over alternatives, what constraints shaped the decision) rather than describing what the code does.
The team practice that supports this is making documentation a quality criterion in automation review, not a task to complete before closing the ticket. An automation workflow that lacks a clear dry-run mode, legible audit output, and documented failure behavior is incomplete, not because the documentation is missing, but because those capabilities are part of what makes automation trustworthy.
A year into his transition, Jordi was paired with a junior engineer who had no networking background. She asked him why the automation checked for an active BGP session before removing a peer rather than simply issuing the removal command. Jordi had written that check in ten minutes and never documented it. He explained the reason: removing a peer that is still advertising routes causes a traffic black hole that takes the full routing protocol convergence time to clear, often thirty seconds or more in a campus network, and that is not an acceptable interruption for a provisioning workflow. As he explained it, he realized the check had a second implication he had not encoded. He wrote both down. Three weeks later, the junior engineer caught a logic error in the second implication. The act of teaching had made the automation better than his original reasoning. He had not expected that. It happened every time after that too.
13.5.4 Knowledge Capture and Open Source#
The institutional memory problem compounds over time. An organization with three years of automation history and significant turnover has a substantial body of undocumented decisions embedded in code that no current team member fully understands.
The practices that reduce this risk are process practices, not tool practices. Code reviews as mandatory knowledge transfer: the reviewer who does not understand why the code was written this way is not qualified to approve it. Pair work on automation tasks, where knowledge transfer is a primary goal alongside delivery. Architecture decision records for non-obvious design choices, maintained as living documents that accumulate the reasoning behind the platform’s current shape.
The AI-assisted development pace introduces a specific variant of this problem. When engineers use AI tools to generate automation code quickly, the result can be production-grade in behavior but orphaned in understanding: nobody on the team fully knows why the code is structured the way it is, what edge cases were considered, or what assumptions are embedded in it. The code works until it does not, and when it fails, the debugging chain starts from zero. The Supervised Colleague pattern from section 13.3.3 is the mitigation: AI-generated code requires the same review and ownership transfer as code from any contributor, with the added discipline of documenting what the generation process did not make explicit.
Open source network automation tooling contributes a different kind of knowledge capture: shared vocabulary and shared failure modes developed across many organizations rather than one. A team that builds on open source tooling inherits the debugging, design, and operational experience of the broader community. When they encounter a failure mode that the community has already documented, they recognize it. Contributing back, even small fixes or documentation improvements, builds the team’s capacity to engage with that knowledge base effectively. This is an organizational capability, not just a technical choice.
13.5.5 Ethical and Human Implications#
Full automation shifts accountability in ways the industry has not fully resolved.
When a human engineer executes a change that causes an outage, accountability is clear: a person made a decision. When an automation system executes a change that causes an outage, the accountability chain is longer and harder to trace: the engineer who wrote the automation, the engineer who approved it, the engineer who configured the trigger, the manager who approved the autonomous execution policy. The legal and organizational frameworks for this are still developing.
The AI dimension intensifies this question. When an AI-driven orchestration system makes a routing decision autonomously (as described in Chapter 17), the chain between human intent and network action includes a reasoning layer that cannot be fully audited in the way that deterministic code can. An AI system can arrive at a correct action for reasons the engineers who deployed it did not anticipate. It can also arrive at a wrong action for the same reasons. When the automation is wrong, the question “who is responsible?” becomes genuinely difficult.
This does not mean autonomous operation should be avoided. It means that the scope of autonomous action, the conditions that trigger human escalation, and the audit trail that allows post-incident review must be designed with the same rigor as the automation logic itself. Two principles apply regardless of automation maturity: autonomous action should be bounded (the automation knows what it is not authorized to do and stops), and the system should always be able to explain what it did, why, and what it would have done differently.
13.6 Cross-Domain Collaboration#
The network team had spent six months building reliable automation for firewall rule provisioning. The security team had done the same for security group policy enforcement. Both platforms worked. Both had passed their pilots. Both were in production.
Then the network team re-segmented a campus zone to isolate a new set of IoT devices. The change was automated, reviewed, and executed correctly against the network’s Source of Truth. Forty minutes later, the security team’s policy engine began generating violations: the new segment did not exist in the security Source of Truth, so traffic from it matched no approved policy and was silently dropped. Neither automation had a bug. The network automation executed the intended network change. The security automation enforced the intended security policy. The failure was the absence of any shared model between them. The outage lasted four hours while engineers from both teams reconstructed, manually, what two automation systems should have known together.
The cultural shift described in this chapter does not happen within a single team. Network automation that delivers meaningful organizational value always touches other domains: security policy enforcement, cloud infrastructure management, service delivery workflows. The organizational friction at those boundaries is where most mature automation programs plateau.
The problem is structural. NetOps, SecOps, and CloudOps typically evolved separate automation capabilities with different Source of Truth schemas, different change management rituals, and different toolchains that overlap but do not integrate. Each system, working correctly within its own domain, has no awareness of what the other changed. The failure in the story above was not an exception. It is the default outcome when cross-domain automation is not deliberately designed.
13.6.1 The Governance and Empowerment Balance#
Every cross-domain automation program confronts the same tension: the platform team wants to standardize, because standardization is what makes shared tooling possible, and domain teams want autonomy, because their requirements do not fit cleanly into a standard.
The resolution that works in practice is the Paved Road pattern, developed in the platform engineering community (notably at Netflix and similar large-scale operations organizations): the platform team builds a well-lit, well-maintained path that is easier to use than to avoid. They do not prohibit alternatives. They do not mandate adoption. They make the good path the easy path, and they accept that some teams will go off-road for legitimate reasons.
A related ownership question that arises consistently: who owns the boundary between the network as a physical and protocol artifact and the network as an automation target? The network engineer owns network design, protocol behavior, and the correctness of the intent model. The network automation engineer owns the platform that implements that intent. In practice these ownership lines blur constantly, and the most effective teams treat them as shared responsibilities with clear escalation paths rather than clean divisions.
Cross-domain automation programs consistently stall when there is no single accountable owner above the domain teams. Without a shared point of responsibility at director or VP level, the governance and empowerment balance remains a negotiation between peers with competing priorities rather than a managed tension with a clear arbiter. The platform team cannot resolve cross-domain conflicts it is not empowered to resolve. The accountability structure must exist before the architecture can work.
13.6.2 Shared Platforms vs Federated Automation#
The architectural question underlying cross-domain collaboration is whether automation should converge on a shared platform or remain federated across domain tools.
The pattern that scales is neither fully. Shared data layer, federated execution. A single Source of Truth that holds cross-domain intent (network topology, security policy, cloud resource allocation) with a consistent schema and access model. Domain-specific execution tooling (the network automation platform, the security policy engine, the cloud provisioning system) that reads from the shared data layer and writes results back to it.
This architecture allows domain tools to evolve independently while maintaining the shared context that cross-domain workflows require. The alternative, a single unified automation platform for all domains, consistently fails under the weight of differing requirements, differing change velocities, and the organizational politics of which team’s priorities drive the platform roadmap.
This architectural choice connects directly to the scaling patterns in Chapter 11. A federated execution model with a shared data layer has specific consistency requirements: the data layer must be consistent enough for a security policy change and its network enforcement to be coordinated, and loosely coupled enough that a network change does not block waiting for cloud infrastructure synchronization.
13.6.3 Embedded Collaboration#
Committee-based coordination between domain teams does not produce good cross-domain automation. It produces meetings. The pattern that produces working automation is embedded collaboration: engineers from different domains working alongside each other on specific automation problems, not in periodic review sessions.
A practical model is the cross-functional squad: one network engineer, one security engineer, one cloud engineer, assigned to build a specific cross-domain automation capability together for a defined period. The squad owns both the delivery and the ongoing operation of what they build. The rotation ensures that each domain’s practitioners develop working knowledge of the others’ constraints rather than operating through handoffs and translation layers.
Cross-functional squads only work when their members are genuinely dedicated to the squad’s mission. A squad where each member is still carrying their full domain team workload is not a squad: it is a committee with a different name. Effective cross-domain automation requires management commitment to protect squad members’ time. Without that protection, the squad defaults to the path of least resistance, which is each person doing the work that fits their existing domain role, and the cross-domain integration never gets built.
Cross-domain SLOs formalize this collaboration. When an automation workflow crosses domain boundaries, the reliability expectations for the end-to-end workflow cannot be owned by a single domain team. Defining shared SLOs requires both teams to understand each other’s failure modes and to negotiate shared ownership of the outcomes. Who owns an automation failure that originated in a network change and manifested as a security policy violation? The answer cannot be “whoever the incident ticketing system routes it to first”.
flowchart TD
classDef shared fill:#d9e8f5,stroke:#4a7fa5,color:#1a2e3b
classDef domain fill:#d9f5e8,stroke:#3a8a5a,color:#1a2e3b
SoT["Source of Truth - Cross-domain intent"]
subgraph Domains["Domain Execution Layer"]
direction LR
NP["Network Platform"]
SP["Security Policy Engine"]
CP["Cloud Provisioning"]
end
Obs["Observability - Cross-domain telemetry"]
SoT --> NP & SP & CP
NP & SP & CP --> Obs
Obs -.->|"feedback"| SoT
class SoT,Obs shared
class NP,SP,CP domain
13.7. Summary#
Chapter 13 has argued that some of the hardest problems in network automation are not technical. They are organizational: who does the work, how they learn to do it, how the organization sustains it past the first wave of adoption, and how teams that have historically operated in separate silos build shared systems together.
Three themes anchor the chapter:
Roles evolve, they do not disappear. The transition from network ops to platform engineering is a transformation map, not a replacement chart. Deep protocol knowledge does not become less valuable in an automated world. It becomes differently applied: from executing configuration to designing the systems that execute configuration correctly. The five emerging roles described in section 13.2 are the destination that T-shaped skill development paths are pointed toward.
Adoption requires active design. The Adoption Ladder (pilot, scale, sustain) names three stages with distinct success criteria. Getting past the first rung requires trust, not coverage. Getting past the second requires self-service. Getting past the third requires automation that outlives its authors. The resistance patterns (Frozen Expert, Invisible ROI, Black Box) are predictable obstacles with known responses (section 13.4.1). Sustaining that adoption requires a different set of habits: the DevOps and Agile inheritance, a change management model calibrated to automation’s risk profile, documentation embedded in the automation itself, and knowledge capture that survives team turnover (sections 13.5.1 through 13.5.4).
Cross-domain collaboration is architectural, not organizational. The three-kingdoms problem (NetOps, SecOps, CloudOps operating in separate silos) does not resolve through goodwill or governance mandates. It resolves through shared data architecture: a single Source of Truth with consistent schema, federated execution tooling that reads from it, and embedded cross-functional squads that own the seams between domains. The Paved Road pattern is the governance model that makes this work without requiring every team to converge on a single platform.
Chapter 14 extends the organizational dimension in a different direction: treating automation not just as an engineering capability but as a product, with its own lifecycle, stakeholder model, and approach to measuring business impact.
References and Further Reading#
The Phoenix Project, Gene Kim, Kevin Behr, George Spafford (IT Revolution Press, 2013). A novel-format account of DevOps transformation in an IT organization under pressure. The organizational dynamics, the conflict between development velocity and operational stability, and the emergence of shared ownership map directly onto the automation program dynamics described in this chapter.
The DevOps Handbook, Gene Kim, Patrick Debois, John Willis, Jez Humble (IT Revolution Press, 2016). The practical companion to The Phoenix Project: the three ways (flow, feedback, continuous learning), deployment pipelines, and blameless post-mortems. Sections on the DevOps inheritance in this chapter draw on these foundations.
Team Topologies, Matthew Skelton, Manuel Pais (IT Revolution Press, 2019). The definitive framework for designing team structures around fast software delivery. The concepts of stream-aligned teams, platform teams, and enabling teams translate directly into the cross-domain collaboration and embedded squad models discussed in section 13.6.
Accelerate, Nicole Forsgren, Jez Humble, Gene Kim (IT Revolution Press, 2018). Research-backed analysis of what organizational and technical practices predict software delivery performance. The four key metrics (deployment frequency, lead time, change failure rate, time to restore) provide the quantitative foundation for the adoption metrics in section 13.4.2.
Change by Design, Tim Brown (HarperCollins, 2009). Introduces the T-shaped engineer model and the design thinking approach behind it. The IDEO framing of deep domain expertise combined with cross-disciplinary literacy is the foundation for the skill transformation paths in section 13.3.
💬 Found something to improve? Send feedback for this chapter