Mar 29, 2026 · 9596 words · 46 min read

9. The Network#

The VLAN automation had been running for three weeks in the lab. Three switches, one of each vendor, every workflow passing. The team felt confident. On the first production run, 23 of the 800 campus switches failed. All HPE. All running a firmware version that nobody had documented.

The playbook was checking the error response from each device after pushing the VLAN configuration. On modern HPE firmware, an already-existing VLAN returns the error code duplicate-vlan. On this older firmware version, the same condition returned vlan-exists. The playbook had been written to treat duplicate-vlan as an idempotency signal, meaning “this already exists, that is fine.” It had not been written to handle vlan-exists, so it treated that response as a failure. A third of the HPE fleet reported failure. The rollback ran cleanly. The application team’s ticket stayed open for another three hours while the network team manually audited which switches had actually been configured and which had not.

The automation was not wrong. The network had an opinion that nobody had documented.

Six months later, the same team had a containerlab topology mirroring Building B: 24 switches, matching vendor images, with the HPE nodes locked to the production firmware version recorded in the Source of Truth (SoT). On the first test run of the VLAN workflow against that topology, 8 HPE nodes failed with exactly that error code. The team added vlan-exists to the list of idempotent responses in the HPE adapter. Re-run: all 24 nodes passed. Production deployment: 800 switches, zero failures.

The difference was not better code. It was a testing environment that represented reality.

This chapter addresses the block that was always implicit throughout Part 2: the network itself. Every building block built so far was designed by the automation team and behaves according to its documented interfaces. The network was inherited. It has quirks, diverse interfaces, firmware inconsistencies, and capabilities that vary by vendor, platform, and software generation. Chapter 9 addresses two questions: what do we need from the network to make automation reliable, and how do we safely validate automation logic before it touches production?

9.1. Fundamentals#

9.1.1. Context#

Chapter 3 introduced The Network as one of the seven blocks in the NAF Framework: the only block the automation team does not “own” (it’s in the scope of network engineering). They configure it, observe it, and model its intent, but they did not build the operating system, the data model, or the API interface. That dependency shapes every design decision in the platform above it.

Chapter 5 covered the Executor’s write path in detail: how automation roles, parameterized tasks, and idempotency checks operate. What Chapter 5 treats as given is that the device on the other end exposes a reliable, consistent interface. Chapter 9 addresses whether that assumption holds, and what to do when it does not.

Chapter 6 covered the Collector’s read path: gRPC Network Management Interface (gNMI) streaming telemetry, SNMP polling, and the data normalization pipeline. Chapter 9 covers the device-side prerequisites for those paths: what must be true about the network device for the collector to read from it consistently.

The Executor and Collector often use the same protocol to reach the same device: both gNMI and NETCONF are capable of configuration writes and telemetry reads, and both connect to the same management plane. In practice, teams often choose one protocol per operation type based on its strengths: gNMI streaming subscriptions for high-frequency telemetry, NETCONF transactions for configuration. The protocols are not exclusive to those roles. This is why interface selection and device compatibility matter for both blocks simultaneously: a firmware version that breaks gNMI subscriptions affects the Collector, while one that breaks NETCONF edit-config affects the Executor.

Chapter 9 closes Part 2. The six previous chapters built the automation platform: a place to store intent, a way to execute it, a way to observe results, an engine to coordinate everything, and surfaces to expose it to consumers. This chapter addresses the thing the platform was always pointing at.

9.1.2. Goals#

Three goals define The Network block’s contribution to the automation platform:

Understand and navigate the full network infrastructure spectrum. Any large-scale automation platform may talk to campus switches, data center fabrics, cloud VPCs, Kubernetes overlays, overlay controllers, and legacy gear simultaneously. Each type exposes different programmable interfaces. The platform must handle all of them without collapsing into a single lowest common-denominator abstraction.
Validate automation logic and support new network architecture design before touching production. Simulation environments serve two purposes: they are the pre-production gate where logic errors, interface contract violations, and device-specific quirks are caught at the cost of minutes in a lab rather than hours in a production incident, and they are the design environment where new network architectures are explored and validated before any hardware is ordered.
Keep the automation platform stable as the network evolves. New vendors are added. Firmware versions change. New infrastructure types arrive. The platform must be designed to absorb this change through abstraction strategies, not through ad-hoc patches to every workflow whenever the network changes.

9.1.3. Pillars#

Three pillars support these goals, one per functionality:

Network infrastructure spectrum and programmable interfaces: the full range of network types the platform must automate, and the interface each type exposes to the Executor and Collector.
Simulation and testing environments: the toolchain for pre-production validation. Where different lab environment types fit, how they connect to the Saga pattern from Chapter 7, and how to scale them.
Abstraction strategies: structural approaches that allow the automation platform to remain stable as the underlying network changes, regardless of vendor, platform generation, or interface protocol.

9.1.4. Scope#

In scope:

The interfaces through which the Executor and Collector reach network devices. Both NETCONF and gNMI support configuration and telemetry operations; the choice between them per use case depends on operational strengths, not protocol exclusivity. The protocol is often shared between blocks; the operation type differs.
The testing environments and methodologies that validate automation before production
Abstraction strategies for managing multi-vendor and multi-platform heterogeneity
The implications of cloud, Kubernetes, and overlay networking for automation design

Out of scope:

Configuration generation and template rendering (Source of Truth (SoT), Chapter 4)
Execution mechanics: how automation tooling executes a task (Executor, Chapter 5)
The telemetry collection pipeline: how metrics flow into the time-series database (Observability, Chapter 6)

The boundary is consistent: Chapter 9 covers the network’s side of each interface, not the platform’s side.

9.2. Functionalities#

The Network block is the only building block in the NAF framework that the automation platform does not control. It can only interface with the network as the network allows. Every design decision in the previous five chapters, how intent is stored, how execution runs, how telemetry is collected, how workflows are coordinated, how consumers are served, ultimately resolves to a question about what the network device on the other end of the connection supports. This chapter examines that constraint directly.

graph LR

    subgraph Goals
        direction TB
        A1[Navigate the full network infrastructure spectrum]
        A2[Validate automation before production]
        A3[Keep the platform stable as the network evolves]
    end

    subgraph Pillars
        direction TB
        B1[Network infrastructure spectrum and programmable interfaces]
        B2[Simulation and testing environments]
        B3[Abstraction strategies]
    end

    subgraph Functionalities
        direction TB
        C1[Programmable Interfaces]
        C2[Simulation and Testing Environments]
        C3[Abstraction Strategies]
    end

    A1 --> B1 --> C1
    A2 --> B2 --> C2
    A3 --> B3 --> C3

    classDef row1 fill:#eef7ff,stroke:#4a90e2,stroke-width:1px;
    classDef row2 fill:#ddeeff,stroke:#4a90e2,stroke-width:1px;
    classDef row3 fill:#cce5ff,stroke:#4a90e2,stroke-width:1px;

    class A1,B1,C1 row1;
    class A2,B2,C2 row2;
    class A3,B3,C3 row3;

9.2.1. Programmable Interfaces#

The network is heterogeneous by nature. It is not one thing. It is a spectrum of infrastructure types accumulated over years, often built in parallel by different teams (ownership and organizational boundaries are explored in Chapter 13), each with its own interface model, abstraction level, and automation maturity. A modern automation platform can span campus switches, data center fabrics, cloud VPCs, overlay controllers, Kubernetes clusters, service provider WAN infrastructure, and hyperscaler-managed forwarding planes, simultaneously. The platform must handle all of them. The infrastructure type determines what interface is available; the automation platform adapts to that reality rather than mandating a uniform interface.

9.2.1.1. The network infrastructure spectrum#

This is a high-level recap of different network infrastructure scenarios you may have to tackle, depending on the nature of your company:

Campus and branch switching is the core scenario I used as the example throughout Part 2: multi-vendor physical switches (Cisco, Arista, HPE, Extreme). Modern campus gear exposes Command Line Interface (CLI), NETCONF, and gRPC Network Management Interface (gNMI) simultaneously. Automation maturity is high for equipment from the past five to seven years; it is patchy for legacy gear still running decade-old firmware.
Data center fabric topology is typically leaf-spine, often from a smaller vendor set: Arista, Cisco Nexus, or automation-native open networking platforms. Interface uniformity is higher than campus; change management is stricter. EVPN/VXLAN overlays add a management plane above the fabric that may have its own API, separate from the individual device interface. SONiC-based platforms (Cisco 8000, Nvidia Spectrum) are increasingly present in hyperscaler-influenced DC deployments; their configuration interface is a structured database rather than CLI or NETCONF, and is covered further in the abstraction strategies section.
Service provider and WAN infrastructure (carrier-grade routers, MPLS networks, segment routing fabrics) has its own automation challenges: scale, protocol complexity, and the dual concern of control-plane configuration and traffic engineering policy. NETCONF and YANG models are well-established in this space; vendors like Cisco IOS-XR and Junos have mature YANG coverage. The automation platform often targets a controller (SR-PCE, Crosswork, NSO) rather than individual devices.
Cloud networking: AWS VPC, Azure VNet, GCP VPC and others. REST APIs with eventual consistency semantics. There is no concept of “pushing a config” and waiting for a synchronous confirmation. The Executor handles async operations: create, poll, verify. Infrastructure-as-code tooling fits this model naturally. The automation platform must account for the different consistency model, not assume synchronous apply-and-confirm semantics.
SD-WAN and overlay networks (Cisco SD-WAN, Versa, VMware VeloCloud) are controller-managed. The automation target is the controller API, not the individual device. The physical underlay still exists but is managed entirely through the overlay’s abstraction. This affects both execution and observability: the Executor writes policy to the controller; telemetry about traffic, path selection, and policy enforcement also flows through the controller’s northbound interface, not directly from the physical underlay devices.
Kubernetes networking at the CNI layer inverts the device model entirely. The network is defined through Kubernetes API objects: NetworkPolicy, Services, Ingress, and custom resources from CNI plugins such as Cilium, Calico, or Flannel. The device disappears as an automation target. The Kubernetes API is the interface. Network policies are code, not device configuration. This is the model others are converging toward: declarative intent, controller-reconciled state, no direct device access.
DPUs and SmartNICs (Nvidia BlueField, Intel IPU, Marvell Octeon) represent a shift in where network processing happens. In modern data centers, DPUs are installed alongside CPUs on every server to offload network functions: VXLAN encapsulation, encryption, firewall policy enforcement, load balancing, and microsegmentation. This offloads these functions from the host CPU and from network appliances to the SmartNIC firmware. The consequence for automation: “the network device” is no longer only a switch or router in the rack. Functions previously managed through dedicated network appliance APIs are now managed through the DPU management plane and its vendor SDK, a new interface category that standard NETCONF and gRPC Network Management Interface (gNMI) tooling does not yet reach cleanly.
Open networking (SONiC, DENT, OPX) runs Network Operating System (NOS) software on commodity hardware. SONiC’s configuration interface is a Redis database with a Yet Another Next Generation (YANG)-structured schema, structurally different from CLI or NETCONF, and programmatic by design. Increasingly present in hyperscaler-influenced data centers and large-scale enterprise DC deployments. SONiC is notable because it was designed for automation from the start: the interface is a structured database, not a CLI adapted for programmatic access.
Virtual network functions co-exist with physical infrastructure in many environments. A software firewall inserting traffic through policy-defined paths, a virtual load balancer managing traffic distribution across application clusters, a software-based BGP route reflector: these are all automation targets that use management interfaces ranging from vendor REST APIs to NETCONF. They are often managed alongside the physical inventory using the same SoT and Executor, but they require separate adapter paths because their interface models differ from physical devices.
Wireless controllers (Cisco DNA, Aruba Central, Juniper Mist) are controller-based; the automation target is the controller API. Relevant whenever VLAN provisioning extends to wireless SSIDs alongside wired switch ports, as it would in the campus scenario.

The point is not to enumerate every infrastructure type exhaustively. It is to establish that a platform automating any non-trivial network interacts with multiple types simultaneously. The Executor and Collector must route each operation to the correct interface type. The Source of Truth (SoT) must model intent at a level above the individual interface. The complexity of the network is the design constraint the platform was built to absorb.

9.2.1.2. Interface types#

Each infrastructure type exposes one or more interface types to the automation platform. The same physical switch may expose all three simultaneously. The platform adapts to what is available, with preferences that reflect reliability, structure, and scale. No interface type is a universal mandate; the right choice depends on what the device supports and what the operation requires.

Command Line Interface (CLI) over Secure Shell (SSH) is universal, legacy, and fragile. Screen-scraping and text parsing breaks when firmware changes output formatting or adds new fields. Error codes are inconsistent across vendors and across firmware versions. CLI is still the only option for older gear. The recommendation is to minimize its use and avoid building workflows that depend on it for anything more than the devices that have no alternative (the last resort). Setting an interface description looks like:

interface GigabitEthernet0/1
 description uplink-to-core

NETCONF is structured, transactional, and correct when it works. It supports atomic operations and rollback, and its data model is machine-parseable. The transport layer is generally reliable; the data model layer is where the gaps are. Vendor YANG model quality varies significantly: a device may claim NETCONF support but have incomplete or proprietary models for features the platform needs. IETF and OpenConfig YANG models standardize the intent layer; vendor-native YANG models fill the gaps. The same interface description via NETCONF:

<config>
  <interfaces xmlns="urn:ietf:params:xml:ns:yang:ietf-interfaces">
    <interface>
      <name>GigabitEthernet0/1</name>
      <description>uplink-to-core</description>
    </interface>
  </interfaces>
</config>

RESTCONF is the HTTP-based equivalent of NETCONF, using the same YANG models but exposed over REST semantics. It is useful when teams are more comfortable with HTTP tooling than with NETCONF’s XML/SSH transport. The data model is the same; only the transport differs. Vendor support is less uniform than NETCONF. The same interface description via RESTCONF:

PATCH /restconf/data/ietf-interfaces:interfaces/interface=GigabitEthernet0%2F1
Content-Type: application/yang-data+json

{
  "ietf-interfaces:interface": {
    "name": "GigabitEthernet0/1",
    "description": "uplink-to-core"
  }
}

gRPC Network Management Interface (gNMI) and gNOI are gRPC Remote Procedure Call (gRPC)-based protocols. gRPC Network Management Interface (gNMI) handles telemetry and configuration read/write; gNOI handles operational commands. Modern and scale-friendly. Vendor support is mature on Arista and newer Cisco platforms; it is patchy on HPE and legacy gear. Chapter 6 covered gNMI from the Collector’s perspective. Chapter 9 covers the device-side prerequisites: the device must support gNMI subscriptions at the OS version the platform targets. Nvidia Spectrum switches running SONiC expose gNMI natively alongside the CONFIG_DB interface, making them among the most automation-friendly platforms for both configuration and telemetry. The same interface description via gNMI SetRequest:

path:  /interfaces/interface[name=GigabitEthernet0/1]/config/description
value: "uplink-to-core"

Vendor REST APIs (eAPI, NX-API, Cumulus NVUE, and similar) are machine-readable but not standardized across vendors. Useful as a gap-filler when NETCONF or gNMI is absent or incomplete for a specific operation. Nvidia Cumulus switches expose NVUE (a structured REST API with consistent JSON schema) as their primary programmatic interface; Arista and Cisco Nexus expose eAPI and NX-API respectively as alternatives to NETCONF. Treat these as adapter-layer concerns, not as a foundation for a vendor-neutral platform. The same interface description via Arista eAPI (JSON-RPC over HTTPS):

{
  "jsonrpc": "2.0",
  "method": "runCmds",
  "params": {
    "version": 1,
    "cmds": [
      "interface GigabitEthernet0/1",
      "description uplink-to-core"
    ],
    "format": "json"
  },
  "id": "1"
}

Cloud and controller APIs follow REST patterns with eventual consistency. The async operation model is a design requirement, not a limitation to work around. For SD-WAN, wireless, and DPU management planes, the controller API is often the only available interface. Adding a description to an AWS VPC illustrates the pattern: a tagged resource update submitted asynchronously, with no synchronous confirmation that the change was applied:

POST https://ec2.eu-west-1.amazonaws.com/ HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Authorization: AWS4-HMAC-SHA256 ...

Action=CreateTags
&ResourceId.1=vpc-0a1b2c3d4e5f67890
&Tag.1.Key=Description
&Tag.1.Value=uplink-to-core
&Version=2016-11-15

The Kubernetes API is declarative, controller-reconciled, and consistent across vendors. NetworkPolicy, Services, and CNI custom resources are the automation targets. There is no direct device access; the API server is the sole interface. A network isolation policy for the app-payments service:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: app-payments-isolation
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: app-payments
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: app-payments

SONiC CONFIG_DB (Redis) is the native interface for SONiC-based platforms. Rather than a protocol layered on top of the OS, it is the OS’s own configuration store: a Redis database with a YANG-structured schema. Automation writes JSON entries directly to CONFIG_DB; SONiC’s internal orchagent daemon reconciles the intent to the hardware forwarding tables. gNMI is available in parallel for telemetry reads. This is architecturally distinct from CLI, NETCONF, or REST: the interface is the data store itself. Section 9.2.3.4 covers this in more depth. The same interface description written to CONFIG_DB via JSON patch (applied with config load):

{
  "PORT": {
    "Ethernet0": {
      "description": "uplink-to-core",
      "admin_status": "up"
    }
  }
}

9.2.1.3. Per-device interface selection#

When a device exposes multiple interfaces, the platform must choose one per operation type and maintain that choice consistently. A campus switch exposing Command Line Interface (CLI), NETCONF, and gRPC Network Management Interface (gNMI) simultaneously requires a decision, not a mix-and-match approach that varies by workflow or engineer preference.

Recommended hierarchy, applied per operation type. Both gRPC Network Management Interface (gNMI) and NETCONF support configuration writes and telemetry reads; the preference below reflects operational strengths, not protocol exclusivity:

gRPC Network Management Interface (gNMI) preferred for telemetry collection (streaming subscriptions, structured, scale-friendly; gNMI Set is also a valid configuration path)
NETCONF preferred for configuration (transactional, rollback-capable; NETCONF get and get-config are equally valid for state reads)
RESTCONF or vendor REST API as fallback when NETCONF is incomplete for a specific feature
Command Line Interface (CLI) as last resort for legacy gear only

What the Executor needs from the interface: Idempotency or at least reliable error codes distinguishing “already exists” from “failed,” structured error responses, and state verification capability after apply. The HPE vlan-exists vs duplicate-vlan problem was precisely a failure of the second condition: the error code semantics changed between firmware versions.

What the Collector needs: structured reads, streaming telemetry subscriptions, and consistent data models so the Observability layer does not need per-device parsers. gRPC Network Management Interface (gNMI) is the preferred subscription protocol where supported: it is structured, hierarchical, and schema-validated at the device, which eliminates the per-device text parsing that dominated SNMP-era collection. But any subscription mechanism that delivers structured, timely data serves the same function. SNMP polling remains valid for legacy devices where gNMI is unavailable. Syslog feeds structured events for log-based observability. OpenTelemetry (OTel) is an emerging standard worth tracking: originally designed for application observability, it is gaining adoption as a vendor-neutral transport for network telemetry, metrics, and traces. The Collector’s protocol choice is a function of what the device supports; the Observability layer should not need to know which transport was used.

The Source of Truth (SoT) records intended state for every device attribute the platform operates on: intended VLAN configuration, intended BGP neighbor relationships, intended interface descriptions, and intended OS version. OS version deserves special mention here because it affects not just configuration drift detection but the adapter path selection itself: the Executor branches by OS version when the same vendor’s firmware behaves differently between releases. This is not a special case for OS version; it is the same intent-versus-reality pattern applied to every attribute the platform manages. The desired OS version is what the operations team approved; the running version is what Observability observes. When they diverge, that divergence is a signal: the device may be behind on a planned upgrade, or an unplanned change occurred. The platform needs both data points to decide whether to proceed or block.

This distinction matters practically. The SoT says a device should run AOS-CX 10.13. The Collector reports it is running 10.12.1006. The platform has two options: block execution until the OS version is reconciled, or proceed using the 10.12 adapter path. The right answer depends on the team’s change management policy, but the platform needs both data points to make the decision. SoT provides intent; Observability provides reality.

9.2.2. Simulation and Testing Environments#

The network is production infrastructure. Unlike an application backend, there is no staging server to test against by default. Building one is the job of this functionality.

Testing network automation has always been harder than testing application code. A network is a distributed system with many components the automation team does not control: neighboring ASes at peering points, upstream transit providers, customer-managed CPE, wireless clients, and cloud infrastructure operated by third parties. A service provider testing a routing policy change cannot spin up a mock BGP peer from a transit provider to validate the full behavior. An enterprise testing a WAN failover workflow cannot control how the MPLS provider responds. The simulation environments described in this section are the best available substitute: they reproduce what the team controls, accept the limitations of what they cannot, and focus validation on the logic layer where bugs actually live.

The testing pyramid from Chapter 2 (unit, integration, end-to-end) applies directly here. Unit tests validate individual automation modules in isolation, typically with mock device responses. Integration tests validate multi-step interactions: the SoT API returns the expected data structure, the Executor translates it correctly, the device response is handled correctly. End-to-end tests validate the full workflow against something that behaves like a real network device. Simulation environments are the end-to-end layer.

9.2.2.1. Environment types#

The right environment depends on what the test needs to validate. There is a spectrum from low-fidelity, low-cost environments appropriate for routine CI/CD pipelines, to high-fidelity environments worth investing in for production-confidence testing.

Environment type	Startup	Control plane	Data plane	CI/CD fit	When to use
Container-based emulation	Seconds	Yes	No	Native	Automation logic, interface contract validation, workflow testing
VM-based emulation	Minutes	Yes	Yes	Limited	Protocol interoperability, design validation, full-NOS behavior testing
Physical hardware lab	N/A (always on)	Yes	Yes	Manual	Hardware-specific behavior, performance testing, scenarios impossible to emulate
Digital twin	Continuous sync	Yes	Depends on implementation	Custom	Production-fidelity testing; validates automation against actual production topology and state

Container-based emulation uses lightweight network OS images running as containers, connected by virtual links. Topology startup takes seconds. It is the practical default for routine CI/CD: the automation team runs the same workflow code against a containerized topology on every change, catching logic errors before production. Data-plane behavior is not replicated, but control-plane behavior and management interface behavior are sufficient for testing automation logic.
VM-based emulation runs full NOS images as virtual machines. It provides broader vendor coverage, more realistic NOS behavior including data plane, and is appropriate for protocol design testing and multi-vendor interoperability scenarios. The tradeoff: higher resource cost, slower startup, and limited integration with automated pipelines. Not practical for routine commit-level testing.
Physical hardware labs are maintained by many large organizations: a rack of actual switches and routers, often mirroring the production architecture patterns. This provides the highest fidelity for hardware-specific behaviors, performance testing, and scenarios where emulation does not reproduce device behavior accurately. The cost is significant: capital investment, power and space, and the operational overhead of keeping the lab topology synchronized with production architecture. Labs that drift from production patterns provide false confidence. The value is real; the maintenance discipline is the challenge.
Digital twins are live replicas of the production topology, fed by the Source of Truth (same device models, topology, and current intended configuration) and current state from Observability. A digital twin mirrors what production actually looks like right now, not an approximation. The operational cost is significant: maintaining synchronization between the digital twin and production requires continuous reconciliation. It is a maturity-level investment, appropriate for teams that have already validated their platform at scale and need the highest level of pre-production confidence.

Container-based emulation is the practical starting point for most teams. It starts in seconds, integrates natively with CI/CD pipelines, and covers the modern campus and data center equipment used in the majority of automation use cases. The investment in building this environment pays back in the first incident it prevents.

9.2.2.2. Toolchain for container-based and VM-based emulation#

The container-based ecosystem has several tools with distinct roles that are often confused:

containerlab instantiates and wires container-based network OS images. Created by Roman Dodin and widely adopted across the network automation community, containerlab has become the de-facto standard for container-based network labs. It directly orchestrates Docker containers (Arista cEOS, FRR, SONiC, VyOS, and others) and connects them with virtual links defined in a topology file. containerlab starts the topology and provides a running lab in seconds. A minimal three-node topology file looks like:
As teams scale, running containerlab on a single machine becomes a bottleneck. clabernetes distributes containerlab topologies across a Kubernetes cluster, allowing multiple simulation runs in parallel and enabling teams to scale their pre-production gate as the platform grows.

name: building-b-sim
topology:
  nodes:
    cisco-1:
      kind: ceos
      image: ceos:17.9.4
    arista-1:
      kind: ceos
      image: ceos:4.31.2F
    hpe-1:
      kind: vr-aoscx
      image: vrnetlab/vr-aoscx:10.12.1006
  links:
    - endpoints: ["cisco-1:eth1", "arista-1:eth1"]
    - endpoints: ["arista-1:eth2", "hpe-1:eth1"]

netlab abstracts topology definition above containerlab. Created by Ivan Pepelnjak, netlab lets the engineer describe what the topology should accomplish rather than how to wire it: “these three nodes run BGP,” “these nodes are in the same VLAN.” netlab interprets that description and renders it into a containerlab topology file plus initial device configurations per vendor. Think of it as a declarative description of the lab: the engineer defines the service; netlab generates the infrastructure definition. When the goal is to test automation logic against a topology that mirrors a production network model, netlab is the right starting point; containerlab is what instantiates it. A minimal netlab topology for the same three-node scenario:

nodes:
  cisco-1:
    device: iosxe
    image: ceos:17.9.4
  arista-1:
    device: eos
  hpe-1:
    device: aoscx

links:
  - cisco-1:
    arista-1:
  - arista-1:
    hpe-1:

vlans:
  app-payments:
    id: 210
    links: [ cisco-1, arista-1, hpe-1 ]

vrnetlab bridges container-based and VM-based emulation. Some vendors do not provide native container images. vrnetlab wraps vendor VM images inside containers, making them usable inside a containerlab topology. This is how to test against a Cisco IOS-XR or Junos device in a containerlab environment without switching to a VM-based platform.
EVE-NG and GNS3 are VM-based emulation platforms providing broad vendor coverage, GUI-based topology design, and full-NOS behavior including data-plane forwarding. The tradeoff: higher resource usage, slower startup, and limited CI/CD integration. These are the right choice for protocol and design testing, legacy platforms, and multi-vendor interoperability scenarios.
Cisco Modeling Labs is Cisco’s commercial VM lab platform with a REST API for partial CI/CD integration. The right choice for Cisco-centric environments needing access to IOS-XE, IOS-XR, and NX-OS VMs in a managed, shared lab.

9.2.2.3. Validation frameworks#

Validating that a device correctly supports a given interface protocol or YANG path is part of the work described in Chapter 5 (Execution) and Chapter 6 (Observability): the Execution chapter covers validating configuration interfaces before relying on them in production workflows; the Observability chapter covers validating collection paths and confirming that subscriptions return data in the expected format.

One scenario warrants specific treatment in the simulation context: wave validation. After simulation passes but before committing to the full production scope, some teams run a structured validation pass against the first wave. pyATS provides a test framework for writing structured device-interaction tests with rich parsing and state comparison. Robot Framework is a broader keyword-driven test automation framework with network-specific libraries. Both allow a team to encode the expected post-change state as executable assertions: after VLAN 210 is deployed to Wave 1, confirm VLAN 210 exists on all switches, confirm the interface associations are correct, and confirm the Observability layer sees the expected state. This connects directly to section 9.2.2.4: the structured validation layer that separates a passing simulation run from genuine operational confidence before proceeding to the next wave.

A minimal pyATS test asserting VLAN presence after deployment:

from pyats import aetest

class VlanValidation(aetest.Testcase):
    @aetest.test
    def verify_vlan_exists(self, device):
        output = device.parse('show vlan brief')
        assert 210 in output['vlans'], f"VLAN 210 missing on {device.name}"

    @aetest.test
    def verify_vlan_name(self, device):
        output = device.parse('show vlan brief')
        assert output['vlans'][210]['name'] == 'app-payments'

The same check in Robot Framework using the NAPALM library, with explicit setup and readable keyword names:

*** Settings ***
Library    Collections
Library    napalm    WITH NAME    NAPALM

*** Variables ***
@{DEVICES}    cisco-1    arista-1    hpe-1

*** Test Cases ***
VLAN 210 Is Present On All Switches After Deployment
    FOR    ${hostname}    IN    @{DEVICES}
        Connect And Check VLAN    ${hostname}    210    app-payments
    END

*** Keywords ***
Connect And Check VLAN
    [Arguments]    ${hostname}    ${vlan_id}    ${vlan_name}
    NAPALM.Open    ${hostname}
    ${vlans}=    NAPALM.Get VLANs
    Dictionary Should Contain Key    ${vlans}    ${vlan_id}
    Should Be Equal    ${vlans}[${vlan_id}][name]    ${vlan_name}
    [Teardown]    NAPALM.Close

9.2.2.4. Simulation as the pre-production Saga gate#

Chapter 7 described the Saga pattern: a multi-step workflow where each step has a corresponding compensation action that runs if a later step fails. The Saga handles failures in production. Simulation adds the step before the Saga begins: run the workflow against a simulation environment first. If the simulation run fails, no production change occurs. Only when simulation passes does the workflow proceed to the production target.

flowchart LR
    SoT["SoT export (topology + OS versions)"]
    Lab["Simulation environment (containerlab topology)"]
    Workflow["Workflow execution (same code as production)"]
    Pass{Pass?}
    Prod["Production execution (Orchestrator: full scope)"]
    Fix["Investigate and fix (no production impact)"]

    SoT --> Lab --> Workflow --> Pass
    Pass -- Yes --> Prod
    Pass -- No --> Fix --> Workflow

This is the pre-production gate: simulation as the first check before the production Saga scope begins. A failure in simulation is caught before any production device is touched.

Practical implementation:

The Source of Truth exports the topology definition for the target scope, including OS versions per device.
The simulation environment is instantiated with matching vendor images, with OS versions pinned to match the SoT intended state.
The same workflow that will run in production runs against the simulation target first.
Any step that fails in simulation triggers investigation before production execution proceeds.

Progressive rollout waves

Passing simulation does not mean deploying to the entire production scope at once. For large-scale rollouts, simulation is the first gate in a series of progressively larger waves, each with its own validation check before the next wave proceeds. That’s one of my favorite patterns to gain trust with critical deployments, similar to the popular Canary pattern from software development.

A team deploying a new workflow to 100 data centers might structure it as: simulation (virtual topology) → 1 pilot site → 2 sites → 4 → 8 → 16 → remaining sites. Each wave validates that the workflow behaved correctly on the previous wave before expanding. If wave 4 surfaces a failure that simulation did not catch (a hardware-specific behavior, a site-specific state), the rollout halts, the issue is fixed, and the wave sequence resumes from the point of failure.

flowchart LR
    Sim["Simulation"] --> W1["Wave 1 (1 site)"] --> W2["Wave 2 (2 sites)"] --> W3["Wave 3 (4 sites)"] --> W4["Wave N (full scope)"]
    W1 -- Fail --> Fix["Investigate + fix"]
    W2 -- Fail --> Fix
    W3 -- Fail --> Fix
    Fix --> Sim

The Orchestrator controls wave progression. The SoT scopes each wave by site, building, or device group. Validation gates between waves are explicit workflow steps: the Orchestrator checks the Observability layer for expected state before proceeding. This pattern applies whether the scope is 100 data centers or 800 campus switches: the principle is to limit the blast radius of any unforeseen failure while building confidence with each successful wave.

Progressive rollout is intentionally slow. Each wave adds time: waiting for validation results, reviewing failures, deciding to proceed. For teams used to deploying everything at once, the pace feels excessive until the first wave catches a bug that would have hit all 800 switches simultaneously. Slow and sure beats fast and wrong.

Limitations of simulation: container images do not reproduce all firmware behaviors. A container image typically runs recent NOS code; older firmware-specific quirks may not be reproducible unless the image is pinned to a specific version. Simulation catches logic errors, interface contract violations, Yet Another Next Generation (YANG) model gaps, and topology-level failures. It does not guarantee that every possible device state encountered in production has been tested. The goal is significant risk reduction, not zero risk.

The snowflake problem: simulation is most reliable when the production network follows consistent architecture patterns. A network with hundreds of individually customized configurations, each with unique state and history, is harder to represent accurately in simulation. This is one reason automation architecture principles (standardized design patterns, golden templates, SoT-driven configuration) make testing more effective: a well-designed network is more simulatable. The value of simulation compounds with the quality of the network design it represents. Building this repeatability requires close partnership with network engineers, not just automation engineers: the network engineer who understands which sites are truly identical and which carry hidden exceptions is the one who can make the simulation representative. Simulation quality is a joint output of the network design discipline and the automation platform.

9.2.3. Abstraction Strategies#

The network is heterogeneous by nature, and not just in the sense that vendors differ. Any automation platform operating at scale spans physical switching, cloud infrastructure, overlay controllers, service provider WAN infrastructure, and legacy gear simultaneously. Each speaks a different language. The automation platform must not be rebuilt each time a new infrastructure type is added.

This section is about designing the automation layer to absorb change rather than break under it: not just handling today’s heterogeneity but designing for the infrastructure types that will be added next year.

Modern automation platforms span multiple infrastructure domains simultaneously. The architecture that handles this cleanly applies whether the operator is a large enterprise, a service provider managing WAN and customer edge simultaneously, or a hyperscaler running data center fabric alongside cloud networking overlays. The key point: the Executor writes and the Collector reads through the same device interface (gRPC Network Management Interface (gNMI)/NETCONF for physical gear, REST for cloud and controllers), so the interface protocol is a shared concern for both blocks, not a separate design choice per block.

flowchart LR
    SoT["Source of Truth"]
    Orch["Orchestrator"]
    Obs["Observability"]

    subgraph Physical["Physical domain"]
        PhysIF["Network interface (NETCONF/gNMI)"]
        PhysNet["Campus, DC fabric, WAN gear"]
        PhysIF --- PhysNet
    end

    subgraph Cloud["Cloud domain"]
        CloudIF["Network interface (async REST)"]
        CloudNet["Cloud VPCs / networking"]
        CloudIF --- CloudNet
    end

    subgraph Overlay["Overlay domain"]
        OvIF["Network interface (controller API)"]
        OvNet["SD-WAN / MPLS PCE / overlay"]
        OvIF --- OvNet
    end

    SoT --> Orch
    Orch -->|Executor write| PhysIF
    Orch -->|Executor write| CloudIF
    Orch -->|Executor write| OvIF
    PhysIF -->|Collector read| Obs
    CloudIF -->|Collector read| Obs
    OvIF -->|Collector read| Obs

The Source of Truth models the full intent across all topology types as a unified service model. The Orchestrator contains branches by network domain. The Executor routes to the correct adapter based on SoT data. Observability spans all layers, feeding the same data pipeline regardless of the underlying infrastructure type.

The key architectural discipline: the branching happens in the Executor and Orchestrator, not in the SoT. The SoT holds a single intent model for the service. How that intent is realized on different infrastructure types is an Executor concern.

9.2.3.1. Dimensions of heterogeneity#

Before choosing an abstraction strategy, it helps to understand the axis of heterogeneity that strategy must absorb. Not all heterogeneity is the same kind of problem.

Dimension	What varies	Platform design response
Multi-vendor physical	CLI syntax, YANG models, error codes differ per vendor	Adapter pattern: one module per vendor, same input schema from SoT
Firmware generations (same vendor)	Interface behavior changes between OS versions without changing the vendor name	SoT tracks intended OS version; Executor branches by version where behavior differs
Physical vs. cloud	Physical: synchronous apply. Cloud: async REST, eventual consistency	Executor handles operation model per infrastructure type; SoT keeps unified intent
Physical vs. overlay	SD-WAN/EVPN controllers abstract the physical underlay; automation target is the controller API	Executor routes operations to controller, not directly to devices; Collector reads from controller telemetry
Edge vs. core	Same architecture, different risk tolerance and change velocity	Same platform blocks; different workflow configuration, approval gates, and rollout wave sizes

Each row is an axis of difference the platform must handle without requiring the SoT to encode it. The intent model stays unified; the Executor and Orchestrator absorb the variation. The strategies in the following sections address how.

9.2.3.2. Adapter pattern in the Executor and Collector#

The most common starting point and the most widely implemented strategy. One automation module per vendor, all accepting the same input data structure from the SoT. The SoT stores vendor-neutral intent; the Executor’s adapter layer translates per vendor at execution time. The same principle applies to the Collector: one collection module per vendor or protocol, all delivering a normalized data structure to the Observability pipeline. A vendor that speaks gNMI uses one adapter; a vendor that requires SNMP polling or proprietary REST uses another. The Observability layer sees the same data schema regardless of the upstream collection method.

flowchart LR
    SoT["SoT intent (vendor-neutral): vlan_id=210, vlan_name=app-payments"]
    Exec["Executor"]
    CiscoA["Cisco adapter: IOS-XE NETCONF"]
    AristaA["Arista adapter: eAPI / EOS"]
    HPEA["HPE adapter: NETCONF + OS-version error handling"]
    CiscoD["Cisco Catalyst"]
    AristaD["Arista 7050"]
    HPED["HPE Aruba 6300"]

    CollGNMI["Collector: gNMI adapter"]
    CollSNMP["Collector: SNMP adapter"]
    Obs["Observability pipeline (normalized schema)"]

    SoT --> Exec
    Exec --> CiscoA --> CiscoD
    Exec --> AristaA --> AristaD
    Exec --> HPEA --> HPED

    CiscoD --> CollGNMI --> Obs
    AristaD --> CollGNMI
    HPED --> CollSNMP --> Obs

Practical to build and well-understood. The maintenance burden grows as the device inventory diversifies: each new vendor or OS version requires a new or updated adapter. The adapter pattern scales well for a defined set of vendors; it becomes burdensome when the platform must support a large and frequently changing device catalog.

9.2.3.3. Community and industry-driven YANG models#

Two industry bodies publish the vendor-neutral YANG models that reduce per-vendor adapter work. IETF models (published as RFCs and Internet-Drafts) define foundational data structures: ietf-interfaces, ietf-routing, ietf-bgp. OpenConfig models, developed by an operator consortium (Google, AT&T, Deutsche Telekom, and others), cover similar ground with a more operationally focused schema and a faster iteration cycle. Both allow the automation platform to write intent once against a standard model and expect it to work on any compliant device.

A YANG module defining an interface looks like this (simplified from ietf-interfaces):

module ietf-interfaces {
  container interfaces {
    list interface {
      key "name";
      leaf name      { type string; }
      leaf description { type string; }
      leaf enabled   { type boolean; default true; }
    }
  }
}

The same structure appears in OpenConfig (openconfig-interfaces) and in vendor-native models, but with different paths, different leaf names, and different default semantics. The YANG module defines the schema; the protocol (NETCONF or gNMI) transports the data; the adapter layer maps between the standard and the vendor reality.

The practical reality of OpenConfig: vendor implementations vary in completeness. A device may claim OpenConfig support but only implement a subset of the model: reads work, writes do not; or the interface model works but the BGP model is absent.

Beyond missing paths, the more insidious problem is inconsistent data. A device returns a value for an OpenConfig path but in the wrong unit, with a different type, or with fields that should be null populated with vendor-specific defaults. A gRPC Network Management Interface (gNMI) subscription that works in SAMPLE mode may fail silently in ON_CHANGE mode on the same device.

These are not rare edge cases. They are the everyday reality of operating a multi-vendor platform that relies on OpenConfig in production. The standard works on paper; the vendor implementation requires the same per-device investigation the standard was supposed to eliminate. OpenConfig reduces that work significantly, but does not eliminate it. Plan for device-specific testing before relying on a new OpenConfig path in production automation.

YANG translation layers, such as Cisco NSO’s Network Element Driver model, accept vendor-neutral intent and emit vendor-specific commands or YANG edits. The automation team writes to a standard model; the translation layer handles all vendor-specific rendering. Heavy investment, high payoff at scale, and high operational cost when the device catalog is large and diverse.

A note on YANG model families

Three families of Yet Another Next Generation (YANG) models coexist in production networks, and understanding the distinction matters for choosing which to target:

IETF models are developed through the IETF standards process and published as RFCs or Internet-Drafts. They are the foundational standards: ietf-interfaces, ietf-routing, ietf-bgp. Adoption is broad but slow; the process is thorough but takes years. Vendor implementations often arrive two to four years after publication.
OpenConfig models are developed by the OpenConfig consortium, an operator-driven group (Google, AT&T, Deutsche Telekom, and others). OpenConfig iterates faster than IETF and is more operationally focused. It covers many of the same functional areas as IETF models but with different schema design choices. Most production gNMI deployments use OpenConfig paths.
Vendor-native models are each vendor’s own extensions: cisco-ios-xe-native, junos-conf-root, arista-eos-augments. These expose features the standard models do not cover, and they are often required for anything beyond the common-denominator functions that IETF and OpenConfig address. Nokia is an extreme case: almost all operational data on SR OS is accessible only through Nokia-specific YANG models (nokia-conf, nokia-state). The standard models cover a thin surface; vendor-native models are mandatory for any meaningful automation on that platform.

The analogy to SNMP is direct: IETF and OpenConfig YANG models are the equivalent of standard MIBs (MIB-II, IF-MIB, BGP4-MIB), which provided common ground across vendors for basic monitoring. Vendor-native YANG models are the equivalent of private enterprise MIBs, which were necessary for anything beyond what the standard MIBs covered. The dynamic is the same: standards give interoperability for the common case; vendor extensions are unavoidable for the full feature set. The advantage of YANG over SNMP MIBs is that the data model is richer and hierarchical, and the NETCONF/gNMI transport is transactional rather than polled. The disadvantage is the same: a heterogeneous network requires navigating all three model families simultaneously.

Abstraction buys uniformity at the cost of feature access. Every vendor has proprietary capabilities with no equivalent in standard models. A team using OpenConfig must choose: ignore the feature, add a vendor extension, or maintain a vendor-specific override path. There is no clean answer. In practice, most automation work concentrates in functions well-covered by standard models (VLANs, BGP, interfaces); proprietary features matter but are rarely the core use case. Standards also arrive with adoption lag: a vendor may implement an IETF module two to four years after publication, and only on newer platforms. Gate feature usage on the OS version recorded in the SoT rather than assuming uniform support.

9.2.3.4. SONiC and open networking#

SONiC’s configuration interface is a Redis database with a YANG-structured schema: uniform, programmatic, and designed for automation from the start. The vendor-specific adapter work that consumes effort in multi-vendor physical environments does not apply here. The same automation logic works across all SONiC-based platforms regardless of the underlying hardware vendor.

A VLAN entry in the SONiC CONFIG_DB looks like:

{
  "VLAN": {
    "Vlan210": {
      "vlanid": "210"
    }
  },
  "VLAN_MEMBER": {
    "Vlan210|Ethernet0": {
      "tagging_mode": "untagged"
    }
  }
}

This JSON is written directly to Redis via sonic-cfggen or the SONiC management REST API. There is no CLI to parse, no XML to construct. The automation platform writes structured data; SONiC’s orchagent reconciles it to the hardware forwarding tables.

This is the design direction open networking advocates for: a network interface designed for automation rather than adapted to it. Teams introducing SONiC-based platforms alongside traditional gear will find the SONiC adapter to be the simplest to write and the most reliable to operate.

Vendor offerings: SONiC has moved well beyond its Microsoft Azure origins. Most major switch silicon vendors now offer SONiC-based platforms: Microsoft Azure SONiC (the upstream reference), Arista with SONiC-compatible management APIs on select platforms, Cisco 8000 series with Broadcom-based SONiC support, Dell OS10 (which uses a SONiC-derived architecture), Nvidia Spectrum platforms running SONiC as a first-class option, and a growing number of ODM vendors (Edgecore, Celestica, UfiSpace) shipping branded SONiC platforms. The ecosystem has matured to the point where a SONiC-based platform is commercially available at every price and performance tier.
Mature use cases: data center leaf-spine fabrics are the most established deployment. Hyperscalers have run SONiC at scale for over a decade; enterprise data centers are now following. EVPN/VXLAN overlay, BGP routing, ECMP load balancing, and 400G/800G interface support are well-validated. The automation story is strong: gNMI, YANG, and the Redis-backed CONFIG_DB are native interfaces. A SONiC fleet can be managed with the same tools that manage any other gNMI-enabled platform.
New frontiers: SONiC is beginning to appear outside the traditional DC fabric. Disaggregated routing platforms (where SONiC runs on high-port-count routers rather than just switches) extend the open NOS model to routing use cases. SONiC now includes segment routing support: SRv6 (Segment Routing over IPv6) is available in upstream SONiC and is being used in production by service providers running SONiC-based platforms for traffic engineering and network slicing. Some service providers are also evaluating SONiC for peering edge and broadband aggregation. Campus SONiC deployments remain rare but are technically feasible; the hardware ecosystem for campus form factors is less mature. For teams building new platforms today, the question is no longer whether SONiC is production-ready in the DC: it is. The open question is whether the vendor’s SONiC fork and release cadence will remain aligned with upstream over the platform’s lifecycle.

9.2.3.5. Managing coexistence during firmware migration#

The adapter pattern assumes a known, stable OS version per device. During a rolling firmware upgrade, that assumption breaks: devices in the same role, managed by the same automation platform, run different OS versions simultaneously for days or weeks. The abstraction layer that worked yesterday on firmware 10.12 may not work on firmware 10.13 until a new adapter path is added.

The SoT should record the intended OS version (the target after migration) and the Collector should surface the current OS version as operational state. Before the Executor applies configuration, it should read the current OS version from the Collector or Observability pipeline and select the appropriate adapter path, not assume the intended version is already deployed.

A concrete mechanism: the Executor queries the Observability block (or a lightweight pre-execution check against the device itself) for the running OS version. The adapter registry maps OS version ranges to adapter implementations. The Executor invokes the correct adapter based on current state, not SoT intent. Once a device is upgraded and the running version matches intent, the adapter selection stabilizes. During the migration window, two adapter paths for the same vendor may be active simultaneously.

The implementation example in section 9.3 demonstrates this pattern directly: the HPE adapter bug was triggered because one firmware version returned vlan-exists and another returned duplicate-vlan for the same condition. The fix was per-OS-version error handling in the adapter registry. Any automation platform managing a multi-version fleet will encounter this class of problem. Encoding per-version adapter logic, driven by the current OS version read from the Collector, is the systematic mitigation. Chapter 11 addresses how to maintain the OS version mappings as a platform-level catalog rather than as per-playbook constants.

The organizational implication: someone must own the OS version tracking in the SoT, the adapter version registry, and the upgrade sequencing workflow. These three artifacts form the upgrade automation system. Without explicit ownership, they drift independently and the platform’s reliability degrades with every firmware release.

9.2.4. The Network as a Constraint on Every Block#

The three areas covered in this section, programmable interfaces, simulation environments, and abstraction strategies, are not isolated topics. They describe the network’s influence on every block in the NAF framework.

The network’s interface capabilities constrain what the Executor can do: a device that only supports CLI forces the Executor’s adapter into fragile screen scraping; a device with mature gNMI support enables reliable configuration and streaming telemetry. The network’s collection support constrains what the Collector can ingest: a device that does not implement a needed OpenConfig path requires a vendor-specific workaround or a collection gap. The network’s topology constrains what the Orchestrator can safely parallelize: a batch size that is safe for a flat access layer may be catastrophic for a spine-leaf fabric where simultaneous changes to multiple spine nodes can partition the network.

The SoT data model reflects these constraints. OS version fields gate adapter selection. Interface type fields gate collection method. Topology relationships gate Orchestrator concurrency rules. A SoT that records device inventory without recording automation-relevant attributes (interface capabilities, OS version, topology role) is incomplete in ways that only reveal themselves at execution time.

Each architectural decision in Part 2 has a corresponding network-layer constraint that this chapter surfaces. The network is not just the target of automation. It shapes every block that acts on it, and those constraints must be encoded in the platform’s data model, adapter logic, and workflow configuration. Automation that treats the network as a uniform surface will discover the heterogeneity one failed deployment at a time.

9.3. Implementation Example#

9.3.1. From Simulation to Production#

The campus VLAN workflow is ready. Eight weeks of development, three vendors modeled, idempotency tested against a three-node lab. The team wants to deploy it to production: 800 switches, buildings A through F. Before that happens, they run it against simulation.

This example follows the pre-production gate sequence described in section 9.2.2.4, using the Building B scope as the simulation target: 24 switches, 8 Cisco, 8 Arista, 8 HPE.

Step 1: Export topology from the Source of Truth

The SoT holds the Building B switch inventory with intended OS versions:

8 Cisco Catalyst 9300: IOS-XE 17.9.4
8 Arista 7050: EOS 4.31.2F
8 HPE Aruba 6300: AOS-CX 10.12.1006 (older firmware line, pre-10.13)

The SoT export produces a netlab topology definition: node types, vendor-specific image tags mapped to intended OS versions, and the VLAN state the simulation should start with (existing VLANs 100, 150, and 200 already configured on all switches, matching production state).

Step 2: Instantiate the simulation environment

netlab renders the SoT export into a containerlab topology file. containerlab runs on a shared Linux simulation host in the team’s CI environment (a bare-metal server with 64 GB RAM, sufficient for 24 lightweight cEOS/vrnetlab containers simultaneously). The HPE nodes use a vrnetlab-wrapped AOS-CX image pinned to 10.12.1006, matching the intended OS version from the SoT. containerlab starts the topology. All 24 nodes are up and responding to NETCONF and gRPC Network Management Interface (gNMI) within ninety seconds.

Step 3: Run the Chapter 7 VLAN workflow against the simulation target

The Orchestrator triggers the workflow with the simulation inventory as the target scope rather than the production inventory. The first four workflow steps complete without issue:

ServiceNow webhook received, parsed, validated against SoT schema
Pre-check: VLAN 210 does not exist in simulation (correct)
SoT updated with VLAN 210 intent
Approval gate: auto-approved in simulation mode

The fifth workflow step, fan-out execution across all 24 simulated switches via the Executor, returns failures.

Results: 8 Cisco nodes pass. 8 Arista nodes pass. 8 HPE nodes fail.

Error from HPE nodes: vlan-exists. The idempotency check in the HPE adapter handled duplicate-vlan as a no-op but had no handler for vlan-exists. The Executor returned failure, triggering the Saga compensation: VLAN 210 removed from all nodes that had received it.

No production change has occurred. The failure cost is the container startup time and a thirty-minute investigation.

Step 4: Diagnose and fix

The team checks the AOS-CX 10.12 release notes. The vlan-exists error was introduced in 10.11.1, replacing duplicate-vlan for duplicate VLAN detection. The 10.13 release reverted to duplicate-vlan for consistency with the rest of the Aruba product line.

Fix: add vlan-exists to the idempotency error code list in the HPE adapter. The SoT is updated with a note on the OS version boundary (10.11 through 10.12.x returns vlan-exists; 10.13+ returns duplicate-vlan).

Before re-running, the team encodes the expected post-change state as a pyATS or Robot Framework test: after the workflow completes, assert that VLAN 210 exists on all 24 simulation nodes and that the Saga compensation was not triggered. This assertion is added to the simulation pass criteria: the simulation gate closes only when both the workflow completes and the validation assertions pass.

Step 5: Re-run in simulation

The corrected adapter runs against the same containerlab topology. All 24 nodes pass. The Saga completes without compensation. The SoT shows VLAN 210 present on all Building B nodes.

Step 6: Production deployment

The Orchestrator triggers the same workflow against the full production scope: 800 switches, buildings A through F. Each step runs through the same Saga pattern validated in simulation. Zero failures on HPE nodes. The application team’s ServiceNow ticket closes automatically when the Orchestrator completes. The Observability layer validates VLAN 210 presence on all interfaces within the expected time window.

What this demonstrates

The simulation environment is not a verification step that can be skipped when the team feels confident. It is the pre-production gate in the Saga pattern. A simulation topology that mirrors production topology and OS versions is the minimum viable testing environment for a team deploying automation at campus scale. The investment is containerlab setup, OS-version-pinned images, and a SoT export mechanism. The return is the ability to catch device-specific bugs before they become incidents.

The simulation caught this bug not because of exotic testing, but because the OS version in the simulation matched production and the platform ran the exact same workflow code against it. Fidelity of the simulation environment to production state is what makes the catch possible. A simulation environment running the latest vendor images would have missed it.

Chapter 10 covers how to operationalize simulation environments as part of the platform: maintaining containerlab topologies in version control, syncing them with SoT exports on a schedule, and integrating them into CI/CD pipelines so the simulation gate runs automatically on every workflow change.

9.4. Summary#

Chapter 9 closes Part 2 by addressing the block the automation platform was always pointing at: the network itself.

Three ideas anchor the chapter.

The network is not uniform. Any large-scale automation platform interacts with campus switching, data center fabric, cloud VPCs, Kubernetes overlays, overlay controllers, and legacy gear simultaneously. Each exposes a different interface, operates under different consistency semantics, and has different automation maturity. The platform absorbs this heterogeneity through the interface selection hierarchy (gRPC Network Management Interface (gNMI) for telemetry, NETCONF for configuration, Command Line Interface (CLI) as last resort), OS version tracking in the Source of Truth, and vendor-specific adapters in the Executor.

Simulation environments are the pre-production gate. Container-based emulation provides realistic, fast, CI/CD-integrated environments where automation logic is tested against topology and OS versions that mirror production. Failures in simulation are cheap. Failures in production are expensive. Fidelity of the simulation environment to production state is what makes the gate meaningful.

Abstraction strategies extend platform lifetime. The adapter pattern handles today’s multi-vendor physical environment. OpenConfig and YANG translation push toward vendor-neutral models that reduce long-term adapter maintenance. SONiC eliminates the adapter work entirely for platforms that support it. None of these strategies provides a perfect answer; each involves tradeoffs between uniformity and feature access. The right choice depends on the team’s device catalog, operational capacity, and how much of their automation work falls within the standard feature coverage.

With Chapter 9, Part 2 is complete. Six chapters have covered all seven building blocks of the NAF framework: Source of Truth, Collector and Observability (together in Chapter 6), Execution, Orchestration, Presentation, and The Network. These are not independent tools. The SoT feeds the Executor with the intent it needs and the Collector with the expected state to validate against. The Orchestrator sequences both, handles failure, and surfaces progress to the Presentation layer. The Network sits beneath all of them: the constraint that shaped the design of every other block, and the place where automation ultimately produces a result or fails.

Part 3 turns from building the blocks to operating the platform at scale: engineering for reliability, designing for security and compliance, and treating the automation platform as a product with its own lifecycle and operational requirements.

References and Further Reading#

Network Programmability and Automation, Matt Oswalt, Christian Adell, Scott Lowe, Jason Edelman (O’Reilly, 2nd ed. 2023). The most comprehensive practical guide to network automation tooling, covering NETCONF, gNMI, YANG models, and automation frameworks in depth.
Network Programmability with YANG, Benoit Claise, Loe Clarke, Jan Lindblad (Pearson, 2019). The definitive reference on YANG data modeling: how IETF and OpenConfig models are structured, how to read and write YANG modules, and how NETCONF and RESTCONF use them.
Cisco pyATS: Network Test and Automation Solution, John Capobianco, Dan Wade (Cisco Press). A comprehensive guide to pyATS and Genie for network test automation: test scripts, parsers, structured state validation, and CI/CD pipeline integration.
Infrastructure as Code, Kief Morris (O’Reilly, 2nd ed. 2021). Not network-specific, but foundational for understanding how the principles behind repeatable, testable, version-controlled infrastructure apply directly to network automation platform design.

💬 Found something to improve? Send feedback for this chapter