5. Execution#
The playbook had been running perfectly for months: ten access switches in the lab, two minutes end to end, clean results every time. When the team decided to roll it out to the full inventory of 800 switches, nobody expected trouble. The first 600 devices updated without issue. Then the job slowed. Then it stalled. The RADIUS server, suddenly receiving 150 simultaneous SSH authentication requests, started rejecting connections. Ansible hung on 150 devices mid-execution. An engineer killed the job.
What came next was the harder problem: nobody knew which 600 devices had been updated and which 150 hadn’t. No execution state had been recorded. They re-ran the playbook and hoped idempotency would save them, it did, that time. But the incident revealed something important: the execution layer had been designed for the lab, not for the network. Speed, parallelism, error boundaries, state tracking, none of it had been considered. The automation worked; the execution architecture didn’t.
Most people think of network automation as: “take the same data, and without connecting as a human in the CLI, do something on the network.” I would bet most people starting with network automation begin there. And yes, that is what the Execution block does. But as you’ve seen throughout this book, network automation is bigger than this first step, and Execution is just one component within an architecture.
The Execution block directly interacts with the network for the most dangerous operations (e.g., config changes or reboots). So don’t skip this chapter; it’s crucial to do it right.
In this chapter, we’ll cover the goals and pillars this block provides, and the internal capabilities it needs to achieve them.
5.1. Fundamentals#
5.1.1. Context#
Execution defines how to execute actions. The what comes from the Intent block, and when comes from Orchestration.
In early network automation projects, a script that bounces an interface for a given device and interface name can be the first win. That journey can grow into a much more sophisticated system.
5.1.2. Goals#
What does an execution system actually need to do? Five things matter:
Get the right data into place. Before executing anything, you need to know what to execute (the intent), where to execute it (which devices), and how to access them (credentials, connection details). This means pulling inventory and fetching the intended state from the Source of Truth (more about it in Chapter 4), and sometimes querying observability data to understand current conditions. Without this integration, your execution engine is just blindly running commands. For your mental health, avoid it.
Start when needed, whether that’s now or later. Sometimes you need immediate execution: an engineer clicks “deploy” and expects it to happen right away. Other times, execution should wait, deploy during the maintenance window, wait for a device to come online, or react to an event from observability. The system needs to support both synchronous and asynchronous triggers.
Track state over time. Network-wide operations aren’t instant. You might deploy to hundreds of devices over hours. Which devices succeeded? Which failed? Which are still pending? State management also enables idempotency (run the same task twice, get the same result), rollback (undo what you just did), and resume (pick up where you left off after a failure). Without state tracking, every execution is a one-shot gamble.
Execute reliably at scale. A script that works for 5 devices often breaks (or degrades) at 500. You need parallel execution, error handling, retries, rate limiting, and dry-run capabilities. The system should gracefully handle partial failures, provide clear feedback on what went wrong, and never leave the network in an undefined state. Different operations need different strategies. Some are fast and parallel, others slow and serial.
Work with any network device or platform. Your network probably has Cisco, Arista, Juniper, cloud Application Programming Interface (API)s, Linux boxes, firewalls, load balancers. Each one speaks different protocols and has different operational patterns. Your execution layer needs adapters for all of them, with a common interface so the rest of your automation doesn’t care about the differences.
With these goals in mind, what architectural capabilities do you actually need?
5.1.3. Pillars#
Each goal translates to specific capabilities:
Data integration layer. Your execution engine doesn’t exist in isolation. It needs programmatic access to your Source of Truth (inventory, credentials, intent), your observability systems (current state, health metrics), and potentially other systems (ticketing, change management, approval workflows). This means implementing Application Programming Interface (API) clients, handling authentication securely, caching data appropriately, and validating that you have everything needed before starting execution.
Flexible triggering mechanisms. Multiple ways to start execution matter. Synchronous triggers include Representational State Transfer (REST) Application Programming Interface (API)s (direct invocation), webhooks (external system integration), and remote procedure calls. Asynchronous triggers include event listeners (react to observability alerts, device state changes, or external events), schedulers (cron-like periodic execution or one-time scheduled tasks), and message queue consumers. Basic chaining helps too: one execution completes and triggers another.
State management infrastructure. This goes beyond “does the device have this config.” You need execution state (which tasks are running, pending, completed, failed), desired state (what should be configured), and actual state (what is configured). This requires persistent storage, transaction support for atomic operations, locking mechanisms to prevent concurrent conflicts, and a clear state model. The infrastructure must handle distributed scenarios where state is shared across multiple execution workers.
Robust execution engine. This is the heart of the system. It supports both imperative workflows (run command 1, then command 2, then command 3) and declarative approaches (make the device look like this, figure out the steps yourself). The engine handles concurrency (executing against multiple devices in parallel or in batches), implements retry logic with exponential backoff, provides dry-run capabilities (predict what would happen without doing it), captures detailed execution logs, and gracefully handles partial failures. Error handling is critical: when something fails, should you abort everything, continue with other devices, or retry?
Protocol abstraction layer. Network devices are a heterogeneous mess. Some speak Secure Shell (SSH) and expect Command Line Interface (CLI) commands. Others use NETCONF or RESTCONF. Modern devices support gRPC Network Management Interface (gNMI) for streaming telemetry and configuration. Cloud platforms expose Representational State Transfer (REST) APIs. Your execution system needs adapters for all of these, presenting a unified interface to higher-level logic. This layer handles connection pooling, session management, authentication, command formatting, and response parsing. Good abstraction means you can add support for a new device type without rewriting your entire automation.
5.1.4. Scope#
The execution block sits between planning and action. It receives instructions from Intent (what should be configured) and Orchestration (when and how to do it), then interacts directly with network devices to make changes happen.
In scope:
- Connecting to network devices via any supported protocol
- Executing configuration changes, operational commands, file transfers, reboots
- Managing execution state and tracking progress
- Handling errors, retries, and rollbacks
- Providing feedback to Orchestration
Out of scope:
- Deciding what to configure (that’s Intent)
- Orchestrating complex multi-step workflows and deciding when to trigger (that’s Orchestration)
- Long-term storage and analysis of execution results (that’s Observability)
Think of execution as the engine in a car: it provides power and motion, but doesn’t decide where to go or when to turn. Those decisions come from the driver (Orchestration) following a map (Intent).
5.2. Functionalities#
Five key functional areas work together to safely and reliably modify network state:
- Data Integration: Retrieving inventory, credentials, intent, and observability data from upstream systems
- Triggering: Initiating execution through synchronous or asynchronous mechanisms
- State Management: Tracking execution progress, enabling idempotency and rollback
- Engine: The core logic that executes tasks with appropriate concurrency and error handling
- Network Adapter: Protocol-specific interfaces for communicating with diverse network devices
These components form a pipeline: data integration provides the inputs, triggering starts the process, the engine executes tasks using network adapters, and state management tracks everything throughout.
graph LR
subgraph Goals
G1[Get the right data to the right place]
G2[Start when needed]
G3[Track state over time]
G4[Execute reliably at scale]
G5[Work with any network device or platform]
end
subgraph Pillars
P1[Data integration layer]
P2[Flexible triggering mechanisms]
P3[State management infrastructure]
P4[Robust execution engine]
P5[Protocol abstraction layer]
end
subgraph Functionalities
F1[Data Integration]
F2[Triggering]
F3[State Management]
F4[Engine]
F5[Network Adapter]
end
G1 --> P1 --> F1
G2 --> P2 --> F2
G3 --> P3 --> F3
G4 --> P4 --> F4
G5 --> P5 --> F5
The inner architecture looks like this:
graph TD
A[Data Integration] --> C[Engine]
B[Triggering] --> C[Engine]
C --> E[State Management]
E --> C
C --> D[Network Adapter]
classDef component fill:#e1f5ff,stroke:#4a90e2,stroke-width:2px;
class A,B,C,D,E component;
5.2.1. Data Integration#
Before executing anything, you need data. The execution engine pulls information from multiple sources to understand what to do and where to do it.
5.2.1.1. Inventory#
Inventory data defines the target devices. We covered this in Chapter 4, but at minimum:
- Target: IP address or FQDN to reach the device
- Platform/OS: Device type, vendor, OS version (determines which protocol and commands to use)
- Credentials: Username/password, Secure Shell (SSH) keys, Application Programming Interface (API) tokens, certificate paths
- Connection parameters: Secure Shell (SSH) port, timeout values, Application Programming Interface (API) endpoints
- Metadata: Site location, role (spine/leaf/edge), environment (prod/staging)
Inventory data should come from the Source of Truth. Using files only works well in very small environments. Your execution engine queries the Source of Truth Application Programming Interface (API) at runtime (or receives the data via Orchestration when triggering). Some systems cache inventory for performance with configurable refresh intervals, or leverage event-driven inventory that combines the trigger with the associated information.
Never store credentials in inventory files or logs. Use a secrets management system (HashiCorp Vault, AWS Secrets Manager, CyberArk) and inject credentials at execution time. The Source of Truth should provide pointers to that sensitive data and retrieve it at runtime. Your execution engine should support multiple credential sources and per-device credential overrides.
5.2.1.2. Intended Data#
Intended data is what you want to configure or change. This comes from your Intent/Source of Truth block and might include:
- Configuration artifacts: Ready-to-deploy artifacts that can be structured data used directly with APIs, or a set of CLI commands that build the configuration.
- Commands: Specific Command Line Interface (CLI) commands or Application Programming Interface (API) calls to execute for device operations.
- Files: Software images for upgrades, configuration files for import.
Should the execution engine fetch intent data itself, or should Orchestration pass it in? Both work. Fetching intent directly couples execution to your Source of Truth but ensures data is always fresh. Receiving intent as parameters makes execution more generic but requires Orchestration to handle data fetching.
My recommendation is to consume configuration artifacts directly. The Intent block should be responsible for generating the configuration artifact with its own logic, rendering configuration templates with structured data representing state (VLAN configurations, routing policies, ACLs).
Configuration artifacts can go stale between the moment they are generated and the moment they are consumed by the Executor. If the SoT data changes after a pre-generated artifact was cached, the Executor may apply an outdated configuration. The staleness window is the time between the last SoT commit and the execution trigger. For automation that fetches artifacts at trigger time (rather than from a cache), this window is minimal. For pipelines that pre-render artifacts on a schedule and store them for later use, the window can grow to hours. Where data freshness matters, the Orchestrator or Execution trigger should always pull a fresh artifact from the SoT rather than from a cached file.
5.2.1.3. Observed Data#
Observability validation usually belongs to Orchestration, which can decide whether to proceed. In simple cases where Orchestration does not exist yet, Execution can incorporate this role.
Here are some use cases where observability data is used close to the execution flow:
- Pre-execution validation: Is the device reachable? Is there enough disk space for an upgrade? Are there active sessions that would be disrupted?
- Graceful degradation: Before rebooting a switch, query observability to see if there’s redundant connectivity. If not, delay or abort.
- Conditional logic: Only apply a configuration if certain conditions are met (CPU below threshold, no active alarms, time-of-day within window)
- Drain capacity: Before executing a draining operation, ensure the available capacity is enough to absorb the change without breaking SLOs.
This requires integration with your observability systems. The execution engine might query Application Programming Interface (API)s for current metrics, check device status, or wait for readiness signals. Orchestration can also do this integration, then gate execution based on freshness and risk.
A common pattern: tools like Ansible’s “network health checks” modules or Nornir’s data-gathering tasks run observability queries before and after execution, comparing results to detect unexpected changes. The “before/after snapshot” approach catches regressions that would otherwise go unnoticed.
5.2.2. Triggering#
Execution doesn’t start by itself. Something has to trigger it. Modern execution systems support multiple trigger mechanisms:
Synchronous (immediate, blocking):
- Representational State Transfer (REST) Application Programming Interface (API) calls: An external system or user invokes an Application Programming Interface (API) endpoint, execution runs, and the response includes results. Simple and direct.
- Webhooks: External systems (Term "git" not found, ticketing, CI/CD) send HTTP requests when events happen. Common pattern: Term "git" not found push triggers config deployment.
- Command Line Interface (CLI) commands: Engineers run commands that directly invoke execution (
ansible-playbook,terraform apply, custom scripts). These CLI commands are part of the lightweight presentation layer offered by these tools (more in Chapter 8).
Asynchronous (delayed or event-driven):
- Event listeners: React to events from observability (device down, threshold crossed), message queues, or external systems. Ansible Event-Driven Automation (EDA) built this pattern explicitly: listen to events, match rules, trigger playbooks automatically.
- Message queues: Tasks submitted to a queue (RabbitMQ, Kafka, AWS SQS), workers pull and execute them. Enables buffering, priority queuing, and rate limiting.
The right approach depends on your use case. Immediate changes (fixing a production issue) need synchronous triggers. Reactive automation (responding to alerts) uses event listeners. Scale considerations matter too, as we’ll cover in Chapter 11.
Triggering usually comes from Orchestration, though if that doesn’t exist, other blocks can trigger execution directly. For example, triggering directly from the Presentation layer as a human-exposed task, or from the Source of Truth block.
One important consideration: what counts as an execution? For me, an execution is more than a simple task. It can include multiple chained tasks that don’t require a complex workflow (that’s Orchestration’s job). The boundary gets blurry (again). For example, this sequence can still be one execution flow: upgrade firmware, wait for a device reboot, verify operation, update inventory. But if it requires human validation or conditional branching based on broader validation, it belongs in the Orchestration block.
5.2.3. State Management#
State management is about tracking where you are in an execution and what the world looks like so you can make intelligent decisions. There are two main categories:
Execution state (tracking the automation itself):
- Which devices have been processed?
- Which tasks succeeded, failed, or are pending?
- If a task failed with a transient error, can we retry?
- If execution was interrupted, can we resume where we left off?
Actual infrastructure state (tracking the target configuration state):
- What’s currently configured?
Two approaches exist:
Stateless/Agentless (e.g., Ansible). Each execution runs independently. No persistent state between runs. Every execution starts fresh: gather current state, compute diff, apply changes. Simpler to operate (no state database), but less efficient (you re-discover everything each time) and no built-in rollback.
Stateful (e.g., Terraform). The system maintains a persistent state file tracking what was last deployed. On each run, compare desired state (your config) to recorded state (what you deployed last time) to actual state (what’s on the device). This enables precise change planning, efficient execution (only change what’s different), and rollback (revert to previous state). But now you have a state file to protect, lock, and sync across multiple operators.
Transactional execution is the next step when you want stronger safety guarantees than “best effort.” A transaction groups multiple device changes into a single unit: either everything is applied and validated, or the system rolls back to the previous state. This requires three things:
- A clear boundary (what operations are inside the transaction)
- A durable record of pre-change state (for rollback)
- A locking or lease mechanism to prevent concurrent changes from breaking atomicity
In practice, most network automation uses “transaction-like” behavior rather than strict ACID semantics, because not all devices support native commit/rollback. Still, you can approximate transactions by taking snapshots, using candidate configs (when supported), enforcing exclusive locks, and making rollback paths a first-class part of the execution flow.
The challenge? State sync and sharing. If multiple people or systems modify network devices, state gets stale. Terraform addresses this with remote state storage and locking. Ansible avoids the problem by being stateless but sacrifices efficiency. The middle ground: cache current state temporarily, validate before each operation, and build “eventual consistency” patterns where brief mismatches are acceptable.
5.2.3.1. Idempotency#
Idempotency means running the same automation multiple times produces the same result. Apply a VLAN config once: VLAN is created. Apply it again: nothing changes (VLAN already exists). This is critical for reliability: if execution fails halfway through, you can safely re-run it without creating duplicates or breaking things.
How tools achieve idempotency:
Built-in modules (e.g., Ansible): Most Ansible modules are idempotent by design. The
ios_vlanmodule checks if the VLAN exists before creating it. If it’s already there with correct config, Ansible reports “ok” (no change). This requires the module author to implement checking logic.State comparison (e.g., Terraform): Terraform compares desired state to current state, computes a diff, and only applies differences. If you run
terraform applytwice with no changes to your config, the second run does nothing.Declarative APIs (e.g., NETCONF/YANG): Some protocols handle idempotency natively. NETCONF’s
<edit-config>withmergeoperation is inherently idempotent: it merges your config with existing config, creating or updating as needed.Manual checking (e.g., raw scripts): If you’re writing Python or Go scripts, you implement idempotency yourself: query current state, compare to desired state, only make changes if there’s a diff.
Idempotency is harder than it looks. What if the VLAN exists but has wrong settings? Should you update it (potentially disrupting traffic) or report a conflict? What about transient failures (device temporarily unreachable)? Retry logic must distinguish between “operation already done” (idempotent success) and “operation failed” (real error).
Perfect idempotency adds overhead. Every operation requires querying current state first. For large-scale deployments, this slows things down. Some teams accept “mostly idempotent” (works 99% of the time) rather than “perfectly idempotent” (works always, but runs slowly).
Idempotency is a requirement for the declarative approach. Someone has to carry the burden of providing idempotence and hiding the complexity for everyone else.
5.2.4. Engine#
The execution engine is the core logic that takes a task (“configure this VLAN on these 50 switches”) and executes it safely and efficiently. This is not Orchestration. The Orchestration block coordinates multiple execution tasks across time and dependencies. The engine just runs one task well (or a simple chain of tasks).
What it does:
- Accepts a task definition (what to do, which devices)
- Breaks it into atomic operations (per-device actions)
- Executes operations with appropriate concurrency (serial, parallel, batched)
- Handles errors, retries, and rollbacks
- Reports progress and results
An execution engine might configure 100 routers in parallel, but it doesn’t decide when to configure them, which configs to apply based on business logic, or what to do next based on results. Those are Orchestration concerns. Execution is the workhorse; Orchestration is the foreman.
That said, execution engines often support simple chaining: “do task A, then task B on the same devices.” That’s basic sequencing, not full Orchestration. When you need complex workflows (wait for external approval, branch based on results, coordinate across multiple systems) you need a real Orchestration layer (more in Chapter 7).
5.2.4.1. Languages#
How do you define the execution logic? Different language styles have different tradeoffs:
| Style | Examples | Strengths | Tradeoffs |
|---|---|---|---|
| Domain-specific languages (DSLs) | Ansible (YAML), Terraform (HCL) | Lower barrier to entry, self-documenting intent, built-in execution semantics | Limited flexibility, harder debugging in complex scenarios, conditional logic can become awkward |
| General-purpose programming | Nornir (Python), custom Python/Go scripts | Full flexibility, strong debugging tooling, easy library reuse | Higher skill requirement, more code to maintain, less standardization across teams |
Many teams use DSLs (Ansible) for common patterns and drop into custom code (Python modules, plugins) for complex edge cases. This balances accessibility with power. The human factor usually decides which approach wins, more on that in Chapter 13.
5.2.4.2. Imperative versus Declarative#
Imperative: You specify how to do something, step by step.
Declarative: You specify what the end state should be, and the tool figures out how.
In short, prefer declarative when it fits.
A good way to see the difference is to look at Ansible, which supports both:
Imperative Ansible:
- name: Create VLAN 100 cisco.ios.ios_command: commands: - vlan 100 - name EngineeringYou’re literally telling Ansible which commands to run. If the VLAN exists, this still runs (though it might be idempotent depending on the module).
Declarative Ansible:
- name: Ensure VLAN 100 exists cisco.ios.ios_vlans: config: - vlan_id: 100 name: Engineering state: mergedYou describe desired state. Ansible figures out what commands to run. If VLAN 100 already exists with correct name, Ansible does nothing.
| Approach | Strengths | Tradeoffs |
|---|---|---|
| Declarative | Idempotent by nature; intent is easier to read; fewer operator mistakes because the tool handles edge cases | Depends heavily on module/provider quality; less control over execution path; troubleshooting can be harder when internals are abstracted |
| Imperative | Full control over each step; execution flow is explicit; works in almost any environment with CLI access | More code and testing effort; idempotency is on you; long-term maintenance grows quickly as edge cases accumulate |
Most teams use declarative when possible, imperative when necessary. Just remember: declarative means someone else carries the burden, even if it looks simple from the outside.
5.2.4.3. Serial versus Parallel#
Two options exist for running tasks:
Serial execution: Process devices one at a time. Configure device 1, wait for completion, configure device 2, etc. Safe (you see failures immediately and can stop), but slow (100 devices = 100× the time of one device).
Parallel execution: Process multiple devices simultaneously. Configure devices 1-10 at once, 11-20 next, etc.
flowchart TD
subgraph Serial
S1[Device 1] --> S2[Device 2] --> S3[Device 3]
end
subgraph Parallel
P0[Start] --> P1[Device 1]
P0 --> P2[Device 2]
P0 --> P3[Device 3]
end
Why Nornir exists: Ansible was originally serial. For network automation at scale, this was painfully slow. Ansible added
strategy: freeand laterforksto enable parallelism, but it’s still fundamentally designed for sequential execution. Nornir was built from the ground up with parallelism: it uses threading by default to execute against multiple devices concurrently. This makes it 10-100× faster for large device counts.
Considerations for parallel execution:
- Control plane impact: Hitting 1000 devices simultaneously can overload management networks, data sources and authentication systems, or device control planes.
- Dependency handling: If devices depend on each other (spine before leaf), you need ordering. Pure parallelism doesn’t work.
- Error blast radius: If your automation has a bug, parallel execution deploys it to many devices before you notice. Serial execution fails on device 1, you stop before damaging device 2-100.
My recommendation: use parallel execution with batching. Process devices in groups of 10-50 (I used to call them “waves”), verify success of each batch before continuing. This balances speed with safety. More on deployment strategies in Chapter 10.
5.2.4.4. Dry-run (Plan Mode)#
Dry-run is the execution engine’s “plan” phase: simulate changes without touching devices, then surface a concrete diff and risks. A good dry-run pulls current state, computes the exact device-level operations that would run, and validates prerequisites (reachability, schema, dependency ordering). It should be fast, deterministic, and reproducible so reviewers can trust it.
Example (Terraform plan for an AWS VPC with reviewable changes):
Terraform will perform the following actions:
# aws_vpc.main will be created
+ resource "aws_vpc" "main" {
+ cidr_block = "10.10.0.0/16"
+ enable_dns_support = true
+ enable_dns_hostnames = true
+ tags = {
+ "Name" = "core-vpc"
}
}
Plan: 1 to add, 0 to change, 0 to destroy.This is what reviewers sign off on: the exact resources and attributes that would change, without touching the cloud.
In practice, dry-run is what makes approvals meaningful and rollbacks less likely. It gives humans a clear view of what will happen, and it gives the engine a chance to reject unsafe or inconsistent changes early. If the platform supports candidate configs or commit-check, the engine should use those. Otherwise, dry-run is a computed diff plus pre-checks, not a guarantee.
In my experience, offering dry-run capabilities is very convenient in the early days of an automation project to get network engineers’ buy-in. Later, the relevance decreases.
5.2.4.5. Resiliency#
Network execution is inherently unreliable (as most distributed systems are): devices reboot, connectivity hiccups, and control planes get overwhelmed. Other dependencies, such as the Intent block, may also have temporary issues. Resilient execution handles these gracefully:
Retry logic:
- Transient failures (connection timeout, temporary CPU spike) should trigger retries with exponential backoff
- Distinguish retryable errors (timeout) from permanent errors (authentication failure, syntax error)
- Limit retry attempts to avoid infinite loops
Timeout strategies:
- Connection timeouts: how long to wait for device response before giving up
- Task timeouts: how long a complete operation can run (prevents hanging on stuck devices)
- Global timeouts: maximum execution time for entire job
Error handling:
- Fail fast: one device fails, abort everything (safe but inefficient)
- Keep going: log failure, continue with other devices (efficient but might spread damage)
- Threshold-based: if >10% of devices fail, stop (balanced)
Rollback capabilities:
- Take config snapshots before changes
- If execution fails, automatically restore snapshots
- Support dry-run mode: show what would change without actually changing it
Circuit breakers:
- If a device consistently fails, mark it unhealthy and skip it temporarily
- Prevents wasting time repeatedly trying to connect to dead devices
Checkpointing:
- Save progress periodically
- If execution crashes, resume from last checkpoint rather than starting over
Building all this is hard. Most teams start with basic retry logic and add complexity as they hit failures in production.
One useful pattern: treat “safety modes” as first-class execution states. Render intended config, parse current config, gather facts, then apply (merged/replaced/overridden) only after validation gates pass. This gives you deterministic checkpoints before touching production.
5.2.5. Network Adapter#
Network devices speak different protocols. The network adapter layer abstracts these differences so upper layers don’t care whether they’re talking to a Cisco switch via Secure Shell (SSH) or an Arista switch via Representational State Transfer (REST) Application Programming Interface (API).
5.2.5.1. Interfaces#
Different devices support different management interfaces:
| Interface | Description | Example Libraries/Tools | Typical Use |
|---|---|---|---|
| Secure Shell (SSH) / Command Line Interface (CLI) | Most universal, least structured (text in/text out) | Netmiko (Python), Paramiko (Python), scrapli (Python), scrapligo (Go) | Legacy platforms, vendor-specific operational commands |
| NETCONF / RESTCONF | Structured model-driven management (YANG-based) | ncclient (Python), scrapli-netconf (Python), nemith/netconf (Go) | Declarative configuration and standardized data models |
| gRPC Network Management Interface (gNMI) / gNOI | gRPC-based interfaces for config/state and operational RPCs | pygnmi (Python), gnmic (Go CLI), openconfig/gnmi (Go) | Streaming telemetry and modern operations workflows |
| Representational State Transfer (REST) APIs | HTTP/JSON or XML APIs, often platform-specific | requests/httpx (Python), net/http + OpenAPI-generated clients (Go) | Controllers and cloud/network platform APIs |
| JSON-RPC / vendor gRPC | Structured RPC patterns used by specific network operating systems | Arista eAPI (JSON-RPC), vendor gRPC SDKs | Fast remote procedure execution with structured payloads |
Some libraries provide an abstraction layer with a common interface across platforms, such as:
- NAPALM (Network Automation and Programmability Abstraction Layer with Multivendor support): Python library that provides a unified API across vendors.
- Supports Cisco, Arista, Juniper, and others
- Under the hood, uses appropriate protocol (Secure Shell (SSH), Application Programming Interface (API), NETCONF)
- Use case: Multi-vendor environments where you want consistent automation code
- Pybatfish: Not a device transport library, but a strong companion for execution safety. A practical pattern is “plan/apply with NAPALM or Ansible, validate network behavior with pybatfish before and after changes.”
Why use an abstraction layer? Instead of writing different code for Cisco vs. Juniper, NAPALM gives you one API. Call get_facts() and NAPALM figures out how to get device facts whether it’s a Cisco IOS device (via Secure Shell (SSH)), Arista EOS (via eAPI), or Juniper Junos (via NETCONF). The tradeoff: abstraction hides vendor-specific features. For common operations (get config, push config, get facts), it’s great. For more advanced operations or exotic vendor features, you drop back to native interfaces because it’s likely to not be implemented.
5.2.5.2. Operations#
Network execution isn’t just configuration management:
| Operation Type | What It Covers | Notes |
|---|---|---|
| Configuration changes | Push full configs/snippets/declarative state; merge/replace/delete modes | Most common operation type and best supported by tooling |
| Zero-Touch Provisioning (Zero Touch Provisioning (ZTP)) | Automated onboarding at first boot | Requires DHCP + bootstrap servers (TFTP/HTTP/HTTPS) and device support |
| File transfers | Upload images, download logs/backups | Common protocols: SCP, SFTP, TFTP, HTTP |
| Device operations | Reboot/reload, operational commands (ping/traceroute), backups | Common for day-2 operations and remediation |
| Rollback | Revert configuration to previous known-good state | Native rollback on some platforms; backup restore on others |
Each operation type has its own error modes and needs specific handling. Software upgrades need pre-checks (enough disk space?), post-checks (did the device boot correctly?), and rollback plans (if upgrade fails, reload old image). The results should be exposed to Orchestration for decisions, while Execution can still handle local guardrails and retries.
Where validation and rendering belong: The Source of Truth owns intent and (optionally) pre-rendered configs. Execution owns device-level safety checks and operational validation (reachability, pre/post checks). Orchestration decides when to validate and what to do with outcomes (approve, pause, rollback, or escalate). If you want a simple rule: validation that changes workflow belongs to Orchestration; validation that changes device actions belongs to Execution.
5.2.6. Solutions#
Many tools exist for network execution, and comparing them helps understand the tradeoffs:
| Tool | Execution model and strengths | Best fit | Main limitations |
|---|---|---|---|
| Ansible | DSL (YAML), agentless, large module ecosystem, strong for mixed server/network automation | Teams that want quick wins and broad platform support | At large scale, it needs tuning; complex logic can become hard to maintain in YAML |
| Terraform | Declarative IaC (HCL), strong diff/plan engine, stateful workflows, excellent cloud integration | Teams already standardizing on Terraform for infrastructure | Network provider maturity varies; state management adds operational overhead; weaker for day-2 ops |
| Salt | Python-based with agent or agentless options, event-driven architecture, strong scale characteristics | Existing Salt shops or event-heavy operations | Smaller network-automation community and steeper onboarding |
| Nornir | Python-first framework, threaded parallelism, highly flexible, easier debugging and fast | Python-capable teams with custom or performance-sensitive needs | Fewer ready-made components; more engineering ownership required |
| Custom Python/Go | Maximum control and design freedom for domain-specific logic | Edge cases, internal platforms, and highly specialized workflows | You own everything: standards, reliability patterns, testing, and lifecycle support |
| Vendor controllers | Intent + policy controllers with vendor-native workflows (examples: Cisco Catalyst Center, Aruba Central, Juniper Mist) | Teams standardizing around one vendor ecosystem with strong controller APIs | Less portable patterns in multi-vendor environments |
Many teams use multiple tools depending on the use case and network infrastructure.
5.2.7. Zero Touch Provisioning#
Zero Touch Provisioning (ZTP) is a distinct execution pattern where a device boots for the first time and automatically retrieves and applies its full configuration without any human intervention at the device. The data flow is structurally different from day-2 operations and is worth treating as a named architectural pattern.
The ZTP flow
flowchart LR
A[Device boots\nno config] --> B[DHCP assigns\nmanagement IP\nand bootstrap URL]
B --> C[Device fetches\nbootstrap config\nfrom server]
C --> D[Device authenticates\nto SoT and management\nnetwork]
D --> E[Orchestrator detects\nnew device, triggers\nfull provisioning job]
E --> F[Executor applies\nfull intent from SoT]
The key insight is that the bootstrapping step is deliberately minimal: just enough configuration to get the device onto the management network and able to authenticate. Full provisioning runs after that as a standard Executor job, pulling complete intent from the Source of Truth. This means ZTP reuses the same execution pipeline as day-2 operations rather than requiring a separate provisioning system.
Common ZTP patterns
| Pattern | How it works | Trade-off |
|---|---|---|
| DHCP + static file | DHCP points to a static config file per device (matched by MAC or serial) | Simple to implement; does not pull from SoT; breaks at scale |
| DHCP + dynamic generation | Bootstrap server queries SoT and generates device-specific initial config at request time | SoT-driven from day zero; requires the SoT to have the device enrolled before it boots |
| OS image + config | Device downloads both OS image and config during boot (needed for devices requiring an OS provision) | Handles bare-metal or factory-reset devices; increases bootstrap server complexity |
ZTP introduces a sequencing dependency: the device must be enrolled in the Source of Truth before it physically boots, otherwise the bootstrap server has no data to generate a config from. In the campus scenario, this is handled by the ServiceNow-to-Nautobot sync: when a switch is added as an asset in ServiceNow (before it arrives on site), Nautobot automatically creates the device record with location, role, and vendor. When the switch boots, the SoT is already ready.
5.2.8. Hybrid and Cloud Execution#
The Executor increasingly needs to operate across two fundamentally different environments simultaneously: traditional network devices and cloud-native platforms. These environments have different API paradigms, authentication models, and failure modes. Treating them identically leads to fragile automation; treating them as completely separate systems duplicates effort and creates inconsistent operations.
The divergence problem
| Dimension | Network devices | Cloud platforms |
|---|---|---|
| API style | SSH, NETCONF, gNMI (stateful, long-lived connections) | REST/HTTP (stateless, short-lived requests) |
| Auth model | Shared credentials (username/password, SSH keys) | Temporary tokens, service accounts, instance roles (AWS SigV4, Azure AD, GCP service accounts) |
| Operation completion | Usually synchronous (the command either succeeds or fails before the session ends) | Often asynchronous (API returns “accepted,” requires polling for final status) |
| Failure ambiguity | A dropped NETCONF session is a clear failure | A cloud API timeout may or may not have applied the change; you must query to find out |
| Idempotency approach | Imperative by default; declarative requires careful module design | Natively declarative in most platforms (Terraform, CloudFormation, Pulumi) |
The recommended architecture
Keep Intent (SoT) and Orchestration unified. The same business request that creates a VLAN on campus switches may also need to create a security group in AWS. The Source of Truth holds both. The Orchestrator coordinates both. Only at the Network Adapter layer (5.2.5) does the execution path diverge: one adapter handles NETCONF/Ansible against campus switches; another handles Terraform or cloud provider APIs against the cloud environment.
This means the blast radius of a cloud API change and a network device change are governed by the same approval and audit trail, which is the correct architectural outcome even though the protocols are completely different.
The secrets manager strategy must accommodate both long-lived network device credentials and short-lived cloud tokens. For cloud platforms, credentials should be generated at job start and never stored. Most cloud providers offer mechanisms for this (AWS instance profiles, Azure managed identities, GCP workload identity). Network device credentials remain in Vault and are injected at runtime. The Executor should never hold credentials between job runs regardless of the target type.
5.2.9. Security Considerations#
The Executor is the component in your architecture that has write access to the network. It can push configurations, restart services, upgrade firmware, and alter routing policies across hundreds of devices simultaneously. That makes its security posture more consequential than almost any other component, and yet it is frequently the most neglected from an architectural standpoint. A compromised or misconfigured Executor is not a data breach risk; it is a network outage risk.
Credential management
Credentials for network devices should never appear in playbooks, templates, job definitions, or version control. The pattern is straightforward: store secrets in a dedicated secrets manager (HashiCorp Vault, AWS Secrets Manager, CyberArk) and inject them at runtime into the Executor’s environment. The Executor fetches credentials at job start; it does not hold them permanently.
This also means credential rotation becomes operationally safe. When device passwords are rotated, as they should be on a regular schedule, the Executor picks up new credentials on its next run without any deployment or reconfiguration.
Least privilege per operation
Not every execution needs the same level of access. Read operations (gathering facts, dry-run checks, pre-deployment validation) should use read-only credentials. Write operations should use credentials scoped to the change type: a job deploying VLAN changes should not hold credentials that allow BGP configuration changes. Where the network platform supports role-based access (Arista roles, Cisco privilege levels, NETCONF access control), use it to limit the blast radius of any single compromised session.
Across the campus example, this means a VLAN deployment job on an Arista switch uses a role that allows VLAN and interface configuration but cannot touch routing protocol configuration or management plane settings.
Separation of concerns
Who can trigger execution and who can modify automation logic should be different roles with different access controls. An operator approving and launching a VLAN deployment job does not need, and should not have, access to modify the Ansible roles or playbook templates that implement it. In AWX and AAP, this is enforced through role assignments: “Job Launcher” can start pre-approved job templates; “Project Editor” can change the underlying automation logic.
This separation matters most when automation is used for high-impact operations. The person initiating a firmware upgrade across 800 switches should not also be the person who wrote the upgrade playbook, or at minimum a second reviewer should have approved the playbook before it is available for production use.
Audit trail requirements
Every execution event must be logged with enough detail to reconstruct what happened: who triggered the job, which job template was used, which devices were targeted, what parameters were passed, what the outcome was for each device, and at what timestamp. Logs must be retained for the compliance period applicable to the organization, typically 12 to 24 months for change management in regulated industries.
This is not optional in most enterprise environments. Change management processes require a traceable record linking a change ticket to a specific execution event to a specific set of device changes. Design this audit trail into the Executor from the start rather than retrofitting it later.
Network segmentation of the automation infrastructure
The Executor and its control plane (AWX, Ansible control nodes) should sit in a management network segment. Execution traffic to devices should flow through the out-of-band management network where available, not through the same data plane the Executor might be modifying. Inbound access to trigger jobs should require authentication and should not be reachable from untrusted networks.
A practical failure mode to avoid: an Executor that is reachable from the production user VLAN means that any compromised host in that VLAN could potentially trigger network changes. Management plane access belongs on a management plane network.
5.3. Implementation Example#
5.3.1. Use Case: Automated VLAN Deployment Across a Heterogeneous Campus#
We continue with the campus network from Chapter 4. The VLAN service definition (its ID, subnet, per-vendor configuration templates, and target switch groups) is now stored in the Source of Truth. This chapter focuses on how the Execution block picks up that intent and deploys it reliably across all 800 switches.
Scenario: A business team requests a new VLAN for a new application. The request comes through a ticketing system with details: VLAN ID, name, campus sites, and approval. Network operations deploys this VLAN across Arista, HPE, and Cisco access/distribution switches, verifies connectivity, and reports success.
Requirements:
- Deploy VLAN configuration to multiple switches in parallel
- Verify pre-conditions (VLAN doesn’t already exist, switches are reachable)
- Perform dry-run to show what would change
- Execute actual deployment with rollback capability if failures occur
- Verify post-deployment (VLAN is active, no errors)
5.3.2. Solution Architecture#
Zooming out to the whole architecture before diving into execution details:
- Source of Truth: NetBox (stores inventory, VLAN definitions)
- Orchestration/Triggering: AWX Workflow Template coordination (API launch, webhook launch, scheduled launch)
- Observability: Provides real-time data for decision-making
- Execution:
- Execution Engine: Ansible job templates/tasks (precheck, dry-run, deploy, verify, rollback)
- Data Integration: NetBox inventory plugin in Ansible/AWX
- Network Adapters: Ansible modules for Cisco IOS XE, Arista EOS, and HPE/Aruba
- State Management: AWX job and workflow status per phase + per-host results, with rollback task on failure paths
This solution is for illustration purposes; it is not a universal recommendation.
5.3.3. Implementation Flow#
This runs as an AWX workflow with multiple task nodes:
- A VLAN intent change in NetBox triggers AWX via webhook
- AWX workflow starts and runs inventory sync (NetBox)
- Precheck job validates reachability, existing VLAN state, and site guardrails
- Dry-run job renders vendor-specific tasks without applying changes
- Approval node (optional) gates production execution
- Deploy job applies VLAN intent in parallel batches across vendors
- Verify job confirms operational and intended state
- On failures, rollback job runs only for affected scope
- Observability collects pre/post data to support final validation
- Final validation runs and the execution summary is published
flowchart TD
A[NetBox VLAN Intent Change] --> B[Webhook to AWX]
B --> C[AWX Workflow Start]
C --> E[Inventory Sync from NetBox]
E --> F[Precheck Job Task]
F --> G[Dry-run Job Task]
G --> H[Approval Node]
H --> I[Deploy Job Task]
I --> J[Verify Job Task]
J --> N[Gather Network Data]
N --> L[Final Validation]
L --> M[Update Ticket and Report]
F -->|failed| X[Stop and return validation errors]
I -->|failed hosts| Y[Rollback Job Task for affected devices]
Y --> J
F --> N
%% --- Styles ---
classDef sot fill:#cfe8ff,stroke:#0b5fb3,stroke-width:2px;
classDef orch fill:#ffe0cc,stroke:#c24b00,stroke-width:2px;
classDef exec fill:#d7f4e1,stroke:#0f7a3b,stroke-width:2px;
classDef obs fill:#ffd6d6,stroke:#b22222,stroke-width:2px;
class A sot;
class B,C,E,H,L,M orch;
class F,G,I,J,Y,X exec;
class N obs;
5.3.4. Solution Summary#
This implementation demonstrates all key execution capabilities:
- Data Integration: Pulls inventory and intent from NetBox using dynamic inventory
- Triggering: AWX workflow supports API/webhook/schedule, plus approval-gated execution
- State Management: AWX workflow/job status tracks progress and failure domains; rollback path is explicit
- Engine: Ansible handles parallel execution (configured via
forks/batching), error handling, and dry-run - Network Adapter: Vendor-native resource modules map one intent to different platforms (
cisco.ios.ios_vlans,arista.eos.eos_vlans, and HPE/Aruba collection modules) - Observability: Pre/post data collection supports final validation and reporting
Resiliency features:
- Pre-checks prevent deploying to unreachable devices or creating duplicate VLANs
- Dry-run shows changes before applying
- Safety checkpoints use resource-module states (
rendered,parsed,gathered) before apply states (merged,replaced,overridden) - Idempotent modules make re-running safe
- Post-verification catches silent failures
- Rollback: dedicated AWX rollback node limits rollback to affected device scope
Scaling considerations:
- Parallel execution via
forks: 50processes 50 switches simultaneously - For larger deployments (500+ switches), batch into groups and use Ansible Tower/AWX for workflow management
- Add rate limiting if control plane or authentication systems can’t handle the load
5.4. Summary#
The Execution block is where network automation becomes tangible: configurations get pushed, devices get rebooted, changes happen. But reliable execution is more than sending commands over Secure Shell (SSH).
It starts with data integration and triggering, then depends on state management (idempotency, rollback, and transaction-like safety), a capable engine (serial/parallel choices, dry-run/plan, resiliency), and a network adapter layer that hides protocol differences across vendors. Observability feeds validation and guardrails, while Orchestration decides when to proceed or stop. The tooling matters less than the execution patterns: integrate data, trigger intentionally, execute deterministically, validate results, and keep a clean rollback path.
Start small and scale deliberately: one workflow, one device class, and clear pre/post checks. Add batching, rate limits, and stronger validation as the blast radius grows. Execution is powerful and risky; test thoroughly, roll out gradually, monitor constantly, and always plan for rollback.
References and Further Reading#
- Network Programmability and Automation: Skills for the Next-Generation Network Engineer, 2nd Edition. O’Reilly Media, 2023. Matt Oswalt, Christian Adell, Scott S. Lowe, and Jason Edelman. Chapter 10 (Automation Tools) and Chapter 12 (Automation Architecture).
- Network Automation Cookbook, 2nd Edition. Packt Publishing, 2024. Christian Adell and Jeff Kala.
- Ansible Network Automation documentation: https://docs.ansible.com/ansible/latest/network/
- Mastering Python Networking, 4th Edition. Packt Publishing, 2023. Eric Chou
- NAPALM documentation: https://napalm.readthedocs.io/
- Nornir documentation: https://nornir.readthedocs.io/
- Terraform Network Infrastructure Automation: https://www.terraform.io/use-cases/network-infrastructure-automation
💬 Found something to improve? Send feedback for this chapter