Mar 20, 2026 · 6714 words · 32 min read

7. Orchestration#

The network team had done everything right. They had a solid Source of Truth, well-tested playbooks for every operation, and a clear runbook to execute them. On paper, deploying a new VLAN service was fully automated. In practice, it took half a day and one specific engineer.

That engineer knew the sequence. First, validate the SoT data was complete. Then run the pre-check playbook. Then review the output, look for failed devices, and decide whether to continue. Then trigger the deployment playbook. Then wait. Then run the validation playbook. Then update the ServiceNow ticket manually. If any device failed partway through, roll it back before the others noticed. She had it all in a runbook, step by step, in a shared document that nobody else had fully internalized.

When she went on leave, deployments stopped. When a junior engineer tried the sequence and got the order wrong, the network was left in a partial state for six hours. The automation existed. The coordination was still manual.

That is the problem this chapter solves. Orchestration is the building block that turns a collection of automation tools into a system that behaves as one. It coordinates the other blocks, decides when they run, handles failures gracefully, tracks every decision, and does all of this without an engineer standing in the middle managing the flow by hand.

7.1. Fundamentals#

7.1.1. Context#

Every building block we have covered so far does one thing well. The Source of Truth holds intent. The Executor applies it. The Collector retrieves state. Observability makes sense of that state. Each one is a specialist, and specialists need a connector: something that decides when each should act, what to do with the result, and how to recover when something goes wrong.

That connector is the Orchestrator.

In Chapter 3 we placed it at the center of the NAF framework as the block that coordinates all the others. Chapter 5 showed how the Executor applies configuration changes; Chapter 6 showed how Observability validates the results. This chapter shows how the Orchestrator sequences both, and handles everything that can go wrong between them.

Without orchestration, you are not running automation. You are running scripts that someone has to invoke in the right order, at the right time, and interpret the right way. That is automation in name only.

7.1.2. Goals#

The Orchestrator needs to fulfill five goals:

  1. Coordinate multi-block workflows end-to-end. A deployment is not a single action; it is a sequence: validate intent in the SoT, run pre-checks, execute configuration, validate the result, notify stakeholders. The Orchestrator holds the full sequence together.

  2. React to events automatically, with or without human initiation. A ServiceNow approval, an Observability alert, a scheduled compliance scan: all of these should be able to trigger a workflow without a human manually starting it.

  3. Resilient and scalable execution. A workflow touching 800 switches in parallel must complete reliably. It must survive a restart. It must handle partial failures without losing the work done by the devices that succeeded.

  4. Provide tamper-evident visibility. Operators need to see what is running right now. Auditors need to see what ran last month, who triggered it, which version of the workflow ran, and what every step produced. Both needs must be met from the same system.

  5. Manage workflow definitions as production software. A workflow that deploys to 800 switches cannot be changed casually. The logic itself needs to be persisted, versioned, tested, and promoted to production with the same discipline applied to any code running in production.

7.1.3. Pillars#

Five pillars support these goals:

  1. Workflow engine: define, run, and track multi-step processes
  2. Triggering layer: manual, scheduled, event-driven, webhook
  3. Resilient execution: workflow survives restarts; supports retry, rollback, and concurrent operations across hundreds of devices without per-device bottlenecks
  4. Audit and observability: every step logged, every decision traceable
  5. Pipeline management: persist, version, and promote workflow definitions safely in production

7.1.4. Scope#

The Orchestrator coordinates. It does not act.

In scope:

  • Coordinating other building blocks
  • Defining workflow logic and step dependencies
  • Handling triggering from multiple sources
  • Tracking execution state and producing audit trails
  • Managing rollback decisions when steps fail

Out of scope:

  • Executing device changes (that is the Executor)
  • Storing network operational state (that is Observability)
  • Holding network intent (that is the Source of Truth)

A common mistake is building an Orchestrator that duplicates the execution or storage responsibilities of its neighbors. The interface between blocks must stay clean.

An Orchestrator that also stores credentials, manages device inventory, or runs its own configuration engine has grown into something else. If your orchestration tool starts absorbing responsibilities from other blocks, you have an architectural coupling problem that will be expensive to untangle later.

7.2. Functionalities#

The five goals and pillars are realized through five core functionalities. Each maps directly to one goal and its supporting pillar:

  1. Workflow Engine: how workflows are structured and how steps coordinate
  2. Triggering: how and when workflows start, and what the caller experiences
  3. Resilience and Scale: how workflows complete reliably under failures and at volume
  4. State and Traceability: tracking execution state and producing tamper-evident audit records
  5. Pipeline Management: persisting and managing workflow definitions safely in production
graph LR

    subgraph Goals
        direction TB
        A1[Coordinate multi-block workflows end-to-end]
        A2[React to events automatically]
        A3[Resilient and scalable execution]
        A4[Tamper-evident visibility and audit]
        A5[Safe changes to production pipelines]
    end

    subgraph Pillars
        direction TB
        B1[Workflow engine: define, run, track]
        B2[Triggering layer: manual, scheduled, event-driven]
        B3[Resilient execution: durable, retry, rollback at scale]
        B4[Audit and observability: every step logged]
        B5[Pipeline management: persist, version, promote safely]
    end

    subgraph Functionalities
        direction TB
        C1[Workflow Engine]
        C2[Triggering]
        C3[Resilience and Scale]
        C4[State and Traceability]
        C5[Pipeline Management]
    end

    A1 --> B1 --> C1
    A2 --> B2 --> C2
    A3 --> B3 --> C3
    A4 --> B4 --> C4
    A5 --> B5 --> C5

    classDef row1 fill:#eef7ff,stroke:#4a90e2,stroke-width:1px;
    classDef row2 fill:#ddeeff,stroke:#4a90e2,stroke-width:1px;
    classDef row3 fill:#cce5ff,stroke:#4a90e2,stroke-width:1px;
    classDef row4 fill:#b3d8ff,stroke:#4a90e2,stroke-width:1px;
    classDef row5 fill:#99ccff,stroke:#4a90e2,stroke-width:1px;

    class A1,B1,C1 row1;
    class A2,B2,C2 row2;
    class A3,B3,C3 row3;
    class A4,B4,C4 row4;
    class A5,B5,C5 row5;

7.2.1. Workflow Engine#

The workflow engine is the core of the Orchestrator. It defines multi-step processes, executes them, tracks their state, and handles the relationships between steps.

Before going into patterns, it is worth naming the four fundamental approaches to coordination, because this choice shapes everything else.

7.2.1.1. Coordination approaches#

  • Monolith (imperative program): A single Python script calls each step in sequence, waits for each result, and decides what to do next. Simple to write, and many teams start here. The problem is durability: if the script crashes at step 5 of 10, you restart from the beginning. There is no persistent state between runs. You also cannot parallelize steps without writing threading logic yourself. It works for small, fast operations; it breaks under load and unreliable infrastructure. In practice, this is what the Executor block already provides: a script invoked by an operator. Calling it orchestration is generous.

  • Workflow (DAG-based): This is where orchestration genuinely begins. Steps are defined as a directed acyclic graph, where each node is a task and edges express dependencies. The engine tracks state per node: if you restart after a crash, only the failed or incomplete steps re-run. Parallelism is built in: independent branches execute concurrently. This is the dominant approach for production orchestration, and the most proven for network automation. Tools in this category include Prefect, Temporal, and AWX workflow templates.

  • Choreography (event-driven, no central coordinator): There is no Orchestrator. Each component reacts to events published by others: the Executor publishes “deployment complete,” the Observability system consumes it and runs validation, the notification system consumes the validation result. The coupling between components is loose, which makes it easy to add new consumers. The downside: there is no single place to understand the full workflow state. Debugging cross-service failures requires correlating events from multiple systems. This approach works for simple reactive patterns but scales poorly in complexity.

  • Agentic (sense-logic-act loop): A Large Language Model (LLM) or AI system acts as the logic layer. The agent observes current state (from Observability or SoT), reasons about what action is needed, and invokes the Executor. Rather than following a predefined workflow graph, it makes decisions dynamically at runtime. This is the approach that trades determinism for flexibility and forms the architectural foundation for autonomous networks. Section 7.2.7 covers it in depth; Chapter 17 extends it to full autonomy.

Most teams start with the Monolith approach and graduate to Workflow (DAG-based) as their automation matures. Choreography is rarely the right choice for network operations because audit and traceability requirements make a central coordinator valuable. Agentic approaches are real but not yet the default for production network operations.

7.2.1.2. Workflow patterns#

Within the DAG-based approach, four patterns cover most real-world network automation workflows:

  • Sequential: Steps run one after another; each depends on the previous completing successfully. Appropriate when strict ordering is required: validate intent, then pre-check, then execute, then verify.

  • Parallel (fan-out/fan-in): Multiple independent steps execute concurrently. A fan-out step launches many parallel tasks (one per device, or one per building); a fan-in step waits for all of them to complete before proceeding. Essential for operations across hundreds of devices: without fan-out, a 10-second per-device operation across 800 devices takes over two hours sequentially.

  • Conditional branching: The path taken depends on runtime state. Did the pre-check pass? Branch to execution. Did it fail? Branch to abort and notify. The workflow definition includes decision nodes that evaluate results and choose the next step.

  • Saga pattern: For long-running workflows, the Saga pattern adds compensating transactions to each step. If step N fails, the workflow runs compensating actions for steps N-1 through 1 in reverse order, returning the system to a known-good state. This is how rollback works at the orchestration level: not by re-running the whole workflow, but by executing the inverse of each step that succeeded.

flowchart TD
    subgraph Sequential
        S1[Step A] --> S2[Step B] --> S3[Step C]
    end

    subgraph FanOut["Fan-out and Fan-in"]
        F0[Start] --> F1[Device 1]
        F0 --> F2[Device 2]
        F0 --> F3[Device N]
        F1 --> FJ[Fan-in]
        F2 --> FJ
        F3 --> FJ
    end

    subgraph Saga["Saga - rollback on failure"]
        P1[Step 1] --> P2[Step 2] --> P3[Step 3]
        P3 -- fail --> R3[Rollback 3]
        R3 --> R2[Rollback 2]
        R2 --> R1[Rollback 1]
    end

The Saga Pattern is also the structural basis for progressive rollouts: deploying network changes in waves rather than all at once. Wave 1 covers a small test population; if it succeeds, the workflow fans out to Wave 2, then Wave N. If any wave fails, compensation rolls back only that wave. This is the network equivalent of a canary deployment: the same controlled promotion discipline software teams apply to code releases, applied to network changes.

7.2.1.3. Dependency management#

Workflow steps rarely exist in isolation. The engine needs to express and enforce three kinds of dependencies:

  • Data dependencies: Task B cannot start until Task A completes and produces a result that B consumes. The engine passes outputs between steps as structured data. In practice this means the pre-check step passes the list of reachable devices to the execution step, rather than re-querying from scratch.

  • Multi-input dependencies (fan-in): A step waits for multiple parallel branches before proceeding. The engine holds state for each branch and triggers the join step only when the configured completion policy is met: all completed, or N of M, or first success.

  • External dependencies: Sometimes a step cannot proceed until something outside the system happens. A change window opens. A human approves. A device becomes reachable after a reboot. The engine must support waiting states with configurable timeouts and defined escalation paths for when the wait never resolves.

7.2.2. Triggering#

A workflow has to start somehow. Triggering answers two questions: what causes a workflow to begin, and what does the caller experience after it does.

How triggering differs from Chapter 5: In Chapter 5, triggering referred to how an operator or system invokes the execution engine directly: an API call to AWX, a template launch from a Command Line Interface (CLI). That is execution-layer triggering. Here, triggering describes what causes the Orchestrator to initiate a workflow, which may in turn invoke the Executor as one of several steps. The Orchestrator receives the trigger and decides what full workflow to run. The Executor just executes what the Orchestrator tells it to.

7.2.2.1. Triggering modes#

Four modes cover the full spectrum from human to fully automated:

  • API call: The orchestrator exposes an HTTP endpoint. An operator calls it manually, a UI button calls it, or an external system (ServiceNow, Nautobot, a CI/CD pipeline) calls it when an event occurs. These are the same mechanism: what differs is who initiates the call and whether it carries structured event data. Appropriate for any change that needs to start now, whether the initiator is a human or a system.

  • Scheduled: A cron-like schedule starts a workflow at a configured time. Nightly compliance scans, weekly firmware audits, monthly capacity reports. Most orchestrators support this natively; some teams use an external scheduler and call the orchestrator API.

  • Message queue: The Orchestrator consumes messages from a queue (Kafka, NATS, RabbitMQ) and starts a workflow per message. This decouples the event producer from the orchestrator: the producer publishes and moves on; the orchestrator processes at its own pace. Appropriate for high-volume event streams where delivery-at-most-once guarantees from direct API calls are insufficient.

The choice between push (API call) and pull (message queue, scheduled) depends on volume and coupling tolerance. API calls are simpler but require the sender to know the orchestrator’s endpoint. Message queues scale better for high-volume streams but add infrastructure to operate.

7.2.2.2. Response contract#

Once a workflow starts, the caller needs to know what to expect back:

  • Synchronous: The caller waits until the workflow completes and receives the result directly. Appropriate for short operations (under 30 seconds) where the caller needs the result to proceed. Most interactive use cases start here and discover they have outgrown it when they try to run a 10-minute deployment and the API client times out.

  • Asynchronous: The caller receives a workflow run ID immediately and polls for status, or registers a callback URL to receive the result when complete. Required for any workflow that touches more than a handful of devices. This has a direct implication for how the system surfaces status: the Presentation layer (Chapter 8) must expose status endpoints and push notifications, because users triggered workflows from somewhere and need a way to track them without holding an open HTTP connection.

  • Hybrid: The workflow starts asynchronously, but the caller can optionally block on a sync-wait endpoint if needed. A convenience pattern that avoids forcing callers to choose upfront.

Idempotency in event-driven workflows: When Observability fires an alert or a SoT change event fires, multiple systems may react simultaneously. The same alert can fire twice; a message queue may deliver the same message more than once. Every event-driven workflow must be idempotent: running it twice against the same input must produce the same result as running it once. At the orchestration level, this typically requires a deduplication key: a unique identifier the orchestrator checks before starting a new run. If a run with that key already exists and is still running, reject or queue the duplicate. This sounds simple and is surprisingly easy to get wrong in practice.

The Closed-Loop Pattern: Observability detects a drift condition (a device configuration has diverged from intent). It fires an alert to the Orchestrator via the triggering mechanisms above. The Orchestrator starts a remediation workflow: queries the SoT for the expected state, invokes the Executor to correct the device, and then re-runs an Observability check to confirm the fix. If the check passes, the loop closes. If it fails, the Orchestrator escalates. This pattern is the foundation of self-healing automation and is covered in depth in Chapter 15. The loop only works if the Orchestrator, Observability, and Executor are decoupled enough that each can play its role independently.

7.2.3. Resilience and Scale#

This is what separates an orchestrator you prototype with from one you trust at 3am in production. A workflow that runs reliably under ideal conditions tells you nothing about whether it will survive a coordinator restart mid-run, handle 40 of 800 devices failing pre-checks, or degrade gracefully when a dependency never resolves. The next time your orchestrator is tested, it will not be in a demo.

Resilience at the execution layer, covered in Chapter 5, addresses idempotency and retry at the individual device level: if a playbook fails against one device, retry it. What Chapter 5 does not address is workflow-level durability: what happens when the Orchestrator itself restarts mid-run, when 40 of 800 devices fail pre-checks, or when a dependency never resolves and the workflow hangs. Scaling the orchestrator itself, including high availability, horizontal workers, and database load under concurrent runs, is an infrastructure concern addressed in Chapter 11.

7.2.3.1. Durable state#

A workflow engine that stores execution state in memory loses everything on restart. This is acceptable for scripts and unacceptable for production orchestration. Durable state means the workflow’s progress survives the orchestrator’s failure: which steps completed, what they produced, which are still pending. When the orchestrator comes back online, it resumes from where it left off.

Temporal is built entirely around this guarantee: it replays the workflow function from its event history on restart, making in-flight state crash-proof. AWX stores job state in its database. Scripts store nothing.

7.2.3.2. Retry strategies#

Not all failures are equal. A device that was briefly unreachable during a firmware reboot should be retried. A device that returned a permanent authorization error should not: retrying will not help and may trigger lockout policies.

Retry configuration should distinguish:

  • Transient failures: network blips, API timeouts, temporary resource contention. Retry with exponential backoff.
  • Permanent failures: bad credentials, device in maintenance mode, configuration that does not apply to this device type. Abort and escalate.

The workflow definition should specify per-step retry policy. A pre-check step and a configuration push step have different failure semantics and should not share the same retry settings.

7.2.3.3. Rollback and compensation#

When a deployment workflow fails partway through, you have three options: leave the partial state, fix it manually, or roll it back automatically. In most network operations, partial state is dangerous: devices that received the new configuration will behave differently from devices that did not.

The Saga Pattern addresses this by defining a compensating transaction for each step. If the workflow succeeds up to step N and then fails, it runs compensating actions for steps N-1 through 1 in reverse order. In a VLAN deployment: if the configuration push succeeds on 30 devices and fails on the 31st, the saga rolls back the 30 successful pushes before reporting failure.

The Saga Pattern requires that compensating transactions be defined in advance, at workflow design time. This is extra work upfront. The alternative is discovering at 2am that your network is in a state that no runbook describes.

7.2.3.4. Concurrency control#

Fan-out across 800 devices simultaneously will overwhelm most execution layers. Concurrency control at the orchestration level means:

  • Batching: run N devices in parallel, wait for the batch to complete, then run the next N. This controls blast radius: if a bad configuration reveals a problem, you have not touched all 800 devices yet.
  • Rate limiting: the execution layer has limits; the orchestrator must respect them. Do not let the AWX queue fill with 800 simultaneous jobs.
  • Partial success thresholds: define what success means at scale. 798 of 800 devices configured correctly might be good enough to proceed; 600 of 800 should halt the workflow and escalate.

7.2.3.5. Timeout and circuit breaker#

Workflows that wait forever are a reliability problem. Every step that can block needs a configured timeout. When the timeout expires, the workflow needs a defined action: escalate, skip, or abort.

The Circuit Breaker pattern extends this to repeated failures: if a step fails more than N times in a window, stop trying and raise an alert. This prevents a single unreachable device from holding a workflow open for hours.

7.2.3.6. Resilience across block boundaries#

The five patterns above address failures within the Orchestrator’s own execution: durable state, retries, rollback, concurrency, timeouts. But workflows also fail at the boundaries between blocks, and those failures require different handling.

When the Orchestrator calls the SoT API and receives a timeout, the appropriate response is different from when the Executor returns a device failure. A SoT timeout means the Orchestrator cannot verify that the intent it is about to deploy is current. Proceeding with cached data risks applying stale configuration. The correct response is usually to abort and retry rather than to proceed with uncertainty. A Presentation layer event that never arrives (the approval webhook from ServiceNow) may mean the request was rejected, the integration is broken, or the message was lost. These require different recovery paths: a missing approval is not a transient failure that should be retried indefinitely.

Block-boundary failures should be classified explicitly in the workflow definition:

  • Dependency unavailable (SoT, Observability): the workflow cannot proceed safely. Fail cleanly, emit a diagnostic event, and do not retry with stale data.
  • Dependency degraded (partial SoT read, Observability lagging): the workflow may proceed with explicit acknowledgement that it is operating with reduced confidence. Log the degraded state; do not silently proceed.
  • Downstream block failed (Executor returned error, Collector returned no data): apply the retry and rollback patterns above, but classify the failure source accurately so the alert and audit record name the correct block.

A workflow that treats all failures as retryable device errors will retry a broken SoT integration until it exhausts its retry budget and pages an on-call engineer with the wrong diagnosis. Classifying failures at block boundaries is the difference between a workflow that fails fast and one that fails confusingly.

7.2.4. State and Traceability#

When something goes wrong at 3am, two questions need immediate answers: what is the current state of the workflow, and who triggered it?

When the audit team asks questions three months later, the same record must answer: what ran, which version of the workflow definition ran, what every step received as input, what it produced as output, and what the final outcome was.

These are different questions with different urgency, but they come from the same source: the workflow’s execution state and its audit record.

Traceability is a cross-cutting concern across the automation platform: the Source of Truth records every change to intent; Observability records every change to network state. But neither of those records tells you who initiated a workflow, which steps ran in sequence, and what caused the outcome. Only the Orchestrator’s audit record closes that gap. Because the Orchestrator coordinates all the other blocks, its trace is the authoritative view of everything that happened across the full automation system for a given event.

7.2.4.1. State is not network state#

This distinction is easy to miss. In Chapter 6, “state” meant the network’s operational state: interface counters, BGP sessions, device CPU. In Chapter 5, “state” meant the desired configuration intent held in the SoT. Here, state means the execution state of the workflow itself: which steps have run, which are running, which failed, and what they produced.

The two are related (a workflow step produces network state changes that Observability then validates) but they are stored differently, queried differently, and serve different purposes. Do not conflate them.

7.2.4.2. The workflow state machine#

Every workflow run moves through a defined state machine:

  • Pending: received but not yet started
  • Running: actively executing steps
  • Succeeded: all required steps completed successfully
  • Failed: a step failed and the workflow could not continue
  • Cancelled: stopped by a human or an external signal

Each step within the workflow carries its own state. A partially successful fan-out run must show exactly which devices succeeded, which failed, and which were skipped. “The workflow failed” is not useful without the per-device breakdown.

7.2.4.3. Audit log requirements#

Every workflow run record must contain, at minimum:

  • Who or what triggered the run: human identity, system name, triggering event ID
  • When it started and when it completed
  • Which version of the workflow definition ran
  • The full input the workflow received
  • The output of every step
  • The final outcome

This record must be tamper-evident: it cannot be edited after the fact. In regulated environments, the Orchestrator’s audit log is the change management record. If this record can be altered, the change management system is not trustworthy.

Most orchestration tools write audit logs to a database they also own. Whether that is sufficient depends on your compliance requirements. Some organizations export orchestration audit logs to an append-only external system (a SIEM, a write-once log store) to prevent tampering by anyone with database access, including anyone who can run queries against the orchestrator’s own database.

7.2.5. Pipeline Management#

A workflow that configures 800 production switches is, architecturally, production software. The central challenge is not just versioning: it is how you store, validate, and promote workflow definitions so that the logic itself is as reliable as the infrastructure it manages.

The workflow definition encodes decisions: which pre-checks to run, what to do when a device is unreachable, what constitutes success. Changing it carelessly changes what happens to every device that workflow touches next time it runs.

I have seen teams silently break production workflows by editing the workflow definition directly in the orchestration tool’s UI. Nobody noticed until the next run touched a device type the modified step did not handle correctly. The UI edit had no review, no test, no rollback path.

7.2.5.1. Strategies#

  • Git-backed definitions: Store workflow definitions in version control. The orchestration tool pulls from Git on deployment, not from local UI edits. This gives you a history, a review process, and the ability to roll back to any prior version.

  • Blue/green pipeline versions: Maintain two versions of a workflow in production: the current stable version that handles all active workflows, and a new version that receives new runs after a validation period. Traffic shifts only when the new version is proven stable.

  • Canary rollout: A new workflow version handles a small number of runs first. A firmware upgrade workflow might apply the new version to 10 devices before promoting it to the full fleet. Problems surface at 10 devices, not 800.

7.2.5.2. Testing orchestration changes#

Workflow definitions can be tested before production:

  • Dry-run mode: run the workflow against real inputs but skip the steps that modify the network. Verify that the logic produces the expected sequence of actions.
  • Staging environment: a subset of devices dedicated to testing workflow changes before they run against production.
  • Shadow execution: run the new version in parallel with the current version, but ignore its results. Compare outputs to detect divergence before cutting over.

Two different rollbacks: Rolling back a workflow run means undoing the network changes made by a specific execution (handled by the Saga Pattern in section 7.2.3). Rolling back a workflow definition means reverting the workflow logic to a previous version: switch the Git reference, redeploy the definition, and the next run uses the prior version. These are orthogonal operations. Confusing them is how teams end up rolling back the wrong thing.

7.2.6. Solutions Landscape#

Disclaimer: the tools listed here are examples for explanatory purposes, not recommendations. Each has a different architecture and trade-off profile. Evaluate them against your team’s capabilities, existing tooling, and operational constraints.

The orchestration tool market covers a wide range of models, from network-specific platforms to general-purpose workflow engines. The question is rarely “which is best” and almost always “which fits how we operate.”

ToolExecution modelWhat makes it architecturally differentNetwork automation fit
AWX / Ansible AAPYAML workflow templates, UI-firstThe orchestrator and executor are the same system: Ansible jobs are first-class citizens, no translation layer. RBAC, credentials, and inventory are unified.Teams already using Ansible for execution; the straightforward path when Ansible is already the execution layer
ItentialLow-code/no-code visual builder, network-specificPurpose-built for network operations: pre-built adapters for ITSM, IPAM, and multi-vendor devices. Workflow builder accessible to non-developers.Enterprise network teams who need multi-vendor, multi-system integration without custom code
PrefectPython code-as-DAG, developer-firstWorkflows are Python functions with decorators. The pipeline is software: tested, versioned, and observed like application code. Strong native observability of the pipeline itself.Teams comfortable with Python who want to treat orchestration with software engineering discipline
TemporalDurable execution engine, code-definedSurvives crashes mid-workflow: any step replays from its last checkpoint. Saga and compensation are first-class constructs, not bolt-ons.Long-running workflows (firmware upgrades, large rollouts) where partial execution and rollback must be rock-solid
WindmillScript-first, lightweight, open-sourceEach node in the workflow is an independent script (any language). Low operational overhead; easy to self-host and customize without enterprise platform complexity.Smaller teams or orgs that want flexible custom logic without platform weight

One architectural question cuts across all of these: does your orchestration tool also serve as your execution engine, or are they separate? AWX/AAP collapses both into one. Prefect, Temporal, and Windmill are orchestrators that call out to separate execution tools (Ansible, custom scripts, APIs). The collapsed model is simpler to operate; the separated model gives you more flexibility to swap execution engines independently. Chapter 3 introduced this trade-off under minimal coupling.

The AWX/AAP row in the table above deserves a clarifying note for teams considering it as their primary platform. AWX workflow templates are the Orchestrator; individual Ansible job templates are the Executor. The architectural boundary between the two exists even though they run within the same platform. A team that puts all logic into job templates (the Executor) and uses workflow templates only to chain them has effectively built an Orchestrator from execution primitives, which limits visibility, retry configuration, and rollback options. Conversely, a team that puts business logic into workflow templates (conditional branching based on external data, dynamic inventory selection) is using AWX as a genuine orchestrator. The distinction matters because AWX’s audit trail, approval gates, and notification hooks are workflow-level features. If all your logic lives in job templates, you cannot use those features without restructuring. This is not a limitation of AWX; it is a consequence of not distinguishing Orchestration from Execution in the design.

These tools overlap significantly: Python runs in both Prefect and Windmill; both Prefect and Temporal can call Ansible. A rough starting point: if your team is Ansible-first and wants minimal new components, AWX workflow templates are the natural fit; if you want testable, code-first pipelines with software engineering discipline, Prefect and Windmill both work with different operational models; if durability under partial failure is the primary constraint, Temporal’s replay model is purpose-built for it. Treat this as a foothold for evaluation, not a final answer.

7.2.7. The Agentic Orchestrator#

The agentic coordination approach listed in 7.2.1 introduces a different execution model: instead of following a pre-defined DAG, a Large Language Model (LLM) receives a goal and determines the sequence of actions at runtime. The agent observes current state from the Observability and SoT blocks, reasons about what action closes the gap, invokes the Executor, and re-observes to verify the outcome. If the result is not what was expected, it reasons again. The decision logic is not encoded in a workflow definition; it lives in the model’s reasoning.

What enables this across the full automation platform is Model Context Protocol (MCP) (Model Context Protocol). Each building block exposes an MCP server: a defined set of tools the agent can call as part of its reasoning loop. The SoT server exposes device queries and intent lookups; the Observability server exposes state queries and compliance checks; the Executor server exposes job triggering and status polling. The agent calls these tools in whatever sequence the situation requires, without the workflow author having pre-coded every possible combination.

The architectural implication is that the blocks do not need to know about each other or about the agent. The MCP interface is the contract. The same SoT that answers REST calls from a DAG-based workflow today will answer MCP tool calls from an AI agent without modification. Clean block boundaries pay off here in a way that tight coupling never would.

DAG-based workflows and agentic orchestration are not mutually exclusive. In practice, DAG-based workflows are the right choice for routine, well-defined operations: VLAN deployments, compliance scans, firmware upgrades. These are high-frequency, need predictable audit trails, and should not depend on LLM reasoning. The agentic layer handles the novel: incident-driven remediation, anomaly investigation, situations where the right sequence of actions cannot be determined until the current state is understood. The two approaches can coexist in the same platform, with the DAG routing to an agentic workflow when a situation does not match any known pattern.

This section introduces the pattern. The full architectural treatment, including how agentic orchestration scales to continuous autonomous operation across the full network, is in Chapter 17.

7.3. Implementation Example#

7.3.1. Orchestrating the Campus VLAN Service Lifecycle#

We have been following the same campus network through Part 2. In Chapter 4, we stored the VLAN service request in Nautobot: the VLAN ID, subnet, target switches, and per-vendor configuration templates. In Chapter 5, we used Ansible via AWX to push that configuration to the campus switches. In Chapter 6, we validated the deployment and began monitoring the service.

What we never addressed is who coordinates all of that. Each chapter had an implicit assumption: an engineer ran each step manually. Here we make that explicit and replace the engineer with a workflow.

The scenario

The application team has submitted a request for a new application segment: VLAN app-payments, subnet 10.22.14.0/24, deployed to all access switches in building-b across the Cisco and Arista stacks. The request was submitted through ServiceNow and has just received final approval.

That approval fires a webhook. The Orchestrator receives it and starts the VLAN service deployment workflow.

The workflow

We implement this as an AWX workflow template, consistent with the execution layer already in place. AWX supports workflow templates that chain job templates with conditional branching and approval gates, which is sufficient for this scenario.

flowchart TD
    A[ServiceNow webhook\nVLAN request approved] --> B[Step 1: Validate SoT\nQuery Nautobot for device records\nand VLAN definition completeness]
    B --> C{SoT data complete?}
    C -- No --> Z1[Abort: notify engineer\nand update ServiceNow ticket]
    C -- Yes --> D[Step 2: Fan-out pre-checks\nReachability and VLAN state\nacross all target switches]
    D --> E{Pre-checks passed?}
    E -- Failures --> Z2[Abort: report per-device\nfailures to ServiceNow]
    E -- Passed --> F[Step 3: Approval gate\nOptional human sign-off\nbefore production execution]
    F --> G[Step 4: Execute deployment\nAnsible playbook via AWX]
    G --> H[Step 5: Fan-out validation\nObservability check per switch]
    H --> I{All devices validated?}
    I -- Full success --> J[Step 6: Notify\nUpdate ServiceNow ticket\npost to Slack]
    I -- Partial failure --> K[Step 6a: Saga compensation\nRollback failed scope only]
    K --> J

Step 1: SoT validation

The Orchestrator queries Nautobot to confirm that device records and VLAN definition are complete before anything touches the network. This guard step is the Orchestrator’s responsibility, not the Executor’s: the Executor should receive valid input, not discover bad data mid-push. If any check fails, the workflow aborts and updates the ServiceNow ticket with the specific gap. (How SoT validation is structured is covered in Chapter 4.)

Step 2: Pre-checks across target switches (fan-out)

The workflow fans out across all target switches in parallel. Each branch runs a lightweight pre-check: reachability, VLAN table state, available capacity. The fan-in step classifies results and branches; the failure threshold is configurable. What matters orchestration-wise is that this is a fan-out/fan-in pattern with conditional branching, not sequential polling. (Pre-check execution patterns are in Chapter 5.)

Step 3: Approval gate

An optional human sign-off before the production push. An engineer has 30 minutes to approve or reject; a timeout proceeds automatically and is logged. The Orchestrator holds the workflow in a waiting state with an external dependency until the approval arrives or the window expires.

The approval gate is a policy decision, not an architectural one. High-maturity teams often remove it and replace human approval with automated confidence thresholds derived from pre-check results: if pre-check coverage exceeds 95% and zero critical failures were found, proceed automatically. Teams still building trust keep the gate. Both choices are valid at different points on the automation maturity spectrum discussed in Chapter 1.

Step 4: Deployment execution

The Orchestrator triggers the AWX job template from Chapter 5, passes the parameters from the SoT validation step, and waits for the result. The Executor handles the rest. This is the separation of concerns in practice: the Orchestrator decides to act, the Executor acts.

Step 5: Observability validation (fan-out)

After deployment, a second fan-out runs a validation job per target switch: does the VLAN appear in the VLAN table, are interfaces in the expected state? The fan-in classifies results: full success, partial success, or failure. (What the Observability layer validates and how is covered in Chapter 6.)

Step 6: Outcome routing

Full success closes the workflow: the ServiceNow ticket is updated and a summary posted to Slack. Partial failure triggers Saga compensation: the rollback playbook runs only for the devices that failed validation; devices that succeeded retain their configuration. The ServiceNow ticket records the per-device outcome.

Back to the engineer from the opening

This workflow shows all seven building blocks from the Chapter 3 NAF framework working together: Nautobot (Source of Truth) provided the intent, AWX (Executor) applied it, Prometheus (Observability) validated the result, the AWX workflow template (Orchestrator) coordinated the full sequence, and a ServiceNow webhook (Presentation, covered in Chapter 8) triggered the entire flow from the application team’s request.

The engineer from the opening story is still here. She is the one who reviews the approval gate when it fires. She is the one who investigates the partial failure the Saga pattern flagged but could not fully resolve. She is still the expert. What the orchestrator took from her is not the expertise, but the burden of holding the sequence together in her head, re-running it step by step, and being the only person who could. That is what coordination without automation actually costs.

7.4. Summary#

This chapter established that the Orchestrator is the building block that transforms a set of working automation tools into a system that behaves as one. Without it, coordination remains manual, failures require human intervention, and the scale of the network becomes a limit on how much automation is actually possible.

At its core, orchestration rests on a workflow engine that defines how multi-step processes are structured and executed. The choice between monolith, DAG, choreography, and agentic coordination is an architectural decision, not a tooling preference. The DAG model with its named patterns (sequential, fan-out, conditional, Saga) is the production standard for network automation. The agentic pattern, introduced in Section 7.2.7 with Model Context Protocol (MCP) as the interface layer, represents a different trade-off and receives its full treatment in Chapter 17.

Triggering defines how the external world reaches the orchestrator. The distinction between execution-layer triggering (Chapter 5) and orchestration-layer triggering matters architecturally: one drives a single device change, the other drives a coordinated multi-system workflow. Event-driven automation and the Closed-Loop Pattern inherit the non-negotiable properties of idempotency and deduplication precisely because automated triggers fire without human review.

Resilience and scale separate automation that works in demos from automation that works at 3am. Durable state, retry strategies that distinguish transient from permanent failures, the Saga Pattern for partial-failure compensation, and concurrency controls for large device populations are not optional features to add later. They define whether the orchestrator can be trusted when conditions are not ideal.

State and traceability provide the record of what ran, who triggered it, and what every step produced. This record is distinct from network operational state (Chapter 6) and network intent (Chapter 4). Tamper-evident audit logs are a compliance requirement, not an afterthought. Pipeline management closes the loop: workflow definitions are production software and must be versioned, tested, and promoted with the same discipline as any other code running in the automation platform.

The campus VLAN service scenario brought these five functionalities together: a ten-step workflow triggered by a ServiceNow webhook, coordinating SoT validation, pre-checks, execution, Observability validation, and Saga compensation, with no engineer in the coordination loop. The natural next question is who sees this workflow running and how the application team that triggered it tracks its progress. That is the Presentation layer, covered in Chapter 8.

💬 Found something to improve? Send feedback for this chapter