3. Architectural Thinking#
This chapter is the entry point to the foundations of this book. It explains why it is important to adopt this mindset, introduces a reference framework proposed by the Network Automation Forum (NAF), and finally suggests how to leverage it in your projects.
Part 2 of the book is all about diving deep into these topics. Thus, in this chapter, we will just present them to give a high-level picture before getting into the weeds. This is important because once we describe each building block, having the overall picture will help you connect the dots.
But being a strong believer in understanding the WHY before the WHAT, let’s first understand the reasons to leverage an architecture.
3.1. Why a reference architecture matters#
“Only the educated are free.” by Epictetus
A network automation solution is a combination of multiple pieces, each playing a role. In networking, we are used to simple or monolithic solutions that try to behave as all-in-one boxes to solve all problems. This may be true to an extent, but nothing can solve all the diverse challenges and use cases alone. So, if your use case is not one of the most popular ones, you may need to customize or extend.
Without a clear reference architecture, teams often face a predictable set of problems:
- Duplicate Effort: Multiple teams build similar tools (e.g., data sources, duplicated scripts, redundant monitoring systems) independently, wasting resources and creating inconsistent interfaces.
- Integration Gaps: Tools don’t talk to each other cleanly, forcing expensive custom integration work.
- Unclear Responsibilities: It’s unclear which tool should handle observability, orchestration, or state management, leading to confusion and finger-pointing.
- Hard to Extend: Adding new capabilities or tools requires rethinking the entire system.
- Knowledge Silos: Different teams use different mental models, making it hard to onboard new people or maintain solutions.
A reference architecture solves these problems by providing a single clear mental model for how automation systems should be organized. It defines:
- What each component should do -> its responsibilities
- How components interact -> their interfaces and data flows
- Where you can make choices -> which tools to use
- Where you need consistency -> how components talk to each other
A reference architecture is not new for network engineers. We have all grown up with our beloved OSI model. Using the OSI model provides a step-by-step, layered approach to diagnosing and understanding network problems because each layer has different responsibilities and concerns. Network professionals intuitively understand that this separation of concerns makes complex systems manageable.
In network automation, we need something similar that can guide how to develop and interconnect solutions that work together. It helps teams make consistent decisions, reuse components across projects, and evolve their automation capabilities without reinventing the wheel each time.
3.2. A Network Automation Architecture#
Consequently, we need an architecture. But notice the ‘A’ in the section title: there isn’t just one architecture—there are many. An architecture is simply a reference framework—a way to structure thinking and guide consistent decisions. I have contributed to three initiatives:
- Network to Code reference architecture, a collective effort led by the NTC Architecture team (which I was part of for four years), designed to support efficient and understandable network automation across a wide range of use cases.
- Network Automation and Programmability 2nd Edition by O’Reilly, an effort to connect the dots of all the content of the book related to all its content done with my co-authors: Matt Oswalt, Scott S. Lowe and Jason Edelman.
- NAF Framework was a community project unter the umbrella of the NAF led together with Wim Henderickx, Dinesh Dutt, Claudia de Luna, Ryan Shaw and Damien Garros with the goal of helping both newcomers and experienced practitioners build automation solutions in a structured, repeatable way.
Across these iterations, I focused on maintaining consistency and learning from each evolution. All three approaches are useful. However, I propose using the NAF Framework as the current best practice: it is community-driven, vendor-agnostic, and broadly applicable. It consolidates years of collective learning and is actively maintained by an engaged and growing community.
The NAF Framework proposes the following architecture:
block-beta
columns 7
space:1
block:layer1:5
Presentation["Presentation"]
end
space:1
space:7
block:Observability:2
columns 2
%% ObsLabel["Observabilit"]:2
ObservedState[("Observed State")]:1
ObservedLogic["Observed Logic"]:1
end
space
block:Orchestration:1
columns 1
OrchLabel["Orchestration"]:1
end
space
block:Intent:2
columns 2
%% IntLabel["Intent"]:2
IntendedState[("Intended State")]:1
IntendedLogic["Intended Logic"]:1
end
space:7
space
Collector["Collector"]:2
space
Executor["Executor"]:2
space
space:7
space:1
block:layer4:5
NetworkInfra["Network Infrastructure"]
end
Presentation <--> Observability
Presentation <--> Orchestration
Presentation <--> Intent
Observability <--> Orchestration
Orchestration <--> Intent
Collector --> Observability
Collector <--> Orchestration
NetworkInfra --> Collector
Orchestration --> Executor
Intent --> Executor
Executor --> NetworkInfra
classDef darkStyle fill:#5a4149,stroke:#4a9eff,stroke-width:2px,color:#e8e8e8,font-size:20px,font-weight:bold
class Presentation,NetworkInfra,ObsLabel,IntLabel,Collector,Executor,ObservedState,ObservedLogic,IntendedState,IntendedLogic,OrchLabel darkStyle
This architecture contains the following blocks, using the definitions from the document:
- Intent: Defines the logic to handle and the persistence layer to store the desired state of the network, including both configuration and operational expectations.
- Executor: Encompasses the actual tasks applied to the network to drive changes (e.g., updating configuration) as guided by the intended state.
- Collector: In contrast to Executor, this component focuses on retrieving (i.e., reading) the actual state of the network.
- Observability: It persists the actual network state, and defines the logic to process it.
- Orchestrator: Defines how the automation tasks are coordinated and executed in response to events.
- Presentation: Provides the interfaces through which users interact with the system, including dashboards, graphical user interfaces (e.g., ITSM), and CLI tools.
This architecture is not arbitrary; it’s the natural consequence of applying software engineering principles to network operations.
The NAF framework aims to help network automation architects organize their solutions by referencing the functionalities for each module, and defining MUSTs, SHOULDs, and MAYs (using the RFC style). Then, you can decide how do you implement them to solve you own use cases.
Having multple components doesn’t mean that you have to pick one tool per component. In many cases, you could use a tool that implements different functionalities, but this comes with compromises that you have to understand (and acknowledge).
Now, we will go over the different blocks to introduce them.
3.2.1. Intent or Source of Truth#
The Intent is all the data that defines all the necessary information you need to bring your network from nothing to fully operational, encompassing all the different lifecycle states, from pre-provisioning to bootstrap, to fully operational, and final decommissioning of all kinds of network services.
The architecture names it Intent, but I match it with another popular term: Source of Truth (SoT). The SoT term is sometimes misunderstood, and I want to start by clarifying what the SoT or Intent is (I’m going to use them interchangeably throughout the book).
As you can imagine this could represent very diverse data including devices, IP addressing, data center infrastructure (e.g., racks, cables), routing protocols, virtualized services, secrets, operational limits, configuration templates, and service or policy abstractions. One key aspect is that this data must be structured to be understood by machines. Then, it has to support create, read, update, and delete operations and be accessible through a standardized, well-documented API such as REST or GraphQL.
By analogy with manual network management, this mostly matches the role of the network architect who defines the design, the network schema, or the network planner who defines the BOM for a new network deployment.
Ideally, this component should provide a consistent and unified view of the desired state, even when the data is distributed across multiple data sources. These data sources, which I refer to as Systems of Record (SoR), are the actual owners of the data. So, this is one of the first things to clarify: it’s extremely rare to have only one data source but the data has to be consistent when consolidated. Data management is a complex topic that requires data governance features, including some form of metadata like timestamps, data origin, data ownership, and valid periods, to facilitate understanding and managing this data.
Moreover, connected to the features that enable trustworthy network automation, this block should offer transactional and versioned access to the data that enables predictable and reliable network automation processes.
If you are trying to understand which actual tools could fit into this category, here are a few examples: CSV/YAML/JSON files, Git, NetBox, Nautobot, Infrahub, Infoblox, or any general-purpose databases.
See Chapter 4 for a deep dive into building and managing your Source of Truth.
3.2.2. Executor#
The executor is in charge of implementing (e.g., writing) the desired state coming from the Intent into the actual network state, interacting with the network via different interfaces such as SSH, NETCONF, gNMI/gNOI, and REST APIs.
Continuing with the analogy to manual network management, the executor would be the network engineer who prepares the final network configuration and connects to the network device via CLI to enter the configuration commands.
However, this is not as simple as copy & paste data from one place (the intent) to another (the network). There are many operations to be taken into account, from network changes to rebooting or upgrading systems. Also, even though most of the time the reference data comes from the Intent block, in some cases, the observed data coming from Observability may influence this component, so both have to be combined.
For example, the Executor might validate device reachability before attempting changes, adapt failover logic based on current network state, or skip execution if Observability shows a critical dependency service is down. This feedback from Observability ensures that automation decisions are informed by real-time network conditions, not just Intent alone.
Similar to the Intent, this component should provide features such as dry-run, transactional changes, and idempotency (via declarative or imperative approaches) that help build automation systems that network engineers can rely upon. This is usually the component that network engineers care more about (initially) because it’s the one actually “changing” the network.
Tools that mostly fit into this category are Ansible, Terraform/OpenTofu, or any kind of scripts that leverage libraries such as Netmiko, Scrapli, or Napalm, or even a Kubernetes CRD (Custom Resource).
Head to Chapter 5 to explore execution frameworks, idempotency patterns, and implementation strategies.
3.2.3. Collector#
Analogous to the Executor, the Collector is in charge of retrieving the actual operational data from the network via different interfaces and protocols (the same interfaces as the Executor, plus others like SNMP, Syslogs, or other flow-based telemetry).
The destination of all this data is the Observability block, where the data is used to power automation decisions.
One important topic here, not usually considered, is the need for normalized data to be comparable. So, getting consistent metric names for all vendors (e.g., system_version defining the OS version independently of the platform) and consistent metadata to allow advanced data processing is crucial.
Examples of tools that fit into this category are Telegraf, Vector, gNMIc, PMACCT, goFlow, Akvorado, or any kind of script that leverages libraries to fetch data from the network.
Because of affinity, I will cover the Collector together with the next building block, Observability in Chapter 6.
3.2.4. Observability#
Closely related to the Collector (in many cases seen together), Observability receives the collected data, supports its persistence, and offers programmatic access to support advanced analytics, reporting, and troubleshooting workflows, ideally using capable query languages (such as PromQL).
The data from this block must expose relevant discrepancies between the desired state and the actual state of the network, generating events (or alerts) that could trigger mitigation systems.
Continuing with the manual network management analogy, this fits the role of the network operations center, which checks the actual state of the network and reacts accordingly.
Moreover, the observed data can be enriched with contextual information from the intended state, including other third-party sources (e.g., EoL information, CVEs, maintenance notifications, etc.), enhancing analysis and enabling more accurate data correlation.
In traditional network monitoring, all the functionalities of Observability (and collection) have been integrated into one big box (for example, tools like Nagios, LibreNMS, Spectrum, etc.).
Currently, more diverse systems that integrate are gaining popularity being Prometheus and InfluxDB popular or Elasticsearch, and other related systems that manage the alerts (e.g., Alertmanager), and visualize (e.g., Grafana, Kibana). This has led to some popular stacks: ELK, TIG, or TPG.
Learn more about monitoring architecture, alerting strategies, and observability best practices in Chapter 6.
3.2.5. Orchestrator#
So far, you have seen many different components, and there is a need to coordinate them: integrating the processes of the different building blocks and creating sophisticated end-to-end automation workflows. Because of this, the orchestrator must support multiple ways to interact, from manual triggering to fully event-driven workflows, both synchronously and asynchronously.
The workflows that the orchestrator implements should provide ways to define multiple steps, and must also support rollback and dry-run features, relying on the other components (for example, using versioned snapshots from the intent block).
In manual operations, this is usually loosely defined in a checklist for a process that has to be followed.
This component provides users with a comprehensive visualization of what’s going on in the whole network automation, showing how the different processes come together, via clear logging and traceability.
AI/ML is gaining a lot of traction here because of the assisted decision-making it can bring to determining the course of action.
Examples here are AAP or AWX (Ansible Automation Platform), Windmill, or Prefect.
Dive into Chapter 7 to learn about workflow design, event-driven automation, and orchestration platforms.
3.2.6. Presentation#
Sometimes we forget how important it is to expose the proper interface to the users of network automation. This is usually a blurry area because all the tools we have already introduced have some sort of interface: CLI, API, or web-based. In some cases, you may decide it’s good enough. In other cases, you may create your own user interface to get the best of all worlds and simplify user experience. It depends.
Traditionally, this has been limited for external users to phone calls or email, and for network teams, CLI.
One way or another, this layer must allow flexible authentication and authorization (it’s the entry point of your system), and then adjust to the actual needs. Sometimes, this could be a network-specific platform, or sometimes it could integrate with company-wide systems (e.g., ServiceNow, Slack, etc.).
It could cover write or read operations, depending on the role, but always consider the needs of your different types of users to get the best results.
Explore user experience, interface design, and integration patterns in Chapter 8.
3.2.7. The Network#
Last but not least, we have to understand the capabilities of the network to support automation, and the network is no longer devices and cables connected. In the 2020s, the virtualization and cloud-based networking solutions make the end-to-end reachability (the network) a hybrid and diverse environment.
The network itself is both a constraint and an enabler for your entire automation architecture. Unlike the previous six blocks which you have control over, the network infrastructure (your devices, platforms, services, and connectivity) may have limitations that affect what’s possible.
Network capabilities fall into several categories:
- Interfaces & Protocols: What management interfaces does your network support? SSH CLI? NETCONF? gNMI? REST APIs? SNMP? Some platforms support many, some support few. These choices directly constrain what Collector and Executor can do.
- Data Models: Even if a device supports gNMI, the YANG models it exposes may be incomplete. For example, your device may claim gNMI support but not expose the specific configuration you need to manage. Understanding these gaps is critical during planning.
- Operational Maturity: Newer platforms may have modern APIs but undocumented behaviors. Older platforms may be stable but lack APIs entirely. You need to assess the actual maturity, not just the marketed features.
To effectively support automation, your network infrastructure should ideally provide:
- Development & Testing Environments: Like software development, automation requires safe places to test. This might mean: lab networks (often expensive and limited), virtual network environments (Containerlab, GNS3), or vendor-provided simulators (Cisco DevNet, Arista EOS-lite).
- Consistent Interfaces: If 90% of your devices support NETCONF but 10% only support SSH, you’ll need either to standardize or to build multiple Executors. Every inconsistency increases complexity.
- Adequate Telemetry: If you can’t get the data you need from your devices, Observability becomes meaningless. Ensure your devices can stream telemetry (streaming telemetry, SNMP, syslog) at the granularity and information you need.
The network is the hardest part of the architecture to change. You can’t easily swap your routers. So understand its constraints early, and design your automation within them. This is why the Network block deserves careful attention during architecture planning.
Understand network capabilities, device APIs, and test environments in Chapter 9.
Next, to summarize this section, let’s go over a very simple use case and how the architecture maps to it.
3.2.8. A Practical Example#
Before we dive into each block, let’s walk through a concrete scenario that shows how the architecture works together. Imagine you need to:
Problem: Roll out a new subnet (10.100.0.0/24) across 50 branch sites, validate that it’s working, and alert if any site falls out of compliance.
How the architecture can solve it:
- Intent (Source of Truth): A network engineer enters the desired configuration in Nautobot: subnet definition, VLAN, routing configuration. This is stored as the desired state.
- Orchestrator: AWX receives a webhook from Nautobot about the change and triggers the Executor.
- Executor: An Ansible playbook reads from Nautobot API, generates device-specific configurations, and pushes them via NETCONF to all 50 routers. Each deployment is —running it again won’t cause duplicate configs.
- Collector: After deployment, a gNMIc collector connects to each device and retrieves the actual interface configurations, IP addresses, and routing table entries.
- Observability: The collected data flows into a Prometheus TSDB. A query compares desired subnet configuration (from Intent) against actual running config (from Collector). Any mismatches are exposed as metrics and trigger alerts.
- Orchestrator: When an alert fires (e.g., “10.100.0.0/24 not present on site-15”), an AWX workflow automatically triggers a remediation: it re-runs the Executor on site-15, then re-collects data, then validates the fix.
- Presentation: A dashboard shows all 50 sites, highlighting which ones are compliant (✅) and which ones have drifted (❌). Non-technical stakeholders can click a site to see details launching a Grafana dashboard. IT ops can trigger manual remediations without writing code.
- The Network: The success of this entire flow depends on whether your devices support NETCONF. If half your routers only support SSH CLI, you’d need a different approach (less elegant, more custom scripting), or assume that some features (transactionality) won’t be achieved.
The Key Insight: No single tool did this. Nautobot (Intent), Ansible (Executor), gNMIc (Collector), Prometheus (Observability), AWX (Orchestrator), and a custom Grafana dashboard (Presentation) all worked together. The architecture provided the mental model for how to integrate them.
This is what architectural thinking enables: systematic, composable automation.
3.3. How to Use an Architecture#
Now that you understand the different building blocks of the NAF network automation architecture, you might be wondering: “How do I actually get started with this in my projects?”
3.3.1. Sequential Approach for Adoption#
The reference architecture is not a prescription. It’s a framework for thinking about and organizing your automation solutions. Here’s a practical, sequential approach to applying it:
flowchart LR
A[Understand Your Current State]:::phase1 --> B[Plan Your Automation Journey]:::phase2
B --> C[Make Better Tool & Design Decisions]:::phase3
C --> D[Design for Integration]:::phase4
D --> E[Communicate with Stakeholders]:::phase5
E --> F[Evolve Incrementally]:::phase6
classDef phase1 fill:#e0f7fa,stroke:#333,stroke-width:2px;
classDef phase2 fill:#b2ebf2,stroke:#333,stroke-width:2px;
classDef phase3 fill:#80deea,stroke:#333,stroke-width:2px;
classDef phase4 fill:#4dd0e1,stroke:#333,stroke-width:2px;
classDef phase5 fill:#26c6da,stroke:#333,stroke-width:2px;
classDef phase6 fill:#00bcd4,stroke:#333,stroke-width:2px;
Phase 1: Understand Your Current State
Start by mapping your existing tools and processes to the architectural blocks. This assessment exercise helps you identify gaps, overlaps, and potential areas for improvement:
- Is your data scattered across multiple systems?
- Are you collecting network state effectively, or are you flying blind?
- How do you currently execute changes? Manually? Ad-hoc scripts? Organized frameworks?
If you discover excellent execution capabilities but poor observability, you’ve identified a high-impact problem to solve next.
Phase 2: Plan Your Automation Journey
Use the architecture to guide your automation roadmap. You don’t need to implement all blocks at once. Start with components that address your most critical pain points:
- If you’re struggling with configuration drift, focus on Intent/Source of Truth and Execution.
- If you can’t detect problems quickly, prioritize Collector and Observability.
- If your automation is fragmented and unreliable, invest in Orchestration.
- If users are confused about how to request changes, improve Presentation.
Prioritize by business impact, not by architectural completeness.
Phase 3: Make Better Tool & Design Decisions
When evaluating new tools or building custom solutions, ask yourself:
- Which architectural block does this serve?
- Does it integrate well with my other components?
- What data flows do I need between blocks?
This prevents tool sprawl and ensures your automation ecosystem remains cohesive. It also clarifies whether you should build or buy. If a tool cleanly fits one block and has well-defined interfaces, it’s usually better to buy than build.
Phase 4: Design for Integration
Understanding the boundaries between components helps you design better interfaces. Key principle: components should not need to know each other’s internals.
- Your Executor doesn’t need to know how Intent stores data: it just needs a well-defined API to query desired state.
- Your Orchestrator shouldn’t care whether you’re using Ansible or Terraform: it just triggers execution and monitors results.
- Your Observability system shouldn’t need to know about Collector internals: it just needs clear metrics and events.
This decoupling is what allows you to evolve each component independently.
Phase 5: Communicate with Stakeholders
The architecture provides a common language for discussing automation with different audiences:
- To management: “We’re strengthening our Observability posture to decrease MTTR.”
- To your team: “Let’s standardize how our Collector sends data to Observability.”
- To other departments: “Our Presentation layer will integrate with your ITSM instance.”
Clear architectural language reduces confusion and helps secure buy-in.
Phase 6: Evolve Incrementally
The architecture allows you to swap components as your needs change:
- Today you might use Git as your Source of Truth, but tomorrow you could adopt NetBox or Infrahub without completely redesigning your automation workflows, as long as you maintain clear interfaces.
- You might start with a simple script as your Orchestrator, later replacing it with AAP or Windmill.
- You can migrate from one TSDB to another in Observability without disrupting how Collector retrieve its data.
This evolutionary approach minimizes risk and allows you to learn as you go.
Don’t fall into the trap of over-engineering. The goal is not architectural purity, but practical automation that solves real problems. Sometimes the best solution is a simple script that spans multiple architectural blocks. The architecture is a guide, not a straightjacket.
The key takeaway: use the architecture as a mental model to organize your thinking, identify gaps, and make deliberate design decisions. Not as a rigid template that dictates every implementation detail.
3.3.2. Common Pitfalls to Avoid#
Learning from others’ mistakes can save you significant time and effort. Here are common pitfalls when applying the reference architecture:
Trying to Implement Everything at Once
It’s tempting to think: “If this architecture is good, I should build all seven blocks perfectly.” This leads to massive, multi-year projects that become outdated before completion.
Better approach: Start with one or two blocks that solve your most urgent problem. Build incrementally. Early success builds momentum and buy-in.
Believing “One Tool to Rule Them All”
Some vendors claim their platform handles everything. While this might be true, it often means:
- You’re locked into their specific way of doing things
- It’s hard to swap out components later
- You might be paying for features you don’t need in blocks you don’t care about
Better approach: Choose best-of-breed tools for each block, but ensure they have clear APIs and integration points. Accept that you may need to integrate 3-5 tools, but you’ll have flexibility.
Ignoring Network Constraints
You design a beautiful Executor using gNMI, but 30% of your devices only support SSH CLI. Or you want streaming telemetry, but your older platforms only support periodic SNMP.
Better approach: Understand your network infrastructure’s capabilities first. This constraints your architecture. You can’t ignore physics, and you can’t ignore what your devices can do.
Assuming Interfaces Will Be Simple
You say: “Executor will just call Intent’s API for desired state.” But Intent uses NetBox with custom extensions, and Executor expects flat YAML. Suddenly you’re writing translation layers.
Better approach: Invest upfront in clear, well-documented interfaces. Use standards where possible (REST APIs, gRPC, clear schema definitions). The cost of good interfaces early prevents much larger costs later.
Building Custom Tools When Good Ones Exist
Your team decides to build a custom Collector because “no tool exactly meets our needs”. Six months later, you have 3,000 lines of code maintaining proprietary telemetry pipelines.
Better approach: Evaluate existing tools (e.g., Telegraf, Vector, gNMIc). They handle 80% of use cases and are battle-tested. Customize them or build adapters if needed, but don’t build from scratch.
Treating Observability as an Afterthought
Many teams focus on Intent and Executor, then realize too late that they can’t see what’s actually happening in the network. Observability is bolted on last.
Better approach: Plan Observability from day one. What metrics will you collect? How will you detect drift? What alerts matter? Answer these before you build Executor.
Forgetting About Users
Engineers build a powerful Orchestrator, but the only way users can interact with it is via CLI commands. Non-technical users are confused; adoption is poor.
Better approach: Consider your users early. What interfaces do they need? APIs? Web UI? ServiceNow integration? Sometimes the Presentation layer is what makes or breaks adoption.
These pitfalls are not theoretical, they’re patterns from real automation projects. Learning from them now will make your architecture stronger.
3.4. Summary#
Architectural thinking is essential for building successful network automation solutions. Just as the OSI model provides a layered framework for understanding networks, a reference architecture helps you organize and design automation systems that are maintainable, scalable, and reliable.
This chapter introduced the NAF Framework, which defines seven key building blocks. The architecture is not a rigid prescription but a flexible framework. Use it to assess your current state, plan your automation journey, make informed tool decisions, communicate with stakeholders, design for integration, and evolve incrementally. Remember: the goal is practical automation that solves real problems, not architectural purity.
In Part 2 of this book, we’ll dive deep into each of these building blocks, exploring implementation patterns, best practices, and real-world examples that you can apply to your own automation projects.
💬 Found something to improve? Send feedback for this chapter