Nov 30, 2025 · 6537 words · 31 min read

2. Design Principles#

In this chapter, we explore the foundational design principles that make network automation reliable, scalable, and safe to operate. These principles are not abstract theories — they are the difference between automation that gets adopted and positively impacts your business, or automation that gets ignored and eventually fails.

While many of the following design principles apply equally to other software projects, network automation has some unique characteristics because it supports building and operating networks—a fundamental pillar of IT infrastructure. Network engineers have built and maintained critical systems for decades using a model that emphasizes caution, precision, and manual verification. Now we’re asking them to adopt a fundamentally different operational model, and change requires trust.

2.1 Building Trust#

Before diving into the specific design principles, it is essential to understand the foundation of trust. Trust is the cornerstone of successful network automation, and without it, adoption becomes a significant challenge.

Why is trust so important? Because without trust, you won’t gain the required adoption by the network engineers. Adopting a new operational model requires that automation provides, at minimum, the same level of confidence as the manual processes it replaces (plus other benefits).

This becomes even more challenging when the automation systems are often built by engineers who may not have deep networking experience. As a result, automation must compensate by embedding strong, explicit characteristics that make it safe and dependable, and allow easy customization and management by the network engineering teams.

The relevance of trustworthy automation is inspired by Damien Garros’ presentation Building Trustworthy Network Automation at Autocon3.

To simplify, these are the four basic characteristics that we must adhere to:

  • Predictable: It should produce consistent, deterministic outcomes. Engineers must be able to anticipate what will happen before pressing “go,” and always get the same behavior regardless of how many times it is executed.
  • Reliable: It must handle errors gracefully, recover from failures, and ensure operations complete safely (or roll back to a working state), even under unexpected conditions and at the required scale.
  • Usable: Engineers need interfaces that allow them to validate, reason about, and control behavior without excessive complexity and with the necessary guardrails.
  • Understandable: The system cannot be a black box. It must expose intent, steps, results, and decisions in a transparent manner that builds human confidence.
graph BT
  %% ===== Middle Layer =====
  subgraph L2[**Qualities**]
    direction LR
    B1[Predictable]:::layer2
    B2[Reliable]:::layer2
    B3[Usable]:::layer2
    B4[Understandable]:::layer2
  end

  %% ===== Top Layer =====
  subgraph L1[" "]
    direction TB
    A[Trust]:::layer1
  end

  %% ===== Connections: Behavior → Outcome =====
  B1 --> A
  B2 --> A
  B3 --> A
  B4 --> A

  %% ===== Styling =====
  classDef layer1 fill:#ffcccc,stroke:#b8860b,stroke-width:2px,color:#000;
  classDef layer2 fill:#ffe6cc,stroke:#4682b4,stroke-width:1.5px,color:#000;

Predictability or deterministic behavior is becoming increasingly relevant as AI/ML technologies enter the network automation domain, because these introduce some randomness.

These characteristics should be treated as non-negotiable requirements, not afterthoughts. And they must be acknowledged by the network engineers who will use, and ultimately depend on the automation.

As someone who has built many network automation solutions, I can attest: there is nothing more frustrating than seeing a well-engineered solution ignored or abandoned because network operators didn’t trust it enough. Using these principles as guideposts will help you avoid this outcome.

So, given these characteristics, let’s explore the principles that support them.

2.2 Design Principles#

With trust established as the foundation, we can now explore the design principles that support the qualities required for trustworthy automation. These principles ensure that automation is reliable, predictable, and scalable.

While the full list could be much longer (and we will expand it later when introducing other principles from software engineering), there are six fundamental principles that every network automation solution should incorporate:

  • Idempotency: Executing the same operation multiple times should yield the same final state. Idempotency reduces side effects, removes ambiguity, and makes automation more understandable and safe to repeat when needed.
  • Transactional: Changes should be grouped so they either complete fully or fail safely. No partial, inconsistent, or half-applied states. Transactionality is essential for predictability and enables network-wide rollbacks when something goes wrong.
  • Intent-Driven: Automation must allow engineers to define the desired state of the system under different conditions. Clear intent models improve predictability, reduce cognitive load, and make the system more usable.
  • Dry Run Friendly: The system should expose what it would do before doing it. Dry runs make automation more usable and trusted by allowing humans to validate actions, visualize outcomes, and detect issues before they impact the network.
  • Versioning: Automation should support versioning of both data and logic, allowing you to move backward or forward in time, recover previous states, and maintain multiple states in parallel. Versioning reinforces predictability by ensuring every snapshot is controlled, and strengthens reliability by providing a safe rollback point.
  • Testable: A proper testing environment is essential before executing changes in production. Testability improves reliability and helps engineers understand how the system behaves.

These principles are not exclusive to networking, they apply to automation across any IT domain. But they become especially critical in networking, where changes carry significant operational risk.

These six principles form the foundation for safe, predictable, and trustworthy network automation. They set the stage for more advanced concepts introduced later in the chapter and become even more important when scaling automation across large, complex infrastructures.

graph BT
  %% ===== Bottom Layer =====
  subgraph L3[**Principles**]
    direction LR
    C1[Idempotent]:::layer3
    C2[Versioned]:::layer3
    C3[Transactional]:::layer3
    C4[Testable]:::layer3
    C5[Dry Run]:::layer3
    C6[Intent-Driven]:::layer3
  end

  %% ===== Middle Layer =====
  subgraph L2[**Qualities**]
    direction LR
    B1[Predictable]:::layer2
    B2[Reliable]:::layer2
    B3[Usable]:::layer2
    B4[Understandable]:::layer2
  end

  %% ===== Top Layer =====
  subgraph L1[" "]
    direction TB
    A[Trust]:::layer1
  end

  %% ===== Connections: Foundation → Behavior =====
  C1 --> B1
  C1 --> B4

  C2 --> B3
  C2 --> B2

  C3 --> B1
  C3 --> B2

  C4 --> B2
  C4 --> B4

  C5 --> B3
  C5 --> B4

  C6 --> B1
  C6 --> B3

  %% ===== Connections: Behavior → Outcome =====
  B1 --> A
  B2 --> A
  B3 --> A
  B4 --> A

  %% ===== Styling =====
  classDef layer1 fill:#ffcccc,stroke:#b8860b,stroke-width:2px,color:#000;
  classDef layer2 fill:#ffe6cc,stroke:#4682b4,stroke-width:1.5px,color:#000;
  classDef layer3 fill:#ccffcc,stroke:#228b22,stroke-width:1.5px,color:#000;

Each of these design principles deserves more context to be fully understood. We’ll explore them in an order that reflects their logical dependencies: starting with defining what we want to achieve, then ensuring it happens safely and correctly, and finally validating that it works.

2.2.1. Intent-Driven#

Why start with intent? Before discussing how to execute automation, we must understand what we’re trying to achieve. Intent-driven design is the foundation upon which all other principles build.

Some common use cases to start with automation are scripts or playbooks that execute simple actions: turn an interface up or down by providing the FQDN and interface name. But this creates a problem: the next person who works with that interface won’t know what its desired state is. Should it be up or down?

Obviously, this issue affects both manual and automated operations. But with automation, the problem becomes worse because you can’t confidently infer network intent from its current state, especially when many ad hoc changes have been applied. The actual network configuration is the result of your Intent-Driven logic, not a reliable record of it.

Intent manifests at different levels depending on your variability and customization needs. Simple environments may require less variable data and can rely on templates. Other data may require cross-team review, while some can be updated following conventions and guardrails.

Someone might argue that their intent is simply the actual state of the network—“check the config, that’s my intent.” While this is technically an intent, you cannot prove it matches your original or actual intent and hasn’t been altered by circumstances, unless you have a separate source to compare against. The network configuration is the output of your intent, not the intent itself.

Therefore, the quality of your Intent-Driven data directly determines the quality of your automation. This concept is closely related to the Source of Truth (SoT) (SoT), which we’ll cover in detail in Chapter 4.

Infrastructure as Code (IaC): Intent-driven automation is the foundation of Infrastructure as Code (IaC)—treating network infrastructure definitions like software code. With IaC, you get version control, code review processes, testing, and the ability to treat infrastructure changes the same way you treat software changes. This bridges network automation with software engineering best practices. IaC keeps templates, inventories and intent as files under version control so changes are reviewed, tested, and traced, not applied as one-off CLI edits.

Building on intent, the next principle ensures that once we know what we want, we can execute it safely and consistently.

2.2.2. Idempotency#

Why idempotency matters: Idempotency ensures that repeated executions produce the same result. This is essential because automation often runs multiple times—either by accident, by retry mechanisms, or by design. Without idempotency, each run could make unpredictable changes.

Idempotency means that running the same operation multiple times results in the same final state, with no unintended side effects.

For example, in automated ACL provisioning, if we define a rule, we expect:

  • If the rule doesn’t exist, it is added in a specific position.
  • If the rule already exists, do not add duplicates.
  • If another rule is added later, the previous one remains untouched and is not reordered.

This may sound like common sense, but the reality is more complex. The system must implement logic to ensure idempotency. For example, it needs to check:

  • What is the current state of the ACL?
  • Is the rule already part of it?
  • If not, what difference needs to be applied?

Without idempotency, configuration may be applied in unpredictable order, impacting reproducibility.

Idempotency applies to the different dimensions of the network automation. For example, when requesting an IP address via DHCP or from an IPAM system that understands the mapping between the identifier request and the result:

Figure 1: Example of Idempotency (from Damien’s slides)

As you will read later, to achieve idempotency, prefer using Declarative models instead of Imperative ones. Define the end state rather than the steps. This doesn’t mean that using an imperative mode is impossible to achieve, but it is more complicated.

While idempotency focuses on consistency within a single execution, the transactional principle ensures that changes are applied atomically across multiple systems, preventing partial or inconsistent states.

2.2.3. Transactional#

Why transactionality matters: Network changes often affect multiple devices simultaneously. Without transactionality, a failure partway through could leave the network in an inconsistent state—some devices updated, others not. This is especially problematic in distributed systems where coordinating across multiple actors is inherently complex.

Transactional is a common concept in databases, where multiple changes are applied atomically to maintain consistency. In networking (a distributed system) transactionality is even more complicated, but it’s not a new problem. In 2006, NETCONF (RFC 6241) introduced the concept of a network-wide commit feature that applies changes globally across multiple devices or rolls back entirely if validation fails. This atomic operation prevents partial states where some devices have been updated but others haven’t—a nightmare scenario for troubleshooting. However, not all platforms support NETCONF, and integration complexity varies significantly. So while the problem is solved in theory, practical implementation depends on your infrastructure.

Providing transactional behavior prevents partial or broken operations—which are hard to troubleshoot and leave systems in unpredictable states—is essential to avoid eroding trust in your automation. A network where some devices have applied a change and others haven’t is worse than a network that rolled back completely.

So, generally, how could we apply transactionality to networking? This is a combination of techniques:

  • NETCONF: Network management capabilities that support distributed management coordination. But the support is very limited.
  • Database-backed transactions: Store configuration changes in a database with ACID properties, then apply them in coordinated batches with rollback capability
  • Change notification queues: Capture all intended changes, validate them as a group, then execute with a coordinated rollback mechanism if any device fails
  • Two-phase commit patterns: Prepare changes on all devices first (phase 1), then commit them simultaneously (phase 2), with rollback if phase 2 fails on any device
  • Transactional logs: Record every change attempt with sufficient detail to undo partial operations manually or automatically

Notice that all mechanisms in a distributed environment require a rollback capability to provide seamless rollback. You need a commit or snapshot which can be activated (internally or externally to the platform) to move to a previous state. This is not always available. Where true atomic commits aren’t available, implement a validated “prepare” stage and use an orchestrated coordinated commit or staged rollout to reduce risk (only use this when no native atomicity available).

This principle naturally connects to Versioning: when something goes wrong, you need to know exactly what state you’re in so you can roll back to a known-good version.

2.2.4. Versioned#

Why versioning matters: Every change carries risk. Versioning allows you to know exactly what changed, when it changed, who changed it, and provides the ability to revert if needed. This is the foundation of safe operations at scale.

Historically, network engineering managed change through configuration backups—snapshots used for rollback purposes. However, reconciling post-change state after a rollback is extremely complex because backups lack context.

In software engineering, Version Control System (VCS) like Git is a standard practice (e.g., I haven’t worked in any environment without some form of VCS). These systems enable:

  • Easy collaboration: Multiple developers contribute code and can move forward on different branches
  • Time travel: Return to previous points in history to understand what changed and when
  • Atomic grouping: Multiple file changes can be bundled together as a single unit

Applying Versioning to network automation provides significant benefits:

  • Auditability and Traceability: Easily track who changed what and when, making the system transparent
  • Cross-team Collaboration: Facilitates review of changes and enables team-based automation development
  • Atomic Changes: Related changes across multiple files can be applied together, preventing partial or incomplete states
  • CI/CD Integration: Automated pipelines can trigger testing and validation whenever code changes are detected

Configuration templates are a prime example of data that benefits from versioning. A single template change impacts all derived configurations across the network, so version control becomes essential. In general, all data should be stored in a versioned way. It is common to have versioned YAML or JSON files modeling data of any sort, or data with more complex relationships.

GitOps: An emerging pattern that takes versioning a step further—the Git repository becomes the single source of truth, and a controller continuously compares the desired state in Git with the actual state in the network, automatically correcting drift. This is gaining adoption in network automation, especially in cloud-native and Kubernetes-integrated environments.

Versioning provides the history and audit trail. The next principle ensures we validate that changes work correctly before deployment.

2.2.5. Testable#

Why testability matters: Automation at scale amplifies errors. A bug in a playbook that affects one device is manageable; the same bug applied to 10,000 devices is a disaster. Testing is your safety net before deploying to production.

Testing is particularly challenging in networking. Often you don’t own the entire network (think of internet BGP peerings, where you don’t own the other side), or the scale is prohibitive (imagine testing a data center fabric with thousands of switches). Reproducing real test scenarios can be near-impossible or prohibitively expensive.

However, this doesn’t mean you’re without options. You can implement testing at multiple levels that validate automation behavior and network state under various circumstances.

The testing pyramid provides a useful framework:

graph BT
  %% Unit tests (base)
  U1["Unit Tests<br/>(Data Quality, Logic)"]:::unit

  %% Integration tests
  I1["Integration Tests<br/>(Simulated Environment)"]:::integration

  %% End-to-end tests (top)
  E1["End-to-End Tests<br/>(Lab, Validation)"]:::enduser

  %% Links
  U1 --> I1
  I1 --> E1

  %% Styling
  classDef unit fill:#a6e3a1,stroke:#3b7a57,stroke-width:2px,color:#000;
  classDef integration fill:#89b4fa,stroke:#1e3a8a,stroke-width:2px,color:#000;
  classDef enduser fill:#f9e2af,stroke:#b8860b,stroke-width:2px,color:#000;
  • Unit Test validate data quality, basic logic, and specific behaviors in isolation using mocks or fake systems. They run quickly and catch fundamental errors early.
  • Integration Test introduce third-party systems in a simulated (and simplified) environment. Different automation components interact with each other and simulated network flavors, allowing you to validate integration points (for example, faking the API or SSH CLI interface of a network device).
  • End-to-End Test validate behavior in near-real scenarios using virtualized network environments or labs simulating part of your production network (we will go deeper on this topic in Chapter 9).

AI code assistants are increasingly useful for tests generation. If you can articulate what to test, these tools can help create comprehensive test suites quickly (and catch logic flaws that you may not spot at first glance).

The critical insight: automation requires continuous testing to ensure nothing breaks as systems grow. A regression introduced late in development is far more expensive than one caught during unit testing. As your automation scales, this becomes non-negotiable.

Advanced Testing Strategies:

  • Chaos Engineering: Intentionally inject failures (device outages, network latency) to validate that your automation and monitoring handle them gracefully (Netflix popularized this approach with their Chaos Monkey).
  • Property-Based Testing: Define properties that must always hold true (e.g., “BGP should always converge within 30 seconds”) and let test frameworks generate scenarios to verify them

Finally, before deploying any changes, operators need visibility into what will happen.

2.2.6. Dry-Run Friendly#

Why dry-run capability matters: Even with all the previous safeguards, humans need to understand what automation will do before it executes. Dry-run capability is the bridge between trust and action—it shows operators exactly what will change before any change is made.

In any major decision, understanding the plan of action before execution is critical. When building a house, you want to see how it will look before approving its construction. Similarly, in network engineering, we are accustomed to change management processes that review execution plans before deployment.

With network automation, the frequency and scope of changes can increase dramatically. Providing operators with clear visibility into what will be executed—before it happens—is essential for building confidence and enabling proper review.

There are tools that provide this Dry Run capability, but you can always create your own if needed:

  • Ansible: The --check and --diff flags show what would be changed
  • Terraform: The terraform plan command displays the difference between current and desired state
  • Custom Network APIs: Some may expose a “dry-run” or preview mode

Here’s an example of what this visibility looks like:

Ansible Diff Output

- description: Uplink to CoreSwitch1
+ description: Uplink to CoreSwitch2
  mtu: 9216

Terraform Plan Output

~ description = "Uplink to CoreSwitch1" -> "Uplink to CoreSwitch2"
  mtu          = 9216

Custom Network API Preview Response

{
  "operation": "PATCH",
  "path": "/interfaces/Gi0-1",
  "changes": [
    {
      "field": "description",
      "current_value": "Uplink to CoreSwitch1",
      "proposed_value": "Uplink to CoreSwitch2",
    }
  ],
  "mtu": 9216
}

Dry runs transform automation from a “hope for the best” approach into a deliberate, reviewable process.

Advanced Dry-Run Concepts:

  • Impact Analysis: Beyond showing what changes, analyze and communicate the business impact (e.g., “This will temporarily affect 10% of traffic”)
  • Staged Rollouts: Implement dry-run at scale by rolling out changes gradually to a subset of devices first, validating impact before full deployment
  • Network Simulation: Combine with network testing tooling to execute dry-runs against a replica of your production network

2.3. Architectural Decision Patterns#

Beyond the foundational principles, there are key concepts and considerations that bridge theory with practical implementation. These patterns help you make strategic decisions about your automation architecture.

2.3.1. Declarative vs. Imperative#

The choice between declarative and imperative approaches is a fundamental decision in automation design. Each has its strengths and trade-offs, depending on the use case. Managing infrastructure configuration can take two approaches:

  • Declarative automation defines the desired end state; the system figures out how to achieve it. Example: “I want interface Gi0/1 to have an MTU of 9000.” The automation engine determines the current state and applies only necessary changes. It focuses on WHAT, the desired outcomes.
  • Imperative automation specifies the exact steps to execute. Example: “Show interface config, parse it, calculate the difference, send these exact CLI commands in this order.” It focuses on HOW, the specific actions.

Declarative approaches naturally support Idempotency, Dry Run capabilities, and Transactional behavior, but they require infrastructure and systems that support it.

Imperative approaches offer precise control but are harder to make idempotent and more prone to unintended side effects. However, they may be the only option when your target system lacks declarative capabilities.

The key trade-off: declarative approaches scale better. When available, the complexity of implementing complex workflows remains constant, while imperative approaches see complexity grow exponentially as workflows become more intricate.

Most modern automation frameworks (Terraform, Ansible, Kubernetes, or YANG-based tools) trend toward Declarative models while allowing Imperative fallbacks when needed (e.g., Ansible raw modules, direct CLI, Netmiko).

Some tools like Ansible can function both ways depending on the module type. Network-specific modules (like cisco.ios.ios_interfaces) are declarative, while ansible.netcommon.cli_command executes imperative commands. In the example below, you’ll see both approaches:

- name: Imperative vs. Declarative Approaches in Ansible
  hosts: switches
  gather_facts: no
  vars:
    interface_name: GigabitEthernet0/1
    new_description: Uplink to CoreSwitch2
  tasks:
    - name: Imperative - Execute exact commands
      ansible.netcommon.cli_command:
        command: |
          configure terminal
          interface {{ interface_name }}
          description {{ new_description }}
          no shutdown
          end
      # Repeating this task may add duplicates and fails idempotency

    - name: Declarative - Define desired state
      cisco.ios.ios_interfaces:
        config:
          - name: "{{ interface_name }}"
            description: "{{ new_description }}"
            enabled: true
        state: merged
      # Repeating this task is safe, it converges to the desired state

Keep in mind that with a declarative approach, you are offloading the logic to an external system, and in some scenarios, that may not be an available option. Notice in the previous example how you leverage an Ansible module cisco.ios.ios_interfaces that implements the logic needed to provide the declarative flavor.

The declarative approach is safer and more predictable, but requires module support for your specific device.

2.3.2. Mutable vs. Immutable Infrastructure#

The concept: Do you allow in-place modifications to infrastructure (e.g., the network), or do you replace infrastructure entirely when changes are needed?

  • Mutable Infrastructure: Traditional approach where you SSH into devices and modify configurations directly (or you modify them via NETCONF, gNMI, or any kind of API). Changes are applied in-place.
    • Pro: Less disruptive, lower overhead.
    • Con: Harder to track state, increases configuration drift risk.
  • Immutable Infrastructure: You never modify running infrastructure. Instead, you create new infrastructure with desired changes and switch traffic/connections. Used heavily in cloud (containers, VMs) but less common in networking.
    • Pro: Predictable state, easier to verify, eliminates drift.
    • Con: Requires orchestration, more complex recovery, higher resource overhead.

In network automation, we’re typically in a hybrid state: configurations are mutable (we change them in place), but the principle of immutability should guide your design—version everything, track changes, and be able to rebuild from scratch if needed.

2.3.3. Greenfield vs. Brownfield#

Not a technical term, but it’s very useful to understand two different scenarios when developing network automation solutions:

  • Brownfield environments exist with legacy systems, manual processes, inconsistent configurations, and tribal knowledge. It is harder because you’re automating complexity that was never meant to be automated: inconsistent designs, missing data, legacy technology, human habits, and live traffic constraints. But it’s the most common environment.
  • Greenfield environments are built from scratch with automation-first design. You can implement all principles cleanly: version control from day one, declarative intent, clean data models, comprehensive testing. This is ideal but rare.

Let’s try to dig a bit more into why brownfield environments are so complicated…

  • Inconsistent network designs (or if it was, it is not fully implemented) and it’s just a bunch of exceptions together.
  • Lack of clean and reliable data, the network itself or maybe some outdated spreadsheet are the references.
  • Some network devices from different vendors and generations do not support modern network management interfaces (e.g., CLIs) that limit the implementation of declarative approach.
  • Operational culture and fear of change, do not forget that people is first, and you need to gain trust before adoption.
  • Automation must coexist with live traffic without a clean cutover, the margin for error is tiny.

However, there is still hope in these cases. The following three approaches facilitate getting started with these scenarios:

  1. Partial greenfield approach: Automate new infrastructure while gradually refactoring legacy systems. This shows progress without starting with maximum complexity.
  2. Incremental target selection: Focus on parts of the network that are easier to automate and provide quick wins:
    • Adding configuration drift and remediation for a few management features (AAA, NTP, SNMP)
    • Automating interface provisioning for a single device type
    • Standardizing a single subsystem before expanding
  3. Build momentum: Each small win demonstrates automation’s value and builds trust for funding and expanding

2.3.4. Device Diversity and Service Abstraction#

Not all network devices or services work the same way. Different vendors, models, and even software versions have unique interfaces, capabilities, and limitations. Your automation must address this heterogeneity strategically.

Two primary approaches:

  • Embrace Vendor-Specific Automation: Write automation tailored to each vendor’s unique capabilities without compromising reproducibility. Pros: simpler initially, leverages device strengths. Cons: creates silos, harder to migrate if requirements change.
  • Abstract It Away: Create a vendor-agnostic abstraction layer that standardizes common operations. Pros: portability, unified interface. Cons: added complexity, potential loss of device-specific capabilities.

Best Practice: Layered Approach Most mature operations use both: vendor-specific drivers at the bottom (one layer per device type), then a common abstraction layer above. This gives you the benefits of both strategies.

Real-World Example: Building a DMVPN Alternative Network vendors offer DMVPN (Dynamic Multipoint VPN) for hub-spoke VPN scalability. An alternative is to set up many simple point-to-point VPN tunnels instead. However, without automation this is cumbersome (and the reason for the protocol to exist). Managing thousands of tunnels with automation is doable (and not extremely complicated), and it provides similar scalability through orchestration rather than protocol sophistication, since virtually all platforms support basic VPN tunnels. You’ve replaced a protocol dependency with an automation abstraction—often a better trade-off.

The key principle: separate implementation details (how specific devices work) from the final goal you want to achieve. This enables vendor independence and simplifies reasoning about your system.

2.3.5. The Fallacy of Tools Over Design#

While tools are essential, they are not a substitute for good design. A common mistake: purchasing a tool and expecting it to solve automation problems. Tools matter, but they amplify existing architecture and design. A well-designed automation strategy with a mediocre tool beats a poorly-designed strategy with the best tool available.

Design and architecture are strategy and should come first. Tools are implementation details. Invest time in understanding your requirements, designing your architecture, and defining your principles before selecting tools. Sometimes a distributed architecture is appropriate; other times a simpler, more capable solution better matches your needs in terms of outcome versus complexity. There is no single formula, but you must recognize and be intentional about your choices.

Also remember that tools are not magic boxes. You always need to bring your own logic into them. Customization and domain-specific configuration are unavoidable.

In Chapter 3, we’ll explore a reference architecture that highlights the main building blocks you need to consider when evaluating tools.

2.3.6. Buy vs. Build#

A common strategic question: should you purchase an off-the-shelf solution, build custom automation, or use a hybrid approach?

The simple rule: buy when you can, build when you must. Counter-intuitively, building is often more expensive than buying. But buying what you need is not always possible.

When evaluating the decision:

  • Buy: Use when a product closely matches your needs, supports your architecture, and the cost/benefit analysis favors it
  • Build: Use when your requirements are unique, you have the expertise, or off-the-shelf solutions don’t fit your principles

This is fundamentally a design decision, not a procurement one. It’s about how much customization and control you truly need.

In practice, most organizations use a hybrid approach: buy strategic components (orchestration platforms, CI/CD systems, data storage) but build domain-specific automation (templates, workflows, validation logic). You must own the domain-specific layer—generic recipes rarely deliver the outcomes you need.

When evaluating open-source solutions, consider extensibility. You can often reuse the framework and build your own custom layers on top—for example, a data source to store network intent that extends the core tool with custom data models. Also, do not forget that with open source, you can still get support and enterprise editions to add safety nets when needed.

2.3.7. Why These Principles Matter: Learning from Failures#

Automation amplifies whatever you feed it. Clean, well-designed processes become more efficient and reliable. But poor design, bad data, or flawed logic also scale, creating bigger problems faster.

The Meta Outage Case Study

In October 2021, Meta (formerly Facebook) experienced a global network outage that took their systems offline for hours. What caused it? Automation amplified a misconfiguration. During a routine traffic shift, automated systems made global changes based on incomplete policy validation. The configuration cascaded across their network, causing a single point of failure that propagated globally.

The root cause was a faulty configuration change on backbone routers, which disrupted communication between data centers. This cascading failure halted services globally and also impacted internal tools, complicating diagnosis and resolution. Meta clarified that the issue was not due to malicious activity and no user data was compromised. However, the outage highlighted presumable critical gaps in their automation strategy:

  • No idempotency: Changes applied without checking current state
  • No transactionality: Some devices got the change, others didn’t (partial failure)
  • Insufficient testing: The scenario wasn’t caught in pre-production
  • No dry-run capability: Changes were applied without preview or validation
  • Poor versioning: Unable to quickly identify and revert the problematic change
  • Insufficient observability: Couldn’t detect the failure fast enough
  • No graceful degradation: Changes cascaded globally without blast radius containment
  • Unclear responsibilities: No clear decision-making hierarchy for automation

Mapping to Principles: Each of our six core design principles directly addresses one of these failures. This demonstrates they’re not optional nice-to-haves, they’re essential safeguards.

The Lesson: Don’t automate chaos hoping to fix it later. Instead:

  1. Start with processes you understand and can validate
  2. Automate incrementally, learning from each step
  3. Treat every automation failure as a learning opportunity
  4. Build in safeguards: rate limits, rollback mechanisms, observability, blast radius containment
  5. Separate concerns so failures in one area don’t cascade globally

Notice how this connects with the People, Process, and Technology approach introduced in Chapter 1.

The principles we’ve explored—Intent-Driven, Idempotency, Transactional, Versioning, testability, and Dry Run capability—directly address these failure modes. They’re not optional features; they’re essential safeguards that must be built in from the start.

2.4. Software Engineering Principles#

In addition to network-specific design principles, broader software engineering principles play a crucial role in building maintainable and scalable automation systems. These principles provide a solid foundation for long-term success.

While some principles apply to both coding and architecture, we’ll organize them into two categories following inspiration from Robert C. Martin’s “Clean Code” and “Clean Architecture”:

  • Clean Code Principles: How to write automation logic that is readable, maintainable, and correct.
  • Clean Architecture Principles: How to structure systems so components remain independent, testable, and evolvable.

Focusing on applying them to network automation, you can find good examples of these principles in this Network to Code blog series by Ken Celenza.

2.4.1. Clean Code Principles#

In this section, the focus is on how the software components are built.

Write Code for Readers

The code you write will be read many times in the future: by yourself (when debugging) or by others. And they likely won’t have your original context. Therefore, express your intention clearly in the code through meaningful names, comments, and structure (please do not overuse comments; use them carefully). Automation code is not just instructions to machines; it’s a form of documentation for humans.

DRY – Don’t Repeat Yourself

Avoid duplicating logic across your automation codebase. Instead, extract common patterns into reusable templates, functions, or workflows.

Rather than writing device-specific configuration logic in ten different playbooks, create a shared template with variables for differences. This reduces bugs and makes updates easier.

This principle also applies to data: use proper data structures leveraging inheritance and polymorphism to create more scalable and reusable data models.

When you violate DRY, a fix in one place requires fixes in five others, and you’ll inevitably miss one.

YAGNI – You Aren’t Gonna Need It

Don’t build features, flexibility, or abstraction layers you don’t immediately need. Future-proofing often creates unnecessary complexity.

Build what you need today. Refactor when new requirements actually arrive. This is especially important in automation where uncertainty is high and requirements evolve.

This is also known as premature optimization:

“The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.” – The Art of Computer Programming by Donald Knuth

Single Responsibility Principle (SRP)

Each module, function, or workflow should have one reason to change. In network automation, this means:

  • A configuration template renderer shouldn’t also handle device discovery
  • A validation workflow shouldn’t also execute changes

When each component has a single responsibility, failures are isolated, testing is simpler, and changes are less risky. Obviously, there will be a composition function (or orchestration) to connect all these functions together.

Fail Fast, Fail Visible

Detect problems as early as possible and expose them clearly. In automation:

  • Validate data immediately upon input (don’t wait until deployment)
  • Make Dry Run output explicit and obvious
  • Log failures with full context, not vague error codes
  • Alert operators immediately when something goes wrong

Catching issues early reduces blast radius and response time.

Encapsulation

Hide implementation details behind well-defined interfaces. Users of your automation should only see what they need to see.

This allows you to refactor internal implementation without breaking downstream tools that depend on your automation.

Security

Encryption, authentication, least privilege, and audit trails must be embedded in automation systems from the start, not retrofitted later.

In network automation: Every change should be auditable, credentials should never be hardcoded, and access control should follow the principle of least privilege. A single automation system with full network access is a security disaster waiting to happen.

We will go deeply into security and compliance considerations in Chapter 12

Principle of Least Astonishment

Automation should behave the way users expect. Surprising or counterintuitive behavior erodes trust.

For example, if an automation task is named “deploy_interface,” operators expect it to create an interface, not delete one. Unexpected behavior frustrates users and causes mistakes.

Defensive and Robust Programming

Build in retries, Circuit Breaker patterns, Compensation Logic, and fallback mechanisms. Distributed systems fail; design for it rather than against it. If a device is temporarily unreachable, retry with exponential backoff rather than failing immediately. If a change fails halfway through, have a rollback plan.

“Be conservative in what you send, be liberal in what you accept.” Postel’s Law (RFC 761)

In network automation:

  • Conservative sending: Ensure data you send to APIs or devices adheres to strict schemas and contracts
  • Liberal acceptance: Be ready to handle variations (e.g., attributes as integers or strings with conversion) to maximize interoperability with different system versions

This principle bridges clean code with architecture. It affects both how you write integration logic and how you structure system interfaces.

2.4.2. Clean Architecture Principles#

Next, we explore principles that govern how components of a network automation solution come together.

KISS – Keep It Simple, Stupid

Simpler is easier to understand, test, and maintain. Avoid over-engineering solutions in design, implementation, and architecture. Simplicity reduces bugs, increases maintainability, improves readability, and makes systems easier to extend or debug.

Simple doesn’t mean simplistic; it means choosing the most straightforward approach that satisfies requirements without over-engineering or adding premature abstractions.

In network automation, this means preferring straightforward Declarative approaches over complex imperative script (when possible), prioritizing clarity, small composable components, predictable behavior, and solutions that can be easily understood by others (including your future self).

Separation of Concerns

Clearly separate data (configuration), logic (workflows), and presentation (APIs/UIs). This prevents tight coupling and enables independent evolution of each layer.

In network automation:

  • Data layer: Network intent stored as structured
  • Logic layer: Automation engines and validation rules
  • Presentation layer: APIs, CLIs, dashboards for operators

This separation allows you to change how operators interact with automation without affecting the underlying logic. We will explore more about the concerns in Chapter 3.

Observability

Automation must be instrumented to measure its own behavior, detect failures, and trigger corrective actions. You cannot optimize what you cannot measure. Track both technical metrics and business-oriented ones (e.g., ROI of automation initiatives).

In Chapter 5, we will cover the different types of observability data we care about in networking, such as metrics, logs, traces, network flows, and alerts (and many more!), and how to leverage them to provide useful information at all levels.

Without observability, you’re flying blind. You won’t know if automation is working correctly or just appearing to work. Remember: automation systems themselves fail and need monitoring. Instrument your automation tools as thoroughly as you instrument the network.

Extensibility

Design with the future in mind. New vendors, new technologies, and new requirements will arrive. Architecture should allow for these without requiring complete rewrites.

In practice: Use plugin architectures for vendor-specific drivers, avoid hard-coded assumptions about network topology, and keep interfaces stable while implementations evolve.

Minimal Coupling, Maximal Cohesion

Define clear contracts for how systems communicate: schemas, validation rules, and backward-compatibility policies. These contracts enable independent evolution of components.

In network automation, if your orchestration system talks to your device drivers through a well-defined REST API, either layer can evolve independently as long as the API contract is maintained.

Always approach every system with an API-First Design: Design APIs first (not implementations). This ensures systems can be developed independently and swapped without breaking other components.


These advanced principles will be explored in greater depth in later chapters as we discuss scaling automation across large (and smaller) organizations. For now, understand that these principles complement the design principles we explored earlier—together, they form the foundation for trustworthy, maintainable network automation at scale.

2.5. Summary#

This chapter established that trust is the foundation of successful network automation. Trust emerges from four core qualities: Predictable, Reliable, Usable, and Understandable.

These qualities are supported by six foundational design principles:

  1. Intent-Driven: Define what you want to achieve before how to achieve it
  2. Idempotent: Repeated executions produce consistent results
  3. Transactional: Changes complete fully or fail safely, never partially
  4. Versioned: Track all changes with full history and audit trails
  5. Testable: Validate behavior before deploying to production
  6. Dry-Run Friendly: Preview changes before execution

Beyond these core principles, we explored architectural decision patterns (declarative vs. imperative, greenfield vs. brownfield, device abstraction) and software engineering principles (clean code and clean architecture) that operationalize these patterns in real systems.

These principles are not abstract theory—they have concrete implementations in tools and frameworks you’ll use. Throughout the rest of this book, we’ll see how architectural thinking (Chapter 3) applies these principles to larger systems, and how the building blocks (Chapters 4–9) operationalize them.

The key takeaways:

  • Start with principles, not tools
  • Design for Predictable outcomes, not complexity
  • Measure and improve continuously.

When you do this consistently, trust follows naturally—and with trust comes the ability to scale automation across your entire organization.

You now understand the principles that make automation trustworthy. In Chapter 3 (Architectural Thinking), we’ll see how to structure these principles into scalable systems. You’ll learn how to design automation that grows with your organization without becoming unmanageable—a practical, architectural view of how to systematically apply the principles you’ve learned here.

💬 Found something to improve? Send feedback for this chapter