Feb 15, 2026 · 12517 words · 59 min read

4. The Source of Truth#

Every automation system starts with one question: What am I actually going to do? When you deploy a firewall rule, add an IP address, or configure a VLAN, you’re doing it based on some representation of intent. That representation is your source of truth: the single, reliable version of what should be deployed.

I use “Source of Truth” and “Intent” interchangeably. They’re exactly the same concept.

Without a reliable source of truth, network operations become a mess of tribal knowledge. Engineers disagree on what’s deployed. Spreadsheets contradict what’s actually running. Two different automation systems make conflicting changes to the same device. When something breaks, you’re left doing archaeology: “Why was this configured this way? Who approved it? When did it change?”

This chapter looks at how to build a source of truth that works whether you have dozens of devices or hundreds of thousands, serves everyone who needs data (humans, automation, other systems), keeps the data accurate and trustworthy, and handles the complexity of pulling data from multiple sources. We’ll cover six building blocks: Modeling, Consumption, Enforcement, Versioning, Aggregation, and Design-Driven that together create a solid foundation for network automation. I’ll also touch on practical issues when bringing existing networks into this system, and show you what solutions are actually available.

4.1. Fundamentals#

4.1.1. Context#

Your source of truth does three things: it defines how you express what you want, where that intent lives, and how you keep it trustworthy over time.

It makes sure your automation workflows actually use real data instead of making stuff up. It gives your execution systems something reliable to work with. And it lets your monitoring systems check whether reality matches intent.

Without a good source of truth, all the other parts of your architecture just become standalone tools with no clear direction. With one, they actually work together.

4.1.2. Goals#

Your source of truth needs to do six things:

  1. Capture everything you need. Store the full picture of your network: configs, assets, topology, services, IP addresses, circuits, device specs, secrets, maintenance windows, compliance stuff, and who owns what. Include not just what’s running today, but what you’ve planned and what you’re retiring. When all this data lives in one place instead of scattered across spreadsheets and people’s heads, your automation systems actually have the context they need to make smart decisions.

  2. Let operators think in business terms, not device syntax. Staff should work at the business level (“add a new branch” or “set up an MPLS service”) not at the Command Line Interface (CLI) level. Behind the scenes, your system figures out the actual device-specific configs. This keeps people focused on what they’re trying to accomplish, not on low-level details.

  3. Let people and machines access the data easily. Your system needs Application Programming Interface (API)s (Representational State Transfer (REST), GraphQL) so automation can get and modify data. It needs a web UI and Command Line Interface (CLI) so humans can browse and edit. Everyone needs solid access controls so they only see what they should. You can have a hundred automation workflows running at the same time, dozens of staff editing data, and external systems syncing in, and it all stays consistent and fast.

  4. Keep the data clean and trustworthy. Validate everything: check that data types are right, relationships make sense, VLANs are in valid ranges, IP addresses don’t duplicate. Track who changed what and when. Let people approve major changes before they stick. If something goes wrong, you can roll back. Your automation needs to trust that the data is accurate, because bad data leads to bad network state and outages.

  5. Let people work in parallel without stepping on each other. Multiple teams should be able to propose changes at the same time. Changes get grouped into atomic bundles: either they all go in or they all fail, no halfway states. You can test changes in a staging area before they go live. For big migrations, you can branch off, do your work, and merge back. You always know which intent is being used now versus what’s being proposed.

  6. Bring data in from other systems. You probably already have asset management for hardware, IP systems for addresses, circuit providers for connectivity, CMDBs for services. Don’t duplicate that. Instead, sync it in. Make clear rules about which system owns which data. Keep everything in sync both ways. This means you get a unified view of everything without reinventing the wheel.

graph TD

    %% --- Subgraphs ---
    subgraph Goals
        direction LR
        A1[Capture everything you need]
        A2[Let operators think in business terms, not device syntax]
        A3[Let people and machines access the data easily]
        A4[Keep the data clean and trustworthy]
        A5[Let people work in parallel without stepping on each other]
        A6[Bring data in from other systems]
    end


    %% --- Row gradient classes ---
    classDef row1 fill:#eef7ff,stroke:#4a90e2,stroke-width:1px;
    classDef row2 fill:#ddeeff,stroke:#4a90e2,stroke-width:1px;
    classDef row3 fill:#cce5ff,stroke:#4a90e2,stroke-width:1px;
    classDef row4 fill:#b3d8ff,stroke:#4a90e2,stroke-width:1px;
    classDef row5 fill:#99ccff,stroke:#4a90e2,stroke-width:1px;
    classDef row6 fill:#80bfff,stroke:#4a90e2,stroke-width:1px;

    %% --- Apply classes per row ---
    class A1 row1;
    class A2 row2;
    class A3 row3;
    class A4 row4;
    class A5 row5;
    class A6 row6;

With these goals, the next step is understanding which architectural capabilities the solution must provide.

4.1.3. Pillars#

Each goal needs specific capabilities to work. Here’s what you actually need to build:

  1. Flexible Data Modeling Framework: Look, your data model will make or break everything else. I’ve watched teams spend months building beautiful execution workflows only to hit a wall because their data model can’t express what they actually need. You need schema definition that handles complex object hierarchies, relationships between entities, custom attributes, and ways to add new object types without breaking existing automation. The model needs both normalized relational structures (for referential integrity) and denormalized document structures (for performance) depending on how you’ll access the data. You can adopt industry-standard models (OpenConfig, IETF YANG modules) where they fit, but you’ll need custom extensions for your organization’s specific quirks. The hard part is balancing rigidity (so things stay consistent) with flexibility (so you can evolve without rebuilding from scratch).

  2. Design and templates. High-level business intent needs to become device-specific configs. Templates, inheritance, and logic engines do this translation. You use variables, conditionals, and reusable patterns so you don’t have to manually write every device’s config.

  3. APIs and interfaces. Everything (create, read, update, delete) should be available through Application Programming Interface (API)s: Representational State Transfer (REST), GraphQL, or webhooks, so things react when data changes. This will be the entry points for other architectural components, including the Presentation layer that exposes UIs or CLIs to user. Also, proper authentication and authorization at the Application Programming Interface (API) level provides controlled access.

  4. Data validation. Check that new data is correct before you accept it: right types, unique values where they need to be, valid relationships. Stop obviously bad data from getting in. Explain clearly what’s wrong when someone tries to add something invalid. You can validate some things instantly and run the expensive checks in the background.

  5. Change history. Keep complete records: who changed what, when, and why. Never allow changes to history to be erased or modified. Let people see what something looked like at any point in time. Support merging when people work in parallel.

  6. Data aggregation. You probably have data living in different systems. Connectors pull it together: CMDB, IPAM tools, asset management, wherever. Define clear rules about which system owns which data. Handle conflicts sensibly when the same data comes from multiple places.

graph LR

    %% --- Subgraphs ---
    subgraph Goals
        direction TB
        A1[Capture everything you need]
        A2[Let operators think in business terms, not device syntax]
        A3[Let people and machines access the data easily]
        A4[Keep the data clean and trustworthy]
        A5[Let people work in parallel without stepping on each other]
        A6[Bring data in from other systems]
    end

    subgraph Pillars
        direction TB
        B1[Flexible Data Modeling Framework]
        B2[Design and templates]
        B3[APIs and interfaces]
        B4[Data validation]
        B5[Change history]
        B6[Data aggregation]
    end


    %% --- Row connections ---
    A1 --> B1
    A2 --> B2
    A3 --> B3
    A4 --> B4
    A5 --> B5
    A6 --> B6

    %% --- Row gradient classes ---
    classDef row1 fill:#eef7ff,stroke:#4a90e2,stroke-width:1px;
    classDef row2 fill:#ddeeff,stroke:#4a90e2,stroke-width:1px;
    classDef row3 fill:#cce5ff,stroke:#4a90e2,stroke-width:1px;
    classDef row4 fill:#b3d8ff,stroke:#4a90e2,stroke-width:1px;
    classDef row5 fill:#99ccff,stroke:#4a90e2,stroke-width:1px;
    classDef row6 fill:#80bfff,stroke:#4a90e2,stroke-width:1px;

    %% --- Apply classes per row ---
    class A1,B1 row1;
    class A2,B2 row2;
    class A3,B3 row3;
    class A4,B4 row4;
    class A5,B5 row5;
    class A6,B6 row6;


Finally, before detailing the six Building Blocks that realize these pillars, let’s clarify what falls within the Source of Truth’s scope.

4.1.4. Scope#

Before diving into implementation details, it’s important to understand what the source of truth is responsible for and what it’s not.

What’s in scope:

The source of truth manages all intent data: the complete definition of what your network should look like. This includes production configurations, staging environments, development branches, and test scenarios. Everything that describes “what you want” lives here.

What’s out of scope:

The source of truth doesn’t do several things that might seem related:

  1. Observability data: The source of truth doesn’t store metrics, logs, or runtime state. However, it does define the expectations you’ll compare against, like threshold values for alerts or baseline performance numbers. The actual observability data lives elsewhere (covered in Chapter 6).

  2. Network interaction: The source of truth doesn’t talk to devices or push configurations. It provides the necessary artifacts (device configs, validation rules, deployment manifests) but doesn’t execute them. That’s the Executor’s job (Chapter 5).

  3. Orchestration logic: The source of truth doesn’t define the sequence of steps or workflows for deploying changes. It defines the intended final state. How you get there (which devices first, what validation steps, rollback procedures) belongs to the Orchestrator (Chapter 7).


Think of the source of truth as the north star for your network automation strategy. It’s the single, authoritative answer to “what should the network look like?” Everything else in your automation system (execution, monitoring, orchestration) references this truth to do its job. When reality drifts from intent, the source of truth tells you what reality should become.

4.2. Building Blocks#

Now let’s talk about the six building blocks that actually implement all this.

Each one maps to a goal and pillar:

  1. Modeling: Defines what data you store and how it relates. Device models, interfaces, VLANs, circuits, services. Let it evolve as your needs change.

  2. Design-Driven: Translates high-level intent into actual device configurations through templates and logic.

  3. Consumption: How people and systems actually get and use the data. Application Programming Interface (API)s, web UI, Command Line Interface (CLI). Everyone gets access control suited to their role.

  4. Enforcement: Makes sure bad data doesn’t sneak in. Validation, uniqueness checks, referential integrity. Clear error messages.

  5. Versioning: Keeps the complete history. Who changed what, when, and why. Roll back when needed.

  6. Aggregation: Pulls data in from other systems (CMDB, IPAM, etc.) and keeps it synced.

graph LR

    %% --- Subgraphs ---
    subgraph Goals
        direction TB
        A1[Capture everything you need]
        A2[Let operators think in business terms, not device syntax]
        A3[Let people and machines access the data easily]
        A4[Keep the data clean and trustworthy]
        A5[Let people work in parallel without stepping on each other]
        A6[Bring data in from other systems]
    end

    subgraph Pillars
        direction TB
        B1[Flexible Data Modeling Framework]
        B2[Design and templates]
        B3[APIs and interfaces]
        B4[Data validation]
        B5[Change history]
        B6[Data aggregation]
    end

    subgraph Building Blocks
        direction TB
        C1[Modeling]
        C2[Design-Driven]
        C3[Consumption]
        C4[Enforcement]
        C5[Versioning]
        C6[Aggregation]
    end


    %% --- Row connections ---
    A1 --> B1 --> C1
    A2 --> B2 --> C2
    A3 --> B3 --> C3
    A4 --> B4 --> C4
    A5 --> B5 --> C5
    A6 --> B6 --> C6

    %% --- Row gradient classes ---
    classDef row1 fill:#eef7ff,stroke:#4a90e2,stroke-width:1px;
    classDef row2 fill:#ddeeff,stroke:#4a90e2,stroke-width:1px;
    classDef row3 fill:#cce5ff,stroke:#4a90e2,stroke-width:1px;
    classDef row4 fill:#b3d8ff,stroke:#4a90e2,stroke-width:1px;
    classDef row5 fill:#99ccff,stroke:#4a90e2,stroke-width:1px;
    classDef row6 fill:#80bfff,stroke:#4a90e2,stroke-width:1px;

    %% --- Apply classes per row ---
    class A1,B1,C1 row1;
    class A2,B2,C2 row2;
    class A3,B3,C3 row3;
    class A4,B4,C4 row4;
    class A5,B5,C5 row5;
    class A6,B6,C6 row6;

The following is an architectural perspective of the Intent block:

graph TB

    %% Tier 1
    subgraph T1[External]
        A[Consumption]
    end

    %% Tier 2
    subgraph T2[Data]
        B[Design-Driven]
        D[Modeling]
    end

    %% Tier 3
    subgraph T3[Engine]
        C[Aggregation]
        E[Enforcement]
        F[Versioning]
    end

    %% Tier connections
    A <--> B
    A <--> D

    B <--> D

    D <--> C
    D <--> E
    D <--> F

4.2.1. Modeling#

Your data model is critical: it determines what you can represent and how easily you can work with it. As George Box noted: “All models are wrong, but some are useful.” There’s no perfect model that works for everyone. In my experience, data modeling is more art than science. But certain patterns show up consistently across successful projects.

4.2.1.1. Foundational Principles#

How you organize your data matters. It determines what you can express and how efficiently systems can validate and use it. Different formats have different tradeoffs. YAML is readable but doesn’t validate much. JavaScript Object Notation (JSON) works everywhere but is verbose. Yet Another Next Generation (YANG) adds validation but has a steep learning curve. Pick based on what you actually need.

FormatUse CaseStrengthsTradeoffs
YAMLConfiguration, human editingReadable, conciseLimited schema validation
JavaScript Object Notation (JSON)Application Programming Interface (API)s, document storageTooling, ecosystemVerbose for humans
XMLStandards-based exchangeXSLT processing, schemasHeavy syntax
Protocol BuffersPerformance, serializationCompact, versioningBinary, requires code generation
YANGNetwork device modelingIndustry standard (RFC 6020), hierarchical constraintsSteep learning curve

Your data exists at different levels. The same network piece can be described in multiple ways for different purposes:

  • Service level: Business-friendly (“set up an MPLS L3VPN for branch X”)
  • Technical level: Technical specs (“BGP AS 65001, route targets 65001:100, policies…”)
  • Device level: Actual configs (“interface xe-0/0/0; unit 100;…”)

Good models connect these layers. You can start at a business level, then generate device configs automatically. But you don’t always need all three layers: depends on what you’re building.

4.2.1.2. Data Persistence and Scale#

Choosing how to store model data has profound implications for consistency, performance, and evolution. These implications that become critical as your network grows from hundreds to hundreds of thousands of managed objects.

  • Relational Databases (e.g., MySQL, PostgreSQL): This is the safe bet for most teams, and I mean that as a compliment. They enforce schema consistency, provide ACID transactions, and every engineer on your team already knows SQL. They excel at representing normalized hierarchies (VLANs containing interfaces) and preventing data anomalies. The downside: schema changes hurt at scale, and performance tanks with deep joins across millions of rows. But that’s a good problem to have: it means you’ve actually deployed something.

  • Document Databases (e.g., MongoDB, CouchDB): Great if you need schema flexibility and horizontal scalability. Documents naturally model nested structures (a device with all its configs and metadata as one blob). But here’s the catch: you’re now responsible for consistency across documents, and complex queries get expensive fast. Unless you have a specific reason to go document-based, stick with relational.

  • Graph Databases (e.g., Neo4j): These are genuinely better when relationships matter more than objects: “which VLANs does this interface connect to? What devices route between these two locations?” They traverse relationships of arbitrary depth efficiently. But unless you’re doing complex topology queries constantly, you’re choosing the exotic option. Your ops team doesn’t know it, the tooling is less mature, and write performance lags for simple updates. Graph databases solve real problems, but make sure you actually have those problems first.

More about data persistency, from a different perspective, in the Observability chapter.

Database selection is one of the key differentiators among products in this space. For example, NetBox and Nautobot use relational databases, while Infrahub uses a graph database, as you can see in section 4.2.8.

Persistence can also be implemented using file-based storage with data models like YAML or JavaScript Object Notation (JSON) (or CSV), commonly tracked in Git systems for version control.

Most production Source of Truth systems use polyglot persistence: a relational database for authoritative intent and relationships, document storage or Git for flexibility and caching, and graph capabilities for topology analysis.

How granular should your model be? This matters for performance. If you model every interface on every device as a separate object, you end up with 50,000+ objects for a medium network. Queries get slow. Updates become painful.

The practical approach: use templates for common patterns. Say “interfaces 1-40 use this standard template” and only track exceptions. That’s 2 objects instead of 40, queries stay fast, and rendering still works.

In Chapter 11, we will cover more insight on how decisions like this impact performance.

4.2.1.3. Network Data Domains#

Comprehensive Source of Truth implementations typically model these interconnected domains:

  • Inventory and Assets: Physical and logical devices, hardware specifications, serial numbers, procurement date, contract terms, lifecycle stage
  • Data Center Infrastructure: Locations (geographic and hierarchical), buildings, floors, rooms, racks, power distribution, cable routes
  • IP Address Management (IPAM): Address pools, subnets, assignments, DNS resolution, DHCP scopes, utilization tracking
  • Virtualization and Cloud: VPCs, subnets, security groups, compute instances, storage, container orchestration relationships
  • Connectivity: Physical circuits (MPLS, Ethernet), virtual tunnels, peering relationships, bandwidth allocations, QoS policies
  • Routing: BGP communities, autonomous systems, routing policies, prefix lists, route targets for L3VPN services
  • Services: Logical service definitions (L3VPN, L2VPN, firewall traversal), service-to-device mappings, SLAs
  • Security and Compliance: Access control lists, firewall rules, security zones, compliance tags, audit requirements
  • Management: SNMP details, gNMI subscriptions, NTP sources, syslog targets, TACACS+/RADIUS integration

Beyond network-specific domains, many other data types are needed. For example, a secrets backend to store credentials. More generally, comprehensive data management fits within a global IT infrastructure management system.

I want to share a common question for many teams debating using a network-specific data source, such as NetBox, versus a general purpose one like ServiceNow. In my experience, even though is possibe to achieve similar outcomes, the fact that ServiceNow is a company-wide system makes it harder (and slower) to evolve, slowing down the network teams to start leveraging it for a complete network representation.

Data Classification and Common Attributes

Across network domains, certain classification patterns appear consistently. Well-designed models include these foundational attributes to keep consistent data manipulation:

  • Role: What purpose does this object serve? (core, distribution, access; primary, secondary)
  • Status: Is it active, planned, decommissioning, or retired?
  • Kind: What type of object is it? (VLAN, device, circuit, service)
  • Ownership: Which team or business unit manages this?
  • Location: Where is this resource situated geographically or organizationally?

4.2.1.4. Model Design Patterns: Extensibility, Polymorphism and Migrations#

Some solutions ship with a built-in model based on what the creators think networks should look like. That’s amazing if your network matches that assumption, and not so good if it doesn’t. Other solutions let you build it from scratch: great flexibility, but you need discipline. Usually, the best approach is usually a hybrid: use the built-in models for your common cases (the 80% that everybody does), but make sure you can extend them for your specific needs.

The Spectrum: From Opinionated to Custom

There’s a spectrum here, and it’s worth understanding where you fall:

  • Highly Opinionated Models (e.g., NetBox out of the box):

    • Pros: Fast to deploy, less decision-making, built-in best practices
    • Cons: Painful if your network doesn’t fit the mold, changing the model is hard
  • Fully Custom Models (build your own from scratch, like Infrahub):

    • Pros: Perfect fit for your specific needs, no wasted fields
    • Cons: Extra time to design, lots of trial and error, nobody else to copy from (but there are references)

Even though Infrahub allows you to build fully custom models, it also comes with opinionated models to get started with.

Where you should actually start? Here’s what I tell everyone: start with tools with opinionated models like NetBox, Nautobot, or Infrahub. Period. I don’t care how special you think your network is. The “we’ll build our own perfect model” teams are still modeling two years later while these tool teams shipped valid models many months ago.

You’re not Google (or maybe yes?). In many cases, you’re not building a hyperscale cloud. Use what exists, extend it when you hit real limits (not imagined ones), and ship something that works.

Obviously, if you already anticipate that you have a very special network use case, you may consider which tool will support it better, but do not completely reinvent the wheel from day one.

The only decision that matters: can you extend without rewriting? If adding a custom field to track “cost center” requires rebuilding your entire schema, run away. If you can add it as a custom attribute in 20 minutes, you’ve found the right tool.

You want 80% standardization with 20% flexibility (handle your specific reality). Anything claiming to be “infinitely flexible” will take infinite time to configure.

Polymorphism: One Model, Many Flavors

All interfaces aren’t created equal. A physical optical port and a tunnel interface share some properties, but they’re different beasts. You could create completely separate models for each, but that gets messy fast and limits usability.

Better approach: define a shared base model that covers what they have in common, then create specialized variants for the details that differ.

# All interfaces share these basics
interfaces:
  - name: "eth0"
    type: "physical"
    status: "up"
    mtu: 1500
    ipv4_address: "192.0.2.1/24"

# But a physical optical port has extra stuff
  - name: "eth0"
    type: "physical_optical"
    status: "up"
    mtu: 1500
    ipv4_address: "192.0.2.1/24"
    optics_module: "100GBASE-LR4"
    tx_power_dbm: -2.5
    rx_power_dbm: -8.3
    laser_temperature: 48.2

# And a tunnel looks totally different underneath the covers
  - name: "tun-vpn-dallas"
    type: "tunnel_gre"
    status: "up"
    mtu: 1476
    ipv4_address: "10.0.0.1/30"
    tunnel_source: "203.0.113.1"
    tunnel_destination: "198.51.100.5"
    tunnel_encap: "GRE"
    tunnel_key: 100

This way, your scripts can query “all interfaces on this device” without knowing whether they’re talking to optics or tunnels. But when you need optical-specific details (getting TX power readings), you can reach into those specialized fields. Add a new interface type down the road? No problem: just add another variant.

Models Change: Plan for It

Networks live for decades. Your model won’t stay static. You’ll add new fields, change how things relate to each other, deprecate stuff that doesn’t work anymore. But you can’t just flip a switch and break everything running on your old schema.

The challenge is that when your model changes, a lot of downstream stuff can break: validators need new rules, APIs change what they return, database queries assume certain columns exist, your templates need different field paths, reports are written against the old structure. All of that needs to adapt, or at least not catastrophically break. Here’s how to handle changes without blowing things up:

  • Mark things as deprecated before you remove them. If a field is going away, tell everyone 2-3 releases ahead of time. Give them a grace period to migrate.

  • Support field aliases during the transition. Old code asking for device_name? Route it to hostname under the hood. Your API still works, people have time to update their automation.

  • Create migration helpers. When you restructure data (like moving interfaces from a flat list to nested under devices), provide scripts that do the legwork.

  • Test with everything that depends on your data. Before rolling out schema changes:

    • Does the API still return data the old way? (via aliases)
    • Do your templates still render?
    • Do the SQL queries still work?
    • Do integrations with external systems still parse what they get?
  • Expect mixed versions in production for a while. Big organizations often have devices on three different schema versions simultaneously:

    150 devices on schema 1.9 (old)
    300 devices on schema 2.0 (current)
    50 devices on schema 2.1 (beta testing)

Your system needs to handle all three without breaking. That complexity is worth it: networks matter too much to break with careless schema changes.

4.2.1.5. Configuration Templates#

Templates turn abstract intent (“I want a VLAN”) into actual Command Line Interface (CLI) commands. Here’s the key rule: templates contain logic, not data. The template says “use the VLAN ID from the data model, output it here.” The actual VLAN ID comes from the data, not the template. This keeps data portable and testable. Jinja2 is popular because it’s human-readable, integrated with Python and Ansible, and practical for real networks. However, it is not the only option—there are many valid alternatives.

For example, given a data structure for interfaces

interfaces:
  - name: Ethernet1
    description: Uplink to core
    enabled: true
  - name: Ethernet2
    description: Server port
    enabled: false

And this CLI configuration Jinja2 template

# Interfaces
{% for iface in interfaces %}
interface {{ iface.name }}
  description {{ iface.description }}
  {% if iface.enabled %}
  no shutdown
  {% else %}
  shutdown
  {% endif %}
{% endfor %}

It generates the following CLI output:

# Interfaces
interface Ethernet1
  description Uplink to core
  no shutdown
interface Ethernet2
  description Server port
  shutdown

Notice the clear separation of concerns between the data and the configuration artifact.

An interesting alternative to Jinja is the CUE declarative configuration language that unifies multiple data functionalities:

  • Raw data, like YAML
  • Data validation/shema, like JSON Schema (more in the enforcement section)
  • Dynamic data generation, like Jinja

CUE treats configuration as typed, composable data with enforced invariants rather than as loosely structured text (like Jinja)

4.2.2. Design-Driven#

When you have 50 devices and 30 VLANs, you can manage them individually: create each VLAN, configure each interface, allocate each IP by hand. It’s tedious but manageable.

When you have 5,000 devices and hundreds of services, manual management is impossible. Adding a new branch office would mean manually specifying 100+ config items per location. The design-driven building block solves this: operators describe what they want at a high level (“add a branch”), and the system expands it into complete technical specs.

4.2.2.1. From Business Intent to Technical Data#

Example scenario - “Add branch office in Dallas”:

Business Intent (High Level):

{
  "type": "branch_office",
  "location": "dallas-tx",
  "site_code": "DAL-01",
  "circuit_count": 2,
  "employee_count": 50,
  "applications": ["erp", "voip", "video"]
}

Design Processing

PhaseTasks
1. Apply template• Creates site object
• Allocates subnets from delegation pool (regional IPAM)
• Creates VLANs for data, voice, guest
• Defines security zones and firewall rules
2. Allocate resources• Next available subnet /22 for site: 10.15.0.0/22
• Next available VLAN range: 2010-2014
• Next available BGP community: 65001:2010
3. Resolve logic• site_count < 100 employees? Yes → Small office template
• Redundant circuits needed? Yes → Create 2 BGP neighbors
• Applications = [erp, voip]? Yes → Add firewall policies
4. Render details• PE device config (BGP setup, route targets, QoS for voice)
• CE device config (LAN interfaces, VLANs, NTP)
• Firewall policy (permit ERP traffic, prioritize voice)
• DNS entries for office services
• Monitoring setup (SNMP, syslog, alerts)
5. Output50+ configuration objects ready to deploy

This transforms low-effort, high-level intent into comprehensive technical detail.

4.2.2.2. Design Patterns and Reusable Building Blocks#

Effective design systems depend on pattern libraries.

Network Design Patterns Library

PatternComponents
Small Branch Office• Single edge router (resilient via backup)
• 2-3 VLANs (data, voice, guest)
• Single MPLS VPN with Backup BGP peer
• QoS for voice (EF, AF class)
• Firewall rules: restrict access except ERP and voip
Medium Regional Hub• Redundant edge routers (active-standby)
• 10-15 VLANs (by department + guest + OOB)
• Multiple MPLS VPNs (some custom per application)
• Sophisticated QoS (6 classes)
• Advanced firewall policies (application layer)
Data Center Edge• Quad-redundant clos topology
• 100+ VLANs (auto-generated per customer)
• BGP unnumbered, multipath
• Automatic VLAN allocation

Each pattern encodes proven architectural decisions. Deviating requires explicit approval, but deploying them should be seen as a normal operation.

4.2.2.3. Making Designs Reproducible#

Here’s a problem: you design a site on Monday and allocate VLAN 100. You design a different site on Tuesday and allocate VLAN 101. But if you re-run the Monday design six months later for validation, the system might allocate VLAN 102 because VLAN 100 is taken.

That’s not reproducible. The solution: whenever you make a design request, record which design got which resources. If the same request comes in again, you get the same VLAN. This requires tracking request-to-resource mappings permanently and always allocating in the same order.

  • Deterministic allocation order: Always iterate through candidates in same order
  • Atomic allocation: Resource reservation is atomic; partial allocation not possible

This is why many design systems use content-addressable storage (hash of design) to ensure consistency.

4.2.2.4. Render vs. Build Distinction#

Design systems separate two phases:

PhaseInputOutputSide EffectsQuestions AnsweredUse Cases
RenderHigh-level designTechnical specificationsNone - no resources allocated, no changes pushed“What would be created?"
“Are there errors?"
“Can I see the proposed data expansion?”
Review before approval
Validation testing
Estimating change scope
BuildRendered specificationsReal changes (IPs committed, VLANs created, configs pushed, audit trail)Atomic Operation, Idempotency, Monitorable, Rollbackable“What was actually deployed?"
“When did it happen?"
“Can I roll back?”
Production deployment
Resource reservation
Triggering execution workflows

Supporting render without build enables:

  • Safe validation before commitment
  • What-if analysis without risk
  • Team review and approval workflows
  • Staging in test environment first

4.2.2.5. Design Versioning and Evolution#

Network designs evolve over time. New patterns emerge. Tools improve. Designs from 2020 may need updating for 2025.

Design versioning challenges:

  • Design Version 1 (2020): Branch office template: single router, 2 VLANs

  • Design Version 2 (2023): Branch office template: dual router (redundancy), 4 VLANs, improved security

What happens to branches deployed with Version 1?

  • Continue running old version? (security debt)
  • Force upgrade all branches? (change risk)
  • Gradual migration? (operational complexity)

Solutions:

Design-as-code with versioning

designs/
  ├─ branch_office_v1.yaml (deprecated 2023-01-01)
  ├─ branch_office_v2.yaml (current)
  └─ branch_office_v2_beta.yaml (testing)

Sites:
  - site: DAL-01
    design: branch_office_v2 (references specific version)
    design_parameters: {...}

This allows:

  • Knowing exactly which design version generated each site
  • Gradual migration (update site-by-site)
  • Rolling back to v1 if v2 has issues
  • Testing v3 in staging before rolling out

Snowflake accommodation

Most sites use design_v2 template But site DAL-01 has special requirements:

  • Extra rack space (old building quirk)
  • Unusual IP addressing (legacy allocation)
  • Custom firewall rule (historical business requirement)

Solution:

  • Deploy design_v2 as base
  • Apply site-specific overrides
  • Track overrides separately (document the snowflake)

This prevents:

  • Designing exceptions into the template (pollutes the design)
  • Manual configuration of exceptions (drift over time)
  • Lost context about why exceptions exist

4.2.3. Consumption#

Data locked away in a database is useless. The consumption layer is how people and systems actually get their hands on the data. Make it easy, and you get buy-in. Make it hard, and people will work around it.

4.2.3.1. API Design & Security#

Different API styles serve different consumption patterns. The following are some of the most popular nowadays:

API Style Comparison

InterfaceRequest PatternUse CaseStrengthsTradeoffs
Representational State Transfer (REST)Hypertext Transfer Protocol (HTTP) verbs (GET, POST, PATCH, DELETE) on resource URLsGeneral-purpose CRUDSimple, widely understood, statelessCan require many requests; Versioning challenges
GraphQLSingle endpoint, client specifies exact fields neededComplex multi-resource queriesFlexible, client-driven, reduces over-fetchingMore complex server implementation, N+1 query risk
gRPCProtobuf-based RPC over HTTP/2High-performance, low-latencyBidirectional streaming, binary efficiency, 10-100x faster than RESTLearning curve, limited browser support
WebhooksServer pushes changes to registered endpointsReactive automation, real-time syncAsynchronous, decoupled, no polling overheadUnreliable delivery, retry complexity, security challenges

Effective consumption strategies often combine multiple interfaces:

  • REST for human-friendly, simple operations
  • GraphQL for complex multi-domain queries with authorization filtering
  • gRPC/streaming for high-volume automation
  • Webhooks for reactive downstream systems

Note on MCP (Model Context Protocol) In addition to traditional API styles, emerging interaction models such as the Model Context Protocol (MCP) enable AI agents to interact with systems in a structured, tool-oriented manner. Unlike REST or gRPC, which define transport and request semantics, MCP focuses on safe capability discovery, contextual data retrieval, and controlled action execution for agent-driven workflows. This pattern is particularly relevant in AI-assisted operations and observability use cases, where automated reasoning systems need structured access to telemetry, configuration, and remediation tools. While still evolving, MCP represents a shift from resource-centric APIs toward agent-oriented interaction models.

One critical rule: do authentication and authorization at the API layer, not downstream. This prevents security holes and makes auditing work properly.

More about security implications in Chapter 12.

4.2.3.2. API Operations: Versioning, Performance & Caching#

Once APIs are designed and secured, operational concerns around evolution and performance become critical.

API Versioning & Evolution

Networks are long-lived infrastructure; automation systems must evolve without breaking operational workflows.

  • URL versioning: /api/v1/devices and /api/v2/devices maintain separate implementations

    • Pro: Explicit, debuggable, clients choose when to upgrade
    • Con: You maintain multiple implementations (this is actually good: it forces you to plan migrations)
  • Header versioning: Same endpoint, version in Accept: application/vnd.company+json; version=2

    • Pro: Single endpoint, cleaner URLs
    • Con: Invisible in logs, harder to debug, clients mess it up constantly

Either way, announce breaking changes 6-12 months ahead via deprecation headers:

Deprecation: true
Sunset: Mon, 01 Jan 2026 00:00:00 GMT
Link: <https://docs/migration>; rel="deprecation"

I recommend using URL versioning: /api/v1/devices and /api/v2/devices. Yes, header versioning looks cleaner in the documentation. But when something breaks at 3am and you’re grepping logs, you want to see /v1/ vs /v2/ immediately, not dig through request headers to figure out which API version caused the problem. Your future on-call self will thank you.

Performance & Caching

Consuming data one by one may work well for some use cases, but when scale grows, performance limits are eventually reached.

Bulk operations allow updating thousands of objects in single API calls. This is essential for scale but introduces trade-offs:

  • Single-object API: POST /api/devices/device Benefits: Validates every change, clear error messages per object Cost: 10,000 device updates = 10,000 API requests + roundtrips = minutes

  • Bulk API: POST /api/devices/bulk-update Benefits: Single roundtrip for 10,000 updates, can parallelize backend processing Cost: Validation shortcuts (skip expensive checks), harder to report per-object errors

Other alternatives to improve data consumption include limiting scope via:

  • Pagination and filtering prevent clients from accidentally loading millions of objects into memory or timing out: GET /api/devices?location=datacenter-1&status=active&limit=100&offset=0

    Cursor-based pagination offers advantages over offset-based pagination when dealing with large datasets, as it’s more efficient for distributed systems and remains consistent even when data is modified between requests.

  • Caching strategies dramatically improve performance:

    • Server-side caching: Redis cache of “all devices in location X” invalidated when any device in that location changes
    • Client-side caching: HTTP ETags let clients validate cached data without refetching
    • Validation layer bypass: For read-only queries, skip validation checks
  • Rate limiting protects backends from overload. For example:

    • 10 requests/second per user
    • 1,000 requests/minute per API key
    • Backpressure signals (HTTP 429) tell clients to slow down

4.2.3.3. Context-Aware Data Modeling#

Different people need different things. Network engineers want to search and find data, edit specific fields with validation feedback, see relationships. Automation workflows need APIs, transactions, and webhooks. Operations teams want task-focused views and audit trails. Business wants reports without technical detail. External systems need to sync data both directions.

Notice that the user interface of the consumption block belongs to the Presentation layer. We will cover more about this in Chapter 8.

The data has to adapt to each consumer context. For example, not all the device details have to be exposed to each persona. Context-aware customization makes data consumption more efficient and effective.

A network engineer needs interfaces and routing details. Finance wants cost and warranty info. Security wants compliance zone. Instead of returning all 500 fields to everyone, give each consumer only what they need. You can do this with API query parameters (?view=network_engineer) or GraphQL by letting them request specific fields.

The Challenge: One Data Model Cannot Fit All Consumers

Consider a single device object in the SoT containing hundreds of attributes:

  • Hardware details (model, serial number, firmware version)
  • Network configuration (IP addresses, VLANs, routing protocols)
  • Operational metadata (location, cost center, warranty expiration)
  • Service relationships (which customers use this device)
  • Security context (compliance zone, access policies)

Different consumers need radically different views of this data:

Example: Device “router-core-01” in Different Contexts

ConsumerContextData RepresentationKey Attributes
Network EngineerTroubleshooting connectivityNetwork-centric viewInterfaces, IP addresses, BGP peers, routes, VLANs, uplink status
Finance TeamCost allocationBusiness-centric viewCost center, depreciation schedule, warranty status, purchase date, vendor
Security TeamCompliance auditSecurity-centric viewCompliance zone, access policies, last security scan, patch level, open vulnerabilities
Automation WorkflowConfiguration deploymentExecution-centric viewManagement IP, credentials reference, device platform, config template name
Service CatalogCustomer impact analysisService-centric viewCustomers served, services hosted, SLA tier, redundancy group

Implementation Patterns

  1. View-based transformations: API queries specify desired context, server transforms data model accordingly

    GET /api/devices/router-core-01?view=network-engineer
    → Returns: {interfaces: [...], bgp_peers: [...], vlans: [...]}
    
    GET /api/devices/router-core-01?view=finance
    → Returns: {cost_center: "...", warranty_expiry: "...", purchase_cost: ...}
  2. GraphQL field selection: Consumers explicitly request only needed fields

    query NetworkEngineerView {
      device(name: "router-core-01") {
        interfaces { name, ip_address, status }
        bgp_neighbors { peer_ip, state }
      }
    }
    
    query FinanceView {
      device(name: "router-core-01") {
        cost_center
        purchase_date
        warranty_expiration
      }
    }
  3. Projection layers: Backend computes derived views optimized for specific access patterns

    • Network topology graph: Devices as nodes, connections as edges (for topology visualization)
    • Service dependency tree: Hierarchical view of services → devices → components (for impact analysis)
    • Compliance matrix: Devices grouped by zone with policy adherence status (for auditing)
  4. Domain-specific languages (DSLs): Specialized query languages tailored to user mental models

    • Network engineer: “Show me all devices with BGP session down in datacenter-1”
    • Finance: “Calculate monthly depreciation for devices purchased in Q3 2025”
    • Security: “List devices in PCI zone with critical CVEs”

Benefits of Context-Aware Modeling

BenefitImpactExample
Reduced cognitive loadUsers see only relevant data for their taskFinance team never sees VLAN configurations; network engineers don’t see purchase orders
Performance optimizationAPI returns minimal data, reducing bandwidth and processingMobile app requests device summary (5 fields) vs. full device object (200+ fields)
Security through obscurity reductionSensitive fields filtered based on consumer roleCredentials, security zones hidden from unauthorized consumers
API stabilityBackend schema changes don’t break consumers if views remain stableAdding new device fields doesn’t affect existing network-engineer view consumers
Multi-language supportField names, enums, descriptions translated based on consumer localeCatalan-speaking operator sees “Ubicació” instead of “Location”

However, you must consider challenges and other considerations:

  • View maintenance overhead: Each new view requires definition, testing, documentation
  • Consistency across views: Same data exposed through multiple views must remain consistent
  • Performance complexity: Computing dynamic views adds latency; requires caching strategies
  • Discovery: Consumers must know which views exist and when to use them

Context-aware modeling is the bridge between data structure optimization (how we store data efficiently) and consumption optimization (how consumers access data naturally). It recognizes that the Source of Truth serves many masters, each with their own perspective on what “truth” means for their work.

4.2.3.4. AI-Assisted Interfaces#

Modern UIs use AI to help users. As you type a device name, the system suggests matching devices you’ve edited before in this location. When creating a service, it suggests reasonable technical parameters based on the service type. Before you bulk-update, it flags if you’re doing something unusual compared to your history. These features require learning from past operations, but they dramatically improve usability at scale.

4.2.4. Enforcement#

Bad data destroys automation. Wrong intent leads to broken networks, security breaches, and confused ops teams. Enforcement is your guardian: it stops obviously wrong data from getting in and explains clearly why.

4.2.4.1. Schema and Constraint Enforcement#

Validation TypePurposeExample RulesExample Error
Schema ValidationEnforce data types and formatsVLAN id: integer 1-4094
VLAN name: string, max 32 chars
VLAN sites: array of site references
VLAN status: enum(active, planned, deprecated)
{ "id": 5000, "name": "CUSTOMER-VLAN" }
→ Error: VLAN ID 5000 exceeds maximum 4094
Uniqueness ConstraintsPrevent duplicate entriesVLAN 100: unique per site
IP 192.0.2.1: unique across IPAM
Hostname “pe-01”: unique per region
Attempt to create second “pe-01” in same region
→ Error: Hostname already exists
Referential IntegrityEnsure relationships remain validCircuit references: site_a, site_b (must exist in Sites), vendor (must exist in Vendors)Delete site_a referenced by circuit
→ Options: Reject deletion, Cascade delete, or Orphan with “site unknown”
Range ValidationEnforce numeric and pattern constraintsBGP AS: 1-4294967295
IPv4: regex ^(\d{1,3}\.){3}\d{1,3}$
Interface speed: {1G, 10G, 25G, 40G, 100G}
Interface speed “5G”
→ Error: Invalid speed, must be one of {1G, 10G, 25G, 40G, 100G}
Business Rule ValidationEnforce organizational policiesIf VLAN status=‘active’ → must have ≥1 site
If device type=‘firewall’ → must have security_zone
If service type=‘L3VPN’ → route_distinguisher must be unique per customer
Active VLAN with no sites
→ Error: Active VLANs must be assigned to at least one site

There are many ways to implement schema validation, from generic coding languages to more specific solutions:

  • JSON Schema: JSON document that describes data constraints to compare against actual data.
  • CUE: Provides typed, validated data generation with constraints and validation.
  • YANG: Network-specific modeling language with built-in constraint enforcement.

For example, using JSON Schema, you can validate your JSON (or YAML) data:

  • JSON Data
{
  "hostname": "sw-core-01",
  "mgmt_ip": "192.0.2.10",
  "role": "core",
  "interfaces": [
    {
      "name": "Ethernet1",
      "enabled": true,
      "vlan": 100
    },
    {
      "name": "Ethernet2",
      "enabled": false,
      "vlan": 200
    }
  ]
}
  • JSON Schema
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "required": ["hostname", "mgmt_ip", "role", "interfaces"],
  "additionalProperties": false,
  "properties": {
    "hostname": {
      "type": "string",
      "pattern": "^[a-z0-9-]+$"
    },
    "mgmt_ip": {
      "type": "string",
      "format": "ipv4"
    },
    "role": {
      "type": "string",
      "enum": ["core", "distribution", "access"]
    },
    "interfaces": {
      "type": "array",
      "minItems": 1,
      "items": {
        "type": "object",
        "required": ["name", "enabled", "vlan"],
        "additionalProperties": false,
        "properties": {
          "name": {
            "type": "string",
            "pattern": "^Ethernet[0-9]+$"
          },
          "enabled": {
            "type": "boolean"
          },
          "vlan": {
            "type": "integer",
            "minimum": 1,
            "maximum": 4094
          }
        }
      }
    }
  }
}

4.2.4.2. Hard vs. Soft Enforcement#

Reject bad data. Hard stop. I’ve seen too many “soft validation” systems turn into garbage dumps because “just this once” becomes standard operating procedure. If it’s important enough to validate, it’s important enough to enforce.

But, as usual, there are exceptions:

  1. Policy guidelines (not rules): “You’re using a /22 subnet when we usually use /24” is a warning, not an error. Warn the user, log it, but let them proceed if they override.

  2. Expensive cross-object checks: If validation takes more than a few seconds, accept the change and validate asynchronously. But still enforce the result: flag the violation, notify the user, and require fixes.

  3. Brownfield network discovery: When you’re importing existing configurations, you’ll find violations everywhere. Flag them, but don’t block import. The whole point is to discover what’s actually there, even if it violates your ideal model.

Everything else? Reject it. Your automation depends on trustworthy data. Bad data means outages.

4.2.4.3. Validation Cost and Performance Trade-offs#

Validation is not free. A hierarchy of validation costs affects design decisions:

graph LR
    SPEED["⚡ SPEED"]
    V1["< 1ms<br/>Type validation"]
    V2["~10ms<br/>Uniqueness"]
    V3["5-50ms<br/>Regex"]
    V4["50-200ms<br/>Referential<br/>integrity"]
    V5["100-500ms<br/>Rule engine"]
    V6["1-5 sec<br/>External API"]
    V7["Hours+<br/>Consistency"]
    THOROUGH["THOROUGHNESS 🎯"]
    
    SPEED --> V1
    V1 --> V2
    V2 --> V3
    V3 --> V4
    V4 --> V5
    V5 --> V6
    V6 --> V7
    V7 --> THOROUGH
    
    style SPEED fill:none,stroke:none,font-size:14px,font-weight:bold
    style THOROUGH fill:none,stroke:none,font-size:14px,font-weight:bold
    style V1 fill:#d4edda,stroke:#28a745,stroke-width:2px
    style V2 fill:#d7ecf1,stroke:#17a2b8,stroke-width:2px
    style V3 fill:#dfe8f1,stroke:#17a2b8,stroke-width:2px
    style V4 fill:#fff3cd,stroke:#ffc107,stroke-width:2px
    style V5 fill:#ffe8cd,stroke:#ffc107,stroke-width:2px
    style V6 fill:#f8d7da,stroke:#dc3545,stroke-width:2px
    style V7 fill:#f5c2c7,stroke:#dc3545,stroke-width:2px

Practical systems choose a performance target and validate only to that level for synchronous write paths:

  • Fast path (< 10ms): Type, regex, local index checks
  • Standard path (< 100ms): Uniqueness, referential integrity
  • Slow path (< 5sec): Rule engine, complex business logic
  • Async path (hours later): Background validation, consistency checks

4.2.4.4. AI-Enhanced Data Quality#

Machine learning improves data quality without slowing throughput. These AI-driven validation techniques complement the user-facing AI features in section 4.2.3.4 to create a comprehensive intelligent data layer.

Anomaly Detection

Match historical patterns against new requests to infer inconsistencies:

  • Historical pattern: When provisioning new branch, operators create VLAN in range 1000-1999, assign to site_a, configure static routing
  • New request: Create VLAN 2500, assign to site_c, enable OSPF
  • Alert: “This VLAN creation pattern is unusual. Are you sure? (98% confidence this should be in 1000-1999 range)”
  • ML technique: Clustering and outlier detection on historical change vectors

Auto-Correction Suggestions

Automatically suggest corrections toward the most likely valid solution:

  • Input: IP address “192.0.2.1/33” (invalid prefix length > 32)
  • System: “Did you mean /24? (inferred from similar subnets in this site)”
  • ML technique: Pattern matching on subnet allocations within the same location

Vector Embeddings for Consistency

Use embeddings to detect significant deviations from peer configurations:

  • Store embeddings of each device configuration
  • New device config submitted
  • Compare embedding to similar devices in same role
  • If significantly different: “This device differs from peers in role: Similar devices have NTP servers X,Y,Z and syslog to server Z”
  • ML technique: Cosine similarity on device configuration embeddings

ML Implementation Considerations

AspectRecommendationRationale
Confidence thresholdsOnly surface suggestions with >90% confidencePrevents alert fatigue from false positives
Training dataUse 6-12 months of validated changesBalance recency with statistical significance
Model updatesRetrain weekly or when drift detectedAdapt to evolving network patterns
ExplainabilityAlways show why AI flagged an issueBuild operator trust in recommendations
Human overrideAllow operators to mark false positivesImprove model through feedback loops

Notice the use of “confidence” scores. This is an important concept when using AI because it gives more context about how much you can rely on the recommendation. Always pair AI suggestions with explanations of the underlying pattern that triggered them.

4.2.4.5. Cost of Enforcement: System Design Implications#

High-fidelity validation affects architecture:

ApproachAdvantagesDisadvantagesBest For
Synchronous Validation
(validate before accepting write)
✓ User gets immediate feedback
✓ Data consistency guaranteed
✗ Slow API responses (validation latency)
✗ Blocks high-throughput automation
Critical data integrity operations
Interactive user workflows
Asynchronous Validation
(accept first, validate later)
✓ Fast API responses
✓ High throughput
✗ Data consistency window exists
✗ Complex error reporting (“write succeeded but failed validation 5 minutes later”)
Bulk operations
High-volume automation
Hybrid Approach✓ Fast type/format validation synchronously
✓ Expensive cross-object validation asynchronously
✓ API endpoint to check validation status
✗ More complex implementation
✗ Requires status tracking
Most production systems
Balanced performance and correctness

4.2.5. Versioning#

Your network intent changes constantly. Devices are added, configs updated, services moved, policies tweaked. Versioning lets you understand what was true when, how it got that way, and how to get back to something safe if needed.

4.2.5.1. Version Control and Immutable Audit Trails#

All changes must be recorded immutably, creating a complete history. This may take different forms, for example:

  • Change logs

    Change Event:
      timestamp: 2025-02-07T14:32:15Z
      user: engineer@company.com
      action: UPDATE
      resource_type: vlan
      resource_id: vlan-100
      changes:
        description: "Customer VLAN" → "Production Customer VLAN"
        assigned_sites: [site-a, site-b] → [site-a, site-b, site-c]
      reason: "Adding new office location"
      request_id: req-12345
  • Git Commits

    commit 0c0ad51152f9b3be307802badb15eca8d121c576 (HEAD -> new-site, origin/new-site)
    Author: Engineer <engineer@company.com>
    Date:   Sat Feb 7 09:43:59 2026 +0100
    
        Adding a new site: BCN01
    
    diff --git a/sites.yaml b/site.yaml
    index ae18c87..dabf6c5 100644
    --- a/sites.yaml
    +++ b/sites.yaml
    @@ -219,51 +219,1279 @@ sites:
    
    + BCN01:

This immutable log answers critical questions:

  • What was the network intent on date X? (time travel)
  • Who made this change and why? (accountability)
  • Which deployment configuration corresponds to this intent? (traceability)
  • What changed between versions Y and Z? (diff/comparison)

Immutability is the key: no one, not even administrators, can edit historical records. This prevents cover-ups and ensures audit trails remain trustworthy.

Versioning is also related to retention policies, and it needs balance compliance with practicality, for example:

  • Keep all changes for 7 years (regulatory requirement)
  • Prune intermediate versions after 2 years (e.g., if VLAN 100 changed 10 times, keep only start and end)
  • Archive to cold storage after 1 year (for cost)

Related to data change events, event sourcing patterns (storing all changes as events for full data recreation) are powerful but they are less common in network infrastructure management. This is because network intent is highly stateful. The current data is more important than the sequence of changes that led to it. Additionally, network changes often involve external state (device configurations, IP allocations in external systems) that cannot be fully recreated from events alone. However, event sourcing can be valuable for specific use cases like audit compliance or forensic analysis.

4.2.5.2. Version Control Patterns: Branching, Merging, and Rollback#

Modern Source of Truth systems borrow heavily from software version control patterns. These capabilities enable safe parallel work, experimental changes, and rapid recovery from mistakes.

Branching for Parallel Work Streams

Any organization with more than one engineer or AI agent working in parallel needs a way to work without blocking others. When one actor (human or AI) is working on provisioning new data, it cannot block the entire Source of Truth.

Learning from software development, branching mechanisms enable parallel work streams. Each team can work on parallel data tracks and finally merge them with the “main” track when ready. This merge is an opportunity to resolve any inconsistencies that may have been introduced.

Example workflow:

main branch (production intent)
  │
  ├─── feature/add-dallas-office (Engineer A)
  │     • Creates site DAL-01
  │     • Allocates VLANs 2100-2110
  │     • Defines 5 new devices
  │
  └─── feature/upgrade-ntp-servers (Engineer B)
        • Changes NTP config for all devices
        • Updates 3,000 device records

When both branches merge back to main:

  • No conflict: Different objects modified → auto-merge
  • Conflict detected: Both modified device-core-01’s NTP → manual resolution required
  • Validation: Merged result must pass all enforcement rules before commit

Implementing this branching mechanism into traditional databases is an important problem in its own right. For example, NetBox and Nautobot, which rely on relational databases, have taken different approaches: NetBox uses database copies for branching, while Nautobot leverages Dolt (a SQL database with built-in Git-like Versioning). Infrahub, designed from the ground up with Git-like branching in mind, implemented this using a graph database with native branching capabilities built-in.

Rollback and Recovery

Mistakes happen. Data Rollback must be fast and reliable. Note that this is the Rollback of the intent data itself, not directly the network configuration.

Rolling back data requires the Execution component to enforce the configuration changes (you could Rollback configurations on the network infrastructure if available, but eventually both the network and intent must be synchronized to maintain a stable state).

To support data Rollback, there are two main approaches:

ApproachHow It WorksStorage OverheadRecovery SpeedComplexity
SnapshotsPeriodic full copies of entire datasetHigh (full copy per snapshot)Fast (direct restore)Low implementation complexity
Event SourcingRecord all changes as events, replay to reconstruct stateLow (only deltas stored)Slower (must replay events)High implementation complexity
HybridSnapshots every N hours + event log between snapshotsMediumFast to snapshot, then replay eventsMedium complexity

Rollback maybe triggered in two ways:

  • Manual: Operator recognizes problem, initiates Rollback
  • Automated:
    • Observability detects increased error rates after deployment
    • Validation post-deployment fails (“500 devices unreachable after change”)
    • Business impact threshold exceeded (“customer SLA violations > 10”)

The combination of branching (parallel work without conflicts) and Rollback (rapid recovery from mistakes) provides the temporal flexibility that large-scale network automation demands.

4.2.5.3. Atomic Transactions and Consistency Guarantees#

Transactional guarantees ensure data consistency when multiple related changes must succeed or fail together:

  • Atomicity: Either everything succeeds or everything rolls back
  • Consistency: Data constraints upheld before and after
  • Isolation: Concurrent transactions don’t interfere
  • Durability: Once committed, persists even if system crashes

Implementing this requires careful design:

  • Database transactions (if all data in single DB)
  • Distributed transactions (if data spans systems)
  • Saga pattern (if no native distributed transaction support)

In Chapter 11, we will cover more about how these requirements impact reliable systems.

4.2.5.4. Change Approval Workflows#

Once data changes are proposed, they may require human approval workflows before becoming active. Data typically flows through several stages before being considered final, similar to continuous integration pipelines.

graph LR
    A["📝 Operator Proposes Change<br/>Add 50 new branch devices"] --> C["📦 STAGING<br/>No network effect yet<br/>Preview, test, dry-run"]
    C --> D["✓ Automated Checks Validation<br/>Schema & constraints?"]
    C --> E["✓ Simulation<br/>Routing/MPLS impact?"]
    C --> F["✓ Compliance<br/>Security policies?"]
    D --> G{All Checks<br/>Pass?}
    E --> G
    F --> G
    G -->|Pass| H["👥 Human Review<br/>Change Control"]
    G -->|Fail| I["❌ Rejected<br/>Notify proposer"]
    H --> J{Approved?}
    J -->|Yes| K["✅ APPROVED<br/>Ready for deployment"]
    J -->|No| I

An example data change workflow might look like:

  1. Operator proposes change: “Add 50 new branch devices”

  2. Change enters STAGING: (no effect on network yet, can be previewed, tested, dry-run)

  3. Automated checks run (Continuous Integration):

    • Validation: Does it pass schema and constraints?
    • Simulation: Does it break routing, MPLS, etc.?
    • Compliance: Does it violate security policies?
    • Report: ✓ All passed
  4. Human review:

    • Change control committee gets notification
    • Reviews diff, rationale, blast radius
    • Approves or requests changes
  5. Approval decision, change to APPROVED state.

  6. Deployment triggering: Once approved, it triggers the workflow component to eventually deploy the related changes to the network (a common trigger for automated execution workflows)

However, not all changes require human approval, and this is where most organizations get it wrong. Change control boards are where automation goes to die. I’m not saying skip approvals, I’m saying automate them (i.e., change approval at scale).

If your CAB meets weekly to approve adding VLANs, you’ve already lost. Build guardrails that auto-approve 95% of changes, save human review for the actual risky stuff. For example, when provisioning an AWS VPC, operators don’t wait for humans to approve the network change: they’re following proven templates within defined boundaries.

The rule is: changes that fit within clear guardrails proceed automatically; only the guardrail logic itself requires human review and approval. Define minimum guardrail coverage and maximum impact tolerance. Then get out of the way and let the system work.

Otherwise, your “automation” is just a prettier way to wait for meetings.

4.2.6. Aggregation#

Your network data doesn’t live in a vacuum. HR systems track who works where. Asset management knows device details and warranty dates. IPAM systems own IP address allocations. Circuit providers have connectivity info. Services teams have SLA details.

Don’t build another data silo. Your organization already has too many. If your source of truth can’t pull from ServiceNow, Infoblox, and your CMDB, you’re just building spreadsheet 2.0 with an API. Nobody will maintain it. You’ll spend your life begging people to “please update the SoT” while they ignore you and keep using their existing tools.

Aggregation is not optional: pull in data from existing systems, keep it synced both ways, and present a unified view. The source of truth should be a federation layer, not a replacement database.

4.2.6.1. You’re Not Starting From Zero#

You almost certainly have data scattered across multiple systems. Each system is authoritative for its domain: HR owns organization, asset systems own hardware details, IPAM owns IP allocations. The source of truth doesn’t replace these systems; it pulls their data together into one consistent interface so clients don’t have to query five different systems.

  • Organizational Context:

    • HR System: Departments, teams, roles, responsibility matrix
  • Infrastructure:

    • Asset Management: Hardware details, serial numbers, procurement, warranty
    • Cloud Platform: VPCs, subnets, security groups, instance details
    • Circuit Provider (external): Connectivity status, utilization, faults
    • IPAM System: IP allocations, DHCP scopes, DNS entries
    • Configuration Management: Templates, baselines
  • Services:

    • Service Catalog: Service definitions, SLAs, customers
    • Billing System: Chargebacks, capacity planning

Each system is authoritative within its domain (understanding domain as a data type, or a subset of the datatype clearly delimited). The Source of Truth doesn’t replace them; it orchestrates their data into a coherent whole accessible through a consistent interface, eliminating the need for complex client applications to correlate data from multiple sources.

4.2.6.2. Conflict Resolution in Federated Systems#

When data comes from multiple sources, conflicts arise:

  • Centralized IPAM system says: 192.0.2.0/24 is allocated to customer X
  • The CMDB says: 192.0.2.0/24 is allocated to customer Y

Who is correct? It depends on authority designation through governance.

For example, a resource domain ownership strategy might specify:

  • Centralized IPAM, source of truth for:

    • IP address allocations
    • Subnet sizes
    • DHCP scopes
  • CMDB, source of truth for:

    • VLAN-to-subnet mappings
    • Interface IP assignments
    • Network interface operational status

Conflict resolution strategies include:

  • Source Authority (most common): A governance decision designates one system as authoritative (e.g., Asset System is authoritative for device credentials)

  • Timestamp-based: Use last-modified timestamp; most recent change wins. Risk: Doesn’t account for corrective changes vs. mistakes

  • Logical resolution: Evaluate context:

    • SoT value is currently in use on devices (current, proven working)
    • Asset System value is claimed to be the Desired State (should-be)
    • Option 1: Trust current (what’s working)
    • Option 2: Trust should-be (governance model)
    • Option 3: Detect: configuration on device doesn’t match either value, investigate further
  • Manual escalation: When confidence is medium (values differ but not contradictory), escalate to human review

4.2.6.3. Aggregation Architecture#

There are two fundamental approaches to data aggregation:

ApproachArchitectureHow It WorksSync PatternsAdvantagesDisadvantages
Centralized Aggregation
(Pull Model)
Central SoT fetches from external systems• Aggregator polls/subscribes to sources
• Transforms to unified schema
• Detects conflicts
• Enriches local data
• Serves unified view
• Bidirectional (SoT ↔ IPAM)
• Unidirectional (SoT ← HR)
✓ Centralized control
✓ Aggressive validation/enrichment
✓ Single consumer query point
✗ Network latency to sources
✗ Depends on external availability
✗ Eventual consistency only
✗ Scale challenges
Distributed Federation
(Push Model)
Each system maintains own data, SoT coordinates• Event-driven (message buses)
• Webhooks for real-time notifications
• Cache layers for reference data
• Scheduled refresh for slow-changing data
• Event-driven
• Webhook-based
• Cached reference data
✓ Domain-owned data quality
✓ No duplication
✓ Scales without bottleneck
✓ Strong per-domain consistency
✗ Consumers query multiple systems
✗ Cross-system joins expensive
✗ Complex message coordination

4.2.6.4. Synchronization Mechanisms#

Keeping data consistent across systems is the core challenge of aggregation. The choice of architecture determines which synchronization approach to use:

MechanismHow It WorksExample FlowAdvantagesDisadvantagesTypical Latency
Periodic Sync
(Schedule-based)
Data pulled on scheduleEvery 6 hours:
1. SoT reads from IPAM
2. Compares to local cache
3. Applies changes to views
4. Publishes SoT changes back
✓ Simple
✓ Predictable
✓ Low overhead
✗ Hours-long inconsistency windows
✗ Batch conflicts
Minutes to hours
Event-Driven
(Message bus)
React to changes in real-timeIPAM changes subnet 10.0.0.0/24
→ Publishes “Subnet:Changed” to bus
→ SoT consumes message
→ Updates local cache
✓ Real-time
✓ Responsive
✗ Complex choreography
✗ Harder to debug
Seconds
Streaming/Webhooks
(Direct push)
Source pushes to registered webhookSoT registers webhook
→ IPAM allocates 10.0.2.5/32
→ HTTP POST to webhook
→ SoT validates, updates
✓ Direct communication
✓ No message bus needed
✗ Requires webhook support
✗ Network reliability issues
Sub-second to seconds

There is no real-time synchronization system. Data will become eventually consistent, but this may take some time that has to be evaluated for each use case.

4.2.6.5. Data Governance and Authority Frameworks#

Effective federation requires clear governance with explicit definitions per data domain. For example:

DomainAuthoritative SourceSync DirectionFrequencyConflict Resolution
Device InventoryAsset Management SystemAsset SoT (read-only in SoT)DailyAsset always wins
Network TopologyCentral SoTSoT (reporting systems)Real-timeN/A (SoT is source)
IP AllocationIPAM SystemBidirectionalHourly inbound, real-time outboundIPAM wins for free-pool, SoT wins for static assignments

For each data element, authority metadata should track:

  • Source system: Where did it come from?
  • Last sync time: When was it verified?
  • Update method: Poll, push, or webhook?
  • Authority level: How trustworthy is this data?

With this metadata, the system can make informed governance decisions about data handling.

And, when playing with multiple entities, be prepared for failures. External systems will inevitably fail. When a federated source is unavailable (e.g., IPAM down for 30 minutes), several strategies exist (pick your poison):

  1. Block operations: “Can’t allocate IP addresses without IPAM”
  2. Use cached data + mark as stale: “Using cached IPAM data from 2 hours ago; this may be inconsistent”
  3. Graceful degradation: “Can still provision devices, skip IP allocation, and queue the IP allocation for when IPAM recovers”

4.2.7. Brownfield Onboarding#

When you deploy a source of truth in an existing network, you face a tricky problem: how do you populate it with current state without just embedding all your legacy cruft as permanent truth?

The wrong approach: Poll your devices and assume “whatever is running = what should be”. You’ll load all workarounds, hacks, and technical debt straight into the source of truth.

The right approach: Use the current network state as a starting point, not the truth. Snapshot it, load it in, then deliberately redesign it to be clean. Once you’ve got the design you actually want, stop syncing from the network. The source of truth becomes the boss; the network follows its lead. Here’s what that looks like in practice:

  1. Bootstrap: Take a snapshot of your current network (devices, IPs, VLANs, circuits) and load it into the source of truth as starting data.

  2. Redesign: Over weeks and months, have people review that data and decide what the intended state should actually be. You’ll probably want to consolidate redundant VLANs, clean up legacy stuff, standardize naming, and write the design templates for future deployments.

  3. Flip the direction: Once you’ve got the design you want, stop syncing from the network back to the source of truth. Now the source of truth is the boss. Execution (Chapter 5) pushes changes outward to the network.

  4. Detect drift: Observability (Chapter 6) watches for devices that drift from what the source of truth says they should be. When drift happens, operators decide: fix the device or fix the source of truth?

There are tools such as Slurpit.io or extensions for NetBox or Nautobot focused on this problem.

This approach requires initial effort but avoids the trap of treating deployed state as permanent truth. Your SoT evolves toward clean design intent, even if the network takes time to fully align. Obviously, you can keep using it in dry-run mode to double-check that the network has not drifted.

4.2.8. Solutions Landscape#

There are a lot of source of truth products out there. Some are open source, some are commercial. Some are general purpose, some are specialized for specific domains like IP management. This is a snapshot as of early 2026: the landscape changes constantly, so always check current capabilities and momentum before choosing.

4.2.8.1. Open Source Solutions#

SolutionData Domains CoveredKey Technical Characteristics
NetBoxIPAM, DCIM, Circuits, Devices, Virtualization, Inventory• Extensive plugin ecosystem (150+ plugins)
• Mature (10+ years) with large community
• Frequent breaking changes between versions
NautobotIPAM, DCIM, Circuits, Devices, Inventory, Extensible Apps• Fork of NetBox with enhanced extensibility
• Job framework for automation workflows
• Git data source integration
• Stable API with professional support
InfrahubNetwork topology, Devices, IPAM, Schemas, Relationships• Graph database for complex relationship modeling
• Git-like branching built into core architecture
• Schema-driven with dynamic data models
• Proposed-state workflows for change review

4.2.8.2. Commercial Solutions#

SolutionData Domains CoveredKey Technical Characteristics
ServiceNow CMDBConfiguration items, Services, Assets, Changes• Enterprise ITSM integration
• ITIL-aligned workflows
• Multi-vendor federation capabilities
• AI/ML-driven insights and recommendations
Device42Data Center assets, Dependencies, Application mapping, IPAM• Comprehensive auto-discovery for rapid onboarding
• Application dependency mapping
• Agentless discovery across multiple platforms

All the open-source solutions listed before also have a commercial offering.

4.2.8.3. Specialized Platforms#

SolutionData Domains CoveredKey Technical Characteristics
InfobloxDNS, DHCP, IPAM (DDI)• Enterprise-grade DDI authority
• DNS security and threat intelligence
• Multi-site replication
SolarWinds IPAMIP address management, DHCP, DNS• Native integration with SolarWinds monitoring ecosystem
• Automated IP conflict detection
• Active Directory integration
phpIPAMIP address management, Subnets, VLANs• Lightweight and cost-effective
• Simple deployment (LAMP stack)
• Straightforward IPAM without DCIM complexity

4.2.8.4. Selection Considerations#

When evaluating Source of Truth solutions, consider these alignment factors:

  • Scale requirements: Number of devices (hundreds vs. hundreds of thousands), rate of change, query volume
  • Data model needs: Relational structure (NetBox, Nautobot) vs. graph relationships (Infrahub) vs. flexible document store
  • Integration ecosystem: Existing tools (monitoring, ITSM, cloud platforms) that must federate data
  • Team expertise: Python/Django familiarity, preference for low-code platforms, comfort with Git workflows
  • Operational model: Self-hosted vs. SaaS, change approval processes, RBAC requirements
  • Evolution trajectory: Brownfield migration path, schema extensibility, API stability

4.3. Implementation Example#

4.3.1. Use Case: A Federated Network SSoT#

Scenario: A mid-sized enterprise with distributed infrastructure needs a unified source of truth that consolidates network data from multiple authoritative systems. Today they operate with:

  • Infoblox: Authoritative IPAM system managing address allocation and DNS
  • ServiceNow: Asset inventory and CMDB tracking server locations, hardware models, warranty data
  • Network devices: Individual documentation, spreadsheets, tribal knowledge about what’s deployed where

Challenge: Network engineers manually cross-reference Infoblox, ServiceNow, and device Secure Shell (SSH) sessions to understand the network. Adding a new device requires updates in 3-4 systems. When some system is down for maintenance, decisions stall. Configuration Drift occurs because there’s no single agreement on what should be deployed.

Solution: Deploy an aggregation intent (e.g., Nautobot) as a federated Network Source of Truth that integrates with both Infoblox and ServiceNow and also incorporating the actual configuration state from the brownfield network, offering operators a consolidated view while respecting each system’s authority.

This example is illustrative, not prescriptive. There are many valid SoT products and architectures that could solve this scenario. The point is demonstrating how the six building blocks: Modeling, Consumption, Enforcement, Versioning, Aggregation, and Design-Driven work together in a real scenario. Your organization’s SoT will likely look different based on existing tools, team expertise, and constraints.

4.3.2. Solution Architecture#

graph TB
    IPAM["Infoblox<br/>Authoritative IPAM<br/>IP ranges, DNS"]
    CMDB["ServiceNow<br/>Authoritative CMDB<br/>Assets, locations, metadata"]
    DEVICES["🖧 Network Devices<br/>Routers, Switches<br/>Existing configurations"]
    
    SLURPIT["Slurpit<br/>Brownfield Discovery<br/>Config extraction"]
    
    subgraph NAUTO ["🔗 Nautobot (Aggregation Hub)"]
        direction TB
        AGG["Aggregation Layer<br/>Authority rules<br/>Conflict resolution"]
        SOT["Consolidated SoT and data expansion<br/>Devices, IPs, sites<br/>relationships"]
        API["Consumption APIs<br/>REST, GraphQL<br/>Webhooks"]
    end
    
    EXEC["🚀 Execution System<br/>Config generation<br/>Device deployment"]
    OBS["👁️ Observability<br/>State validation<br/>Drift detection"]
    UI["🖥️ Presentation<br/>Web UI, CLI<br/>Dashboards"]
    
    IPAM -->|Webhooks/Polling<br/>IP data| AGG
    CMDB -->|Webhooks/Polling<br/>Asset data| AGG
    DEVICES -->|SSH/NETCONF<br/>Device state| SLURPIT
    SLURPIT -->|Transformed data<br/>Initial bootstrap| AGG
    
    AGG --> SOT
    SOT --> API
    
    API -->|Intent queries| EXEC
    API -->|Config templates| EXEC
    API -->|Data access| UI
    
    OBS -->|State Drift| API
    API -->|Inventory| OBS
    
    EXEC -.->|Deploy configs| DEVICES
    OBS -.->|Monitor state| DEVICES

The key insight: Nautobot is not replacing Infoblox or ServiceNow. It’s aggregating their data, resolving conflicts, and serving as the single API that downstream systems (Execution, Observability, Presentation) consume. This separation of concerns allows each system to be best-of-breed.

How the 6 Components Work Together#

Modeling: Each data source comes with its own data model with some overlap that allows for data connection via identifiers. The responsibility is distributed, each data source specializes in different data. The structure must allow partial data until all information is consolidated (e.g., a device can exist in Nautobot before IP assignment).

Consumption: Nautobot offers a REST API and GraphQL interface for other devices to query data consistently as a central point rather than having to integrate with all the data sources.

Enforcement: Nautobot enforces global validation for consistency. If IPAM says 10.1.1.5 is allocated to device-dal-01, but this prefix is allocated for another region where the device doesn’t belong, it must be flagged. Also, it prevents orphaned data: Can’t assign an IP from Infoblox if device doesn’t exist in ServiceNow (referential integrity). Moreover, soft validation warns: “Device was last updated in ServiceNow 30 days ago; may be stale” (without blocking), with capacity to clean data.

Versioning: All changes tracked as changelog records. When an object changes and IPAM auto-reallocates an IP, Nautobot records “IPAM webhook: IP 10.1.1.5 assigned to device-dal-02, interface eth1.1 on 2024-02-08T14:32:00Z”, which allows tracing why a device has its current IP back through history. However, the rollback capacity is limited because there is no mechanism to rollback to a previous moment in time (disclaimer: Nautobot VCS app allows more sophisticated capabilities but not covered here for simplicity).

Aggregation: This is a key aspect in this solution as there are multiple data sources to interconnect. Nautobot prioritizes Infoblox for IP data (it’s the authoritative IPAM). For asset metadata (warranty, cost center), ServiceNow is authoritative, and Nautobot uses this to solve discrepancies. The sync strategy could look like:

  • ServiceNow → Nautobot: Periodic sync every 4 hours (can tolerate slight delay in asset metadata)
  • Infoblox → Nautobot: Real-time webhooks for IP changes (IPAM changes are urgent, can’t wait for polling)
  • Network devices → Nautobot: Using the Nautobot Onboarding App the data from the network is onboarded in Nautobot data models (eventual consistency acceptable) Moreover, if there are some failures, Nautobot could offer some resiliency mechanisms such as:
  • If Infoblox is down for 2 hours, Nautobot continues operating using cached IP data, marked as “stale”
  • Operators see warning: “IP data from 2 hours ago; IPAM currently unavailable; new allocation deferred”
  • Once Infoblox recovers, pending allocations are processed atomically

Design-Driven: Using the Nautobot Design Builder App, Nautobot offers a high-level interface for onboarding a new site. Operator provides high-level intent: { "type": "branch", "location": "denver", "employees": 30, "circuits": 2 }. Then, a design template expands this: Create site record in Nautobot, request IP range from Infoblox integration, auto-generate interface config, query ServiceNow for location metadata, resulting in comprehensive technical data ready for deployment without manual object creation.

4.3.3. Implementation Flow#

Initial data load

  1. Infoblox import: Nautobot connects to Infoblox API → pulls all IP ranges, reservations, DNS records
  2. ServiceNow import: Nautobot connects to ServiceNow CMDB → pulls all IT assets, locations, relationships
  3. Brownfield network discovery with Slurpit: Slurpit.io connects to existing network devices to extract current configuration state:
    • Device inventory (models, serial numbers, software versions)
    • Interface configurations and operational state
    • IP addressing and VLAN assignments
    • Routing protocol configurations
    • Transforms device configs into Nautobot-compatible data models
  4. Divergence detection and resolution: Nautobot audit process correlates data from all three sources (Infoblox, ServiceNow, Slurpit):
    • Conflict example: Device interface shows 10.1.1.5/24, but Infoblox shows this IP allocated to different device
    • Resolution workflow: Nautobot flags 47 conflicts for human review
      • Network engineer evaluates each: “Trust device state” or “Trust IPAM; device needs correction”
      • Governance rules applied: “Trust device for access ports, trust IPAM for infrastructure IPs”
    • Batch resolution: Similar conflicts resolved with consistent policy
  5. Unified schema population: Nautobot merges all three sources: devices[*].{ name, location_id, ipv4_address, serial_number, cost_center } with validated, conflict-resolved data

Live operations

  1. Add a new device: Operator creates entry in ServiceNow (new asset). Nautobot catches webhook → creates device object, queries Infoblox for next available IP in site subnet, assigns it automatically
  2. IP reservation needed: Network engineer uses Nautobot UI to reserve an IP → Nautobot calls Infoblox REST API → IPAM confirms allocation → Returns to Nautobot, displayed to engineer
  3. Infoblox maintenance window: IPAM goes offline 2 hours. Nautobot shows cached data “IP data valid as of 14:30 UTC; refresh pending”. Operators can still query devices, see last-known IPs, but allocation is deferred. When Infoblox returns, pending allocations queue up and process.

Handling inconsistency and periodic drift detection

After initial onboarding, Nautobot continuously monitors for divergence across all three data sources:

  1. Ongoing sync: Besides event-driven updates triggered when changes happen, periodic sync runs every 4-6 hours:

    • Infoblox sync: Webhook-driven for IP changes + periodic reconciliation
    • ServiceNow sync: Periodic polling for asset metadata updates
    • Slurpit discovery: Periodic device polling to capture actual network state
  2. Nautobot audit and correlation: Nautobot’s audit process compares data from all sources to detect inconsistencies:

    • Data source conflicts: Device interface IP differs from Infoblox allocation
    • Configuration Drift: Device state differs from Nautobot intent (NTP server changed, VLAN added to trunk)
    • Stale metadata: ServiceNow asset last updated 90 days ago (potential staleness)
  3. Divergence classification and remediation:

    • Type 1 - Configuration Drift: Device differs from Nautobot intent → Trigger Execution to correct device
      • Example: NTP server changed on device → Execution regenerates config and pushes correction
    • Type 2 - Intent staleness: Intentional change not yet reflected in Nautobot → Alert operator to update SoT
      • Example: Operator manually added VLAN during incident → Update Nautobot to reflect current intent
    • Type 3 - External authority mismatch: Conflict between authoritative sources (Infoblox vs device reality)
      • Example: IP allocation mismatch → Human decision required based on governance rules
  4. Automated vs. manual remediation:

    • Auto-remediation: Pre-approved changes (NTP, DNS, syslog servers) corrected automatically via Execution
    • Manual approval: Critical changes (BGP config, routing protocols, security policies) require human review before correction
    • Escalation: Unresolvable conflicts or repeated drift patterns escalated to network team
  5. Audit trail: All detected divergences, resolutions, and remediation actions logged for compliance and troubleshooting

4.3.4. Approach Trade-Offs#

Advantages of This Approach#

AdvantageDescriptionBenefits
Consolidation without replacementInfoblox and ServiceNow stay as authoritative sourcesNautobot orchestrates rather than replaces existing investments
Multi-source truth for most use cases5-30 second sync delay acceptable for most operationsDevice provisioning (4-hour sync), IP allocation (30-sec delay), config generation (nightly) all work well
Respects domain expertiseEach team owns their domainInfoblox team: IP strategy; ServiceNow team: asset lifecycle; Network team: design/deployment
Rich data modelModels relationships across systemsEnables cross-domain queries: “Devices in high-security locations with expired warranties?”
Operational resilienceCached data available during outagesIf Infoblox down → cached IP data; If ServiceNow down → last-known metadata
Audit and complianceAll changes tracked with full lineageRegulatory queries: “Who approved IP change from 10.1.1.5 to 10.1.2.5, when, why?”

Limitations and Constraints#

LimitationImpactMitigation Strategy
Sync delays5-30 sec (webhooks) to 4 hours (polling) latencyFor critical allocations, bypass Nautobot and call Infoblox directly; sync asynchronously
Conflict complexityOverlapping attributes need explicit resolution logicUse authority matrix to make conflicts explicit (e.g., ServiceNow owns MAC address)
Operational overheadEach webhook/API/sync job is a potential failure pointMonitor integration health (webhook failures, timeouts, lag); maintain runbooks per failure mode
Data quality dependencyGarbage in, garbage out from source systemsSoft validation (warnings not blocks); anomaly detection for suspicious data
Stale data windowDuring outages, operators work with hours-old cached dataDocument acceptable staleness windows; train operators on “use cached if delay > 2h” criteria
Integration maintenanceAPI version upgrades require Nautobot updatesUse abstraction layer (integration adapters); quarterly integration testing

There are other ways and solutions that may fit better to your use case. Take your own decisions based on your needs and requirements. There are always tradeoffs to account for.

4.4. Summary#

The source of truth is the foundation of network automation. It’s the single place where you say “this is what the network should be”. This chapter covered six building blocks that make it work:

  1. Modeling: How you organize and store network data
  2. Design-Driven: Turning high-level intent (“add a branch”) into actual technical configs
  3. Consumption: How people and systems get access to that data
  4. Enforcement: Making sure the data is actually valid
  5. Versioning: Keeping track of changes so you can see what happened and roll back if necessary
  6. Aggregation: Pulling data from other systems so you don’t duplicate it

We also looked at practical stuff: how to populate a source of truth with existing network data without embedding all your legacy hacks, and what actual products exist to do this.

The key point is this: a source of truth isn’t one magic tool. It’s a collection of practices, systems, and processes working together so you actually know what your network is supposed to be.

Next, we explore the Execution component, which will convert this Intent block from documentation into actual network changes.

References and Further Reading#

Data Modeling and Schema Standards

APIs and Data Consumption

Data Quality and Validation

  • Data Quality Fundamentals (DAMA DMBOK): Enterprise data quality frameworks
  • JSON Schema: Declarative schema validation standard
  • YANG Constraints (RFC 6020, section 8.2.4): Network-specific validation patterns

Versioning, Audit, and Change Management

  • Pro Git (Scott Chacon & Ben Straub): Version control concepts and design patterns
  • The Phoenix Project (Gene Kim, Kevin Behr, George Spafford): Change management and operational discipline
  • Semantic Versioning: Versioning strategy for APIs and data models

Data Integration and Aggregation

  • Enterprise Integration Patterns (Gregor Hohpe & Bobby Woolf): Data integration architecture patterns
  • Master Data Management (David Loshin): Federated governance frameworks

Network Programmability and Automation

Distributed Systems and Scalability

  • Designing Data-Intensive Applications (Martin Kleppmann): Scalability, consistency, and API design patterns
  • Distributed Systems (Andrew S. Tanenbaum & Maarten van Steen): Fundamental concepts for federated systems

💬 Found something to improve? Send feedback for this chapter