6. Observability#
Traditional network monitoring watches for broken things: is the interface up? Is the CPU over 90%? Is this service responding? That’s useful, but it’s reactive: you’re watching for failure.
Observability is different. It’s about understanding why things break or are about to break. It’s not just “that device is down” but “why is it down, what breaks because it’s down, what’s the business impact.” With network automation, observability becomes the feedback loop: you observe something, your system detects it, decides how to respond, takes action, and verifies the fix worked. That’s closed-loop automation.
This chapter covers everything you need to see what’s happening in your network: what data to collect, how to collect it, how to store it, how to alert on it, and how to show it to people in a way that actually helps them make decisions.
We cover two building blocks here: Collector and Observability, because they’re tightly connected.
Whether you use a traditional all-in-one platform (SolarWinds, LibreNMS), a cloud service (Datadog, New Relic), or build your own stack from open-source pieces (Prometheus, Grafana, etc.), the underlying architecture is the same. Understanding those patterns helps you choose the right approach for your scale and your team.
This section is heavily influenced by the book Modern Network Observability by Packt, that I coauthored with David Flores and Josh Vanderaa. If you want to go hands-on and learn an implementation with the TPG (Telegraf-Prometheus-Grafana) stack and other tools, I definitely recommend it.
6.1. Fundamentals#
Before going into the lower level details, here we establish the foundationals about network observability within the network automation strategy, defining its goals, supporting pillars, and scope.
6.1.1. Context#
You probably already monitor your network. You use Simple Network Management Protocol (SNMP), you look at System Logging Protocol (Syslog), you have dashboards showing link utilization. That’s monitoring: it tells you if things are working.
But automation needs more. When your network changes (because automation changed it), you need to know immediately if it breaks something. When you get an alert, you need context: which customers does this affect? Which services failed? What’s the blast radius? Traditional monitoring gives you alarms. Automation needs intelligence.
Here’s the key difference:
- Monitoring: Is the interface up? Is the CPU high? Simple yes/no questions.
- Observability: Why did the interface go down? What’s it impacting? How do we fix it automatically? What happened historically that led to this?
Observability feeds automation. Your system observes the network, detects problems, decides what to do, takes action, and verifies the fix worked. That cycle repeating is called “closed-loop automation.”
Traditional monitoring tools (the big monolithic ones like SolarWinds) try to do everything in one product. You can make that work, but you’re often paying for features you don’t need and constrained by features you do. The alternative is building observability from pieces: pick the collector that works for your devices, the storage that scales with your data, the alerting that fits your automation workflows. This is harder to assemble but much more flexible.
This chapter walks through both approaches and the patterns that work regardless of which you choose.
One initial choice is about how the platform runs:
Monolithic (SolarWinds, LibreNMS): One product does everything. You install it, configure it, and go. Good if your network is straightforward and you don’t have DevOps experience. Bad if you want flexibility or your network is unusual: you’re stuck with their model.
Cloud SaaS (Datadog, New Relic, Kentik): They run everything for you. Fast to deploy, no infrastructure headaches, beautiful dashboards out of the box. But you’re paying monthly based on volume, your data lives on their servers (matters for some compliance regimes), and when you hit their limits, you’re stuck. I’ve seen teams spend $50K/month on observability SaaS wondering why their CFO is unhappy.
Build-it-yourself (Prometheus + Grafana, or Telegraf-Prometheus-Grafana (TPG) stack): Total flexibility, no vendor lock-in, better economics at scale. But you’re now running databases, message queues, and collector infrastructure. If you don’t have people who can operate this stuff, you’ll spend more time fixing observability than fixing your network.
The real question: Do you have the team to run it? If yes, build it. If no, buy it. Don’t fool yourself about which category you’re in.
After that choice, two more questions come up:
- Where does it run? On your premises (you control everything, but you run it), in the cloud (they run it, but your data leaves your network), or hybrid (some places local, some places cloud)?
- What’s the cost model? Per device? Per metric ingested? Flat subscription? These decisions add up fast when you’re collecting millions of data points per minute.
6.1.2. Goals#
Your observability system needs to do seven things:
Observe everything automatically. New device connects? It should start reporting data without someone manually registering it. New service comes online? It’s already being watched. This requires integration with your source of truth so observability knows what exists.
Handle heterogeneous environments with good data. Your network probably has Cisco, Arista, Juniper, cloud providers, Linux servers, containers. Each one has different ways to expose data. And forget 5-minute intervals: you need near-real-time data when automation is making changes.
Correlate data across layers. A server is slow. Is the network congested, or is it a database issue? You need data from network devices, servers, applications, all speaking the same language so you can draw lines between them.
Scale without melting. Networks grow. When you’re collecting a million metrics per second, traditional architectures crumble. You need systems designed for scale from day one.
Let people analyze the data intelligently. Give analysts access to query your data, not just pre-built dashboards, but powerful queries so they can answer their own questions. And they need both real-time data and history (trends, anomaly detection).
Detect problems and fix them automatically. Most of the time, your automation should respond to issues without waiting for a human. Only when automation can’t figure it out should someone get paged. And that alert better explain what’s wrong, not just show raw numbers.
Show people what they need to see. A dashboard is worthless if it shows too much or the wrong stuff. Give the network ops team their view, the business their view, the engineers their view. Each person gets what helps them do their job.
graph TD
%% --- Subgraphs ---
subgraph Goals
direction LR
A1[Observe all the network with minimal human effort]
A2[Support heterogeneous network environments with enough data and accuracy]
A3[Observe data from different IT layers with context]
A4[Handle massive-scale network scenarios]
A5[Offer access to observability data for sophisticated analysis in near real-time]
A6[Be proactive to detect network issues and reduce time to recover]
A7[Create tailored user-oriented visualizations]
end
%% --- Row gradient classes ---
classDef row1 fill:#eef7ff,stroke:#4a90e2,stroke-width:1px;
classDef row2 fill:#ddeeff,stroke:#4a90e2,stroke-width:1px;
classDef row3 fill:#cce5ff,stroke:#4a90e2,stroke-width:1px;
classDef row4 fill:#b3d8ff,stroke:#4a90e2,stroke-width:1px;
classDef row5 fill:#99ccff,stroke:#4a90e2,stroke-width:1px;
classDef row6 fill:#80bfff,stroke:#4a90e2,stroke-width:1px;
classDef row7 fill:#66b2ff,stroke:#4a90e2,stroke-width:1px;
%% --- Apply classes per row ---
class A1 row1;
class A2 row2;
class A3 row3;
class A4 row4;
class A5 row5;
class A6 row6;
class A7 row7;
With these goals, the next step is understanding which requirements the solution has to offer.
6.1.3. Pillars#
Each goal needs specific building blocks to work. Here’s what you need:
Know what to observe. Your source of truth has all the devices, services, credentials. Observability should pull that data automatically so collectors know what to monitor. When you add a device to the SoT, monitoring comes online automatically.
Collect data efficiently. You need multiple collection methods: Simple Network Management Protocol (SNMP) for older equipment, gRPC Network Management Interface (gNMI) streaming for modern devices, System Logging Protocol (Syslog) for events, flows (NetFlow, IP Flow Information Export (IPFIX)) for traffic, maybe packet captures for deep troubleshooting. Different tools, different speeds, different data richness. The good news: you don’t have to pick just one.
Normalize everything. Your Arista metrics look different from Cisco metrics, which look different from cloud provider metrics. Your logging is unstructured text, flows are binary. You need a layer that translates all this into a common language and adds context (which device, which customer, which service).
Move data reliably at scale. Traditional monitoring pipelines are sequential: collector → processor → storage. At scale, this is a bottleneck. You need message buses and streaming platforms that decouple each stage so they can scale independently.
Store it smartly. Time-series data (metrics) needs databases optimized for that. Logs need something different. You need to query across millions of data points in milliseconds. Not all databases are equal here.
Turn data into actions. Raw metrics don’t trigger automation. You need rules: “if CPU > 90%, check if it’s expected maintenance, if not, take these steps.” And those rules feed into your automation orchestrator or alerting system.
Show it visually. Data is useless if nobody looks at it. You need dashboards, but smart ones: different views for different people, able to drill down, able to show trends and comparisons.
graph LR
%% --- Subgraphs ---
subgraph Goals
direction TB
A1[Observe all the network with minimal human effort]
A2[Support heterogeneous network environments with enough data and accuracy]
A3[Observe data from different IT layers with context]
A4[Handle massive-scale network scenarios]
A5[Offer access to observability data for sophisticated analysis in near real-time]
A6[Be proactive to detect network issues and reduce time to recover]
A7[Create tailored user-oriented visualizations]
end
subgraph Pillars
direction TB
B1[Close integration with SoT to understand what needs to be monitored]
B2[Ability to collect data via different protocols supporting very frequent/on-demand updates]
B3[Normalization of heterogeneous data with contextual metadata for richer analysis]
B4[Scalable data distribution systems to support scale-out architectures]
B5[Persistence layer supporting time-series data and powerful query languages]
B6[Flexible rule definitions and routing scenarios with external system integration]
B7[Custom visualizations and integration of multiple data stores]
end
%% --- Row connections ---
A1 --> B1
A2 --> B2
A3 --> B3
A4 --> B4
A5 --> B5
A6 --> B6
A7 --> B7
%% --- Row gradient classes ---
classDef row1 fill:#eef7ff,stroke:#4a90e2,stroke-width:1px;
classDef row2 fill:#ddeeff,stroke:#4a90e2,stroke-width:1px;
classDef row3 fill:#cce5ff,stroke:#4a90e2,stroke-width:1px;
classDef row4 fill:#b3d8ff,stroke:#4a90e2,stroke-width:1px;
classDef row5 fill:#99ccff,stroke:#4a90e2,stroke-width:1px;
classDef row6 fill:#80bfff,stroke:#4a90e2,stroke-width:1px;
classDef row7 fill:#66b2ff,stroke:#4a90e2,stroke-width:1px;
%% --- Apply classes per row ---
class A1,B1 row1;
class A2,B2 row2;
class A3,B3 row3;
class A4,B4 row4;
class A5,B5 row5;
class A6,B6 row6;
class A7,B7 row7;
Finally, before detailing the seven functionalities that realize these pillars, let’s clarify what falls within Observability’s scope.
6.1.4. Scope#
To extend the goals introduced above, there are other points that also belong within Observability’s responsibilities:
- Different levels of observation adapted to users’ perspectives (technical, operational, business)
- Integration with CI/CD pipelines, providing feedback for automated testing and validation
- Observability of the automation system itself (meta-monitoring of Collectors, processors, and alerting systems)
However, on the other side, there are functions that belong to other components of the architecture:
- Defining network intent: What the network should look like (Intent/SoT responsibility)
- Executing network changes: Actually implementing remediation (Executor responsibility)
- Orchestrating complex workflows: Coordinating multi-step remediation across multiple systems (Orchestrator responsibility)
This clear boundary ensures Observability focuses on detection and insights, while other building blocks handle intent definition, execution, and orchestration.
Transitioning to modern observability requires careful planning. Do not take it as something as simple as replacing one monitoring tool with another; you have to transform your mindset. With more power comes more responsibility, and you will have more to choose and adjust.
After this initial introduction that exposes the key concepts related to the Observability block, let’s go into each functionality next.
6.2. Functionalities#
The seven goals and pillars are realized through seven core functionalities. Each functionality maps to a goal and its supporting pillar, creating a direct chain from business requirements to technical implementation:
- Inventory: Consumes intent from the SoT and provides the metadata, device lists, and collection targets to all downstream components.
- Collector: Retrieves observed data from the network using multiple protocols and collection methods, both pull-based (polling) and push-based (streaming).
- Processor: Normalizes heterogeneous data into a common schema and enriches it with contextual metadata (tags, relationships, business context), besides doing other data operations.
- Distribution: Decouples data producers from consumers using distributed, asynchronous patterns. Moves data and events reliably from Collectors through processors to persistence and alerting systems.
- Persistency: Stores normalized data in databases optimized for efficient ingestion, retention, and querying at scale.
- Alerting: Analyzes persisted data using flexible rules and thresholds to detect conditions of interest, generating events that trigger external systems (automation or human notifications).
- Visualization: Renders observed data and triggered events into dashboards, reports, and other visual interfaces tailored to different user audiences and use cases.
graph LR
%% --- Subgraphs ---
subgraph Goals
direction TB
A1[Observe all the network with minimal human effort]
A2[Support heterogeneous network environments with enough data and accuracy]
A3[Observe data from different IT layers with context]
A4[Handle massive-scale network scenarios]
A5[Offer access to observability data for sophisticated analysis in near real-time]
A6[Be proactive to detect network issues and reduce time to recover]
A7[Create tailored user-oriented visualizations]
end
subgraph Pillars
direction TB
B1[Close integration with SoT to understand what needs to be monitored]
B2[Ability to collect data via different protocols supporting very frequent/on-demand updates]
B3[Normalization of heterogeneous data with contextual metadata for richer analysis]
B4[Scalable data distribution systems to support scale-out architectures]
B5[Persistence layer supporting time-series data and powerful query languages]
B6[Flexible rule definitions and routing scenarios with external system integration]
B7[Custom visualizations and integration of multiple data stores]
end
subgraph Functionalities
direction TB
C1[Inventory]
C2[Collector]
C3[Processor]
C4[Distribution]
C5[Persistence]
C6[Alerting]
C7[Visualization]
end
%% --- Row connections ---
A1 --> B1 --> C1
A2 --> B2 --> C2
A3 --> B3 --> C3
A4 --> B4 --> C4
A5 --> B5 --> C5
A6 --> B6 --> C6
A7 --> B7 --> C7
%% --- Row gradient classes ---
classDef row1 fill:#eef7ff,stroke:#4a90e2,stroke-width:1px;
classDef row2 fill:#ddeeff,stroke:#4a90e2,stroke-width:1px;
classDef row3 fill:#cce5ff,stroke:#4a90e2,stroke-width:1px;
classDef row4 fill:#b3d8ff,stroke:#4a90e2,stroke-width:1px;
classDef row5 fill:#99ccff,stroke:#4a90e2,stroke-width:1px;
classDef row6 fill:#80bfff,stroke:#4a90e2,stroke-width:1px;
classDef row7 fill:#66b2ff,stroke:#4a90e2,stroke-width:1px;
%% --- Apply classes per row ---
class A1,B1,C1 row1;
class A2,B2,C2 row2;
class A3,B3,C3 row3;
class A4,B4,C4 row4;
class A5,B5,C5 row5;
class A6,B6,C6 row6;
class A7,B7,C7 row7;
These components can be seen as a data pipeline or ETL (Extract, Transform and Load), with the following diagram:
flowchart TB
A[Network] --> B[Collector]
subgraph Observability
direction LR
B --> C[Distribution] --> D[Persistence]
B -.-> P[Processing]
C -.-> P
D -.-> P
D --> E[Alerting]
E -.-> P
D --> G[Visualization]
X[Inventory] -.-> B
X -.-> G
X -.-> P
end
E -.-> F[Orchestration]
G -.-> H[Presentation]
Y[SoT] -.-> X
Figure 1: Observability Pipeline.
6.2.1. Inventory#
The inventory component answers a simple question: what should I be monitoring?
You already have this data somewhere: it’s in your source of truth. You’ve got device names, IP addresses, what they are, whether they’re active, credentials. Don’t duplicate that by hand. Pull it in automatically.
What you need from your SoT:
- A unique name or ID for each device or service (so you know it’s the device you think it is)
- How to reach it: IP address, hostname, and credentials (if you need to actively pull data from it)
- What it is: the type and vendor (so you know which Collectors work with it)
- Whether it’s active: if it’s planned, active, or being decommissioned (so you don’t alert on expected downtime)
Beyond the basics, you might also care about:
- OS or device specifics: some devices use different protocols, some are old and need special handling
- Context: who owns it, where’s it located, which customers depend on it (useful for filtering alerts and dashboards)
Why automate this instead of building a list by hand? Because people forget to update lists. Devices get added, nobody tells monitoring, and suddenly you’re missing visibility. But if observability reads from your SoT, when you add a device there, it’s automatically being monitored.
Push vs. pull?
- Pull: Observability checks the SoT periodically for updates. Simple, but if something changes mid-interval, you miss it.
- Push: When the SoT changes, it signals observability automatically (webhooks, message bus). Faster and more reliable.
If you have good data in your SoT and real integration (not manual copy-paste), inventory becomes automatic and you never miss observing a new device.
6.2.2. Collectors#
Your data has to come from somewhere. Collectors are the bridges between your network and your Observability platform. They’re responsible for pulling (or receiving) data from your devices and feeding it into the pipeline. Without effective Collectors, everything downstream is garbage.
There are two fundamental approaches:
- Passive: Your devices send data to the collector. Think syslog servers receiving log messages from routers, or IPFIX collectors listening for flow records. The device decides what to send and when.
- Active: The Collector asks devices for data. It connects to each device and pulls information using Simple Network Management Protocol (SNMP), gRPC Network Management Interface (gNMI), Representational State Transfer (REST) Application Programming Interface (API)s, or subscribes to streaming data. The Collector is in control: it decides what to ask for and when.
You can also categorize collectors by deployment:
- Agentless: No software to install on devices. A central collector server (usually running somewhere else) connects to each device individually. Simple to start with, but can become a bottleneck as you scale.
- Agent-based: Install a small agent on each device or service. Agents push data to a central location, or pull directly from local sources. More distributed, easier to scale, but more moving parts to manage.
Independently of the collection method, it’s important to highlight the different types of data that may be of interest in the network automation environment. I classify it in four broad categories:
- Management plane: The state of the device, or to read data about the configuration, logging, or network statistics. Protocols in this group are Simple Network Management Protocol (SNMP), System Logging Protocol (Syslog), gRPC Network Management Interface (gNMI), NETCONF, and RESTCONF.
- Control plane: This is where the distributed protocols that determine the packet forwarding of the network, such as layer 2 or layer 3 forwarding tables, run. A few examples of control plane protocols are OSPF, IS-IS, and BGP. These planes can be observed via techniques such as Ping or Traceroute, or telemetry protocols such as BMP.
- Forwarding plane: This plane is where the packets are moved (e.g., network interfaces), and it is the most demanding in terms of data volume and velocity. Naturally, when observing it, it’s also crucial to not impact the primary goal of the plane, which is to forward packets. In this group, we have tools such as TcpDump, IPFIX, sFlow, Netflow, Cisco SLA, PSAMP, and eBPF.
- External data: This category includes everything that is not network device-specific. For instance, circuit provider information and the contact for a given interface, coming from an external asset manager system or physical Internet of Things (IoT) sensors, could fit into this broad field.
flowchart TB
subgraph Network Device/Service
direction TB
A[Management Plane]
B[Control Plane]
C[Forwarding Plane]
A --> B
B --> C
end
D[External data]
Figure 2: Scope of Collector.
Stop screen scraping Command Line Interface (CLI) output. I know you’re doing it. We all did it. But it’s 2026 and every major vendor supports proper telemetry now.
CLI scraping is fragile (vendors change output format), slow (screen scraping and parsing text is expensive), unreliable (random command timeouts), and scales terribly. If your device is so old it only has CLI, either replace it or accept that you’ll have limited observability. Don’t build your entire monitoring stack around the lowest common denominator.
Really, it comes down to two core questions about what you’re collecting:
- What to collect: Metrics? Logs? Flow records? Each has different data models. SNMP has MIBs, modern gear speaks gNMI, applications use OpenTelemetry. The dream is a universal standard that lets you correlate data across everything, but that doesn’t exist yet. So you might need to build your own translation layer that turns all these different formats into something consistent (which is what the processing layer does next).
- How to get it: Which protocol? SNMP is old but stable, gNMI is modern and pushes data at you continuously, IPFIX captures what’s actually flowing. It varies by what you’re trying to observe.
This table summarizes what you might collect and the tools available:
| Data Type | Protocols / Collection Methods | Notes / Examples |
|---|---|---|
| Metrics | Simple Network Management Protocol (SNMP), Hypertext Transfer Protocol (HTTP) scraping, Command Line Interface (CLI) polling, OpenTelemetry (OpenTelemetry Protocol (OTLP)), Streaming telemetry (gRPC Network Management Interface (gNMI)) | Device metrics, host metrics, application metrics |
| Logs | OpenTelemetry (OTLP), file tailing, syslog | Application logs, system logs, structured logs |
| Traces | OpenTelemetry (OTLP) | Distributed tracing across services |
| Network Flows | NetFlow, IPFIX | Traffic flows, source/destination analysis |
| Protocol-specific | BMP, BGP, ARP, OSPF | BGP monitoring (BMP), ARP tables, BGP tables, OSPF tables |
| Packet Captures | PCAP (libpcap), SPAN / TAP | Full packet inspection, deep troubleshooting |
Table 1: Data and Protocols to collect.
To get a deeper understanding of the different options, refer to the Modern Network Observability book.
Network data collection applies to data centers, ISP backbones, cloud network services, Linux kernel interfaces, or raw packets. It depends on your environment.
Here’s the key: before you decide what to collect, start with the problem you’re trying to solve. That decision drives everything else. Are you detecting interface saturation? Monitoring BGP convergence? Tracking DDoS traffic? Your problem determines what data you need and how often you need it.
SNMP, syslog, and NetFlow are still around and work fine. But they’re poll-based: collectors ask devices for data on a fixed schedule and miss everything in between. Modern stuff is different. Streaming telemetry lets devices push data at you continuously (or when state changes).
Streaming telemetry
Devices push data continuously using YANG models. You set up subscriptions and data flows in real-time. Two flavors:
- Dial-In: Collector asks device to start streaming (collector initiates).
- Dial-Out: Device is pre-configured to stream to you (device initiates).
Much lower latency than polling, and the device stays in control of the flow.
flowchart TB
A[Collector]
B[Device]
A -.->|Dial-In| B
B -->|Streaming| A
B -.->|Dial-Out| A
Figure 3: Streaming Telemetry.
Hypertext Transfer Protocol (HTTP)-exposed metrics
Hypertext Transfer Protocol (HTTP) scraping (pull-based metrics) is simple and scales well. Simple Network Management Protocol (SNMP) does this, but increasingly network OSes are exposing metrics directly over Hypertext Transfer Protocol (HTTP) in Prometheus format. Easier for Collectors to consume, no special Simple Network Management Protocol (SNMP) Management Information Base (MIB) parsing needed. SONiC, Cumulus, Arista EOS, and others all expose metrics this way.
| Vendor / OS | Metric Type | Example Metric |
|---|---|---|
| SONiC | Interface traffic | sonic_interface_rx_bytes_total{interface="Ethernet32"} 1.234e+12 |
| NVIDIA Cumulus | Interface traffic | node_network_receive_bytes_total{device="swp1"} 9.21e+10 |
| Arista EOS | Interface traffic | arista_interface_in_octets_total{interface="Ethernet1"} 8.3e+11 |
Table 2: HTTP-exposed metrics.
The scraping approach provides low latency and near-real-time metrics, rich labels, and pull-based collection (central control of rate/timeout), connecting well with cloud-scale observability.
OpenTelemetry
OpenTelemetry is a vendor-neutral standard and toolkit for collecting, processing, and exporting telemetry data. Think about it as a common telemetry language and pipeline that unifies metrics, logs, and traces across networks, systems, and applications.
It does not replace network protocols like SNMP, NetFlow, gNMI, or BMP. Instead, it standardizes how telemetry is represented and transported after collection.
In traditional network monitoring, the data models are diverse and use vendor-specific in schemas and naming, what makes Hard to correlate across layers (network ↔ system ↔ application). In opposition, OpenTelemetry helps by providing:
- A common data model for metrics, logs, and traces
- A standard transport protocol (OpenTelemetry Protocol (OTLP)), over gRPC Remote Procedure Call (gRPC) or Hypertext Transfer Protocol (HTTP)
- A single processing pipeline for multiple signal types
Grafana Alloy or Telegraf are examples of a Collector implementing OpenTelemetry Protocol (OTLP). It collects data from different exporters and exports to different backends such as metrics (Prometheus-compatible Time Series Database (TSDB)s), logs (Loki, Elasticsearch, ClickHouse) and traces (Tempo, Jaeger).
And this brings us to final consideration, the common structure of modern pluggable collectors, with Input, Processor and Output stages. For example, in Telegraf, OTLP is an option for an output plugin.
Collector Architecture
Simplistically, every collector can be broken down into three parts (sometimes these are pluggable and others are more hardcoded).
flowchart LR
A[INPUT] --> B[PROCESSOR] --> C[OUTPUT]
Figure 4: Collector's Architecture.
- Input: It defines what has to be observed and under which parameters.
- Processor: Even though it is optional, it’s very convenient as soon as data enters the data pipeline to ensure data structure consistency. The processing could become very complex and may impact performance at scale, so not all processing has to be done at this level.
- Output: Shows how the collector moves the data into the pipeline. It may send it directly to other blocks like processing or persistency, or use the distribution component to scale.
There are many collectors (each one with different capabilities) such as Telegraf, Grafana Alloy, gNMIc, PMACCT, goflow, etc, but they are using similar architecture. So when choosing one (sometimes you may need several), approach like:
- Device capabilities, what protocols do your devices support?
- Data volume, high-volume needs streaming; low-volume can use polling.
- Latency requirements, near real-time vs. traditional intervals.
- Team skills and ecosystem fit with your backend.
All the “complete” network observability solutions like Suzieq, Kentik, and other also implement builtin collector for the observability data covered.
After the data is collected, one step that it has been introduced is the Processor that manipulates the data in different stages of the pipeline.
6.2.3. Processor#
Raw data from collectors is messy. Different devices export metrics differently, logs are unstructured text, values might be in different units. The processor layer cleans this up and makes it useful.
Goal: Once data is received, it must adhere to shared standards in order to converge signals from multiple sources, correlate them, and enrich them with additional context. Without this processor step, observability pipelines quickly become fragmented, difficult to query, and expensive to operate. There are multiple opportunities to process it depending on scale, complexity, and operational requirements.
The following are common processing actions that apply to observability pipelines.
6.2.3.1. Normalization/Transformation#
Data comes in different formats. Arista sends metrics one way, Cisco another, syslog is text, NetFlow is binary. Normalization translates all of this into a common format so your backend doesn’t have to understand 50 different dialects.
Structuring
Devices emit data in ways that make sense for them. Your job is to translate that into a format that works for analysis:
log-based:
Mar 18 14:22:11 leaf01 IFACE-5-STATE: swp1 oper-state changed from UP to DOWNto
{ "timestamp": "2025-03-18T14:22:11Z", "level": "INFO", "device": "leaf01", "component": "interface", "event": "oper_state_change", "interface": "swp1", "previous_state": "UP", "current_state": "DOWN" }metric-based:
<metric name>{<labels>} <value>interface_admin_state{hostname="leaf01", ifname="swp1"} 1 interface_oper_state{hostname="leaf01", ifname="swp1"} 0 interface_speed_bps{hostname="leaf01", ifname="swp1"} 100000000000 interface_in_errors_total{hostname="leaf01", ifname="swp1"} 0 interface_out_errors_total{hostname="leaf01", ifname="swp1"} 12Table-based: some tools (e.g., Suzieq) organize data into tabular, time-indexed state views:
| hostname | ifname | adminState | operState | speed | inErrors | outErrors | timestamp | |----------|--------|------------|-----------|-------|----------|-----------|-----------| | leaf01 | swp1 | up | up | 100G | 0 | 0 | t1 | | leaf01 | swp1 | up | down | 100G | 0 | 12 | t2 |
Renaming and alignment
Different telemetry sources describe the same concept using different names, paths, and label conventions. For example:
Openconfig: /interfaces/interface/state/oper-status value: UP tags: source=192.0.2.1 and interface_name=eth1
SNMP: ifOperStatus{ifName="GigabitEthernet0/1", device="router01"} 1
Native Prometheus: interface_oper_state{interface="swp1", host="leaf01"} 1Normalization aligns them into a consistent model, including the object name, label renaming, and the value (using the same unit conversion):
intf_oper_state{name="eth1", device="192.0.2.1"} 1
intf_oper_state{name="GigabitEthernet0/1", device="router01"} 1
intf_oper_state{name="swp1", device="leaf01"} 16.2.3.2. Enrichment#
Enrichment adds extra content to the observability data beyond what is actually observed. These extra dimensions added to the data allow more sophisticated data consumption. For example, you may be able to understand that the metrics belong to a device that plays a specific role in the network and act accordingly.
There are two main approaches to enrichment:
Extending data Adding extra metadata or labels to the observed data to complement it. This data could be static (e.g.,
org=my-company) to mark all your data, dynamic based on collection context (e.g.,collector_id=1234), or dynamic based on the observed data itself (e.g., givenhostname=rtr-1, create a labellocation=BCN-01by correlating with the SoT).intf_oper_state{ name="swp1", device="leaf01", role="leaf", location="BCN0001" } 1Creating new data Following the Prometheus ecosystem’s “info metrics” pattern, we can generate metrics that do not represent actual state but intended state. These metrics are useful in later observability pipeline stages to add more dimensions to analysis, as you will discover in the Alerting section.
device_info{ name="leaf1", role="leaf", vendor="arista", model="7050SX3", platform="eos", os_version="4.29.2F", location="BCN0001", rack="AB1", rack_unit="U32", environment="prod" } 1The info metrics are a curious type of data that do not have the relevant data in the value (e.g., the 1 in the previous metrics) but in the labels. This trick allows reusing Time Series Database (TSDB) that do not support some type of values (like strings).
Both approaches add labels and context. That’s powerful for alerts and analysis. But there’s a cost to think about:
- Cardinality: Every new label multiplies your time series. Device × interfaces × metrics is already high-cardinality. Add labels carelessly and storage explodes, queries slow down. Be thoughtful.
- Update frequency: Device racks and management IPs don’t change every second. Don’t poll enrichment data like you poll volatile metrics. Event-driven updates work better, fewer queries to your source of truth.
- Resilience: If your source of truth goes offline, enrichment stops. Cache it so you keep operating, even degraded. Your automation relies on this data, so make it rock-solid.
6.2.3.3. Transformation / Derivation / Aggregation#
Take raw metrics and derive new ones from them. Example: merge all interface input traffic from leaf and spine switches into a single “fabric bandwidth used” metric for trending. You’re combining existing data to answer bigger questions or feed dashboards. Prometheus calls this “recording rules.”
- record: fabric:interface:in_bps
expr:
(
sum by (fabric, role, hostname, name) (
rate(interface_in_octets_total{role=~"leaf|spine"}[5m])
) * 8
)
* on (hostname, name) group_left (fabric, role)
sot_interface_info{role=~"leaf|spine"}In the Prometheus ecosystem, this is known as recording rules.
Another processing functionality for reducing the amount of data is aggregation, by reducing dimensionality adjusted for the final usage of the data. For example, summarizing interface information per device, or summarizing device information per site. For example, creating rate calculations from counter metrics, or histogram bucketing, can be useful for many analyses.
6.2.3.4. Filtering#
Drop the junk early. Not every log line is worth storing. Not every interface metric matters (maybe you don’t care about loopbacks). The earlier you filter, the less you waste on storage and processing. Allowlists (only keep this) are safer than denylists (drop that).
6.2.3.5. Sampling / Throttling#
Even after filtering, volume might still be too high. Throttle it. Sample probabilistically (“keep 10% of these requests”), focus on top-K metrics (“only store the 100 busiest interfaces”), or rate-limit per source (“max 1000 metrics per device”). As data ages in your database, roll it up (5-second granularity becomes 5-minute averages) to save space.
Finally, all these processors may be run at different stages of the observability pipeline, depending on the use case:
- Collector: Best for lightweight, early normalization and filtering
- Dedicated processor: Required at scale for dynamic enrichment and complex transformations
- Persistence layer: Suitable for recording rules and long-term rollups (Normalization should always happen before this)
- Alerting layer: Derives events from stored data and applies business logic
In practice, effective observability pipelines distribute processing across layers, depending on tooling, scale, and operational constraints.
6.2.4. Distribution#
Simple linear pipelines (collector → processor → database) don’t scale. If the database gets slow, the collector backs up. If you need to upgrade the processor, you stop collection. Everything is tightly coupled and fragile.
This is where message brokers come in.
Message brokers like Apache Kafka or NATS sit in the middle. Producers (collectors, devices) publish to topics. Consumers (processors, databases, alerting) pull at their own pace. Fully decoupled.
Benefits:
- Scaling: Each component scales independently.
- Resilience: If a consumer is slow, data queues up instead of being dropped.
- Flexibility: Same data feeds multiple backends without duplication at the source. Upgrade or restart one component without affecting others.
See Chapter 11 for more on scaling and reliability patterns.
6.2.5. Persistence#
Once data is processed, it needs to live somewhere. The database layer stores all your observability data. It needs to handle huge volumes, support fast queries, and keep costs reasonable.
Good databases for observability share common traits:
- Time-aware: Data is inherently timestamped. Databases optimize for range queries and time-windowed calculations.
- High write throughput: Constant metric ingest. Databases handle it without slowing down.
- Multi-dimensional: Metrics carry labels (device, interface, location). Query and aggregate them efficiently.
- Flexible queries: Need expressive languages (PromQL, LogQL) to explore data without predefined schemas.
- Lifecycle management: Storage grows fast. Support retention, downsampling, deletion to control costs.
- Schema flexibility: New metrics appear constantly. Databases handle evolution without costly migrations.
What database types work?
No single database handles all observability workloads perfectly, and anyone telling you otherwise is selling something. Here’s what actually works:
Time-series databases (Time Series Database (TSDB)): This is where you start. Prometheus won the metrics war. Its data model (metrics with labels) became the de-facto standard, and PromQL is the query language everyone knows. Use Prometheus if you’re under 100 million active series. Beyond that, look at VictoriaMetrics (compatible with Prometheus, scales better, uses less memory). InfluxDB is fine but their licensing keeps changing. Avoid vendor-specific solutions unless you’re already locked into their ecosystem.
Columnar databases: ClickHouse is the king here. It’s absurdly fast for log aggregation and flow analysis. If you need to query billions of rows for reporting or historical analysis, this is your tool. InfluxDB v3 is trying to compete but ClickHouse has years of hardening. Parquet files work for analytics workloads where you don’t need real-time writes (like Suzieq does).
Text-search databases: Elasticsearch if you must, but honestly, modern alternatives like Loki (from Grafana) are simpler and cheaper to run. Splunk is great if someone else is paying for it. The dirty secret: most teams over-invest in log search and under-invest in structured logging that would make search unnecessary.
My recommendation: Start with Prometheus (or similar) for metrics, Loki (or similar) for logs. Add ClickHouse (or similar) when you need serious historical analysis. That stack will get you to massive scale before you need something fancier.
This classification is not absolute; most tools have a primary classification but then implement some characteristics of others (especially time-series).
Two important concepts when designing storage: cardinality (how many unique values can a label take) and dimensionality (how many labels does one metric have). High-cardinality labels multiplied by many dimensions = explosion in stored data and slow queries. This is one of the biggest challenges in observability. See Chapter 11 for deep scaling considerations.
Each database has its own characteristics that must be mapped to the use cases it needs to solve. For example, Suzieq uses a columnar solution (Apache Parquet files) because the questions it tries to answer are relational rather than time-series based. For example: “Which routes exist on spines but not on all leaves?”
- Requirements:
- Filter across many attributes
- Compare rows across devices
- Join tables (interfaces, neighbors, routes)
- Look at state at a point in time (not historical evolution)
- Solution: This is what a columnar analytic solution is designed for. A Time Series Database (TSDB) could help with checking the number of routes, but to identify missing routes would require many labels, which is not its primary strength.
After all the data management, there are two final steps:
- Create events for other automation to use or humans to intervene: Alerting
- Visualize the data to provide information for decision-making: Visualization
6.2.6. Alerting#
Alerts turn data into action. Your goal: feed them to automation. Notify humans when automation can’t fix it.
Alerts flow through stages:
- Detection: Find something wrong in data.
- Processing: Enrich with context. Is it critical? Minor? False alarm? Correlate related alerts to reduce noise.
- Routing: Send to orchestration (run workflows), teams (Slack), or incident management (PagerDuty).
- Escalation: If automation fails, humans take over.
flowchart LR
A[Detection] --> B[Processor] --> C[Routing] --> D[Escalation]
Figure 5: Alerting Stages.
The hard part isn’t setting up alerts. It’s preventing alert fatigue. I’ve seen NOCs with 10,000 active alerts where nobody looks at any of them anymore. At that point, you don’t have monitoring, you have expensive noise.
Here’s how you actually fix it:
Route 95% of alerts to automation, not humans. If a human sees an alert, it should be because automation tried and failed. Interface flapping? Automation checks if it’s maintenance, reboots the optics, opens a ticket with the vendor. Human only gets paged if automation can’t resolve it.
Kill static thresholds. “CPU > 80%” alerts are useless. 80% might be normal for that device. Use dynamic baselines: alert when something deviates from its historical pattern, not from some arbitrary number.
Group related alerts. When a core switch dies, you’ll get 500 alerts from downstream devices. Show one: “Core switch down, 500 devices affected.” Not 500 individual alerts.
Require a runbook for every alert. If you can’t write down what someone should do when they get the alert, delete the alert. Seriously. If the action is “investigate,” that’s not an action, that’s a waste of time.
Measure alert quality. Track false positive rates. Any alert with >10% false positives gets fixed or deleted. Track time-to-acknowledge. If alerts sit unacknowledged for hours, they’re not important enough to exist.
The goal is not “comprehensive monitoring.” The goal is “only page humans for things humans need to fix.”
6.2.6.1. The role of AI and AIOps in observability#
Let’s cut through the hype: most “AI-powered observability” is just anomaly detection with a marketing budget. That said, there’s real value here if you deploy it correctly.
What actually works:
Anomaly detection: ML is genuinely better than static thresholds at learning “normal” for each device. CPU at 85% might be fine for device A, disaster for device B. ML figures this out automatically. This is table stakes now, not magic.
Alert correlation: When 50 things break simultaneously, ML can group them and suggest “the core switch is probably the root cause.” Saves hours of troubleshooting. But you still need humans to verify, because ML gets it wrong 20% of the time.
Capacity forecasting: ML is decent at “based on trends, this link will saturate in 6 weeks.” Better than humans eyeballing graphs. Still needs human judgment on whether to care.
What’s oversold:
Automatic root cause analysis: Every vendor promises this. None deliver reliably. You’ll get suggestions, sometimes good ones, but “AI diagnosed and fixed the problem” is 95% marketing and 5% cherry-picked examples.
Self-healing networks: Automation can fix known problems with known solutions. That’s not AI, that’s good engineering. True “self-healing” for novel problems doesn’t exist yet. When vendors demo it, ask them to show you the failure cases.
“AIOps replaces your NOC”: No. AIOps helps your NOC be more effective. The human judgment, business context, and ability to handle edge cases? That’s still humans.
Bottom line: Use ML for anomaly detection and alert ranking. Be skeptical of everything else until you’ve tested it on your actual network, not a vendor demo.
6.2.7. Visualization#
Goal: All the observed data should provide value to the decision-makers so crafting user-oriented visualization should answer the user needs.
The final block is the dashboard/report layer. Let me be blunt: most dashboards are terrible. They’re either vanity metrics that make executives feel good (“99.99% uptime!”) or they’re data vomit (50 graphs per screen, nobody knows what any of them mean).
Here’s how to build dashboards that actually help:
Build for decisions, not decoration. Every widget should answer a specific question or trigger a specific action. If someone looks at a graph and doesn’t know what to do with it, delete the graph.
Show problems, not just data. Green/yellow/red signals beat raw numbers. “Interface utilization: 45%” is useless. “Interface utilization: normal” or “Interface utilization: WARNING, trending toward saturation” is actionable.
Hierarchical drill-down beats one big dashboard. Start with a global health summary (“3 sites have issues”). Click a site to see device health. Click a device to see interfaces. Five focused dashboards beat one cluttered disaster.
Match the audience. NOC staff need real-time operational status and drill-down. Managers need trend summaries and business impact. Engineers need raw data access and query interfaces. One dashboard trying to serve everyone serves no one.
Make it interactive or don’t bother. Static dashboards age badly. Let people filter, zoom, adjust time ranges. Support investigation, don’t just show pretty pictures.
And here’s the controversial take: most teams have too many dashboards. I’ve seen organizations with 200+ dashboards where each person looks at 2 of them. Delete the dashboards nobody uses. If no one looks at it for 3 months, it doesn’t matter.
Strictly speaking, this component belongs to the Presentation (Layer) layer of the architecture, but I want to address it here before moving on.
The basic rules:
- Clarity: Easy to understand. Every element has purpose.
- Relevance: Show only data that supports decisions. Noise kills insight.
- User-focused: Build for your audience. NOC staff and managers need different views.
- Interactive: Let people drill, zoom, adjust time. Support investigation.
- Hierarchical: Global overview first, then drill down. Many focused dashboards beat one cluttered dashboard.
flowchart TD
A[Global Overview] --> B[Site Summary] --> C[Device Summary] --> D[Device Detail] --> E[Interface Detail]
Figure 6: Hierarchical Drilldown.
In the Modern Network Observability book, chapter 11 (“Application of Your Observability Data”) you can find a lot more details for architecting dashboards.
Keep in mind that this block is very related to the user perception, so do not forget to interview them and involve in the process.
Next, I want to provide an example of how a network observability could look like.
6.3. Implementation Example#
This section illustrates how the observability functionalities come together through a practical use case using the Telegraf, Prometheus, and Grafana stack and other tools.
This is not a tool recommendation at all, but because it’s fully built on top of open-source components, you can give it a try.
6.3.1. Use Case: Proactive Interface Saturation Detection#
Scenario: A data center fabric with 50 leaf and spine switches needs to detect interface saturation before it impacts applications and automatically trigger traffic engineering workflows.
Requirements:
- Alert when interface utilization exceeds 80% for 5 minutes
- Detect sudden traffic spikes (>50% change)
- Maintain 30 days of high-resolution data (30s intervals)
- Integrate with Nautobot (Source of Truth) for inventory
- Trigger orchestration workflows for remediation
6.3.2. Solution Architecture#
Before starting the solution analysis, it’s important to get an estimation of the scale of the scenario. In this case, with 50 devices × 64 ports × 10 metrics = ~32K active time series, so it’s a pretty small scenario that doesn’t require advanced tooling.
Component selection rationale:
- Telegraf was chosen as the collector for its multi-protocol support (SNMP for legacy devices, gNMI for modern devices), extensive plugin ecosystem, and built-in processors for data normalization. It handles the 50-device scale easily on a single instance with 30-second polling intervals.
- Prometheus serves as the persistence layer, optimized for time-series data with its powerful PromQL query language for complex alerting conditions. At 32K series, it operates well within its comfort zone while providing native integration with Alertmanager.
- Grafana provides visualization with multi-datasource support, querying both Prometheus metrics and Nautobot metadata simultaneously to create context-rich dashboards tailored for different audiences (NOC, capacity planning, management).
Architecture
At a high-level, the next figure depicts the main components and roles:
flowchart TB
subgraph Sources["Data Sources"]
NB[Nautobot<br/>Source of Truth]
SW[Network Devices<br/>SNMP/gNMI]
end
subgraph Collection["Collection Layer"]
T[Telegraf<br/>Collectors]
SD[Consul<br/>Service Discovery]
end
subgraph Storage["Storage"]
P[Prometheus<br/>TSDB]
end
subgraph Alerting["Alerting"]
AM[Alertmanager<br/>Routing]
end
subgraph Presentation["Visualization"]
G[Grafana<br/>Dashboards]
end
subgraph Integration["External Systems"]
ORCH[Orchestrator<br/>Automation]
SLACK[Slack<br/>Notifications]
end
NB -->|Device Inventory| SD
SD -->|Dynamic Targets| T
SW -->|Metrics| T
T -->|Expose HTTP| P
P -->|Alert Rules| AM
P <-->|Queries| G
NB -->|Metadata| G
AM -->|Webhook| ORCH
AM -->|Alerts| SLACK
Figure 7: Observability Solution Example.
This simplistic solution architecture is covered extensively in the Modern Network Observability book. If you want a hands-on approach with a lab scenario to test, give it a try.
6.3.3. Implementation Flow#
Inventory Integration: Nautobot serves as the single source of truth, defining which devices to monitor with monitoring profiles and SNMP credentials. A lightweight sync service (e.g., Python script using webhooks) continuously updates Consul’s service registry with device information, enabling dynamic discovery from the collector.
Data Collection: Telegraf uses Consul for service discovery, automatically polling SNMP from devices as they appear in Nautobot. Telegraf processors normalize and enrich data (converting status codes to labels, renaming fields to standard names, and adding contextual information from Nautobot) and expose metrics in Prometheus format on an HTTP endpoint.
Persistence and Analysis: Prometheus scrapes Telegraf endpoints using Consul service discovery, storing metrics in its time-series database. Recording rules pre-calculate interface utilization percentages and bandwidth rates to optimize query performance.
Alerting Logic: Alert rules in Prometheus define conditions (e.g., interface utilization >80% for 5 minutes, traffic spikes >50% increase). When conditions match, Alertmanager handles routing, critical alerts with automation: enabled labels go to the orchestrator webhook, others route to Slack or PagerDuty based on severity.
Visualization: Grafana dashboards provide multiple views: fabric-wide bandwidth trends, top saturated interfaces, per-device drill-downs with interface heatmaps. Template variables enable filtering by site, role, or device. Dashboards query both Prometheus (metrics) and Nautobot (device metadata) for contextual enrichment.
Closed-Loop Automation: When critical saturation alerts fire, Alertmanager sends webhooks to the orchestration platform, which triggers automated traffic engineering workflows to redistribute load across available paths. We cover this component in Chapter 7.
6.3.4. Solution summary#
Operational Benefits:
- Reduced manual monitoring effort through SoT integration
- Proactive issue detection with sub-minute latency
- Closed-loop remediation reducing MTTR from hours to minutes
- Rich context combining metrics with inventory data
Scalability Considerations:
- Current architecture handles 50 devices; can scale to ~500 devices before needing distributed collectors
- Prometheus capacity sufficient up to ~1M active series; beyond that, you may need to consider other solutions or architectures
This brief solution exercise closes this chapter that defines the basic goals and functionalities of Observability within a network automation architecture.
6.4. Summary#
Observability in network automation extends far beyond traditional monitoring, providing the architectural foundation for understanding network behavior, detecting issues proactively, and enabling automated remediation at scale. Built on seven core goals, from automatic discovery with minimal human effort to sophisticated real-time analysis and user-centric visualizations, observability transforms how organizations respond to network events by shifting from reactive troubleshooting to proactive, data-driven operations.
The realization of these goals requires seven interconnected architectural pillars and functionalities: SoT integration for automatic inventory discovery, multi-protocol collectors supporting near-real-time data ingestion through streaming telemetry and modern protocols, processors that normalize and enrich heterogeneous data with contextual metadata, distributed systems for scalable data movement at high volumes, purpose-built databases (e.g., time-series, columnar, and text-search) optimized for different observability workloads, intelligent alerting with AI/ML-enhanced detection and routing to orchestration or human responders, and tailored visualizations that present information at appropriate abstraction levels for different audiences.
Implementing observability requires architectural decisions early in the design process. Organizations must choose between traditional on-premises platforms, cloud-native SaaS solutions offering rapid deployment and AI-powered analytics, or composable open-source stacks providing maximum flexibility. Each approach involves tradeoffs in cost, control, operational overhead, and capability. Success also depends on understanding data processing requirements, from normalization and enrichment to filtering and aggregation, and how emerging AIOps capabilities can reduce alert fatigue and enable predictive operations.
Observability is not a single tool but a coherent architectural pattern. Success depends on treating it as a system, where inventory drives collection, collection enables processing, processing feeds distribution, distribution connects to persistence, persistence informs alerting, and alerting ultimately drives visualization and automated response. By carefully designing each component and how they interact, organizations can build observability systems that scale with their networks, integrate with modern protocols like OpenTelemetry and gNMI, and transform operational visibility into actionable intelligence that powers closed-loop automation.
References and Further Reading#
- Modern Network Observability (David Flores, Christian Adell, Josh Vanderaa): A hands-on approach using open source tools such as Telegraf, Prometheus, and Grafana
💬 Found something to improve? Send feedback for this chapter