· 9574 words · 45 min read

6. Observability#

In modern network automation, observability extends beyond traditional monitoring. It provides the visibility and intelligence needed to understand system behavior, detect issues proactively, and enable automated remediation at scale. Where traditional monitoring simply measures availability and performance, modern observability ensures that network services align with user expectations and business outcomes, creating the foundation for closed-loop automation.

This section covers two building blocks: Collector and Observability, because they are very related.

This chapter explores the architectural components and functionalities that make comprehensive network observability possible. Starting with the goals, it breaks down the key functionalities and depicts which considerations have to be taken into account when designing an observability solution. Whether you choose traditional on-premises platforms, cloud-native SaaS solutions, or composable open-source stacks, understanding the underlying architectural patterns and tradeoffs is essential for building systems that scale with your network and automation strategy.

Disclaimer: In this section, I mention different vendors and solutions as examples. In any case, this is not a recommendation, just for explanatory purposes.

This section is heavily influenced by the book Modern Network Observability by Packt, that I coauthored with David Flores and Josh Vanderaa. If you want to go hands-on and learn an implementation with the TPG (Telegraf-Prometheus-Grafana) stack and other tools, I definitely recommend it.

6.1. Fundamentals#

Before going into the lower level details, here we establish the foundationals about network observability within the network automation strategy, defining its goals, supporting pillars, and scope.

6.1.1. Context#

Observability is not entirely new for network engineers. What was traditionally known as network monitoring forms a part of it, but with the requirements of automated solutions, it must advance significantly, improving in multiple dimensions. This may seem abstract if you are accustomed to monitoring networks with single applications that attempt to do everything (i.e., the traditional approach). These products, while a great starting point, impose constraints that can limit the potential of network observability.

The traditional network monitoring approach has several limitations that have become evident with the rise of new network environments and infrastructure management practices (e.g., DevOps). These limitations have spurred the development of better ways to monitor networks, transforming observability from a passive, reactive function into a key enabler of network automation solutions. This is often called “closed-loop” because it creates a circular feedback mechanism: observe → detect → respond → verify → observe again.

In this chapter, you will explore the inner workings of network monitoring applications and learn how these can be disaggregated to integrate solutions that perform specific tasks more effectively, applying an architectural approach.

As a simple comparison of both terms:

  • Network Monitoring: Measuring the performance and availability of network infrastructure.
  • Network Observability: Ensuring the user’s experience (on top of network services) aligns with expectations.

Going this step further, raises the bar significantly, requiring more complete (and complex) data management, from collection through processing to analysis, to deliver actionable insights that reflect the actual user or business perspective.

One of the fundamental decision that you will face early in the process is the Monolithic vs. Composable Architecture. This is not a simple one because it comes with many tradeoffs, but this should give you a first direction:

  • Traditional Monolithic (e.g., SolarWinds, LibreNMS): On-premises integrated single-vendor solution with pre-built features. Best for smaller networks with limited team expertise and existing datacenter infrastructure, but creates vendor lock-in and limits customization.
  • Cloud-Native SaaS (e.g., Datadog, New Relic): Fully managed observability platforms with integrated metrics, logs, traces, and AI-powered analytics. Offer rapid deployment, automatic scaling, and minimal operational overhead but come with ongoing subscription costs tied to data volume and may have data sovereignty constraints. Strong network monitoring capabilities through agents and integrations with major vendors.
  • Composable/Open-Source (e.g., TPG stack, Grafana Stack, ELK stack): Best-of-breed components you integrate yourself or rely on managed services by vendors. Offers maximum flexibility, cost control for large-scale deployments, and no vendor lock-in, but requires significant expertise and operational overhead. Best for organizations with DevOps practices and engineering resources.

This simple decision leads to many consequences, so choose carefully understanding your own use-case and actual environment constraints.

After that, there are two other important questions that apply here, that you will notice through all the components analysis:

  • Deployment Models: Choose between on-premises (full control, data sovereignty), SaaS/Cloud (reduced operations, elastic scaling), or hybrid approaches based on your security requirements, operational capabilities, and cost model.
  • Key Cost Factors: Consider licensing (per device/metric), infrastructure (compute, storage, bandwidth), operational overhead (personnel, training), and data retention costs that grow with time and cardinality.

These are just a few of the key points to keep in mind while you start a design process for observability. More capable solutions usually imply more complexity to handle, so you have to clearly understand what is the purpose of this transformation.

6.1.2. Goals#

To provide you a structured understanding of the goals of the network observability block, this list introduces seven goals that the modern network automated operations expect from it:

  1. Observe the entire network with minimal human effort. All network devices and services should be observed automatically upon connection to the network, without requiring manual registration or configuration. A newly connected network device should be properly monitored immediately, with all necessary adjustments applied automatically.
  2. Support heterogeneous network environments with sufficient data and accuracy. Modern networks are more diverse than ever, encompassing various vendors, platforms, technologies, and environments. Observability must encompass all these elements with appropriate data collection. Moreover, collecting data every 300 seconds (the traditional SNMP interval) is insufficient for modern demands. Today, near-real-time data collection with minimal latency is essential.
  3. Observe data from different IT layers with context. With IT infrastructure converging under DevOps practices, network observability can no longer exist in isolation. Data from heterogeneous sources must be comparable to enable correlation across layers. Additionally, adding contextual metadata enriches analysis and supports better decision-making.
  4. Handle massive-scale network scenarios. Networks continue to grow, driven by application demands. With the current AI boom, this growth is accelerating. Managing observation and analysis of massive numbers of components requires systems designed for scale from the ground up.
  5. Offer access to observability data for sophisticated analysis in near real-time, including historical data. Modern observability systems must extract meaningful insights from data through complex queries and analytics. This capability must span both real-time streams and historical datasets to support comprehensive analysis and trend detection.
  6. Be proactive in detecting network issues and reducing time to recovery. Unlike traditional monitoring, observability within network automation is designed to enable closed-loop systems that automatically detect and respond to deviations. The goal is autonomous remediation; human intervention is necessary only when automation cannot resolve the issue. In such cases, observability must provide comprehensive troubleshooting information (not just raw data), but intelligent context to guide resolution.
  7. Create tailored user-oriented visualizations to support decision-making. Data is only valuable when it drives decisions. Observability must present information across multiple perspectives, from high-level summaries to detailed technical metrics, all accessible, highlighted, and formatted to enable rapid comprehension and confident decision-making at all organizational levels.
graph TD

    %% --- Subgraphs ---
    subgraph Goals
        direction LR
        A1[Observe all the network with minimal human effort]
        A2[Support heterogeneous network environments with enough data and accuracy]
        A3[Observe data from different IT layers with context]
        A4[Handle massive-scale network scenarios]
        A5[Offer access to observability data for sophisticated analysis in near real-time]
        A6[Be proactive to detect network issues and reduce time to recover]
        A7[Create tailored user-oriented visualizations]
    end


    %% --- Row gradient classes ---
    classDef row1 fill:#eef7ff,stroke:#4a90e2,stroke-width:1px;
    classDef row2 fill:#ddeeff,stroke:#4a90e2,stroke-width:1px;
    classDef row3 fill:#cce5ff,stroke:#4a90e2,stroke-width:1px;
    classDef row4 fill:#b3d8ff,stroke:#4a90e2,stroke-width:1px;
    classDef row5 fill:#99ccff,stroke:#4a90e2,stroke-width:1px;
    classDef row6 fill:#80bfff,stroke:#4a90e2,stroke-width:1px;
    classDef row7 fill:#66b2ff,stroke:#4a90e2,stroke-width:1px;

    %% --- Apply classes per row ---
    class A1 row1;
    class A2 row2;
    class A3 row3;
    class A4 row4;
    class A5 row5;
    class A6 row6;
    class A7 row7;

With these goals, the next step is understanding which requirements the solution has to offer.

6.1.3. Pillars#

Each goal requires specific architectural capabilities. The following pillars describe the foundational design patterns and technical capabilities that observability solutions must provide:

  1. Close integration with the SoT to understand what needs to be monitored. Effective observability begins with knowing what to observe. This requires integration with the Source of Truth (SoT) to automatically discover devices, services, their configurations, credentials, and communication protocols. The observability system should bring monitoring capabilities online automatically as new infrastructure is registered in the SoT, eliminating manual configuration overhead.
  2. Ability to collect data via different protocols supporting very frequent/on-demand updates. Traditional network monitoring protocols impose significant limitations on collection frequency and data richness. Observability requires support for multiple collection methods and protocols, from streaming telemetry to on-demand pulls, each optimized for specific platforms and use cases.
  3. Normalization of heterogeneous data with contextual metadata for richer analysis. Network environments generate diverse data in many different formats, sometimes representing the same concept in different ways. To enable comparison and correlation across sources, we need common schema. In addition, to multiply the analytical value of raw metrics, we need contextual enrichment (adding device roles, business context, relationships).
  4. Scalable data distribution systems to support scale-out architectures. Traditional sequential data pipelines become bottlenecks at scale. Observability systems must use distributed, asynchronous architectures (e.g., message queues, streaming platforms) to decouple data collection from processing, allowing independent scaling of each stage. This is essential for high-volume environments but may not apply to all deployments.
  5. A persistence layer that supports efficient time-travelling and powerful query languages. Observability data is fundamentally time-series in nature. Storage systems must be optimized for efficient time-series ingestion and queries, with rich query languages enabling analytics, aggregations, and complex pattern matching. This allows alerts to trigger intelligently and dashboards to render efficiently even across large historical windows.
  6. Flexible rule definitions and routing scenarios with strong integration with external systems. Raw data must be transformed into actionable events. This requires flexible, expressive rule engines to define thresholds, anomalies, and conditions. Triggered events should flow to external systems: automation orchestrators, incident management platforms, or escalation workflows, enabling end-to-end closed-loop automation or human-in-the-loop response.
  7. Custom visualizations and integration of multiple data stores. Not all observability needs trigger events, much data is valuable for understanding status and trends. Visualization components must support flexible dashboard design, connecting to multiple data stores, rendering diverse metric types, and presenting information at appropriate abstraction levels for different audiences.
graph LR

    %% --- Subgraphs ---
    subgraph Goals
        direction TB
        A1[Observe all the network with minimal human effort]
        A2[Support heterogeneous network environments with enough data and accuracy]
        A3[Observe data from different IT layers with context]
        A4[Handle massive-scale network scenarios]
        A5[Offer access to observability data for sophisticated analysis in near real-time]
        A6[Be proactive to detect network issues and reduce time to recover]
        A7[Create tailored user-oriented visualizations]
    end

    subgraph Pillars
        direction TB
        B1[Close integration with SoT to understand what needs to be monitored]
        B2[Ability to collect data via different protocols supporting very frequent/on-demand updates]
        B3[Normalization of heterogeneous data with contextual metadata for richer analysis]
        B4[Scalable data distribution systems to support scale-out architectures]
        B5[Persistence layer supporting time-series data and powerful query languages]
        B6[Flexible rule definitions and routing scenarios with external system integration]
        B7[Custom visualizations and integration of multiple data stores]
    end


    %% --- Row connections ---
    A1 --> B1
    A2 --> B2
    A3 --> B3
    A4 --> B4
    A5 --> B5
    A6 --> B6
    A7 --> B7

    %% --- Row gradient classes ---
    classDef row1 fill:#eef7ff,stroke:#4a90e2,stroke-width:1px;
    classDef row2 fill:#ddeeff,stroke:#4a90e2,stroke-width:1px;
    classDef row3 fill:#cce5ff,stroke:#4a90e2,stroke-width:1px;
    classDef row4 fill:#b3d8ff,stroke:#4a90e2,stroke-width:1px;
    classDef row5 fill:#99ccff,stroke:#4a90e2,stroke-width:1px;
    classDef row6 fill:#80bfff,stroke:#4a90e2,stroke-width:1px;
    classDef row7 fill:#66b2ff,stroke:#4a90e2,stroke-width:1px;

    %% --- Apply classes per row ---
    class A1,B1 row1;
    class A2,B2 row2;
    class A3,B3 row3;
    class A4,B4 row4;
    class A5,B5 row5;
    class A6,B6 row6;
    class A7,B7 row7;


Finally, before detailing the seven functionalities that realize these pillars, let’s clarify what falls within Observability’s scope.

6.1.4. Scope#

To extend the goals introduced above, there are other points that also belong within Observability’s responsibilities:

  • Different levels of observation adapted to users’ perspectives (technical, operational, business)
  • Integration with CI/CD pipelines, providing feedback for automated testing and validation
  • Observability of the automation system itself (meta-monitoring of collectors, processors, and alerting systems)

However, on the other side, there are functions that belong to other components of the architecture:

  • Defining network intent: What the network should look like (Intent/SoT responsibility)
  • Executing network changes: Actually implementing remediation (Executor responsibility)
  • Orchestrating complex workflows: Coordinating multi-step remediation across multiple systems (Orchestrator responsibility)

This clear boundary ensures Observability focuses on detection and insights, while other building blocks handle intent definition, execution, and orchestration.

Transitioning to modern observability requires careful planning. Do not take it as something as simple as replacing one monitoring tool with another; you have to transform your mindset. With more power comes more responsibility, and you will have more to choose and adjust.

After this initial introduction that exposes the key concepts related to the Observability block, let’s go into each functionality next.

6.2. Functionalities#

The seven goals and pillars are realized through seven core functionalities. Each functionality maps to a goal and its supporting pillar, creating a direct chain from business requirements to technical implementation:

  1. Inventory: Consumes intent from the SoT and provides the metadata, device lists, and collection targets to all downstream components.
  2. Collector: Retrieves observed data from the network using multiple protocols and collection methods, both pull-based (polling) and push-based (streaming).
  3. Processor: Normalizes heterogeneous data into a common schema and enriches it with contextual metadata (tags, relationships, business context), besides doing other data operations.
  4. Distribution: Decouples data producers from consumers using distributed, asynchronous patterns. Moves data and events reliably from collectors through processors to persistence and alerting systems.
  5. Persistency: Stores normalized data in databases optimized for efficient ingestion, retention, and querying at scale.
  6. Alerting: Analyzes persisted data using flexible rules and thresholds to detect conditions of interest, generating events that trigger external systems (automation or human notifications).
  7. Visualization: Renders observed data and triggered events into dashboards, reports, and other visual interfaces tailored to different user audiences and use cases.
graph LR

    %% --- Subgraphs ---
    subgraph Goals
        direction TB
        A1[Observe all the network with minimal human effort]
        A2[Support heterogeneous network environments with enough data and accuracy]
        A3[Observe data from different IT layers with context]
        A4[Handle massive-scale network scenarios]
        A5[Offer access to observability data for sophisticated analysis in near real-time]
        A6[Be proactive to detect network issues and reduce time to recover]
        A7[Create tailored user-oriented visualizations]
    end

    subgraph Pillars
        direction TB
        B1[Close integration with SoT to understand what needs to be monitored]
        B2[Ability to collect data via different protocols supporting very frequent/on-demand updates]
        B3[Normalization of heterogeneous data with contextual metadata for richer analysis]
        B4[Scalable data distribution systems to support scale-out architectures]
        B5[Persistence layer supporting time-series data and powerful query languages]
        B6[Flexible rule definitions and routing scenarios with external system integration]
        B7[Custom visualizations and integration of multiple data stores]
    end

    subgraph Functionalities
        direction TB
        C1[Inventory]
        C2[Collector]
        C3[Processor]
        C4[Distribution]
        C5[Persistence]
        C6[Alerting]
        C7[Visualization]
    end


    %% --- Row connections ---
    A1 --> B1 --> C1
    A2 --> B2 --> C2
    A3 --> B3 --> C3
    A4 --> B4 --> C4
    A5 --> B5 --> C5
    A6 --> B6 --> C6
    A7 --> B7 --> C7

    %% --- Row gradient classes ---
    classDef row1 fill:#eef7ff,stroke:#4a90e2,stroke-width:1px;
    classDef row2 fill:#ddeeff,stroke:#4a90e2,stroke-width:1px;
    classDef row3 fill:#cce5ff,stroke:#4a90e2,stroke-width:1px;
    classDef row4 fill:#b3d8ff,stroke:#4a90e2,stroke-width:1px;
    classDef row5 fill:#99ccff,stroke:#4a90e2,stroke-width:1px;
    classDef row6 fill:#80bfff,stroke:#4a90e2,stroke-width:1px;
    classDef row7 fill:#66b2ff,stroke:#4a90e2,stroke-width:1px;

    %% --- Apply classes per row ---
    class A1,B1,C1 row1;
    class A2,B2,C2 row2;
    class A3,B3,C3 row3;
    class A4,B4,C4 row4;
    class A5,B5,C5 row5;
    class A6,B6,C6 row6;
    class A7,B7,C7 row7;

These components can be seen as a data pipeline or ETL (Extract, Transform and Load), with the following diagram:

flowchart TB
    A[Network] --> B[Collector]

    subgraph Observability
       direction LR
       B --> C[Distribution] -->  D[Persistence]
       B -.-> P[Processing]
       C -.-> P
       D -.-> P
       D --> E[Alerting]
       E -.-> P
       D --> G[Visualization]
       X[Inventory] -.-> B
       X -.-> G
       X -.-> P
    end

    E -.-> F[Orchestration]
    G -.-> H[Presentation]
    Y[SoT] -.-> X

Figure 1 — Observability Pipeline.

6.2.1. Inventory#

Goal: The Inventory component should provide an automated process to identify the targets to be observed, both passively (by receiving data) or actively (by requesting data). This requires managing information that comes from the network Intent (e.g., IP address, credentials, etc.), and adapts the data to be consumed by the observability functions.

Why is this so relevant? First, it removes human error from the equation (e.g., forgetting to update a list of device). More importantly, by leveraging the DRY principle, reusing data defined in one place is faster and more reliable.

In the network intent there is a lot of data; from that, we need some basic information:

  • A unique name or identifier to facilitate unequivocal identification of the target device or service.
  • An IP address or FQDN and the credentials to connect to, if it is not automatically sending the data.
  • The type (including vendor information) and/or role of the service to allow customizing what has to be observed.
  • Status: to identify if the service is active, planned, or under maintenance and act accordingly.

On top of this basic information, extra data may be relevant depending on the use case:

  • More device-specific data, such as the OS, that would limit the data to collect.
  • Other contextual data, such as the owner or the location, so you can apply logic to adjust the observability profile.

Notice that the inventory information is relevant in cases when we connect to retrieve the data or when the network sends the data directly from it’s configuration. For example, when sending logs, recongnizing the source and applying the proper parser, or using the intent data to configure a side application that registers automatically to a central collector engine.

As reflected in Figure 1, the inventory data is already available in the Intent building block, so you just need to retrieve it. And we have two main options here:

  • Periodic pulls that refresh the observability inventory information at intervals. This is simpler but data may be out of sync during the interval. Also, this has some scalability limitations when the size of the network and the amount of data required grow.
  • Event-based updates: when data changes in the Intent block, it would automatically signal Observability to take that into account via synchronous calls (e.g., webhooks) or asynchronously (e.g., message bus). The dynamic updates allow you to react to devices being added, removed, or having their IPs changed.

You can infer how relevant it is having good data in the Intent. Having a wrong IP or bad credentials means a network device or service unobserved.

Once the inventory information is available, it has to be injected into the collector layer to start retrieving data or be ready to receive it. There are two basic approaches:

  • Static configuration: rendering configuration and restarting the collectors (obviously, via automation). This is the most basic option and should be the last resort if the next option is not available.
  • Dynamic configuration: via file locations or HTTP endpoints. This makes the collector autonomous and able to retrieve its configuration from a service discovery integration, like the Prometheus discovery endpoint or services like Hashicorp Consul.

Even though the collector feed is the most important outcome of the inventory layer, it is also necessary for other goals, data enrichment via Processing or customization in the Visualization.

In short, in a network automation strategy, the inventory for Observability should not be built manually. The information is already there, and reusing it enables zero-touch provisioning and more reliable monitoring.

6.2.2. Collectors#

Goal: The Collector component is in charge of ingesting heterogeneous observed data into the observability pipeline with the necessary coverage and accuracy.

As anticipated in the inventory block, it can take two different approaches:

  • Passive: The collector listens to the data that is being sent from the network service or device. So, it’s the target that has the responsibility to choose the data to be observed and send it to the collector. A classic example is logs sent to syslog collectors.
  • Active: The collector takes the lead to connect to the network device or service to request the data, either directly or via subscription. Another classic example is SNMP GET pulls.

Another classification approach is between agent-based and agentless approaches.

  • Agentless: there is no separated and configurable piece of software dedicated to data collection. In this case, the collector (externally) initiates the data request, but it can aslo be active when the network uses a dial-out streaming appraoch.
  • Agent-based: there is a software piece running alonside the network service offering data collection customization. The agent itself is part of the collector, in a distributed way.

Independently of the collection method, it’s important to highlight the different types of data that may be of interest in the network automation environment. I classify it in four broad categories:

  • Management plane: The state of the device, or to read data about the configuration, logging, or network statistics. Protocols in this group are SNMP, Syslog, gNMI, NETCONF, and RESTCONF.
  • Control plane: This is where the distributed protocols that determine the packet forwarding of the network, such as layer 2 or layer 3 forwarding tables, run. A few examples of control plane protocols are OSPF, IS-IS, and BGP. These planes can be observed via techniques such as Ping or Traceroute, or telemetry protocols such as BMP.
  • Forwarding plane: This plane is where the packets are moved (e.g., network interfaces), and it is the most demanding in terms of data volume and velocity. Naturally, when observing it, it’s also crucial to not impact the primary goal of the plane, which is to forward packets. In this group, we have tools such as TcpDump, IPFIX, sFlow, Netflow, Cisco SLA, PSAMP, and eBPF.
  • External data: This category includes everything that is not network device-specific. For instance, circuit provider information and the contact for a given interface, coming from an external asset manager system or physical Internet of Things (IoT) sensors, could fit into this broad field.
flowchart TB
    subgraph Network Device/Service
        direction TB
        A[Management Plane]
        B[Control Plane]
        C[Forwarding Plane]
        A --> B
        B --> C
    end

    D[External data]

Figure 2 — Scope of Collector.

Even though it should be always the last resort to collect data, the CLI scraping still remains an option to get data from the management and control planes.

So, as you can see in the previous classification, there are two main questions to solve:

  • What to get: data (e.g., metrics, logs, flows, etc.) and its data models. Getting a reusable data model that could help to correlate data from different implementations has been the holy grail since the SNMP MIB days. Initiatives like OpenConfig and, more recently, OpenTelemetry have tried to solve this, but there is no universal solution yet, so you may need to develop your own normalized data model (more in the next section).
  • How to get it: protocols, there are different tools depending on the data.

A high-level view of what data we could collect from the network is (order from less to more network specific)

Data TypeProtocols / Collection MethodsNotes / Examples
MetricsSNMP, HTTP scraping, CLI polling, OpenTelemetry (OTLP), Streaming telemetry (gNMI)Device metrics, host metrics, application metrics
LogsOpenTelemetry (OTLP), file tailing, syslogApplication logs, system logs, structured logs
TracesOpenTelemetry (OTLP)Distributed tracing across services
Network FlowsNetFlow, IPFIXTraffic flows, source/destination analysis
Protocol-specificBMP, BGP, ARP, OSPFBGP monitoring (BMP), ARP tables, BGP tables, OSPF tables
Packet CapturesPCAP (libpcap), SPAN / TAPFull packet inspection, deep troubleshooting

Table 1 — Data and Protocols to collect.

To get a deeper understanding of the different options, refer to the Modern Network Observability book.

Network data collection applies to many different network environments: networking gear in data center and backbone, cloud network services, Linux kernel networking or raw network transmit packets. All is relevant for networking, depending on the scenario.

This key question of WHAT to collect and HOW to collect it. To answer this question, start from the end: define the problem you are trying to solve with this data? This will guide your design process.

As you have seen in the previous table, the traditional monitoring protocols are still available (e.g., SNMP, Syslog, or Netflow), but because of the limitations of poll-based low-frequency data collection (that limit the necessary precision, and push-based solutions, which have limited capabilities or very low sampling capacity) new trends in data collection have appeared:

Streaming telemetry

It implements a push model to stream data from the network devices to the collector continuously. The goal is to provide near real-time access to the actual operational/configuration data from the devices (usually, using YANG-defined data models, but also via other transport options like JSON-RPC). There are two modes, but in both cases, it’s possible to define the subscription (this term refers to establishing a streaming telemetry session) to either send data at regular intervals or to only send data when the data changes.

  • Dial-In: the network device receives a subscription request from the collector (e.g., active collector)
  • Dial-Out: the network device has configured the subscription to the collector (e.g., passive collector)
flowchart TB
    A[Collector]
    B[Device]
    A -.->|Dial-In| B
    B -->|Streaming| A
    B -.->|Dial-Out| A

Figure 3 — Streaming Telemetry.

HTTP-exposed metrics

In network monitoring, scraping (collecting pull-based data) is a really common pattern because it’s simple, scalable, and works well with both “infrastructure” metrics and synthetic checks. SNMP is a popular example. But, in the IT realm, a popular trend since Prometheus inception, is the HTTP scraping. Because of these, more and more network operating systems (NOS) can expose metrics directly over HTTP/Prometheus format to be collected, for example SONiC, NVIDIA Cumulus, or Arista EOS.

Vendor / OSMetric TypeExample Metric
SONiCInterface trafficsonic_interface_rx_bytes_total{interface="Ethernet32"} 1.234e+12
NVIDIA CumulusInterface trafficnode_network_receive_bytes_total{device="swp1"} 9.21e+10
Arista EOSInterface trafficarista_interface_in_octets_total{interface="Ethernet1"} 8.3e+11

Table 2 — HTTP-exposed metrics.

The scraping approach provides low latency and near-real-time metrics, rich labels, and pull-based collection (central control of rate/timeout), connecting well with cloud-scale observability.

OpenTelemetry

OpenTelemetry is a vendor-neutral standard and toolkit for collecting, processing, and exporting telemetry data. Think about it as a common telemetry language and pipeline that unifies metrics, logs, and traces across networks, systems, and applications.

It does not replace network protocols like SNMP, NetFlow, gNMI, or BMP. Instead, it standardizes how telemetry is represented and transported after collection.

In traditional network monitoring, the data models are diverse and use vendor-specific in schemas and naming, what makes Hard to correlate across layers (network ↔ system ↔ application). In opposition, OpenTelemetry helps by providing:

  • A common data model for metrics, logs, and traces
  • A standard transport protocol (OTLP), over gRPC or HTTP
  • A single processing pipeline for multiple signal types

Grafana Alloy or Telegraf are examples of a collector implementing OTLP. It collects data from different exporters and exports to different backends such as metrics (Prometheus-compatible TSDBs), logs (Loki, Elasticsearch, ClickHouse) and traces (Tempo, Jaeger).

And this brings us to final consideration, the common structure of modern pluggable collectors, with Input, Processor and Output stages. For example, in Telegraf, OTLP is an option for an output plugin.

Collector Architecture

Simplistically, every collector can be broken down into three parts (sometimes these are pluggable and others are more hardcoded).

flowchart LR
    A[INPUT] --> B[PROCESSOR] --> C[OUTPUT]

Figure 4 — Collector's Architecture.

  • Input: It defines what has to be observed and under which parameters.
  • Processor: Even though it is optional, it’s very convenient as soon as data enters the data pipeline to ensure data structure consistency. The processing could become very complex and may impact performance at scale, so not all processing has to be done at this level.
  • Output: Shows how the collector moves the data into the pipeline. It may send it directly to other blocks like processing or persistency, or use the distribution component to scale.

There are many collectors (each one with different capabilities) such as Telegraf, Grafana Alloy, gNMIc, PMACCT, goflow, etc, but they are using similar architecture. So when choosing one (sometimes you may need several), approach like:

  1. Device capabilities, what protocols do your devices support?
  2. Data volume, high-volume needs streaming; low-volume can use polling.
  3. Latency requirements, near real-time vs. traditional intervals.
  4. Team skills and ecosystem fit with your backend.

After the data is collected, one step that it has been introduced is the Processor that manipulates the data in different stages of the pipeline.

6.2.3. Processor#

Goal: Once data is received, it must adhere to shared standards in order to converge signals from multiple sources, correlate them, and enrich them with additional context. Without this processor step, observability pipelines quickly become fragmented, difficult to query, and expensive to operate. There are multiple opportunities to process it depending on scale, complexity, and operational requirements.

The following are common processing actions that apply to observability pipelines.

6.2.3.1. Normalization/Transformation#

Normalization ensures that data coming from different sources is processable, comparable, and semantically correct. This step aligns structure, naming, units, and semantics so downstream systems can reason over the data consistently.

Key normalization functions include:

Structuring
Raw telemetry often arrives in formats that are easy for humans or devices to emit, but difficult for machines to analyze at scale. Structuring converts this data into machine-friendly representations.

There is no unique solution for all cases (yet), even though there are some initiatives trying to standardize them (e.g., OpenTelemetry).

  • log-based:

    Mar 18 14:22:11 leaf01 IFACE-5-STATE: swp1 oper-state changed from UP to DOWN

    to

    {
    "timestamp": "2025-03-18T14:22:11Z",
    "level": "INFO",
    "device": "leaf01",
    "component": "interface",
    "event": "oper_state_change",
    "interface": "swp1",
    "previous_state": "UP",
    "current_state": "DOWN"
    }
  • metric-based: <metric name>{<labels>} <value>

    interface_admin_state{hostname="leaf01", ifname="swp1"} 1
    interface_oper_state{hostname="leaf01", ifname="swp1"} 0
    interface_speed_bps{hostname="leaf01", ifname="swp1"} 100000000000
    interface_in_errors_total{hostname="leaf01", ifname="swp1"} 0
    interface_out_errors_total{hostname="leaf01", ifname="swp1"} 12
  • Table-based: some tools (e.g., Suzieq) organize data into tabular, time-indexed state views:

    | hostname | ifname | adminState | operState | speed | inErrors | outErrors | timestamp |
    |----------|--------|------------|-----------|-------|----------|-----------|-----------|
    | leaf01   | swp1   | up         | up        | 100G  | 0        | 0         | t1        |
    | leaf01   | swp1   | up         | down      | 100G  | 0        | 12        | t2        |

Renaming and semantic alignment

Different telemetry sources describe the same concept using different names, paths, and label conventions. For example:

Openconfig: /interfaces/interface/state/oper-status value: UP tags: source=192.0.2.1 and interface_name=eth1
SNMP: ifOperStatus{ifName="GigabitEthernet0/1", device="router01"} 1
Native Prometheus: interface_oper_state{interface="swp1", host="leaf01"} 1

Normalization aligns them into a consistent model, including the object name, label renaming, and the value (using the same unit conversion):

intf_oper_state{name="eth1", device="192.0.2.1"} 1
intf_oper_state{name="GigabitEthernet0/1", device="router01"} 1
intf_oper_state{name="swp1", device="leaf01"} 1

6.2.3.2. Enrichment#

Enrichment adds extra content to the observability data beyond what is actually observed. These extra dimensions added to the data allow more sophisticated data consumption. For example, you may be able to understand that the metrics belong to a device that plays a specific role in the network and act accordingly.

There are two main approaches to enrichment:

  • Extending data Adding extra metadata or labels to the observed data to complement it. This data could be static (e.g., org=my-company) to mark all your data, dynamic based on collection context (e.g., collector_id=1234), or dynamic based on the observed data itself (e.g., given hostname=rtr-1, create a label location=BCN-01 by correlating with the SoT).

    intf_oper_state{
        name="swp1", 
        device="leaf01",
        role="leaf",
        location="BCN0001"
    } 1
  • Creating new data Following the Prometheus ecosystem’s “info metrics” pattern, we can generate metrics that do not represent actual state but intended state. These metrics are useful in later observability pipeline stages to add more dimensions to analysis, as you will discover in the Alerting section.

    device_info{
        name="leaf1",
        role="leaf",
        vendor="arista",
        model="7050SX3",
        platform="eos",
        os_version="4.29.2F",
        location="BCN0001",
        rack="AB1",
        rack_unit="U32",
        environment="prod"
    } 1

    The info metrics are a curious type of data that do not have the relevant data in the value (e.g., the 1 in the previous metrics) but in the labels. This trick allows reusing TSDB that do not support some type of values (like strings).

In both cases, we are adding some relevant labels: role and location, which we could use to create more capable queries for alerts or analysis. But as you can imagine, this comes with a cost. When dealing with enrichment, keep this in mind:

  • Cardinality: Enrichment increases metric cardinality because every added label multiplies the number of unique time series, and network metrics already start from a high-cardinality baseline (devices × interfaces × metrics). Adding static or dynamic context indiscriminately can quickly degrade storage efficiency and query performance.
  • Update frequency: the info metrics are contextual information that, in many cases, do not change every second, or even every month. For example, the rack a device is placed in, or the management IP, are not attributes that change frequently, so choose the right level of polling frequency or, ideally, implement an event-based approach.
  • Failure scenarios: Enrichment is an extra, but it may become core for your automation strategy. Make it as resilient as possible so that if the external source goes down you can still operate with some degradation. For example, using caching mechanisms to reduce the bottleneck on your SoT could produce similar outcomes with fewer issues.

6.2.3.3. Transformation / Derivation / Aggregation#

Transformation/Derivation generates data from existing data to simplify its usage. For example, you may want to unify all interface input bits per second (bps) for fabric devices by combining data from devices with the role “leaf” or “spine” to use later in dashboard or other data queries. In this case, an aggregator could create a new metric fabric:interface:in_bps that leverages existing data from the already available interface_in_octets_total metric (and other metadata extracted from infometrics for the matching hostname and name)

- record: fabric:interface:in_bps
expr: 
    (
    sum by (fabric, role, hostname, name) (
        rate(interface_in_octets_total{role=~"leaf|spine"}[5m])
    ) * 8
    )
    * on (hostname, name) group_left (fabric, role)
    sot_interface_info{role=~"leaf|spine"}

In the Prometheus ecosystem, this is known as recording rules.

Another processing functionality for reducing the amount of data is aggregation, by reducing dimensionality adjusted for the final usage of the data. For example, summarizing interface information per device, or summarizing device information per site. For example, creating rate calculations from counter metrics, or histogram bucketing, can be useful for many analyses.

6.2.3.4. Filtering#

Not all the data collected is relevant or useful, so it is important to keep pipelines clean and efficient to avoid overspending resources using filtering processors.

For example, you may not be interested in all the logs coming from your devices, only those matching parameters such as criticality. The earlier you remove irrelevant data from the pipeline, the fewer resources you waste. You can decide what not to keep at a general level (drop unused metrics, or use allowlists) or within a specific data object (drop high-cardinality labels, or remove sensitive metadata).

6.2.3.5. Sampling / Throttling#

After filtering out data not desired, for the desired one, you may want to limit the amount of data collected to protect the backends, and especially, to limit processing and storage cost. So, similarly to the filtering, the sampling or throttling processing happens early on the data pipeline to reduce the data granularity.

This allows controlling volume under load via probabilistic sampling (discarding some samples once we have enough accuracy), focusing on top-K metrics, or just rate-limiting per source for less relevant network components.

Also the rollup process implemented in the persistency layer to reduce data frequency as data ages is another example.

Finally, all these processors may be run at different stages of the observability pipeline, depending on the use case:

  • Collector: Best for lightweight, early normalization and filtering
  • Dedicated processor: Required at scale for dynamic enrichment and complex transformations
  • Persistence layer: Suitable for recording rules and long-term rollups (Normalization should always happen before this)
  • Alerting layer: Derives events from stored data and applies business logic

In practice, effective observability pipelines distribute processing across layers, depending on tooling, scale, and operational constraints.

6.2.4. Distribution#

Goal As observability systems scale, the volume, velocity, and diversity of telemetry data increase significantly. In this context, a simple linear or tightly coupled pipeline, where components are directly chained together, becomes difficult to operate and scale. Individual components often have different performance characteristics, failure modes, and scaling requirements, and coordinating them directly introduces complexity and fragility. In these cases, a distribution layer helps to decouple these components and enable the system to scale reliably.

Why direct, synchronous pipelines break down at scale?

In small or low-throughput environments, observability pipelines can communicate using synchronous connections, such as a collector writing metrics directly to a database or an agent pushing logs straight to a storage backend. While simple, this approach has several limitations at scale:

  • Backpressure propagation: If a downstream component slows down or becomes unavailable, upstream components are forced to block or drop data.
  • Tight coupling: Producers must be aware of consumer availability and performance.
  • Scaling challenges: Components must scale together, even if their workloads differ.
  • Fragile failure handling: Transient failures can result in data loss or cascading outages.

When additional processing stages (e.g., normalization, enrichment, aggregation, filtering) are introduced, these problems are amplified, so there is need for a buffer and decoupling layer that:

  • Absorbs bursts in data volume
  • Decouples producers from consumers
  • Allows each stage to scale independently
  • Provides resilience to partial failures

This buffer becomes the distribution layer of the observability architecture. An this is usually implemented via Message brokers as the distribution layer such as Apache Kafka or NATS are commonly used to implement this distribution layer. They introduce a publish-and-subscribe model where:

  • Many producers (agents, collectors, network devices, applications) publish telemetry data into topics or streams.
  • Many consumers (processors, aggregators, storage backends, alerting systems) independently consume the data at their own pace.
  • Producers and consumers are fully decoupled and do not block each other.

The key benefits of using a distribution layer are:

  • Scalability: Producers and consumers scale independently, allowing each component to match its workload.
  • Fault tolerance: Temporary failures or slowdowns in consumers do not impact producers; data is buffered and replayed when consumers recover.
  • Backpressure management: Queues absorb traffic spikes and smooth ingestion rates without dropping data.
  • Fan-out and reuse: The same telemetry stream can feed multiple downstream systems (storage, alerting, analytics) without duplication at the source.
  • Operational simplicity: Components can be upgraded, restarted, or replaced independently.

You will learn more about this in Chapter 11 when discussing scalability and resiliency.

6.2.5. Persistency#

Goal: Once the data is finally ready after collection and processing, it has to be persisted to enable its later usage. At the heart of any observability platform lies the persistence layer: the databases where all data converges. Just as a well-designed network ensures optimal data flow, well-structured databases ensure seamless data storage and retrieval, allowing for historical analysis and real-time decision-making.

Observability databases must support high-ingestion, time-ordered, high-dimensional data while remaining fast, reliable, and cost-effective at scale. The database types discussed above share several characteristics that make them well suited for this role:

  • Time-centric data handling: Observability data is inherently time-based. These databases are optimized to efficiently store, query, and analyze timestamped data, enabling fast range queries, rollups, and time-windowed aggregations.
  • High-throughput, near–real-time performance: Observability systems continuously ingest large volumes of data. Databases must sustain high write rates while still supporting low-latency queries for dashboards, alerts, and interactive analysis.
  • Efficient metric storage and retrieval: Metrics are numeric and multi-dimensional, described by labels or tags. Suitable databases handle this model efficiently through compression, compaction, and optimized indexing, enabling fast aggregations and scans even at scale.
  • Support for structured and semi-structured data: Beyond metrics, observability includes logs, traces, and network telemetry that often take the form of structured or semi-structured records. These databases support flexible schemas and labeled data, enabling correlation and analysis across different signal types.
  • Data lifecycle and retention management: Observability generates massive data volumes, making lifecycle management critical. These systems support retention policies, rollups, aggregation, and deletion to control storage costs and maintain performance. Techniques such as downsampling allow high-frequency data to be retained in a more compact form over time.
  • Powerful query and access APIs: Observability relies on expressive query languages for aggregation, filtering, and exploration. Languages such as PromQL (and LogQL for logs) exemplify this need and have become widely supported, enabling consistent querying across different backends.
  • Schema flexibility with operational safety: Observability data evolves continuously. Flexible schemas allow new labels and fields to be introduced without costly migrations, while still preserving consistency and clarity for dashboards, alerts, and automation.

Which database types relevant to observability?

Observability systems deal with high-volume, high-cardinality, time-ordered data that must be ingested continuously and queried interactively. While there are many database paradigms, only a few are particularly well suited to these requirements. In practice, modern observability platforms rely on a combination of database types, each optimized for a specific access pattern.

The most relevant database types for observability are:

  1. Time-series databases (TSDB): Time-series databases are optimized for data where time is the primary dimension. Data points are typically append-only, arrive in chronological order, and are queried over time ranges. In the observability ecosystem, the time-series data model popularized by Prometheus has effectively become a de-facto standard. Its exposition format, label-based dimensional model, and query semantics (PromQL) are widely adopted well beyond Prometheus itself, influencing how metrics are produced, transported, stored, and queried across modern observability stacks.

    • Most observability signals, especially metrics and many forms of logs and traces, are naturally time-based. TSDBs are designed to:
      • Handle very high ingestion rates
      • Compress time-ordered data efficiently
      • Support fast range queries and aggregations over time windows
      • Native support for labels/tags (dimensions)
      • Fast queries like rates, averages, percentiles, and rollups
      • Retention policies and downsampling built-in
    • Typical observability use cases
      • Infrastructure and application metrics
      • Service-level indicators (SLIs)
      • Event-like logs stored as labeled time series
      • Trace-derived metrics
    • A few popular examples of this primary type: Prometheus, InfluxDB, VictoriaMetrics, Timescale or Loki (time-series indexed logs)
  2. Columnar databases: Columnar databases store data by column rather than by row, making them highly efficient for analytical queries that scan large datasets but only a subset of fields.

    • Observability data is often explored through aggregations, group-bys, and ad-hoc analytics across large time ranges. Columnar storage enables:
      • Extremely fast aggregations
      • Efficient compression for repetitive fields
      • Flexible and high-performance analytical queries without predefined schemas
      • Efficient group-by and filtering across many dimensions
      • Suitable for large-scale, long-term data retention
    • Typical observability use cases
      • Log and network flows analytics and exploration
      • Trace analysis at scale
      • Long-term historical analysis and capacity planning
      • Correlation across metrics, logs, and traces
    • Some outstanding examples of columnar databases are: ClickHouse or Apache Parquet (file format used by many engines such as SuzieQ)
  3. Text-search databases: specialize in indexing and searching unstructured or semi-structured text, supporting full-text search, ranking, and complex filters.

    • Logs and events often contain free-form text that operators need to search quickly during incident response. Text-search databases excel at:
      • Keyword and pattern searches
      • Fast filtering across large log volumes
      • Interactive, exploratory DSL queries
      • Powerful full-text indexing and search
      • Good support for semi-structured data (e.g., logs)
    • The text-search databases are good for:
      • Application and system logs
      • Incident investigation and debugging
      • Error and exception analysis
      • Security and audit logs
    • Some popular example of this database type are: Elasticsearch or Splunk

But, do you have to choose only one? Nope, No single database type optimally serves all observability workloads and use cases (pick them for your own use case):

  • TSDBs are ideal for real-time monitoring and alerting.
  • Columnar databases excel at large-scale analytics and historical analysis.
  • Text-search databases provide fast, intuitive exploration of log data.

As a result, modern observability platforms often combine these systems or build layered architectures where:

  • Metrics (and logs) flow into TSDBs
  • Logs, packet flows, and traces are usually stored in columnar or search-oriented databases

This classification is not absolute; most tools have a primary classification but then implement some characteristics of others (especially time-series).

Other considerations when selecting or designing an observability storage system, two closely related concepts are especially important:

  • Dimensionality: Dimensionality refers to the number of labels or attributes attached to a metric. Higher dimensionality provides richer context and more powerful analysis, but also increases storage requirements and query complexity.
  • Cardinality: Cardinality describes the number of unique values a label can take. High-cardinality labels can dramatically increase the number of time series or records stored, leading to higher storage costs and slower queries. Managing cardinality is one of the most critical challenges in observability system design.

We will dive deeper into scalability considerations in Chapter 11, but these two topics have a significant impact on this layer due to their effects on query performance. Moreover, as data volume grows, you may need to consider strategies such as sharding (splitting data across multiple instances) or federation (aggregating data).

Each database has its own characteristics that must be mapped to the use cases it needs to solve. For example, Suzieq uses a columnar solution (Apache Parquet files) because the questions it tries to answer are relational rather than time-series based. For example: “Which routes exist on spines but not on all leaves?”

  • Requirements:
    • Filter across many attributes
    • Compare rows across devices
    • Join tables (interfaces, neighbors, routes)
    • Look at state at a point in time (not historical evolution)
  • Solution: This is what a columnar analytic solution is designed for. A TSDB could help with checking the number of routes, but to identify missing routes would require many labels, which is not its primary strength.

After all the data management, there are two final steps:

  • Create events for other automation to use or humans to intervene: Alerting
  • Visualize the data to provide information for decision-making: Visualization

6.2.6. Alerting#

Goal: At the vertex of the observability pyramid, the primary goal in a network automation strategy is to trigger other automation workflows via Orchestration. This does not mean that notifying humans is out of scope, but it should be the last resort for unsolvable cases (the less human intervention, the better).

There are different stages in the life of an alert:

  • Detection: Identifying Events from existing data (e.g., metrics, logs, flows, etc.). An Event is a representation of something that requires action.
  • Processor: Similar to data processing, after an event is created, you may want to process it with enrichment to add more context. Depending on the event’s processing and impact, it could be elevated to Alert level, meaning action is required. If no action is needed, it is simply stored for auditing and traceability purposes. At this layer, a very important point is alert correlation to simplify its management.
  • Routing: Once the alert is properly classified, it can be routed to the next steps (could be multiple). It may trigger an orchestration workflow to run automation, or notify a human. An example of a tool focusing on this layer is Alertmanager. This layer requires strong integration with other systems (for example, Slack for instant messaging or Alerta for management).
  • Escalation: If the alert cannot be resolved automatically and requires immediate human intervention, it becomes an Incident that must involve human judgment and management, such as silencing. A popular tool for incident management is PagerDuty.
flowchart LR
    A[Detection] --> B[Processor] --> C[Routing] --> D[Escalation]

Figure 5 — Alerting Stages.

Some challenging questions for alerting are:

  • Alert fatigue: Too many alerts, often low-value or false positives, desensitize teams and reduce responsiveness.
  • Noise vs signal: Important alerts are easily buried among non-critical ones, delaying real issue detection.
  • Multiple sources: Alerts come from diverse systems (on-premises, cloud, SaaS) with different formats, complicating correlation.
  • Dynamic environments: Constant changes (e.g., Kubernetes, cloud scaling) make static alert thresholds quickly outdated.

6.2.6.1. The role of AI and AIOps in observability#

A separate but increasingly important consideration in observability is the use of AI and Machine Learning (AI/ML) to enhance how telemetry data is processed, analyzed, and acted upon. As observability systems grow in scale and complexity, traditional rule-based approaches to alerting and analysis become difficult to maintain and prone to noise. This has led to the emergence of AIOps, which applies AI/ML techniques to observability data in order to automate insight generation and operational decision-making.

In the context of alerting, AI/ML can provide value across multiple stages of the observability pipeline:

  1. AI-assisted detection. AI/ML significantly improves alert detection by moving beyond static thresholds and handcrafted rules:

    • Noise reduction and intelligent filtering Machine learning models can identify recurring patterns, suppress redundant alerts, and group related signals, dramatically reducing alert fatigue.
    • Anomaly detection: Instead of relying on fixed thresholds, models can learn normal system behavior and detect deviations that indicate potential issues, even when absolute values remain within expected ranges.
    • Predictive alerting: By analyzing historical data and trends, AI models can forecast future behavior and surface alerts before incidents occur, shifting operations from a reactive to a proactive posture.
    • Impact-aware prioritization: Alerts can be ranked based on learned correlations with service health or business outcomes, ensuring that the most critical issues are addressed first.
  2. AI-enhanced alert processing and response. Beyond detection, AI/ML also plays a role in how alerts are processed and resolved:

    • Contextual enrichment: Alerts can be automatically enriched with related metrics, logs, traces, topology information, and change events to provide immediate context.
    • Root Cause Analysis (RCA): By correlating signals across multiple layers, AI models can suggest likely root causes rather than presenting isolated symptoms.
    • Guided remediation: Alerts can include recommended actions or relevant runbooks, derived from historical incident resolution patterns, reducing mean time to resolution (MTTR).
    • Automation and feedback loops: AI-assisted systems can learn from operator actions and outcomes, continuously improving alert quality and response recommendations over time.
  3. AIOps as an evolution of observability. AIOps does not replace observability fundamentals such as metrics, logs, traces, and strong data pipelines. Instead, it builds on top of them, leveraging high-quality telemetry to:

    • Reduce operational noise
    • Improve incident detection and diagnosis
    • Enable predictive and automated operations at scale

As observability data volumes continue to grow, AIOps becomes a key enabler for maintaining system reliability without proportional increases in human operational effort.

6.2.7. Visualization#

Goal: All the observed data should provide value to the decision-makers so crafting user-oriented visualization should answer the user needs.

The final block is perhaps the one most people think of when thinking about monitoring tools: dashboards and reports. Traditionally, this has been completely integrated with a monolithic solution, but nowadays, following the principle of composability, most tools (e.g., Grafana) support integrating different persistence layers, increasing capabilities.

Strictly speaking, this component belongs to the Presentation layer of the architecture, but I want to address it here before moving on.

The basic principles to adhere to for this function are:

  • Clarity and simplicity: Dashboards should be easy to understand at a glance. Every element, color, and panel must have a clear purpose and support decision-making.
  • Effective and accessible visuals: Use color intentionally to highlight meaning and urgency, follow accessibility guidelines, and always reinforce color cues with text or symbols.
  • Contextual relevance: Show only data that directly supports operational goals. Irrelevant metrics add noise and reduce insight.
  • User-centered design: Tailor dashboards to the audience’s role and expertise so they provide the right level of detail and remain genuinely useful.
  • Interactivity: Enable drilling down, zooming, and time-range adjustments so users can investigate causes, not just observe symptoms.
  • Clear hierarchy and layout: Present critical information first, with details layered logically. Use multiple dashboards (overview + deep dives) instead of overcrowding a single view.
flowchart TD
    A[Global Overview] --> B[Site Summary] --> C[Device Summary] --> D[Device Detail] --> E[Interface Detail]

Figure 6 — Hierarchical Drilldown.

In the Modern Network Observability book, chapter 11 (“Application of Your Observability Data”) you can find a lot more details for architecting dashboards.

Keep in mind that this block is very related to the user perception, so do not forget to interview them and involve in the process.

Next, I want to provide an example of how a network observability could look like.

6.3. Implementation Example#

This section illustrates how the observability functionalities come together through a practical use case using the Telegraf, Prometheus, and Grafana stack and other tools.

This is not a tool recommendation at all, but because it’s fully built on top of open-source components, you can give it a try.

6.3.1. Use Case: Proactive Interface Saturation Detection#

Scenario: A data center fabric with 50 leaf and spine switches needs to detect interface saturation before it impacts applications and automatically trigger traffic engineering workflows.

Requirements:

  • Alert when interface utilization exceeds 80% for 5 minutes
  • Detect sudden traffic spikes (>50% change)
  • Maintain 30 days of high-resolution data (30s intervals)
  • Integrate with Nautobot (Source of Truth) for inventory
  • Trigger orchestration workflows for remediation

6.3.2. Solution Architecture#

Before starting the solution analysis, it’s important to get an estimation of the scale of the scenario. In this case, with 50 devices × 64 ports × 10 metrics = ~32K active time series, so it’s a pretty small scenario that doesn’t require advanced tooling.

Component selection rationale:

  • Telegraf was chosen as the collector for its multi-protocol support (SNMP for legacy devices, gNMI for modern devices), extensive plugin ecosystem, and built-in processors for data normalization. It handles the 50-device scale easily on a single instance with 30-second polling intervals.
  • Prometheus serves as the persistence layer, optimized for time-series data with its powerful PromQL query language for complex alerting conditions. At 32K series, it operates well within its comfort zone while providing native integration with Alertmanager.
  • Grafana provides visualization with multi-datasource support, querying both Prometheus metrics and Nautobot metadata simultaneously to create context-rich dashboards tailored for different audiences (NOC, capacity planning, management).

Architecture

At a high-level, the next figure depicts the main components and roles:

flowchart TB
    subgraph Sources["Data Sources"]
        NB[Nautobot<br/>Source of Truth]
        SW[Network Devices<br/>SNMP/gNMI]
    end

    subgraph Collection["Collection Layer"]
        T[Telegraf<br/>Collectors]
        SD[Consul<br/>Service Discovery]
    end

    subgraph Storage["Storage"]
        P[Prometheus<br/>TSDB]
    end

    subgraph Alerting["Alerting"]
        AM[Alertmanager<br/>Routing]
    end

    subgraph Presentation["Visualization"]
        G[Grafana<br/>Dashboards]
    end

    subgraph Integration["External Systems"]
        ORCH[Orchestrator<br/>Automation]
        SLACK[Slack<br/>Notifications]
    end

    NB -->|Device Inventory| SD
    SD -->|Dynamic Targets| T
    SW -->|Metrics| T
    T -->|Expose HTTP| P
    P -->|Alert Rules| AM
    P <-->|Queries| G
    NB -->|Metadata| G
    AM -->|Webhook| ORCH
    AM -->|Alerts| SLACK

Figure 7 — Observability Solution Example.

This simplistic solution architecture is covered extensively in the Modern Network Observability book. If you want a hands-on approach with a lab scenario to test, give it a try.

6.3.3. Implementation Flow#

Inventory Integration: Nautobot serves as the single source of truth, defining which devices to monitor with monitoring profiles and SNMP credentials. A lightweight sync service (e.g., Python script using webhooks) continuously updates Consul’s service registry with device information, enabling dynamic discovery from the collector.

Data Collection: Telegraf uses Consul for service discovery, automatically polling SNMP from devices as they appear in Nautobot. Telegraf processors normalize and enrich data (converting status codes to labels, renaming fields to standard names, and adding contextual information from Nautobot) and expose metrics in Prometheus format on an HTTP endpoint.

Persistence and Analysis: Prometheus scrapes Telegraf endpoints using Consul service discovery, storing metrics in its time-series database. Recording rules pre-calculate interface utilization percentages and bandwidth rates to optimize query performance.

Alerting Logic: Alert rules in Prometheus define conditions (e.g., interface utilization >80% for 5 minutes, traffic spikes >50% increase). When conditions match, Alertmanager handles routing, critical alerts with automation: enabled labels go to the orchestrator webhook, others route to Slack or PagerDuty based on severity.

Visualization: Grafana dashboards provide multiple views: fabric-wide bandwidth trends, top saturated interfaces, per-device drill-downs with interface heatmaps. Template variables enable filtering by site, role, or device. Dashboards query both Prometheus (metrics) and Nautobot (device metadata) for contextual enrichment.

Closed-Loop Automation: When critical saturation alerts fire, Alertmanager sends webhooks to the orchestration platform, which triggers automated traffic engineering workflows to redistribute load across available paths. We cover this component in Chapter 7.

6.3.4. Solution summary#

Operational Benefits:

  • Reduced manual monitoring effort through SoT integration
  • Proactive issue detection with sub-minute latency
  • Closed-loop remediation reducing MTTR from hours to minutes
  • Rich context combining metrics with inventory data

Scalability Considerations:

  • Current architecture handles 50 devices; can scale to ~500 devices before needing distributed collectors
  • Prometheus capacity sufficient up to ~1M active series; beyond that, you may need to consider other solutions or architectures

This brief solution exercise closes this chapter that defines the basic goals and functionalities of Observability within a network automation architecture.

6.4. Summary#

Observability in network automation extends far beyond traditional monitoring, providing the architectural foundation for understanding network behavior, detecting issues proactively, and enabling automated remediation at scale. Built on seven core goals, from automatic discovery with minimal human effort to sophisticated real-time analysis and user-centric visualizations, observability transforms how organizations respond to network events by shifting from reactive troubleshooting to proactive, data-driven operations.

The realization of these goals requires seven interconnected architectural pillars and functionalities: SoT integration for automatic inventory discovery, multi-protocol collectors supporting near-real-time data ingestion through streaming telemetry and modern protocols, processors that normalize and enrich heterogeneous data with contextual metadata, distributed systems for scalable data movement at high volumes, purpose-built databases (e.g., time-series, columnar, and text-search) optimized for different observability workloads, intelligent alerting with AI/ML-enhanced detection and routing to orchestration or human responders, and tailored visualizations that present information at appropriate abstraction levels for different audiences.

Implementing observability requires architectural decisions early in the design process. Organizations must choose between traditional on-premises platforms, cloud-native SaaS solutions offering rapid deployment and AI-powered analytics, or composable open-source stacks providing maximum flexibility. Each approach involves tradeoffs in cost, control, operational overhead, and capability. Success also depends on understanding data processing requirements, from normalization and enrichment to filtering and aggregation, and how emerging AIOps capabilities can reduce alert fatigue and enable predictive operations.

Observability is not a single tool but a coherent architectural pattern. Success depends on treating it as a system, where inventory drives collection, collection enables processing, processing feeds distribution, distribution connects to persistence, persistence informs alerting, and alerting ultimately drives visualization and automated response. By carefully designing each component and how they interact, organizations can build observability systems that scale with their networks, integrate with modern protocols like OpenTelemetry and gNMI, and transform operational visibility into actionable intelligence that powers closed-loop automation.

💬 Found something to improve? Send feedback for this chapter