Skip to content

Chapter 5: Network Analysis — Deep Dive

Overview

Network problems are notoriously hard to diagnose. Is it the application? The kernel? The network? Packet loss? Buffer exhaustion?

melisai's NetworkCollector (internal/collector/network.go) collects data from multiple sources: per-interface counters, TCP protocol statistics, socket state summaries, conntrack table stats, softnet per-CPU counters, IRQ distribution, NIC hardware details, and TCP extended stats.

Source File: network.go

  • Lines: ~520
  • Functions: 12
  • Data Sources: /proc/net/dev, /proc/net/snmp, /proc/net/netstat, /proc/net/softnet_stat, /proc/softirqs, /proc/sys/net/, /sys/class/net/, ss, ethtool

Function Walkthrough

Collect() — Two-Point Sampling + Deep Diagnostics

The collector uses two-point sampling: it reads counters before and after a configurable interval to compute rates (errors/sec, retransmits/sec, IRQ deltas).

func (c *NetworkCollector) Collect(ctx context.Context, cfg CollectConfig) (*model.Result, error) {
    // Phase 1: first sample
    ifaces1 := c.parseNetDev()        // /proc/net/dev
    snmp1 := c.parseSNMP()            // /proc/net/snmp
    irqSample1 := c.readNetRxSoftirqs() // /proc/softirqs

    // Wait for interval (default 1s)
    time.After(interval)

    // Phase 2: second sample + derived rates
    data.Interfaces = c.parseNetDev()
    data.TCP = c.parseSNMP()
    c.parseSSConnections(ctx, data)    // ss command

    // Deep diagnostics (all Tier 1 — no root needed for procfs)
    data.Conntrack = c.parseConntrack()
    data.SoftnetStats = c.parseSoftnetStat()
    data.IRQDistribution = c.computeIRQDistribution(irqSample1)
    c.parseNetstat(data)               // /proc/net/netstat
    c.enrichNICDetails(ctx, data)      // sysfs + ethtool
}

parseNetDev() — Per-Interface Traffic

// /proc/net/dev format:
// Inter-|   Receive                                                |  Transmit
//  face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets...
//   eth0: 1234567  8901   0    0    0     0          0         0  9876543  6789...

What errors and drops mean:

Counter Cause
rx_errors CRC errors, frame alignment errors — hardware/cable issue
rx_dropped Kernel dropped packets — ring buffer full, no memory
tx_errors Carrier errors, abort — cable/switch issue
tx_dropped Queueing discipline dropped — traffic shaping or queue overflow

Rule: Any non-zero error or drop counter warrants investigation.

parseSNMP() — TCP Protocol Statistics

// /proc/net/snmp has paired header/value lines:
// Tcp: ... CurrEstab ActiveOpens PassiveOpens RetransSegs InErrs OutRsts ...
// Tcp: ... 234       5678       9012        45          2      89      ...

Key TCP metrics:

Metric Normal Concern
RetransSegs < 0.1% of total segments > 1% indicates packet loss or congestion
InErrs 0 Any value = corrupted in-flight packets
OutRsts Low High = connections being refused or reset
ActiveOpens/PassiveOpens ratio Depends on role Server should have more PassiveOpens

parseSSConnections() — Socket State Summary

// `ss -s` → summary with TIME_WAIT count
// `ss -tn state close-wait` → CLOSE_WAIT connections (count lines - header)

TCP State Problems:

                  Normal flow
    ┌──────────┐          ┌──────────┐
    │ESTABLISHED├──close──►│FIN_WAIT_1│
    └──────────┘          └────┬─────┘
                          ┌────▼─────┐
                          │TIME_WAIT │  ← stays 60 seconds
                          └──────────┘

                  Problem: leaked connection
    ┌──────────┐          ┌──────────┐
    │ESTABLISHED├──peer───►│CLOSE_WAIT│  ← application never closes!
    └──────────┘  closes  └──────────┘
State Count Meaning
TIME_WAIT < 1000 Normal Connections cooling down
TIME_WAIT > 10000 High churn Many short-lived connections — consider keep-alive
TIME_WAIT > 50000 Port exhaustion risk Ephemeral ports may run out
CLOSE_WAIT > 0 Bug! Application receives FIN but never closes the socket
CLOSE_WAIT > 100 Critical bug Connection leak — application must be fixed

Deep Network Diagnostics

parseConntrack() — Connection Tracking Table

Reads conntrack table usage from /proc/sys/net/netfilter/:

type ConntrackStats struct {
    Count        int64   // current entries
    Max          int64   // nf_conntrack_max
    UsagePct     float64 // count/max * 100
    Drops        int64   // dropped due to full table
    InsertFailed int64   // failed to insert new entry
    EarlyDrop    int64   // entries dropped early to make room
}
Metric Warning Critical Meaning
UsagePct > 70% Yes > 90% Table approaching capacity — new connections will be dropped
Drops > 0 Yes Yes Connections already being dropped

Fix: sysctl -w net.netfilter.nf_conntrack_max=<current*2>

parseSoftnetStat() — Per-CPU Packet Processing

Reads /proc/net/softnet_stat — hex columns per CPU line:

00beef02 00000002 00000005 ...   ← CPU 0
0000abcd 00000000 00000003 ...   ← CPU 1
Column Name Meaning
0 processed Total packets processed by this CPU
1 dropped Packets dropped (softirq couldn't keep up)
2 time_squeeze Times softirq budget ran out

Any non-zero dropped = kernel is losing packets. Causes: - Single CPU handling all NIC interrupts (no RPS/RSS) - net.core.netdev_budget too low (default 300) - IRQ affinity pinning all interrupts to one core

computeIRQDistribution() — NET_RX Softirq Delta

Two-point sampling of /proc/softirqs NET_RX line to show per-CPU interrupt processing rate:

type IRQDistribution struct {
    CPU        int   // CPU number
    NetRxDelta int64 // NET_RX interrupts processed during sample interval
}

What to look for: If one CPU has 10x the delta of others, that CPU is the NIC interrupt bottleneck. Fix with IRQ affinity or enable RPS.

parseNetstat() — TCP Extended Counters

Reads /proc/net/netstat TcpExt: section for production-critical counters:

Counter Meaning Action
ListenOverflows Accept queue full — SYN dropped Increase somaxconn, add SO_REUSEPORT
ListenDrops Same as overflows but includes other causes Check application accept() rate
TCPAbortOnMemory Connection aborted due to memory pressure Increase tcp_mem
PruneCalled Kernel pruned TCP receive buffers Increase tcp_mem limits
TCPOFOQueue Out-of-order packets queued Network reordering or congestion

enrichNICDetails() — Hardware-Level Info

Uses sysfs and ethtool to gather NIC hardware details per interface:

Source Field What it tells you
/sys/class/net/<iface>/speed Speed Link speed (1000Mbps, 10000Mbps)
/sys/class/net/<iface>/queues/ RxQueues, TxQueues Number of hardware queues
/sys/class/net/<iface>/queues/rx-0/rps_cpus RPSEnabled Whether RPS distributes packets across CPUs
/sys/class/net/<iface>/master BondSlave Whether this NIC is part of a bond
ethtool -i Driver NIC driver name (e.g., ixgbe, mlx5_core)
ethtool -g RingRxCur, RingRxMax Current/max ring buffer size
ethtool -S RxDiscards, RxBufErrors NIC-level packet drops

Ring buffer overflow (RxDiscards > 0 with RingRxCur < RingRxMax):

# Increase ring buffer to max
ethtool -G eth0 rx 4096

Sysctl Parameters

Parameter Typical Purpose
tcp_congestion_control cubic Congestion algorithm (bbr is better for lossy/WAN)
tcp_rmem 4096 131072 6291456 TCP receive buffer (min/default/max)
tcp_wmem 4096 16384 4194304 TCP send buffer (min/default/max)
somaxconn 4096 Maximum listen backlog (increase for high-connection servers)
tcp_mem pages pages pages Global TCP memory limits (low/pressure/high)
tcp_max_tw_buckets 65536 Max TIME_WAIT sockets
tcp_keepalive_time 7200 Seconds before keepalive probes
netdev_budget 300 Max packets processed per softirq cycle

Common tuning: - High-throughput: Increase tcp_rmem/tcp_wmem max to 16MB+ - WAN optimization: Switch to bbr congestion control - Web servers: somaxconn=65535 to avoid connection drops under load - High PPS: Increase netdev_budget to 4096+, enable RPS

Anomaly Detection Rules (Network)

Rule Warning Critical Source
tcp_retransmits 10/s 50/s /proc/net/snmp
tcp_timewait 5000 20000 ss
network_errors_per_sec 1/s 100/s /proc/net/dev
conntrack_usage_pct 70% 90% /proc/sys/net/netfilter/
softnet_dropped 1 10 /proc/net/softnet_stat
listen_overflows 1 100 /proc/net/netstat
nic_rx_discards 100 10000 ethtool -S

Diagnostic Examples

Healthy Web Server

{
  "interfaces": [
    {"name": "eth0", "rx_bytes": 5000000, "tx_bytes": 50000000,
     "rx_errors": 0, "tx_dropped": 0, "driver": "virtio_net",
     "rx_queues": 4, "ring_rx_current": 256, "ring_rx_max": 256}
  ],
  "tcp": {
    "curr_estab": 500, "retrans_segs": 2, "time_wait_count": 200, "close_wait_count": 0
  },
  "congestion_ctrl": "bbr",
  "conntrack": {"count": 500, "max": 65536, "usage_pct": 0.76},
  "listen_overflows": 0,
  "softnet_stats": [
    {"cpu": 0, "processed": 50000, "dropped": 0, "time_squeeze": 0}
  ]
}
No errors, no drops, low conntrack usage, zero ListenOverflows, no softnet drops.

Connection Leak

{
  "tcp": {
    "curr_estab": 12000,
    "close_wait_count": 8500,
    "retrans_segs": 0
  }
}
8500 CLOSE_WAIT = massive connection leak. The application is not closing sockets after the remote side disconnects.

NIC Ring Buffer Overflow

{
  "interfaces": [
    {"name": "eth0", "rx_dropped": 45000, "driver": "ixgbe",
     "ring_rx_current": 256, "ring_rx_max": 4096, "rx_discards": 12000}
  ],
  "softnet_stats": [
    {"cpu": 0, "processed": 5000000, "dropped": 200, "time_squeeze": 50}
  ]
}
NIC drops 12K packets at hardware level (ring buffer at 256/4096). Fix: ethtool -G eth0 rx 4096

Conntrack Table Full

{
  "conntrack": {"count": 61000, "max": 65536, "usage_pct": 93.1, "drops": 150}
}
Table at 93% with active drops. Fix: sysctl -w net.netfilter.nf_conntrack_max=131072


Next: Chapter 6 — Process Analysis