Skip to content

Chapter 11: Anomaly Detection

Overview

melisai's anomaly detection engine (internal/model/anomaly.go) applies 37 threshold rules based on Brendan Gregg's recommended values and production best practices. Each rule evaluates a specific metric from collected data and flags it as warning or critical.

All rate-based rules use two-point sampling (delta/interval) to detect issues happening right now, not historical cumulative counters.

How It Works

type Threshold struct {
    Metric    string
    Category  string
    Warning   float64
    Critical  float64
    Evaluator func(report *Report) (float64, bool)
    Message   func(value float64) string
}

Each threshold has an evaluator function that extracts the metric from the report and returns (value, found). If found is true and value >= Critical, a critical anomaly is created. If value >= Warning, a warning.

The 37 Rules

CPU (5 rules)

# Metric Warning Critical Source
1 cpu_utilization > 80% > 95% /proc/stat (delta)
2 cpu_iowait > 10% > 30% /proc/stat (delta)
3 load_average > 2x CPUs > 4x CPUs /proc/loadavg
4 runqlat_p99 > 10ms > 50ms BCC runqlat histogram
5 cpu_psi_pressure > 5% > 25% /proc/pressure/cpu

Memory (8 rules)

# Metric Warning Critical Source
6 memory_utilization > 85% > 95% /proc/meminfo MemAvailable
7 swap_usage > 10% > 50% /proc/meminfo
8 memory_psi_pressure > 5% > 25% /proc/pressure/memory
9 cache_miss_ratio > 5% > 15% BCC cachestat
10 direct_reclaim_rate > 10/s > 1000/s /proc/vmstat pgscan_direct (rate)
11 compaction_stall_rate > 1/s > 100/s /proc/vmstat compact_stall (rate)
12 thp_split_rate > 1/s > 100/s /proc/vmstat thp_split_page (rate)
13 numa_miss_ratio > 5% > 20% /sys/devices/system/node numastat

Disk (5 rules)

# Metric Warning Critical Source
14 disk_utilization > 70% > 90% /proc/diskstats
15 disk_avg_latency > 5ms > 50ms /proc/diskstats
16 biolatency_p99_ssd > 5ms > 25ms BCC biolatency
17 biolatency_p99_hdd > 50ms > 200ms BCC biolatency
18 io_psi_pressure > 10% > 50% /proc/pressure/io

Network (15 rules)

# Metric Warning Critical Source
19 tcp_retransmits > 10/s > 50/s /proc/net/snmp (rate)
20 tcp_timewait > 5000 > 20000 ss
21 network_errors_per_sec > 10/s > 100/s /proc/net/dev (rate)
22 conntrack_usage_pct > 70% > 90% /proc/sys/net/netfilter
23 softnet_dropped > 1/s > 100/s /proc/net/softnet_stat (rate)
24 listen_overflows > 1/s > 100/s /proc/net/netstat (rate)
25 nic_rx_discards > 100 > 10000 ethtool -S
26 tcp_close_wait > 1 > 100 ss (current state)
27 softnet_time_squeeze > 1/s > 100/s /proc/net/softnet_stat (rate)
28 tcp_abort_on_memory > 0.1/s > 1/s /proc/net/netstat (rate)
29 irq_imbalance > 5x ratio > 20x ratio /proc/softirqs (rate delta)
30 udp_rcvbuf_errors > 1/s > 100/s /proc/net/snmp (rate)
31 tcp_rcvq_drop > 1/s > 100/s /proc/net/netstat (rate)
32 tcp_zero_window_drop > 1/s > 50/s /proc/net/netstat (rate)
33 listen_queue_saturation > 70% fill > 90% fill ss -tnl Recv-Q/Send-Q

Container (2 rules)

# Metric Warning Critical Source
34 cpu_throttling > 100 periods > 1000 periods cgroup cpu.stat
35 container_memory_usage > 80% > 95% cgroup memory.current/max

System (1 rule)

# Metric Warning Critical Source
36 gpu_nic_cross_numa > 1 pair > 1 pair sysfs PCI NUMA node

Other (1 rule)

# Metric Warning Critical Source
37 dns_latency_p99 > 50ms > 200ms BCC gethostlatency

Rate-Based Detection

Rules marked (rate) use two-point sampling: the collector reads the counter before and after a 1-second interval, computes delta / seconds. This eliminates false positives from cumulative counters on long-uptime systems.

Rate-based rules: softnet_dropped, softnet_time_squeeze, listen_overflows, tcp_abort_on_memory, udp_rcvbuf_errors, tcp_rcvq_drop, tcp_zero_window_drop, direct_reclaim_rate, compaction_stall_rate, thp_split_rate.

Health Score

The health score (0-100) combines USE metric deductions and anomaly deductions:

USE deductions (weighted by resource importance): - CPU: 1.5x, Memory: 1.5x, Disk: 1.0x, Network: 1.0x, Container: 1.2x - Utilization >= 95% → -15weight; >= 85% → -8weight; >= 70% → -3weight - Saturation > 50% → -15weight; > 10% → -8weight; > 1% → -3weight - Errors > 1000 → -10weight; > 100 → -5weight; > 0 → -2*weight

Anomaly deductions (flat, not weighted): - Critical anomaly = -10 points - Warning anomaly = -5 points

Score clamped to [0, 100].


Next: Chapter 12 — Recommendations Engine