Chapter 11: Anomaly Detection

Overview

melisai's anomaly detection engine (internal/model/anomaly.go) applies 37 threshold rules based on Brendan Gregg's recommended values and production best practices. Each rule evaluates a specific metric from collected data and flags it as warning or critical.

All rate-based rules use two-point sampling (delta/interval) to detect issues happening right now, not historical cumulative counters.

How It Works

type Threshold struct {
    Metric    string
    Category  string
    Warning   float64
    Critical  float64
    Evaluator func(report *Report) (float64, bool)
    Message   func(value float64) string
}

Each threshold has an evaluator function that extracts the metric from the report and returns (value, found). If found is true and value >= Critical, a critical anomaly is created. If value >= Warning, a warning.

The 37 Rules

CPU (5 rules)

#	Metric	Warning	Critical	Source
1	cpu_utilization	> 80%	> 95%	/proc/stat (delta)
2	cpu_iowait	> 10%	> 30%	/proc/stat (delta)
3	load_average	> 2x CPUs	> 4x CPUs	/proc/loadavg
4	runqlat_p99	> 10ms	> 50ms	BCC runqlat histogram
5	cpu_psi_pressure	> 5%	> 25%	/proc/pressure/cpu

Memory (8 rules)

#	Metric	Warning	Critical	Source
6	memory_utilization	> 85%	> 95%	/proc/meminfo MemAvailable
7	swap_usage	> 10%	> 50%	/proc/meminfo
8	memory_psi_pressure	> 5%	> 25%	/proc/pressure/memory
9	cache_miss_ratio	> 5%	> 15%	BCC cachestat
10	direct_reclaim_rate	> 10/s	> 1000/s	/proc/vmstat pgscan_direct (rate)
11	compaction_stall_rate	> 1/s	> 100/s	/proc/vmstat compact_stall (rate)
12	thp_split_rate	> 1/s	> 100/s	/proc/vmstat thp_split_page (rate)
13	numa_miss_ratio	> 5%	> 20%	/sys/devices/system/node numastat

Disk (5 rules)

#	Metric	Warning	Critical	Source
14	disk_utilization	> 70%	> 90%	/proc/diskstats
15	disk_avg_latency	> 5ms	> 50ms	/proc/diskstats
16	biolatency_p99_ssd	> 5ms	> 25ms	BCC biolatency
17	biolatency_p99_hdd	> 50ms	> 200ms	BCC biolatency
18	io_psi_pressure	> 10%	> 50%	/proc/pressure/io

Network (15 rules)

#	Metric	Warning	Critical	Source
19	tcp_retransmits	> 10/s	> 50/s	/proc/net/snmp (rate)
20	tcp_timewait	> 5000	> 20000	ss
21	network_errors_per_sec	> 10/s	> 100/s	/proc/net/dev (rate)
22	conntrack_usage_pct	> 70%	> 90%	/proc/sys/net/netfilter
23	softnet_dropped	> 1/s	> 100/s	/proc/net/softnet_stat (rate)
24	listen_overflows	> 1/s	> 100/s	/proc/net/netstat (rate)
25	nic_rx_discards	> 100	> 10000	ethtool -S
26	tcp_close_wait	> 1	> 100	ss (current state)
27	softnet_time_squeeze	> 1/s	> 100/s	/proc/net/softnet_stat (rate)
28	tcp_abort_on_memory	> 0.1/s	> 1/s	/proc/net/netstat (rate)
29	irq_imbalance	> 5x ratio	> 20x ratio	/proc/softirqs (rate delta)
30	udp_rcvbuf_errors	> 1/s	> 100/s	/proc/net/snmp (rate)
31	tcp_rcvq_drop	> 1/s	> 100/s	/proc/net/netstat (rate)
32	tcp_zero_window_drop	> 1/s	> 50/s	/proc/net/netstat (rate)
33	listen_queue_saturation	> 70% fill	> 90% fill	ss -tnl Recv-Q/Send-Q

Container (2 rules)

#	Metric	Warning	Critical	Source
34	cpu_throttling	> 100 periods	> 1000 periods	cgroup cpu.stat
35	container_memory_usage	> 80%	> 95%	cgroup memory.current/max

System (1 rule)

#	Metric	Warning	Critical	Source
36	gpu_nic_cross_numa	> 1 pair	> 1 pair	sysfs PCI NUMA node

Other (1 rule)

#	Metric	Warning	Critical	Source
37	dns_latency_p99	> 50ms	> 200ms	BCC gethostlatency

Rate-Based Detection

Rules marked (rate) use two-point sampling: the collector reads the counter before and after a 1-second interval, computes delta / seconds. This eliminates false positives from cumulative counters on long-uptime systems.

Rate-based rules: softnet_dropped, softnet_time_squeeze, listen_overflows, tcp_abort_on_memory, udp_rcvbuf_errors, tcp_rcvq_drop, tcp_zero_window_drop, direct_reclaim_rate, compaction_stall_rate, thp_split_rate.

Health Score

The health score (0-100) combines USE metric deductions and anomaly deductions:

USE deductions (weighted by resource importance): - CPU: 1.5x, Memory: 1.5x, Disk: 1.0x, Network: 1.0x, Container: 1.2x - Utilization >= 95% → -15weight; >= 85% → -8weight; >= 70% → -3weight - Saturation > 50% → -15weight; > 10% → -8weight; > 1% → -3weight - Errors > 1000 → -10weight; > 100 → -5weight; > 0 → -2*weight

Anomaly deductions (flat, not weighted): - Critical anomaly = -10 points - Warning anomaly = -5 points

Score clamped to [0, 100].

Next: Chapter 12 — Recommendations Engine