Chapter 2: CPU Analysis — Deep Dive
Overview
The CPU is the most fundamental resource. When it's overloaded, everything slows down. But "CPU is high" is not a diagnosis — you need to know why it's high and what kind of load it is.
melisai's CPUCollector (internal/collector/cpu.go) answers these questions by reading /proc/stat twice and computing the delta.
Source File: cpu.go
- Lines: 207
- Functions: 10
- Tier: 1 (always available, no root needed)
The Collector Interface
Every collector in melisai implements this interface:
type Collector interface {
Name() string // e.g. "cpu_utilization"
Category() string // e.g. "cpu"
Collect(ctx context.Context, cfg CollectConfig) (*model.Result, error)
Available() Availability // tier + reason
}
For CPUCollector:
- Name() → "cpu_utilization"
- Category() → "cpu"
- Available() → Availability{Tier: 1} (always available)
Data Structures
cpuTimes — Raw Kernel Counters
type cpuTimes struct {
user uint64 // Running user-space code
nice uint64 // Running user-space code at lower priority
system uint64 // Running kernel code (syscalls)
idle uint64 // Doing nothing
iowait uint64 // Idle, waiting for I/O
irq uint64 // Handling hardware interrupts
softirq uint64 // Handling software interrupts
steal uint64 // Time stolen by hypervisor (VMs)
}
These are raw jiffie counts from /proc/stat. Each jiffy represents a tick of the system clock (typically 10ms on a 100 HZ kernel).
The total() method sums all states:
func (t cpuTimes) total() uint64 {
return t.user + t.nice + t.system + t.idle + t.iowait + t.irq + t.softirq + t.steal
}
CPUData — The Output
type CPUData struct {
UserPct float64 // user + nice as percentage
SystemPct float64 // kernel time percentage
IOWaitPct float64 // I/O wait percentage
IdlePct float64 // idle percentage
StealPct float64 // hypervisor steal
IRQPct float64 // hardware interrupt handling
SoftIRQPct float64 // software interrupt handling
LoadAvg1 float64 // 1-minute load average
LoadAvg5 float64 // 5-minute load average
LoadAvg15 float64 // 15-minute load average
NumCPUs int // logical CPU count
ContextSwitchesPerSec int64 // context switches per second
PerCPU []PerCPU // per-core breakdown
SchedLatencyNS int64 // CFS sched_latency_ns
SchedMinGranularityNS int64 // CFS sched_min_granularity_ns
}
Function-by-Function Walkthrough
Collect() — The Entry Point
func (c *CPUCollector) Collect(ctx context.Context, cfg CollectConfig) (*model.Result, error) {
startTime := time.Now()
// First sample
sample1, perCPU1, ctxSw1 := c.readProcStat()
// Wait for interval (default 1 second)
interval := cfg.SampleInterval
if interval == 0 {
interval = 1 * time.Second
}
select {
case <-time.After(interval):
case <-ctx.Done():
return nil, ctx.Err() // cancelled or timed out
}
// Second sample
sample2, perCPU2, ctxSw2 := c.readProcStat()
// Compute deltas
data := c.computeDelta(sample1, sample2)
// ... context switches, load avg, per-CPU, scheduler params ...
}
Key design decisions:
-
Two-point sampling: We read
/proc/stattwice, 1 second apart. The delta tells us what happened during that interval. -
Context cancellation: The
selectstatement ensures that if the orchestrator cancels (SIGINT, timeout), the collector returns immediately instead of blocking. -
Configurable interval: The
CollectConfig.SampleIntervaldefaults to 1 second, but can be extended for more accurate measurements.
readProcStat() — Parsing /proc/stat
func (c *CPUCollector) readProcStat() (cpuTimes, map[int]cpuTimes, uint64) {
f, err := os.Open(filepath.Join(c.procRoot, "stat"))
// ...
scanner := bufio.NewScanner(f)
for scanner.Scan() {
fields := strings.Fields(line)
if fields[0] == "cpu" && len(fields) >= 9 {
aggregate = parseCPULine(fields) // Total across all CPUs
} else if strings.HasPrefix(fields[0], "cpu") {
cpuNum, _ := strconv.Atoi(strings.TrimPrefix(fields[0], "cpu"))
perCPU[cpuNum] = parseCPULine(fields) // Per-CPU breakdown
} else if fields[0] == "ctxt" {
ctxSwitches, _ = strconv.ParseUint(fields[1], 10, 64)
}
}
return aggregate, perCPU, ctxSwitches
}
The function returns three values: 1. Aggregate CPU times (across all cores) 2. Per-CPU times (map: CPU number → cpuTimes) 3. Context switch count (total since boot)
parseCPULine() — Extracting Fields
func parseCPULine(fields []string) cpuTimes {
parse := func(idx int) uint64 {
if idx >= len(fields) { return 0 }
v, _ := strconv.ParseUint(fields[idx], 10, 64)
return v
}
return cpuTimes{
user: parse(1), // field[1] = user jiffies
nice: parse(2), // field[2] = nice jiffies
system: parse(3), // field[3] = system jiffies
idle: parse(4), // field[4] = idle jiffies
iowait: parse(5), // field[5] = iowait jiffies
irq: parse(6), // field[6] = hardware interrupt jiffies
softirq: parse(7), // field[7] = software interrupt jiffies
steal: parse(8), // field[8] = steal jiffies (VMs)
}
}
Why field indices start at 1: Field 0 is the label (cpu, cpu0, etc.), so the actual counters start at field 1.
computeDelta() — Percentage Calculation
func (c *CPUCollector) computeDelta(before, after cpuTimes) *model.CPUData {
totalDelta := float64(after.total() - before.total())
if totalDelta == 0 {
return &model.CPUData{} // no CPU time passed (impossible in practice)
}
return &model.CPUData{
UserPct: float64(after.user - before.user + after.nice - before.nice) / totalDelta * 100,
SystemPct: float64(after.system - before.system) / totalDelta * 100,
IOWaitPct: float64(after.iowait - before.iowait) / totalDelta * 100,
IdlePct: float64(after.idle - before.idle) / totalDelta * 100,
StealPct: float64(after.steal - before.steal) / totalDelta * 100,
IRQPct: float64(after.irq - before.irq) / totalDelta * 100,
SoftIRQPct: float64(after.softirq - before.softirq) / totalDelta * 100,
}
}
Key formula:
This works because all CPU states are mutually exclusive and exhaustive — they always sum to 100%.
Note: user and nice are combined into UserPct. Both represent user-space code, just at different scheduling priorities.
computePerCPUDeltas() — Per-Core Analysis
func (c *CPUCollector) computePerCPUDeltas(before, after map[int]cpuTimes) []model.PerCPU {
var result []model.PerCPU
for cpuNum, afterTimes := range after {
beforeTimes, ok := before[cpuNum]
if !ok { continue }
totalDelta := float64(afterTimes.total() - beforeTimes.total())
if totalDelta == 0 { continue }
result = append(result, model.PerCPU{
CPU: cpuNum,
UserPct: float64(afterTimes.user - beforeTimes.user + ...) / totalDelta * 100,
SystemPct: ...,
IOWaitPct: ...,
IdlePct: ...,
})
}
return result
}
Why per-CPU matters: On a 32-core server, aggregate CPU might show 50% utilization, but a single-threaded bottleneck will show 1 core at 100% and 31 cores idle. Per-CPU data reveals this.
readLoadAvg() — Saturation Indicator
func (c *CPUCollector) readLoadAvg() (float64, float64, float64) {
data, _ := os.ReadFile(filepath.Join(c.procRoot, "loadavg"))
// "1.23 0.98 0.76 2/1234 56789"
// ^1m ^5m ^15m ^running/total ^last_pid
fields := strings.Fields(string(data))
la1, _ := strconv.ParseFloat(fields[0], 64)
la5, _ := strconv.ParseFloat(fields[1], 64)
la15, _ := strconv.ParseFloat(fields[2], 64)
return la1, la5, la15
}
Load average is the average number of processes in the run queue OR in uninterruptible sleep (D state). It includes processes waiting for I/O.
Interpreting load average: | Load / # CPUs | Meaning | |--------------|---------| | < 1.0 | Under capacity — room for more work | | = 1.0 | At capacity — no spare cycles | | > 1.0 | Over capacity — work is queuing | | > 2.0 | Significant queuing — noticeable latency | | > 4.0 | Heavily overloaded |
melisai normalizes load average by dividing by the number of CPUs:
A ratio > 1.0 means saturation.CFS Scheduler Parameters
The Completely Fair Scheduler (CFS) uses two parameters that affect latency:
-
sched_latency_ns(default: 6ms): The target time for one complete scheduling round. All runnable tasks should get to run within this time window. -
sched_min_granularity_ns(default: 0.75ms): Minimum time a task runs before being preempted. Prevents excessive context switching.
melisai reads these from:
If sched_latency_ns is very high (e.g., 24ms on some cloud providers), scheduling latency will be poor — but throughput is better. This is a classic latency-throughput tradeoff.
Context Switches
Every time the CPU switches from one process to another, it performs a context switch. The cost is typically 1-10 microseconds, but at very high rates (>100K/sec), the overhead becomes significant.
melisai computes context switches per second:
ctxSwDelta := ctxSw2 - ctxSw1
data.ContextSwitchesPerSec = int64(float64(ctxSwDelta) / interval.Seconds())
Interpreting context switch rates: | Rate | Assessment | |------|-----------| | < 10K/s | Normal for most workloads | | 10K-50K/s | Active system, check if voluntary | | 50K-200K/s | High. Either many threads or locking contention | | > 200K/s | Very high. Likely indicates thread/lock issues |
What to Look For in CPU Results
Healthy System
{
"user_pct": 25.0,
"system_pct": 5.0,
"iowait_pct": 0.1,
"idle_pct": 69.9,
"load_avg_1": 2.5,
"num_cpus": 8
}
CPU-Bound Application
{
"user_pct": 92.0,
"system_pct": 3.0,
"idle_pct": 5.0,
"per_cpu": [
{"cpu": 0, "user_pct": 99.0, "idle_pct": 1.0},
{"cpu": 1, "user_pct": 2.0, "idle_pct": 98.0},
...
]
}