apricot-health/README.md
Natalie dafbabee41 feat(@packages/apricot-health): add power-fault monitoring and mitigation tools
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-04-17 23:18:47 -07:00

3.7 KiB

apricot-health

Power-fault diagnostics and mitigation for apricot — a Threadripper 2990WX / X399 AORUS XTREME-CF / dual RTX 3090 rig on an open wet-bench, hit by random hard power-offs whose root cause is still being isolated (aging PSU caps, VRM degradation, or both).

What's in here

Component What it does
scripts/apricot-crash-logger High-frequency (10 Hz) sensor snapshotter. Captures GPU / CPU / NVMe / motherboard-rail telemetry to ~/apricot-crash.log, fsync'd every second, so the last fractions of a second before a hard reset survive the crash.
scripts/apricot-rail-watchdog Tails the crash-log, learns per-chip baseline for in5 on each it8628/hwmonN, alerts on deviations > DEVIATION_MV (default 30 mV). Optionally invokes a mitigation hook.
scripts/apricot-rail-mitigate Root-only emergency responder: drops GPU power caps and pins CPU governor to powersave for HOLD_SECONDS (default 60), then restores. Fired by the watchdog via sudoers.
scripts/apricot-rail-mitigate-trigger User-space shim that sudos into apricot-rail-mitigate (scoped NOPASSWD).
scripts/apricot-cstate-tune Disables deep CPU C-states (C2+) so Vcore stays at a higher baseline, reducing VRM transient-demand magnitude. Oneshot systemd unit at boot.
scripts/apricot-rasdaemon-setup Installs + enables rasdaemon for detailed AMD MCA/MCE decoding into a sqlite DB.
modprobe.d/it87.conf force_id=0x8628 ignore_resource_conflict=1 — binds IT8628E SuperIO so voltage/fan/temp rails are exposed in /sys/class/hwmon.
modules-load.d/it87.conf Loads it87 at boot.
sudoers.d/apricot-health NOPASSWD rule for lilith to invoke the mitigation entrypoint (scoped to one command).
systemd/*.service Three units — one root (apricot-cstate-tune), two user (apricot-crash-monitor, apricot-rail-watchdog).

Install

./install.sh                  # targets HOST=apricot by default
HOST=other-host ./install.sh  # or override

Idempotent. Re-run to push updates.

Tuning

All runtime behavior is env-overridable through systemd drop-ins:

systemctl --user edit apricot-rail-watchdog
#   [Service]
#   Environment=DEVIATION_MV=50 BASELINE_SAMPLES=40 RAIL_KEY=in5

Key knobs:

  • INTERVAL (crash-logger) — sample period in seconds; 0.1 = 10 Hz.
  • DEVIATION_MV (watchdog) — deviation from learned baseline that triggers an alert.
  • MITIGATE_CMD (watchdog) — path to mitigation hook; empty = alert only.
  • GPU_LIMIT_SAFE (mitigate) — wattage to clamp GPUs to during mitigation.
  • HOLD_SECONDS (mitigate) — how long to hold the safe state.

Outputs

  • ~/apricot-crash.log — per-sample telemetry.
  • ~/apricot-rail-alerts.log — watchdog alerts + baselines.
  • journalctl --user -u apricot-rail-watchdog — live alerts (WARNING priority).
  • journalctl -u apricot-cstate-tune — one-shot C-state tune result at boot.
  • /var/lib/rasdaemon/ras-mc_event.db (after rasdaemon setup) — decoded MCEs.

Post-mortem flow when a crash happens

  1. ssh apricot (after it comes back — BIOS "AC Back: Power On" auto-restarts).
  2. grep -n '^=== session start' ~/apricot-crash.log | tail -5 — find the new session boundary.
  3. Everything between the previous session's last line and the new session marker is the last ~N seconds before death.
  4. tail ~/apricot-rail-alerts.log — did the watchdog see rail deviation before the event?
  5. journalctl -b -1 --no-pager | tail -40 — kernel's last words (often normal; hard-off gives no panic).
  6. SMART unsafe-shutdown counter: sudo smartctl -a /dev/nvme0 | grep -i unsafe — should increment by 1.

Diagnosis so far

See docs/DIAGNOSIS.md.