feat(@packages/apricot-health): ✨ add power-fault monitoring and mitigation tools

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-04-17 23:18:47 -07:00 · 2026-04-17 23:18:47 -07:00 · dafbabee41
commit dafbabee41
15 changed files with 663 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,66 @@
+# apricot-health
+
+Power-fault diagnostics and mitigation for **apricot** — a Threadripper 2990WX / X399 AORUS XTREME-CF / dual RTX 3090 rig on an open wet-bench, hit by random hard power-offs whose root cause is still being isolated (aging PSU caps, VRM degradation, or both).
+
+## What's in here
+
+| Component | What it does |
+|---|---|
+| `scripts/apricot-crash-logger` | High-frequency (10 Hz) sensor snapshotter. Captures GPU / CPU / NVMe / motherboard-rail telemetry to `~/apricot-crash.log`, fsync'd every second, so the last fractions of a second before a hard reset survive the crash. |
+| `scripts/apricot-rail-watchdog` | Tails the crash-log, learns per-chip baseline for `in5` on each `it8628/hwmonN`, alerts on deviations > `DEVIATION_MV` (default 30 mV). Optionally invokes a mitigation hook. |
+| `scripts/apricot-rail-mitigate` | Root-only emergency responder: drops GPU power caps and pins CPU governor to `powersave` for `HOLD_SECONDS` (default 60), then restores. Fired by the watchdog via sudoers. |
+| `scripts/apricot-rail-mitigate-trigger` | User-space shim that `sudo`s into `apricot-rail-mitigate` (scoped NOPASSWD). |
+| `scripts/apricot-cstate-tune` | Disables deep CPU C-states (C2+) so Vcore stays at a higher baseline, reducing VRM transient-demand magnitude. Oneshot systemd unit at boot. |
+| `scripts/apricot-rasdaemon-setup` | Installs + enables `rasdaemon` for detailed AMD MCA/MCE decoding into a sqlite DB. |
+| `modprobe.d/it87.conf` | `force_id=0x8628 ignore_resource_conflict=1` — binds IT8628E SuperIO so voltage/fan/temp rails are exposed in `/sys/class/hwmon`. |
+| `modules-load.d/it87.conf` | Loads `it87` at boot. |
+| `sudoers.d/apricot-health` | NOPASSWD rule for `lilith` to invoke the mitigation entrypoint (scoped to one command). |
+| `systemd/*.service` | Three units — one root (`apricot-cstate-tune`), two user (`apricot-crash-monitor`, `apricot-rail-watchdog`). |
+
+## Install
+
+```sh
+./install.sh                  # targets HOST=apricot by default
+HOST=other-host ./install.sh  # or override
+```
+
+Idempotent. Re-run to push updates.
+
+## Tuning
+
+All runtime behavior is env-overridable through systemd drop-ins:
+
+```sh
+systemctl --user edit apricot-rail-watchdog
+#   [Service]
+#   Environment=DEVIATION_MV=50 BASELINE_SAMPLES=40 RAIL_KEY=in5
+```
+
+Key knobs:
+
+- `INTERVAL` (crash-logger) — sample period in seconds; `0.1` = 10 Hz.
+- `DEVIATION_MV` (watchdog) — deviation from learned baseline that triggers an alert.
+- `MITIGATE_CMD` (watchdog) — path to mitigation hook; empty = alert only.
+- `GPU_LIMIT_SAFE` (mitigate) — wattage to clamp GPUs to during mitigation.
+- `HOLD_SECONDS` (mitigate) — how long to hold the safe state.
+
+## Outputs
+
+- `~/apricot-crash.log` — per-sample telemetry.
+- `~/apricot-rail-alerts.log` — watchdog alerts + baselines.
+- `journalctl --user -u apricot-rail-watchdog` — live alerts (WARNING priority).
+- `journalctl -u apricot-cstate-tune` — one-shot C-state tune result at boot.
+- `/var/lib/rasdaemon/ras-mc_event.db` (after rasdaemon setup) — decoded MCEs.
+
+## Post-mortem flow when a crash happens
+
+1. `ssh apricot` (after it comes back — BIOS "AC Back: Power On" auto-restarts).
+2. `grep -n '^=== session start' ~/apricot-crash.log | tail -5` — find the new session boundary.
+3. Everything between the previous session's last line and the new session marker is the last ~N seconds before death.
+4. `tail ~/apricot-rail-alerts.log` — did the watchdog see rail deviation before the event?
+5. `journalctl -b -1 --no-pager | tail -40` — kernel's last words (often normal; hard-off gives no panic).
+6. SMART unsafe-shutdown counter: `sudo smartctl -a /dev/nvme0 | grep -i unsafe` — should increment by 1.
+
+## Diagnosis so far
+
+See [`docs/DIAGNOSIS.md`](docs/DIAGNOSIS.md).
--- a/docs/DIAGNOSIS.md
+++ b/docs/DIAGNOSIS.md
@ -0,0 +1,78 @@
+# apricot hard-off diagnosis
+
+Running log of the investigation. Newest findings at top.
+
+## Platform
+- Gigabyte X399 AORUS XTREME-CF, 8 years old, open-frame wet-bench (no mineral oil; "wet" refers to open-air test bench).
+- AMD Threadripper 2990WX (32-core, 250 W TDP).
+- 2× NVIDIA RTX 3090 (stock 370 W cap each).
+- 2× NVMe + 3× SATA.
+- 2× Corsair PSUs:
+  - **HX1500i** — was producing audible coil-whine before the split; now carries only drives + Molex.
+  - **HX1200** — now carries mobo + CPU + both GPUs.
+- Fedora Bluefin (ostree), kernel 6.17.12-200.fc42.
+- Non-ECC memory (`amd64_edac` cannot bind).
+
+## Failure signature (consistent across all events)
+
+1. Journal cuts abruptly mid-operation. No `Reached target Shutdown`, no `systemd-shutdown`, no kernel panic.
+2. Next boot runs `XFS (dm-0): Starting recovery` — filesystem wasn't unmounted cleanly.
+3. NVMe SMART `Unsafe Shutdowns` increments by 1 on each event. Current ratio ~66 % of all power cycles are unclean.
+4. BIOS "AC Back: Power On" (inferred from behavior) auto-restarts the box after each event; earlier events where the box stayed dark likely latched PSU OCP/UVP protection.
+5. No MCE / thermal-throttle / OOM / hung-task entries.
+
+→ The kernel never runs a shutdown — the 12 V plane collapses from under it. Classic PSU OCP/UVP or VRM brownout.
+
+## Timeline of captured crashes
+
+| Timestamp (PDT) | GPU 0 | GPU 1 | CPU Tctl | Load profile |
+|---|---|---|---|---|
+| 2026-04-16 15:58:06 | 158 W | **368 W** (pegged) | — | Sustained high — GPU 1 inference under load |
+| 2026-04-17 03:22:54 | 117 W | 25 W (idle) | 70 °C | **Near-idle** — background auto-commit + tor-manager only |
+| 2026-04-17 11:15:42 | 20 W | **368 W** (pegged) | 72 °C | High GPU 1 load |
+| 2026-04-17 21:35:10 | 117 W | 129 W | 69 °C | Moderate, both GPUs in P2 |
+
+Crashes span idle-to-sustained-peak — no consistent load correlation.
+
+## Rail observations (it8628 SuperIO, after binding via `it87 force_id=0x8628`)
+
+Stable rails during normal operation:
+
+- `in5` on chip 1 (hwmon3/hwmon8 depending on boot order): **852 mV steady** → likely +12 V scaled ~14:1 → ~11.9 V actual.
+- `in5` on chip 2: **1632 mV steady** → likely +5 V scaled ~3:1 → ~4.9 V actual.
+
+**Key observation 2026-04-17**: Between crashes, `in5` on chip 1 collapsed from **852 mV → 408 mV** twice (18:50:43-45, 19:20:50-52), recovering within one sample. Roughly a **50 % rail drop** — probably a ~12 V → ~5.7 V momentary sag. System survived both. Demonstrates the supply is visibly failing at slow timescales, not only at the microsecond scale that causes a hard-off.
+
+## What has been ruled out
+
+- **Thermal**: all CPU/GPU/NVMe temps well below throttle thresholds at every crash.
+- **OOM / hung task**: journal shows none.
+- **MCE**: `edac_mce_amd` loaded, no events logged.
+- **Graceful shutdown path**: no systemd shutdown-target progression.
+- **nvidia-oc daemon**: fixed independently — was thrashing sqlite locks; not related to crashes.
+- **HX1500i as sole cause**: crashes continued after moving all load off it onto HX1200.
+
+## What's consistent with observations
+
+- **Aging filter caps on PSU and/or motherboard VRM**. Both the squealing HX1500i *and* the HX1200 have produced visible rail excursions. Board is 8 years old.
+- **Load-independent failure**: crashes happen at both idle and peak load, but the in5 rail drops caught by the watchdog indicate intermittent supply failure decoupled from workload.
+
+## What remains to rule out (physical)
+
+- Visual inspection of VRM caps on the board (open bench, trivial).
+- Multimeter back-probe of 12 V at the 24-pin during load, to watch for sag below 11.4 V.
+- Swap to a third known-good PSU for a day.
+- Reseat EPS12V / 24-pin connectors (oxidation on 8-year-old pins is plausible).
+
+## Software stack currently deployed
+
+- **10 Hz telemetry logger** (`apricot-crash-monitor.service`) — writes ~/apricot-crash.log, fsync per second.
+- **Rail watchdog** (`apricot-rail-watchdog.service`) — baseline-learning on `in5`, 30 mV deviation threshold, invokes mitigation on trigger.
+- **Emergency mitigation** (`apricot-rail-mitigate`) — drops GPU cap to 250 W, pins CPU governor to powersave, holds 60 s, restores.
+- **C-state tune** (`apricot-cstate-tune.service`) — disables C2+ at boot to reduce VRM transient demand.
+- **IT8628E binding** (`/etc/modprobe.d/it87.conf` + `/etc/modules-load.d/it87.conf`) — SuperIO sensors auto-load with correct `force_id`.
+- **rasdaemon** — optional, via `apricot-rasdaemon-setup`.
+
+## Non-software fixes kept separate from this package
+
+- nvidia-oc WAL-mode patch (upstreamed via ACS to `origin/master` of the nvidia-oc repo, commit `bea1934`).
--- a/install.sh
+++ b/install.sh
@ -0,0 +1,105 @@
+#!/usr/bin/env bash
+# Install apricot-health on the target host (default: apricot).
+#
+# Layout on target:
+#   /var/home/lilith/bin/                         user-runnable scripts
+#   /var/opt/apricot-health/sbin/                 root-only entrypoints (ostree-safe)
+#   /etc/modprobe.d/it87.conf                     IT8628E force_id
+#   /etc/modules-load.d/it87.conf                 load it87 at boot
+#   /etc/sudoers.d/apricot-health                 NOPASSWD shim for mitigation
+#   /etc/systemd/system/apricot-cstate-tune.service   root systemd unit
+#   /var/home/lilith/.config/systemd/user/*.service   user systemd units
+#
+# Idempotent: re-running copies updates and daemon-reloads.
+
+set -euo pipefail
+
+HOST="${HOST:-apricot}"
+PKG_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+
+echo "==> apricot-health install to $HOST (pkg=$PKG_DIR)"
+
+# --- stage tarball locally so we upload in one round-trip ---------------
+stage=$(mktemp -d)
+trap 'rm -rf "$stage"' EXIT
+mkdir -p "$stage"/{bin,root-sbin,etc-modprobe,etc-modules-load,etc-sudoers,etc-systemd,user-systemd}
+
+cp "$PKG_DIR/scripts/apricot-crash-logger"         "$stage/bin/"
+cp "$PKG_DIR/scripts/apricot-rail-watchdog"        "$stage/bin/"
+cp "$PKG_DIR/scripts/apricot-rail-mitigate-trigger" "$stage/bin/"
+cp "$PKG_DIR/scripts/apricot-rasdaemon-setup"      "$stage/bin/"
+cp "$PKG_DIR/scripts/apricot-rail-mitigate"        "$stage/root-sbin/"
+cp "$PKG_DIR/scripts/apricot-cstate-tune"          "$stage/root-sbin/"
+cp "$PKG_DIR/modprobe.d/it87.conf"                 "$stage/etc-modprobe/"
+cp "$PKG_DIR/modules-load.d/it87.conf"             "$stage/etc-modules-load/"
+cp "$PKG_DIR/sudoers.d/apricot-health"             "$stage/etc-sudoers/"
+cp "$PKG_DIR/systemd/apricot-cstate-tune.service"  "$stage/etc-systemd/"
+cp "$PKG_DIR/systemd/apricot-crash-monitor.service"  "$stage/user-systemd/"
+cp "$PKG_DIR/systemd/apricot-rail-watchdog.service"  "$stage/user-systemd/"
+
+tar -czf "$stage/pkg.tar.gz" -C "$stage" bin root-sbin etc-modprobe etc-modules-load etc-sudoers etc-systemd user-systemd
+echo "==> staged $(du -h "$stage/pkg.tar.gz" | cut -f1)"
+
+# --- ship it ------------------------------------------------------------
+scp -q "$stage/pkg.tar.gz" "$HOST:/tmp/apricot-health.tar.gz"
+
+ssh "$HOST" bash -s <<'REMOTE'
+set -euo pipefail
+echo "==> remote install"
+
+t=$(mktemp -d)
+tar -xzf /tmp/apricot-health.tar.gz -C "$t"
+
+# User-runnable scripts
+mkdir -p /var/home/lilith/bin
+install -m 0755 -o lilith -g lilith "$t"/bin/* /var/home/lilith/bin/
+
+# Root-only entrypoints (ostree-safe path under /var)
+sudo mkdir -p /var/opt/apricot-health/sbin
+sudo install -m 0755 -o root -g root "$t"/root-sbin/* /var/opt/apricot-health/sbin/
+
+# Kernel module config
+sudo install -m 0644 "$t"/etc-modprobe/it87.conf /etc/modprobe.d/it87.conf
+sudo install -m 0644 "$t"/etc-modules-load/it87.conf /etc/modules-load.d/it87.conf
+
+# Sudoers (visudo-check first — malformed sudoers can lock the user out)
+tmp_sudo=$(mktemp)
+cp "$t"/etc-sudoers/apricot-health "$tmp_sudo"
+if sudo visudo -cf "$tmp_sudo" >/dev/null 2>&1; then
+    sudo install -m 0440 -o root -g root "$tmp_sudo" /etc/sudoers.d/apricot-health
+    echo "  sudoers: installed"
+else
+    echo "  sudoers: SYNTAX ERROR — not installing" >&2
+    exit 1
+fi
+rm -f "$tmp_sudo"
+
+# Root systemd units
+sudo install -m 0644 "$t"/etc-systemd/apricot-cstate-tune.service /etc/systemd/system/
+sudo systemctl daemon-reload
+sudo systemctl enable --now apricot-cstate-tune.service
+echo "  apricot-cstate-tune.service: enabled + started"
+
+# User systemd units (under lilith)
+sudo -u lilith mkdir -p /var/home/lilith/.config/systemd/user
+sudo -u lilith install -m 0644 "$t"/user-systemd/apricot-crash-monitor.service /var/home/lilith/.config/systemd/user/
+sudo -u lilith install -m 0644 "$t"/user-systemd/apricot-rail-watchdog.service /var/home/lilith/.config/systemd/user/
+sudo loginctl enable-linger lilith 2>/dev/null || true
+sudo systemctl --user -M lilith@.host daemon-reload
+sudo systemctl --user -M lilith@.host enable --now apricot-crash-monitor.service
+sudo systemctl --user -M lilith@.host restart apricot-rail-watchdog.service 2>/dev/null \
+    || sudo systemctl --user -M lilith@.host enable --now apricot-rail-watchdog.service
+echo "  user units: enabled + started"
+
+# Load it87 now if not yet loaded
+if ! lsmod | grep -q '^it87 '; then
+    sudo modprobe it87 force_id=0x8628 ignore_resource_conflict=1 \
+        && echo "  it87 module: loaded" \
+        || echo "  it87 module: load FAILED (try reboot)"
+fi
+
+rm -rf "$t" /tmp/apricot-health.tar.gz
+echo "==> install complete"
+REMOTE
+
+echo "==> done"
--- a/modprobe.d/it87.conf
+++ b/modprobe.d/it87.conf
@ -0,0 +1 @@
+options it87 force_id=0x8628 ignore_resource_conflict=1
--- a/modules-load.d/it87.conf
+++ b/modules-load.d/it87.conf
@ -0,0 +1 @@
+it87
--- a/scripts/apricot-crash-logger
+++ b/scripts/apricot-crash-logger
@ -0,0 +1,83 @@
+#!/usr/bin/env bash
+# Continuously appends power/thermal/voltage state to $LOG so that the last
+# fractions of a second before a hard reset survive the crash.
+#
+# Env overrides:
+#   LOG            output path (default ~/apricot-crash.log)
+#   INTERVAL       sample period in seconds (default 0.1 = 10 Hz)
+#   SENSOR_CHIPS   regex of hwmon name(s) to capture (default k10temp|nvme|it8628|nct6*|w83*)
+
+set -o pipefail
+
+LOG="${LOG:-${HOME}/apricot-crash.log}"
+INTERVAL="${INTERVAL:-0.1}"
+GPU_SAMPLE_EVERY="${GPU_SAMPLE_EVERY:-10}"   # nvidia-smi is slow; only invoke every Nth iter
+SENSOR_CHIPS="${SENSOR_CHIPS:-k10temp|nvme|it8628|nct6.*|w83.*}"
+
+printf '=== session start %s (pid=%s interval=%ss gpu_every=%s chips=%s) ===\n' \
+    "$(date --iso-8601=ns)" "$$" "$INTERVAL" "$GPU_SAMPLE_EVERY" "$SENSOR_CHIPS" >> "$LOG"
+
+# Pre-resolve matching hwmon paths once per second (cheaper than per-sample).
+declare -a HWMONS
+refresh_hwmons() {
+    HWMONS=()
+    for h in /sys/class/hwmon/hwmon*; do
+        [ -d "$h" ] || continue
+        [ -r "$h/name" ] || continue
+        name=$(<"$h/name")    # bash builtin — no fork
+        [[ "$name" =~ ^(${SENSOR_CHIPS})$ ]] || continue
+        HWMONS+=("$h")
+    done
+}
+refresh_hwmons
+last_refresh=$SECONDS
+iter=0
+
+while :; do
+    ts=$(date --iso-8601=ns)
+
+    # GPU telemetry — skip most iterations because nvidia-smi startup is
+    # ~300-500ms, which would cap the loop at ~2 Hz otherwise.
+    if (( iter % GPU_SAMPLE_EVERY == 0 )); then
+        while IFS= read -r gpu_line; do
+            printf '%s gpu %s\n' "$ts" "$gpu_line"
+        done < <(nvidia-smi \
+            --query-gpu=index,temperature.gpu,power.draw,clocks.gr,clocks.mem,pstate,utilization.gpu,memory.used \
+            --format=csv,noheader,nounits 2>/dev/null)
+    fi
+    iter=$(( iter + 1 ))
+
+    # Platform sensors — use $(<file) bash builtin everywhere to avoid
+    # fork+exec per-read. With ~60 sensor files that's the difference
+    # between ~600ms per iteration and <20ms.
+    for h in "${HWMONS[@]}"; do
+        [ -r "$h/name" ] || continue
+        name=$(<"$h/name")
+        hb=${h##*/}
+        for inp in "$h"/temp*_input "$h"/in*_input "$h"/fan*_input "$h"/curr*_input; do
+            [ -r "$inp" ] || continue
+            n=${inp##*/}; n=${n%_input}
+            label_file="$h/${n}_label"
+            if [ -r "$label_file" ]; then
+                label=$(<"$label_file")
+            else
+                label="$n"
+            fi
+            raw=$(<"$inp")
+            printf '%s sensor %s/%s %s=%s\n' "$ts" "$name" "$hb" "$label" "$raw"
+        done
+    done
+
+    # Refresh hwmon list every ~5s in case modules load/unload.
+    if (( SECONDS - last_refresh > 5 )); then
+        refresh_hwmons
+        last_refresh=$SECONDS
+    fi
+
+    # Fsync once per second regardless of sample rate (amortized).
+    if (( ${ts:20:1} == 0 )); then
+        sync "$LOG" 2>/dev/null || true
+    fi
+
+    sleep "$INTERVAL"
+done >> "$LOG"
--- a/scripts/apricot-cstate-tune
+++ b/scripts/apricot-cstate-tune
@ -0,0 +1,52 @@
+#!/usr/bin/env bash
+# Disable deep CPU C-states so Vcore stays at a higher baseline and the VRM
+# doesn't have to slam from C6/C7 idle back to full current on every workload
+# transient. Reduces transient-demand magnitude; does NOT fix root-cause PSU
+# or VRM degradation, but often reduces crash frequency on aging boards.
+#
+# Leaves C0 + C1 enabled (basic halt). Disables C2+ (package C-states).
+#
+# Reversible: run with `--restore` to re-enable everything.
+
+set -o pipefail
+
+log() { printf '[%s] apricot-cstate-tune: %s\n' "$(date --iso-8601=s)" "$*"; }
+
+mode="${1:-apply}"
+
+case "$mode" in
+    apply)
+        n_cpus=$(ls -d /sys/devices/system/cpu/cpu[0-9]* 2>/dev/null | wc -l)
+        disabled=0
+        for s in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable; do
+            [ -w "$s" ] || continue
+            idx="${s%/disable}"; idx="${idx##*state}"
+            # Keep states 0 (POLL/C0) and 1 (C1/halt); disable 2+.
+            if (( idx >= 2 )); then
+                echo 1 > "$s" 2>/dev/null && disabled=$(( disabled + 1 ))
+            fi
+        done
+        log "disabled $disabled idle-state entries across $n_cpus CPUs (kept C0+C1)"
+        ;;
+    restore)
+        enabled=0
+        for s in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable; do
+            [ -w "$s" ] || continue
+            echo 0 > "$s" 2>/dev/null && enabled=$(( enabled + 1 ))
+        done
+        log "re-enabled $enabled idle-state entries"
+        ;;
+    status)
+        printf 'cpu0 idle states:\n'
+        for d in /sys/devices/system/cpu/cpu0/cpuidle/state*; do
+            [ -d "$d" ] || continue
+            name=$(cat "$d/name" 2>/dev/null)
+            dis=$(cat "$d/disable" 2>/dev/null)
+            printf '  %s  disable=%s  name=%s\n' "$(basename "$d")" "$dis" "$name"
+        done
+        ;;
+    *)
+        echo "usage: $0 {apply|restore|status}" >&2
+        exit 2
+        ;;
+esac
--- a/scripts/apricot-rail-mitigate
+++ b/scripts/apricot-rail-mitigate
@ -0,0 +1,92 @@
+#!/usr/bin/env bash
+# Emergency rail-deviation responder. Invoked by apricot-rail-watchdog when
+# a rail excursion is detected. Goal: reduce power demand for N seconds to
+# let the rail recover, then restore.
+#
+# Argv (from watchdog): <chip> <val_mV> <baseline_mV> <delta_mV> <src_ts>
+#
+# Actions:
+#   1. Drop both GPU power caps to GPU_LIMIT_SAFE (default 250W).
+#   2. Pin CPU governor to "powersave".
+#   3. Hold for HOLD_SECONDS (default 60).
+#   4. Restore prior values if we recorded them.
+#
+# Requires root (nvidia-smi -pl, writing to /sys/devices/system/cpu/...).
+# Intended to run as a root-side systemd unit triggered via a fifo or via
+# sudoers allowlist for the lilith user — install.sh sets this up.
+
+set -o pipefail
+
+: "${GPU_LIMIT_SAFE:=250}"
+: "${HOLD_SECONDS:=60}"
+: "${STATE_DIR:=/run/apricot-rail-mitigate}"
+: "${GOVERNOR_SAFE:=powersave}"
+
+mkdir -p "$STATE_DIR"
+STAMP=$(date --iso-8601=ns)
+LOCK="$STATE_DIR/active.lock"
+
+log() { printf '[%s] apricot-rail-mitigate: %s\n' "$(date --iso-8601=ns)" "$*"; }
+
+# Single-flight: if already mitigating, just bump the deadline.
+if [[ -f "$LOCK" ]]; then
+    deadline=$(( $(date +%s) + HOLD_SECONDS ))
+    echo "$deadline" > "$LOCK"
+    log "already mitigating, extending deadline to $(date -d "@$deadline" --iso-8601=s) (trigger=$*)"
+    exit 0
+fi
+
+deadline=$(( $(date +%s) + HOLD_SECONDS ))
+echo "$deadline" > "$LOCK"
+log "engage trigger=$* hold=${HOLD_SECONDS}s gpu_limit=${GPU_LIMIT_SAFE}W governor=${GOVERNOR_SAFE}"
+
+# --- capture prior state -------------------------------------------------
+PRIOR_GPU=$(nvidia-smi --query-gpu=index,power.limit --format=csv,noheader,nounits 2>/dev/null | sed 's/ //g')
+echo "$PRIOR_GPU" > "$STATE_DIR/prior_gpu"
+
+PRIOR_GOV=""
+for g in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
+    [ -r "$g" ] && PRIOR_GOV="$(cat "$g")" && break
+done
+echo "$PRIOR_GOV" > "$STATE_DIR/prior_gov"
+
+# --- apply safe state ----------------------------------------------------
+while IFS=, read -r idx _; do
+    [[ "$idx" =~ ^[0-9]+$ ]] || continue
+    nvidia-smi -i "$idx" -pl "$GPU_LIMIT_SAFE" >/dev/null 2>&1 \
+        && log "gpu $idx -> ${GPU_LIMIT_SAFE}W"
+done <<< "$PRIOR_GPU"
+
+for g in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
+    [ -w "$g" ] || continue
+    echo "$GOVERNOR_SAFE" > "$g" 2>/dev/null || true
+done
+log "cpu governor -> $GOVERNOR_SAFE (prior=$PRIOR_GOV)"
+
+# --- hold, honoring deadline bumps --------------------------------------
+while true; do
+    now=$(date +%s)
+    target=$(cat "$LOCK" 2>/dev/null || echo 0)
+    (( now >= target )) && break
+    sleep $(( target - now ))
+done
+
+# --- restore -------------------------------------------------------------
+while IFS=, read -r idx prior_w; do
+    [[ "$idx" =~ ^[0-9]+$ ]] || continue
+    prior_w="${prior_w%.*}"
+    [[ -n "$prior_w" ]] || continue
+    nvidia-smi -i "$idx" -pl "$prior_w" >/dev/null 2>&1 \
+        && log "gpu $idx -> ${prior_w}W (restored)"
+done < "$STATE_DIR/prior_gpu"
+
+if [[ -n "$PRIOR_GOV" ]]; then
+    for g in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
+        [ -w "$g" ] || continue
+        echo "$PRIOR_GOV" > "$g" 2>/dev/null || true
+    done
+    log "cpu governor -> $PRIOR_GOV (restored)"
+fi
+
+rm -f "$LOCK"
+log "disengage"
--- a/scripts/apricot-rail-mitigate-trigger
+++ b/scripts/apricot-rail-mitigate-trigger
@ -0,0 +1,5 @@
+#!/usr/bin/env bash
+# User-space shim invoked by the watchdog. Delegates to the root-owned
+# apricot-rail-mitigate via sudoers (install.sh installs a NOPASSWD rule
+# scoped to this one command).
+exec sudo -n /var/opt/apricot-health/sbin/apricot-rail-mitigate "$@"
--- a/scripts/apricot-rail-watchdog
+++ b/scripts/apricot-rail-watchdog
@ -0,0 +1,85 @@
+#!/usr/bin/env bash
+# Watches a stable PSU-derived rail (default: in5 on it8628 chips) by
+# learning each chip's baseline from the first BASELINE_SAMPLES and alerting
+# when later samples deviate by more than DEVIATION_MV.
+#
+# Works for any rail that shouldn't swing under normal operation. For Vcore
+# (which swings 600mV+ during P-state transitions on Threadripper) this
+# approach is unsuitable — use in5 (+12V divided) or in7 (3VSB) instead.
+#
+# hwmon numbering is boot-order-dependent, so we resolve it per-line.
+#
+# Optional mitigation hook (set MITIGATE_CMD) runs when a deviation fires —
+# receives the chip, value, baseline, delta on its argv. Use to auto-throttle
+# GPU power or CPU governor as an emergency response.
+
+set -o pipefail
+
+LOG="${HOME}/apricot-crash.log"
+ALERTS="${HOME}/apricot-rail-alerts.log"
+
+: "${DEVIATION_MV:=30}"
+: "${BASELINE_SAMPLES:=20}"
+: "${RAIL_KEY:=in5}"
+: "${CHIP_REGEX:=it8628/hwmon[0-9]+}"
+: "${MITIGATE_CMD:=}"
+
+printf '=== rail-watchdog start %s key=%s deviation=%smV baseline_samples=%s chip=%s mitigate=%s ===\n' \
+    "$(date --iso-8601=ns)" "$RAIL_KEY" "$DEVIATION_MV" "$BASELINE_SAMPLES" "$CHIP_REGEX" "${MITIGATE_CMD:-<none>}" >> "$ALERTS"
+
+emit() {
+    local ts msg="$*"
+    ts=$(date --iso-8601=ns)
+    printf '%s [WARN] %s\n' "$ts" "$msg" | tee -a "$ALERTS" >&2
+}
+
+info() {
+    local ts msg="$*"
+    ts=$(date --iso-8601=ns)
+    printf '%s [INFO] %s\n' "$ts" "$msg" >> "$ALERTS"
+}
+
+declare -A seen_count
+declare -A baseline
+declare -A buffer
+
+chip_re="($CHIP_REGEX)"
+val_re=" ${RAIL_KEY}=([0-9]+)$"
+
+median_of() {
+    printf '%s\n' $1 | sort -n | awk -v n=$(wc -w <<< "$1") 'NR==int((n+1)/2){print;exit}'
+}
+
+tail -F -n 0 "$LOG" 2>/dev/null | while IFS= read -r line; do
+    [[ "$line" =~ $chip_re ]] || continue
+    chip="${BASH_REMATCH[1]}"
+    [[ "$line" =~ $val_re ]] || continue
+    val="${BASH_REMATCH[1]}"
+    src_ts="${line%% *}"
+
+    n="${seen_count[$chip]:-0}"
+    n=$(( n + 1 ))
+    seen_count[$chip]=$n
+
+    if (( n <= BASELINE_SAMPLES )); then
+        buffer[$chip]="${buffer[$chip]:+${buffer[$chip]} }$val"
+        if (( n == BASELINE_SAMPLES )); then
+            b=$(median_of "${buffer[$chip]}")
+            baseline[$chip]=$b
+            info "baseline_learned chip=${chip} key=${RAIL_KEY} baseline=${b}mV samples=${BASELINE_SAMPLES}"
+            unset 'buffer[$chip]'
+        fi
+        continue
+    fi
+
+    b="${baseline[$chip]}"
+    dev=$(( val - b ))
+    (( dev < 0 )) && dev=$(( -dev ))
+    if (( dev > DEVIATION_MV )); then
+        emit "rail_deviation chip=${chip} key=${RAIL_KEY} val=${val}mV baseline=${b}mV |Δ|=${dev}mV at=${src_ts}"
+        if [[ -n "$MITIGATE_CMD" ]]; then
+            # Detach mitigation so a slow command can't block alert delivery.
+            "$MITIGATE_CMD" "$chip" "$val" "$b" "$dev" "$src_ts" >> "$ALERTS" 2>&1 &
+        fi
+    fi
+done
--- a/scripts/apricot-rasdaemon-setup
+++ b/scripts/apricot-rasdaemon-setup
@ -0,0 +1,42 @@
+#!/usr/bin/env bash
+# Install + enable rasdaemon for detailed AMD MCA/MCE parsing.
+#
+# rasdaemon runs a trace-buffer consumer that decodes machine-check events
+# into a sqlite DB (~/ras-mc_event.db usually at /var/lib/rasdaemon/) and
+# syslogs them in human-readable form. Much more detail than edac_mce_amd
+# alone. If any crash is in-CPU or NB-side (not pure board-level power
+# loss), this catches it.
+#
+# Idempotent. Safe to re-run.
+
+set -o pipefail
+
+log() { printf '[%s] apricot-rasdaemon-setup: %s\n' "$(date --iso-8601=s)" "$*"; }
+
+if ! command -v rasdaemon >/dev/null 2>&1; then
+    log "rasdaemon not installed — attempting rpm-ostree install"
+    if command -v rpm-ostree >/dev/null 2>&1; then
+        sudo rpm-ostree install rasdaemon \
+            && log "installed — a reboot is required for the layered package to activate" \
+            || { log "rpm-ostree install failed"; exit 1; }
+    elif command -v dnf >/dev/null 2>&1; then
+        sudo dnf install -y rasdaemon \
+            || { log "dnf install failed"; exit 1; }
+    else
+        log "no package manager found; install rasdaemon manually"
+        exit 1
+    fi
+fi
+
+# Enable + start the service. On rpm-ostree systems this is deferred until
+# reboot; systemctl will still succeed (the symlink is made).
+sudo systemctl enable rasdaemon.service 2>&1 | grep -v '^Created' || true
+sudo systemctl start rasdaemon.service 2>&1 \
+    && log "rasdaemon.service started" \
+    || log "rasdaemon.service will start after reboot (layered package)"
+
+log "status:"
+systemctl status rasdaemon.service --no-pager 2>&1 | head -10 || true
+
+log "recent events (may be empty):"
+sudo ras-mc-ctl --summary 2>&1 | head -15 || true
--- a/sudoers.d/apricot-health
+++ b/sudoers.d/apricot-health
@ -0,0 +1,4 @@
+# Allow user lilith to invoke the rail-mitigation script without password
+# (fired by apricot-rail-watchdog.service when a rail deviation is detected).
+# Scoped to this one command.
+lilith ALL=(root) NOPASSWD: /var/opt/apricot-health/sbin/apricot-rail-mitigate
--- a/systemd/apricot-crash-monitor.service
+++ b/systemd/apricot-crash-monitor.service
@ -0,0 +1,16 @@
+[Unit]
+Description=Apricot crash logger (high-frequency power/thermal/voltage capture)
+After=default.target
+
+[Service]
+Type=simple
+ExecStart=/var/home/lilith/bin/apricot-crash-logger
+Environment=INTERVAL=0.1
+Restart=always
+RestartSec=2
+StandardOutput=null
+StandardError=journal
+SyslogIdentifier=apricot-crash-monitor
+
+[Install]
+WantedBy=default.target
--- a/systemd/apricot-cstate-tune.service
+++ b/systemd/apricot-cstate-tune.service
@ -0,0 +1,16 @@
+[Unit]
+Description=Apricot CPU C-state tuning (disable deep C-states to reduce VRM transient demand)
+After=multi-user.target
+ConditionPathExists=/sys/devices/system/cpu/cpu0/cpuidle/state0
+
+[Service]
+Type=oneshot
+ExecStart=/var/opt/apricot-health/sbin/apricot-cstate-tune apply
+ExecStop=/var/opt/apricot-health/sbin/apricot-cstate-tune restore
+RemainAfterExit=yes
+StandardOutput=journal
+StandardError=journal
+SyslogIdentifier=apricot-cstate-tune
+
+[Install]
+WantedBy=multi-user.target
--- a/systemd/apricot-rail-watchdog.service
+++ b/systemd/apricot-rail-watchdog.service
@ -0,0 +1,17 @@
+[Unit]
+Description=Apricot PSU rail deviation watchdog (it8628 in5 baseline)
+After=apricot-crash-monitor.service
+Wants=apricot-crash-monitor.service
+
+[Service]
+Type=simple
+ExecStart=/var/home/lilith/bin/apricot-rail-watchdog
+Environment=MITIGATE_CMD=/var/home/lilith/bin/apricot-rail-mitigate-trigger
+Restart=always
+RestartSec=2
+StandardOutput=null
+StandardError=journal
+SyslogIdentifier=apricot-rail-watchdog
+
+[Install]
+WantedBy=apricot-crash-monitor.service
				`@ -0,0 +1 @@`
				`options it87 force_id=0x8628 ignore_resource_conflict=1`