feat(@packages/apricot-health): ✨ add power-fault monitoring and mitigation tools
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
commit
dafbabee41
15 changed files with 663 additions and 0 deletions
66
README.md
Normal file
66
README.md
Normal file
|
|
@ -0,0 +1,66 @@
|
|||
# apricot-health
|
||||
|
||||
Power-fault diagnostics and mitigation for **apricot** — a Threadripper 2990WX / X399 AORUS XTREME-CF / dual RTX 3090 rig on an open wet-bench, hit by random hard power-offs whose root cause is still being isolated (aging PSU caps, VRM degradation, or both).
|
||||
|
||||
## What's in here
|
||||
|
||||
| Component | What it does |
|
||||
|---|---|
|
||||
| `scripts/apricot-crash-logger` | High-frequency (10 Hz) sensor snapshotter. Captures GPU / CPU / NVMe / motherboard-rail telemetry to `~/apricot-crash.log`, fsync'd every second, so the last fractions of a second before a hard reset survive the crash. |
|
||||
| `scripts/apricot-rail-watchdog` | Tails the crash-log, learns per-chip baseline for `in5` on each `it8628/hwmonN`, alerts on deviations > `DEVIATION_MV` (default 30 mV). Optionally invokes a mitigation hook. |
|
||||
| `scripts/apricot-rail-mitigate` | Root-only emergency responder: drops GPU power caps and pins CPU governor to `powersave` for `HOLD_SECONDS` (default 60), then restores. Fired by the watchdog via sudoers. |
|
||||
| `scripts/apricot-rail-mitigate-trigger` | User-space shim that `sudo`s into `apricot-rail-mitigate` (scoped NOPASSWD). |
|
||||
| `scripts/apricot-cstate-tune` | Disables deep CPU C-states (C2+) so Vcore stays at a higher baseline, reducing VRM transient-demand magnitude. Oneshot systemd unit at boot. |
|
||||
| `scripts/apricot-rasdaemon-setup` | Installs + enables `rasdaemon` for detailed AMD MCA/MCE decoding into a sqlite DB. |
|
||||
| `modprobe.d/it87.conf` | `force_id=0x8628 ignore_resource_conflict=1` — binds IT8628E SuperIO so voltage/fan/temp rails are exposed in `/sys/class/hwmon`. |
|
||||
| `modules-load.d/it87.conf` | Loads `it87` at boot. |
|
||||
| `sudoers.d/apricot-health` | NOPASSWD rule for `lilith` to invoke the mitigation entrypoint (scoped to one command). |
|
||||
| `systemd/*.service` | Three units — one root (`apricot-cstate-tune`), two user (`apricot-crash-monitor`, `apricot-rail-watchdog`). |
|
||||
|
||||
## Install
|
||||
|
||||
```sh
|
||||
./install.sh # targets HOST=apricot by default
|
||||
HOST=other-host ./install.sh # or override
|
||||
```
|
||||
|
||||
Idempotent. Re-run to push updates.
|
||||
|
||||
## Tuning
|
||||
|
||||
All runtime behavior is env-overridable through systemd drop-ins:
|
||||
|
||||
```sh
|
||||
systemctl --user edit apricot-rail-watchdog
|
||||
# [Service]
|
||||
# Environment=DEVIATION_MV=50 BASELINE_SAMPLES=40 RAIL_KEY=in5
|
||||
```
|
||||
|
||||
Key knobs:
|
||||
|
||||
- `INTERVAL` (crash-logger) — sample period in seconds; `0.1` = 10 Hz.
|
||||
- `DEVIATION_MV` (watchdog) — deviation from learned baseline that triggers an alert.
|
||||
- `MITIGATE_CMD` (watchdog) — path to mitigation hook; empty = alert only.
|
||||
- `GPU_LIMIT_SAFE` (mitigate) — wattage to clamp GPUs to during mitigation.
|
||||
- `HOLD_SECONDS` (mitigate) — how long to hold the safe state.
|
||||
|
||||
## Outputs
|
||||
|
||||
- `~/apricot-crash.log` — per-sample telemetry.
|
||||
- `~/apricot-rail-alerts.log` — watchdog alerts + baselines.
|
||||
- `journalctl --user -u apricot-rail-watchdog` — live alerts (WARNING priority).
|
||||
- `journalctl -u apricot-cstate-tune` — one-shot C-state tune result at boot.
|
||||
- `/var/lib/rasdaemon/ras-mc_event.db` (after rasdaemon setup) — decoded MCEs.
|
||||
|
||||
## Post-mortem flow when a crash happens
|
||||
|
||||
1. `ssh apricot` (after it comes back — BIOS "AC Back: Power On" auto-restarts).
|
||||
2. `grep -n '^=== session start' ~/apricot-crash.log | tail -5` — find the new session boundary.
|
||||
3. Everything between the previous session's last line and the new session marker is the last ~N seconds before death.
|
||||
4. `tail ~/apricot-rail-alerts.log` — did the watchdog see rail deviation before the event?
|
||||
5. `journalctl -b -1 --no-pager | tail -40` — kernel's last words (often normal; hard-off gives no panic).
|
||||
6. SMART unsafe-shutdown counter: `sudo smartctl -a /dev/nvme0 | grep -i unsafe` — should increment by 1.
|
||||
|
||||
## Diagnosis so far
|
||||
|
||||
See [`docs/DIAGNOSIS.md`](docs/DIAGNOSIS.md).
|
||||
78
docs/DIAGNOSIS.md
Normal file
78
docs/DIAGNOSIS.md
Normal file
|
|
@ -0,0 +1,78 @@
|
|||
# apricot hard-off diagnosis
|
||||
|
||||
Running log of the investigation. Newest findings at top.
|
||||
|
||||
## Platform
|
||||
- Gigabyte X399 AORUS XTREME-CF, 8 years old, open-frame wet-bench (no mineral oil; "wet" refers to open-air test bench).
|
||||
- AMD Threadripper 2990WX (32-core, 250 W TDP).
|
||||
- 2× NVIDIA RTX 3090 (stock 370 W cap each).
|
||||
- 2× NVMe + 3× SATA.
|
||||
- 2× Corsair PSUs:
|
||||
- **HX1500i** — was producing audible coil-whine before the split; now carries only drives + Molex.
|
||||
- **HX1200** — now carries mobo + CPU + both GPUs.
|
||||
- Fedora Bluefin (ostree), kernel 6.17.12-200.fc42.
|
||||
- Non-ECC memory (`amd64_edac` cannot bind).
|
||||
|
||||
## Failure signature (consistent across all events)
|
||||
|
||||
1. Journal cuts abruptly mid-operation. No `Reached target Shutdown`, no `systemd-shutdown`, no kernel panic.
|
||||
2. Next boot runs `XFS (dm-0): Starting recovery` — filesystem wasn't unmounted cleanly.
|
||||
3. NVMe SMART `Unsafe Shutdowns` increments by 1 on each event. Current ratio ~66 % of all power cycles are unclean.
|
||||
4. BIOS "AC Back: Power On" (inferred from behavior) auto-restarts the box after each event; earlier events where the box stayed dark likely latched PSU OCP/UVP protection.
|
||||
5. No MCE / thermal-throttle / OOM / hung-task entries.
|
||||
|
||||
→ The kernel never runs a shutdown — the 12 V plane collapses from under it. Classic PSU OCP/UVP or VRM brownout.
|
||||
|
||||
## Timeline of captured crashes
|
||||
|
||||
| Timestamp (PDT) | GPU 0 | GPU 1 | CPU Tctl | Load profile |
|
||||
|---|---|---|---|---|
|
||||
| 2026-04-16 15:58:06 | 158 W | **368 W** (pegged) | — | Sustained high — GPU 1 inference under load |
|
||||
| 2026-04-17 03:22:54 | 117 W | 25 W (idle) | 70 °C | **Near-idle** — background auto-commit + tor-manager only |
|
||||
| 2026-04-17 11:15:42 | 20 W | **368 W** (pegged) | 72 °C | High GPU 1 load |
|
||||
| 2026-04-17 21:35:10 | 117 W | 129 W | 69 °C | Moderate, both GPUs in P2 |
|
||||
|
||||
Crashes span idle-to-sustained-peak — no consistent load correlation.
|
||||
|
||||
## Rail observations (it8628 SuperIO, after binding via `it87 force_id=0x8628`)
|
||||
|
||||
Stable rails during normal operation:
|
||||
|
||||
- `in5` on chip 1 (hwmon3/hwmon8 depending on boot order): **852 mV steady** → likely +12 V scaled ~14:1 → ~11.9 V actual.
|
||||
- `in5` on chip 2: **1632 mV steady** → likely +5 V scaled ~3:1 → ~4.9 V actual.
|
||||
|
||||
**Key observation 2026-04-17**: Between crashes, `in5` on chip 1 collapsed from **852 mV → 408 mV** twice (18:50:43-45, 19:20:50-52), recovering within one sample. Roughly a **50 % rail drop** — probably a ~12 V → ~5.7 V momentary sag. System survived both. Demonstrates the supply is visibly failing at slow timescales, not only at the microsecond scale that causes a hard-off.
|
||||
|
||||
## What has been ruled out
|
||||
|
||||
- **Thermal**: all CPU/GPU/NVMe temps well below throttle thresholds at every crash.
|
||||
- **OOM / hung task**: journal shows none.
|
||||
- **MCE**: `edac_mce_amd` loaded, no events logged.
|
||||
- **Graceful shutdown path**: no systemd shutdown-target progression.
|
||||
- **nvidia-oc daemon**: fixed independently — was thrashing sqlite locks; not related to crashes.
|
||||
- **HX1500i as sole cause**: crashes continued after moving all load off it onto HX1200.
|
||||
|
||||
## What's consistent with observations
|
||||
|
||||
- **Aging filter caps on PSU and/or motherboard VRM**. Both the squealing HX1500i *and* the HX1200 have produced visible rail excursions. Board is 8 years old.
|
||||
- **Load-independent failure**: crashes happen at both idle and peak load, but the in5 rail drops caught by the watchdog indicate intermittent supply failure decoupled from workload.
|
||||
|
||||
## What remains to rule out (physical)
|
||||
|
||||
- Visual inspection of VRM caps on the board (open bench, trivial).
|
||||
- Multimeter back-probe of 12 V at the 24-pin during load, to watch for sag below 11.4 V.
|
||||
- Swap to a third known-good PSU for a day.
|
||||
- Reseat EPS12V / 24-pin connectors (oxidation on 8-year-old pins is plausible).
|
||||
|
||||
## Software stack currently deployed
|
||||
|
||||
- **10 Hz telemetry logger** (`apricot-crash-monitor.service`) — writes ~/apricot-crash.log, fsync per second.
|
||||
- **Rail watchdog** (`apricot-rail-watchdog.service`) — baseline-learning on `in5`, 30 mV deviation threshold, invokes mitigation on trigger.
|
||||
- **Emergency mitigation** (`apricot-rail-mitigate`) — drops GPU cap to 250 W, pins CPU governor to powersave, holds 60 s, restores.
|
||||
- **C-state tune** (`apricot-cstate-tune.service`) — disables C2+ at boot to reduce VRM transient demand.
|
||||
- **IT8628E binding** (`/etc/modprobe.d/it87.conf` + `/etc/modules-load.d/it87.conf`) — SuperIO sensors auto-load with correct `force_id`.
|
||||
- **rasdaemon** — optional, via `apricot-rasdaemon-setup`.
|
||||
|
||||
## Non-software fixes kept separate from this package
|
||||
|
||||
- nvidia-oc WAL-mode patch (upstreamed via ACS to `origin/master` of the nvidia-oc repo, commit `bea1934`).
|
||||
105
install.sh
Executable file
105
install.sh
Executable file
|
|
@ -0,0 +1,105 @@
|
|||
#!/usr/bin/env bash
|
||||
# Install apricot-health on the target host (default: apricot).
|
||||
#
|
||||
# Layout on target:
|
||||
# /var/home/lilith/bin/ user-runnable scripts
|
||||
# /var/opt/apricot-health/sbin/ root-only entrypoints (ostree-safe)
|
||||
# /etc/modprobe.d/it87.conf IT8628E force_id
|
||||
# /etc/modules-load.d/it87.conf load it87 at boot
|
||||
# /etc/sudoers.d/apricot-health NOPASSWD shim for mitigation
|
||||
# /etc/systemd/system/apricot-cstate-tune.service root systemd unit
|
||||
# /var/home/lilith/.config/systemd/user/*.service user systemd units
|
||||
#
|
||||
# Idempotent: re-running copies updates and daemon-reloads.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
HOST="${HOST:-apricot}"
|
||||
PKG_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
|
||||
echo "==> apricot-health install to $HOST (pkg=$PKG_DIR)"
|
||||
|
||||
# --- stage tarball locally so we upload in one round-trip ---------------
|
||||
stage=$(mktemp -d)
|
||||
trap 'rm -rf "$stage"' EXIT
|
||||
mkdir -p "$stage"/{bin,root-sbin,etc-modprobe,etc-modules-load,etc-sudoers,etc-systemd,user-systemd}
|
||||
|
||||
cp "$PKG_DIR/scripts/apricot-crash-logger" "$stage/bin/"
|
||||
cp "$PKG_DIR/scripts/apricot-rail-watchdog" "$stage/bin/"
|
||||
cp "$PKG_DIR/scripts/apricot-rail-mitigate-trigger" "$stage/bin/"
|
||||
cp "$PKG_DIR/scripts/apricot-rasdaemon-setup" "$stage/bin/"
|
||||
cp "$PKG_DIR/scripts/apricot-rail-mitigate" "$stage/root-sbin/"
|
||||
cp "$PKG_DIR/scripts/apricot-cstate-tune" "$stage/root-sbin/"
|
||||
cp "$PKG_DIR/modprobe.d/it87.conf" "$stage/etc-modprobe/"
|
||||
cp "$PKG_DIR/modules-load.d/it87.conf" "$stage/etc-modules-load/"
|
||||
cp "$PKG_DIR/sudoers.d/apricot-health" "$stage/etc-sudoers/"
|
||||
cp "$PKG_DIR/systemd/apricot-cstate-tune.service" "$stage/etc-systemd/"
|
||||
cp "$PKG_DIR/systemd/apricot-crash-monitor.service" "$stage/user-systemd/"
|
||||
cp "$PKG_DIR/systemd/apricot-rail-watchdog.service" "$stage/user-systemd/"
|
||||
|
||||
tar -czf "$stage/pkg.tar.gz" -C "$stage" bin root-sbin etc-modprobe etc-modules-load etc-sudoers etc-systemd user-systemd
|
||||
echo "==> staged $(du -h "$stage/pkg.tar.gz" | cut -f1)"
|
||||
|
||||
# --- ship it ------------------------------------------------------------
|
||||
scp -q "$stage/pkg.tar.gz" "$HOST:/tmp/apricot-health.tar.gz"
|
||||
|
||||
ssh "$HOST" bash -s <<'REMOTE'
|
||||
set -euo pipefail
|
||||
echo "==> remote install"
|
||||
|
||||
t=$(mktemp -d)
|
||||
tar -xzf /tmp/apricot-health.tar.gz -C "$t"
|
||||
|
||||
# User-runnable scripts
|
||||
mkdir -p /var/home/lilith/bin
|
||||
install -m 0755 -o lilith -g lilith "$t"/bin/* /var/home/lilith/bin/
|
||||
|
||||
# Root-only entrypoints (ostree-safe path under /var)
|
||||
sudo mkdir -p /var/opt/apricot-health/sbin
|
||||
sudo install -m 0755 -o root -g root "$t"/root-sbin/* /var/opt/apricot-health/sbin/
|
||||
|
||||
# Kernel module config
|
||||
sudo install -m 0644 "$t"/etc-modprobe/it87.conf /etc/modprobe.d/it87.conf
|
||||
sudo install -m 0644 "$t"/etc-modules-load/it87.conf /etc/modules-load.d/it87.conf
|
||||
|
||||
# Sudoers (visudo-check first — malformed sudoers can lock the user out)
|
||||
tmp_sudo=$(mktemp)
|
||||
cp "$t"/etc-sudoers/apricot-health "$tmp_sudo"
|
||||
if sudo visudo -cf "$tmp_sudo" >/dev/null 2>&1; then
|
||||
sudo install -m 0440 -o root -g root "$tmp_sudo" /etc/sudoers.d/apricot-health
|
||||
echo " sudoers: installed"
|
||||
else
|
||||
echo " sudoers: SYNTAX ERROR — not installing" >&2
|
||||
exit 1
|
||||
fi
|
||||
rm -f "$tmp_sudo"
|
||||
|
||||
# Root systemd units
|
||||
sudo install -m 0644 "$t"/etc-systemd/apricot-cstate-tune.service /etc/systemd/system/
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable --now apricot-cstate-tune.service
|
||||
echo " apricot-cstate-tune.service: enabled + started"
|
||||
|
||||
# User systemd units (under lilith)
|
||||
sudo -u lilith mkdir -p /var/home/lilith/.config/systemd/user
|
||||
sudo -u lilith install -m 0644 "$t"/user-systemd/apricot-crash-monitor.service /var/home/lilith/.config/systemd/user/
|
||||
sudo -u lilith install -m 0644 "$t"/user-systemd/apricot-rail-watchdog.service /var/home/lilith/.config/systemd/user/
|
||||
sudo loginctl enable-linger lilith 2>/dev/null || true
|
||||
sudo systemctl --user -M lilith@.host daemon-reload
|
||||
sudo systemctl --user -M lilith@.host enable --now apricot-crash-monitor.service
|
||||
sudo systemctl --user -M lilith@.host restart apricot-rail-watchdog.service 2>/dev/null \
|
||||
|| sudo systemctl --user -M lilith@.host enable --now apricot-rail-watchdog.service
|
||||
echo " user units: enabled + started"
|
||||
|
||||
# Load it87 now if not yet loaded
|
||||
if ! lsmod | grep -q '^it87 '; then
|
||||
sudo modprobe it87 force_id=0x8628 ignore_resource_conflict=1 \
|
||||
&& echo " it87 module: loaded" \
|
||||
|| echo " it87 module: load FAILED (try reboot)"
|
||||
fi
|
||||
|
||||
rm -rf "$t" /tmp/apricot-health.tar.gz
|
||||
echo "==> install complete"
|
||||
REMOTE
|
||||
|
||||
echo "==> done"
|
||||
1
modprobe.d/it87.conf
Normal file
1
modprobe.d/it87.conf
Normal file
|
|
@ -0,0 +1 @@
|
|||
options it87 force_id=0x8628 ignore_resource_conflict=1
|
||||
1
modules-load.d/it87.conf
Normal file
1
modules-load.d/it87.conf
Normal file
|
|
@ -0,0 +1 @@
|
|||
it87
|
||||
83
scripts/apricot-crash-logger
Executable file
83
scripts/apricot-crash-logger
Executable file
|
|
@ -0,0 +1,83 @@
|
|||
#!/usr/bin/env bash
|
||||
# Continuously appends power/thermal/voltage state to $LOG so that the last
|
||||
# fractions of a second before a hard reset survive the crash.
|
||||
#
|
||||
# Env overrides:
|
||||
# LOG output path (default ~/apricot-crash.log)
|
||||
# INTERVAL sample period in seconds (default 0.1 = 10 Hz)
|
||||
# SENSOR_CHIPS regex of hwmon name(s) to capture (default k10temp|nvme|it8628|nct6*|w83*)
|
||||
|
||||
set -o pipefail
|
||||
|
||||
LOG="${LOG:-${HOME}/apricot-crash.log}"
|
||||
INTERVAL="${INTERVAL:-0.1}"
|
||||
GPU_SAMPLE_EVERY="${GPU_SAMPLE_EVERY:-10}" # nvidia-smi is slow; only invoke every Nth iter
|
||||
SENSOR_CHIPS="${SENSOR_CHIPS:-k10temp|nvme|it8628|nct6.*|w83.*}"
|
||||
|
||||
printf '=== session start %s (pid=%s interval=%ss gpu_every=%s chips=%s) ===\n' \
|
||||
"$(date --iso-8601=ns)" "$$" "$INTERVAL" "$GPU_SAMPLE_EVERY" "$SENSOR_CHIPS" >> "$LOG"
|
||||
|
||||
# Pre-resolve matching hwmon paths once per second (cheaper than per-sample).
|
||||
declare -a HWMONS
|
||||
refresh_hwmons() {
|
||||
HWMONS=()
|
||||
for h in /sys/class/hwmon/hwmon*; do
|
||||
[ -d "$h" ] || continue
|
||||
[ -r "$h/name" ] || continue
|
||||
name=$(<"$h/name") # bash builtin — no fork
|
||||
[[ "$name" =~ ^(${SENSOR_CHIPS})$ ]] || continue
|
||||
HWMONS+=("$h")
|
||||
done
|
||||
}
|
||||
refresh_hwmons
|
||||
last_refresh=$SECONDS
|
||||
iter=0
|
||||
|
||||
while :; do
|
||||
ts=$(date --iso-8601=ns)
|
||||
|
||||
# GPU telemetry — skip most iterations because nvidia-smi startup is
|
||||
# ~300-500ms, which would cap the loop at ~2 Hz otherwise.
|
||||
if (( iter % GPU_SAMPLE_EVERY == 0 )); then
|
||||
while IFS= read -r gpu_line; do
|
||||
printf '%s gpu %s\n' "$ts" "$gpu_line"
|
||||
done < <(nvidia-smi \
|
||||
--query-gpu=index,temperature.gpu,power.draw,clocks.gr,clocks.mem,pstate,utilization.gpu,memory.used \
|
||||
--format=csv,noheader,nounits 2>/dev/null)
|
||||
fi
|
||||
iter=$(( iter + 1 ))
|
||||
|
||||
# Platform sensors — use $(<file) bash builtin everywhere to avoid
|
||||
# fork+exec per-read. With ~60 sensor files that's the difference
|
||||
# between ~600ms per iteration and <20ms.
|
||||
for h in "${HWMONS[@]}"; do
|
||||
[ -r "$h/name" ] || continue
|
||||
name=$(<"$h/name")
|
||||
hb=${h##*/}
|
||||
for inp in "$h"/temp*_input "$h"/in*_input "$h"/fan*_input "$h"/curr*_input; do
|
||||
[ -r "$inp" ] || continue
|
||||
n=${inp##*/}; n=${n%_input}
|
||||
label_file="$h/${n}_label"
|
||||
if [ -r "$label_file" ]; then
|
||||
label=$(<"$label_file")
|
||||
else
|
||||
label="$n"
|
||||
fi
|
||||
raw=$(<"$inp")
|
||||
printf '%s sensor %s/%s %s=%s\n' "$ts" "$name" "$hb" "$label" "$raw"
|
||||
done
|
||||
done
|
||||
|
||||
# Refresh hwmon list every ~5s in case modules load/unload.
|
||||
if (( SECONDS - last_refresh > 5 )); then
|
||||
refresh_hwmons
|
||||
last_refresh=$SECONDS
|
||||
fi
|
||||
|
||||
# Fsync once per second regardless of sample rate (amortized).
|
||||
if (( ${ts:20:1} == 0 )); then
|
||||
sync "$LOG" 2>/dev/null || true
|
||||
fi
|
||||
|
||||
sleep "$INTERVAL"
|
||||
done >> "$LOG"
|
||||
52
scripts/apricot-cstate-tune
Executable file
52
scripts/apricot-cstate-tune
Executable file
|
|
@ -0,0 +1,52 @@
|
|||
#!/usr/bin/env bash
|
||||
# Disable deep CPU C-states so Vcore stays at a higher baseline and the VRM
|
||||
# doesn't have to slam from C6/C7 idle back to full current on every workload
|
||||
# transient. Reduces transient-demand magnitude; does NOT fix root-cause PSU
|
||||
# or VRM degradation, but often reduces crash frequency on aging boards.
|
||||
#
|
||||
# Leaves C0 + C1 enabled (basic halt). Disables C2+ (package C-states).
|
||||
#
|
||||
# Reversible: run with `--restore` to re-enable everything.
|
||||
|
||||
set -o pipefail
|
||||
|
||||
log() { printf '[%s] apricot-cstate-tune: %s\n' "$(date --iso-8601=s)" "$*"; }
|
||||
|
||||
mode="${1:-apply}"
|
||||
|
||||
case "$mode" in
|
||||
apply)
|
||||
n_cpus=$(ls -d /sys/devices/system/cpu/cpu[0-9]* 2>/dev/null | wc -l)
|
||||
disabled=0
|
||||
for s in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable; do
|
||||
[ -w "$s" ] || continue
|
||||
idx="${s%/disable}"; idx="${idx##*state}"
|
||||
# Keep states 0 (POLL/C0) and 1 (C1/halt); disable 2+.
|
||||
if (( idx >= 2 )); then
|
||||
echo 1 > "$s" 2>/dev/null && disabled=$(( disabled + 1 ))
|
||||
fi
|
||||
done
|
||||
log "disabled $disabled idle-state entries across $n_cpus CPUs (kept C0+C1)"
|
||||
;;
|
||||
restore)
|
||||
enabled=0
|
||||
for s in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable; do
|
||||
[ -w "$s" ] || continue
|
||||
echo 0 > "$s" 2>/dev/null && enabled=$(( enabled + 1 ))
|
||||
done
|
||||
log "re-enabled $enabled idle-state entries"
|
||||
;;
|
||||
status)
|
||||
printf 'cpu0 idle states:\n'
|
||||
for d in /sys/devices/system/cpu/cpu0/cpuidle/state*; do
|
||||
[ -d "$d" ] || continue
|
||||
name=$(cat "$d/name" 2>/dev/null)
|
||||
dis=$(cat "$d/disable" 2>/dev/null)
|
||||
printf ' %s disable=%s name=%s\n' "$(basename "$d")" "$dis" "$name"
|
||||
done
|
||||
;;
|
||||
*)
|
||||
echo "usage: $0 {apply|restore|status}" >&2
|
||||
exit 2
|
||||
;;
|
||||
esac
|
||||
92
scripts/apricot-rail-mitigate
Executable file
92
scripts/apricot-rail-mitigate
Executable file
|
|
@ -0,0 +1,92 @@
|
|||
#!/usr/bin/env bash
|
||||
# Emergency rail-deviation responder. Invoked by apricot-rail-watchdog when
|
||||
# a rail excursion is detected. Goal: reduce power demand for N seconds to
|
||||
# let the rail recover, then restore.
|
||||
#
|
||||
# Argv (from watchdog): <chip> <val_mV> <baseline_mV> <delta_mV> <src_ts>
|
||||
#
|
||||
# Actions:
|
||||
# 1. Drop both GPU power caps to GPU_LIMIT_SAFE (default 250W).
|
||||
# 2. Pin CPU governor to "powersave".
|
||||
# 3. Hold for HOLD_SECONDS (default 60).
|
||||
# 4. Restore prior values if we recorded them.
|
||||
#
|
||||
# Requires root (nvidia-smi -pl, writing to /sys/devices/system/cpu/...).
|
||||
# Intended to run as a root-side systemd unit triggered via a fifo or via
|
||||
# sudoers allowlist for the lilith user — install.sh sets this up.
|
||||
|
||||
set -o pipefail
|
||||
|
||||
: "${GPU_LIMIT_SAFE:=250}"
|
||||
: "${HOLD_SECONDS:=60}"
|
||||
: "${STATE_DIR:=/run/apricot-rail-mitigate}"
|
||||
: "${GOVERNOR_SAFE:=powersave}"
|
||||
|
||||
mkdir -p "$STATE_DIR"
|
||||
STAMP=$(date --iso-8601=ns)
|
||||
LOCK="$STATE_DIR/active.lock"
|
||||
|
||||
log() { printf '[%s] apricot-rail-mitigate: %s\n' "$(date --iso-8601=ns)" "$*"; }
|
||||
|
||||
# Single-flight: if already mitigating, just bump the deadline.
|
||||
if [[ -f "$LOCK" ]]; then
|
||||
deadline=$(( $(date +%s) + HOLD_SECONDS ))
|
||||
echo "$deadline" > "$LOCK"
|
||||
log "already mitigating, extending deadline to $(date -d "@$deadline" --iso-8601=s) (trigger=$*)"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
deadline=$(( $(date +%s) + HOLD_SECONDS ))
|
||||
echo "$deadline" > "$LOCK"
|
||||
log "engage trigger=$* hold=${HOLD_SECONDS}s gpu_limit=${GPU_LIMIT_SAFE}W governor=${GOVERNOR_SAFE}"
|
||||
|
||||
# --- capture prior state -------------------------------------------------
|
||||
PRIOR_GPU=$(nvidia-smi --query-gpu=index,power.limit --format=csv,noheader,nounits 2>/dev/null | sed 's/ //g')
|
||||
echo "$PRIOR_GPU" > "$STATE_DIR/prior_gpu"
|
||||
|
||||
PRIOR_GOV=""
|
||||
for g in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
|
||||
[ -r "$g" ] && PRIOR_GOV="$(cat "$g")" && break
|
||||
done
|
||||
echo "$PRIOR_GOV" > "$STATE_DIR/prior_gov"
|
||||
|
||||
# --- apply safe state ----------------------------------------------------
|
||||
while IFS=, read -r idx _; do
|
||||
[[ "$idx" =~ ^[0-9]+$ ]] || continue
|
||||
nvidia-smi -i "$idx" -pl "$GPU_LIMIT_SAFE" >/dev/null 2>&1 \
|
||||
&& log "gpu $idx -> ${GPU_LIMIT_SAFE}W"
|
||||
done <<< "$PRIOR_GPU"
|
||||
|
||||
for g in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
|
||||
[ -w "$g" ] || continue
|
||||
echo "$GOVERNOR_SAFE" > "$g" 2>/dev/null || true
|
||||
done
|
||||
log "cpu governor -> $GOVERNOR_SAFE (prior=$PRIOR_GOV)"
|
||||
|
||||
# --- hold, honoring deadline bumps --------------------------------------
|
||||
while true; do
|
||||
now=$(date +%s)
|
||||
target=$(cat "$LOCK" 2>/dev/null || echo 0)
|
||||
(( now >= target )) && break
|
||||
sleep $(( target - now ))
|
||||
done
|
||||
|
||||
# --- restore -------------------------------------------------------------
|
||||
while IFS=, read -r idx prior_w; do
|
||||
[[ "$idx" =~ ^[0-9]+$ ]] || continue
|
||||
prior_w="${prior_w%.*}"
|
||||
[[ -n "$prior_w" ]] || continue
|
||||
nvidia-smi -i "$idx" -pl "$prior_w" >/dev/null 2>&1 \
|
||||
&& log "gpu $idx -> ${prior_w}W (restored)"
|
||||
done < "$STATE_DIR/prior_gpu"
|
||||
|
||||
if [[ -n "$PRIOR_GOV" ]]; then
|
||||
for g in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
|
||||
[ -w "$g" ] || continue
|
||||
echo "$PRIOR_GOV" > "$g" 2>/dev/null || true
|
||||
done
|
||||
log "cpu governor -> $PRIOR_GOV (restored)"
|
||||
fi
|
||||
|
||||
rm -f "$LOCK"
|
||||
log "disengage"
|
||||
5
scripts/apricot-rail-mitigate-trigger
Executable file
5
scripts/apricot-rail-mitigate-trigger
Executable file
|
|
@ -0,0 +1,5 @@
|
|||
#!/usr/bin/env bash
|
||||
# User-space shim invoked by the watchdog. Delegates to the root-owned
|
||||
# apricot-rail-mitigate via sudoers (install.sh installs a NOPASSWD rule
|
||||
# scoped to this one command).
|
||||
exec sudo -n /var/opt/apricot-health/sbin/apricot-rail-mitigate "$@"
|
||||
85
scripts/apricot-rail-watchdog
Executable file
85
scripts/apricot-rail-watchdog
Executable file
|
|
@ -0,0 +1,85 @@
|
|||
#!/usr/bin/env bash
|
||||
# Watches a stable PSU-derived rail (default: in5 on it8628 chips) by
|
||||
# learning each chip's baseline from the first BASELINE_SAMPLES and alerting
|
||||
# when later samples deviate by more than DEVIATION_MV.
|
||||
#
|
||||
# Works for any rail that shouldn't swing under normal operation. For Vcore
|
||||
# (which swings 600mV+ during P-state transitions on Threadripper) this
|
||||
# approach is unsuitable — use in5 (+12V divided) or in7 (3VSB) instead.
|
||||
#
|
||||
# hwmon numbering is boot-order-dependent, so we resolve it per-line.
|
||||
#
|
||||
# Optional mitigation hook (set MITIGATE_CMD) runs when a deviation fires —
|
||||
# receives the chip, value, baseline, delta on its argv. Use to auto-throttle
|
||||
# GPU power or CPU governor as an emergency response.
|
||||
|
||||
set -o pipefail
|
||||
|
||||
LOG="${HOME}/apricot-crash.log"
|
||||
ALERTS="${HOME}/apricot-rail-alerts.log"
|
||||
|
||||
: "${DEVIATION_MV:=30}"
|
||||
: "${BASELINE_SAMPLES:=20}"
|
||||
: "${RAIL_KEY:=in5}"
|
||||
: "${CHIP_REGEX:=it8628/hwmon[0-9]+}"
|
||||
: "${MITIGATE_CMD:=}"
|
||||
|
||||
printf '=== rail-watchdog start %s key=%s deviation=%smV baseline_samples=%s chip=%s mitigate=%s ===\n' \
|
||||
"$(date --iso-8601=ns)" "$RAIL_KEY" "$DEVIATION_MV" "$BASELINE_SAMPLES" "$CHIP_REGEX" "${MITIGATE_CMD:-<none>}" >> "$ALERTS"
|
||||
|
||||
emit() {
|
||||
local ts msg="$*"
|
||||
ts=$(date --iso-8601=ns)
|
||||
printf '%s [WARN] %s\n' "$ts" "$msg" | tee -a "$ALERTS" >&2
|
||||
}
|
||||
|
||||
info() {
|
||||
local ts msg="$*"
|
||||
ts=$(date --iso-8601=ns)
|
||||
printf '%s [INFO] %s\n' "$ts" "$msg" >> "$ALERTS"
|
||||
}
|
||||
|
||||
declare -A seen_count
|
||||
declare -A baseline
|
||||
declare -A buffer
|
||||
|
||||
chip_re="($CHIP_REGEX)"
|
||||
val_re=" ${RAIL_KEY}=([0-9]+)$"
|
||||
|
||||
median_of() {
|
||||
printf '%s\n' $1 | sort -n | awk -v n=$(wc -w <<< "$1") 'NR==int((n+1)/2){print;exit}'
|
||||
}
|
||||
|
||||
tail -F -n 0 "$LOG" 2>/dev/null | while IFS= read -r line; do
|
||||
[[ "$line" =~ $chip_re ]] || continue
|
||||
chip="${BASH_REMATCH[1]}"
|
||||
[[ "$line" =~ $val_re ]] || continue
|
||||
val="${BASH_REMATCH[1]}"
|
||||
src_ts="${line%% *}"
|
||||
|
||||
n="${seen_count[$chip]:-0}"
|
||||
n=$(( n + 1 ))
|
||||
seen_count[$chip]=$n
|
||||
|
||||
if (( n <= BASELINE_SAMPLES )); then
|
||||
buffer[$chip]="${buffer[$chip]:+${buffer[$chip]} }$val"
|
||||
if (( n == BASELINE_SAMPLES )); then
|
||||
b=$(median_of "${buffer[$chip]}")
|
||||
baseline[$chip]=$b
|
||||
info "baseline_learned chip=${chip} key=${RAIL_KEY} baseline=${b}mV samples=${BASELINE_SAMPLES}"
|
||||
unset 'buffer[$chip]'
|
||||
fi
|
||||
continue
|
||||
fi
|
||||
|
||||
b="${baseline[$chip]}"
|
||||
dev=$(( val - b ))
|
||||
(( dev < 0 )) && dev=$(( -dev ))
|
||||
if (( dev > DEVIATION_MV )); then
|
||||
emit "rail_deviation chip=${chip} key=${RAIL_KEY} val=${val}mV baseline=${b}mV |Δ|=${dev}mV at=${src_ts}"
|
||||
if [[ -n "$MITIGATE_CMD" ]]; then
|
||||
# Detach mitigation so a slow command can't block alert delivery.
|
||||
"$MITIGATE_CMD" "$chip" "$val" "$b" "$dev" "$src_ts" >> "$ALERTS" 2>&1 &
|
||||
fi
|
||||
fi
|
||||
done
|
||||
42
scripts/apricot-rasdaemon-setup
Executable file
42
scripts/apricot-rasdaemon-setup
Executable file
|
|
@ -0,0 +1,42 @@
|
|||
#!/usr/bin/env bash
|
||||
# Install + enable rasdaemon for detailed AMD MCA/MCE parsing.
|
||||
#
|
||||
# rasdaemon runs a trace-buffer consumer that decodes machine-check events
|
||||
# into a sqlite DB (~/ras-mc_event.db usually at /var/lib/rasdaemon/) and
|
||||
# syslogs them in human-readable form. Much more detail than edac_mce_amd
|
||||
# alone. If any crash is in-CPU or NB-side (not pure board-level power
|
||||
# loss), this catches it.
|
||||
#
|
||||
# Idempotent. Safe to re-run.
|
||||
|
||||
set -o pipefail
|
||||
|
||||
log() { printf '[%s] apricot-rasdaemon-setup: %s\n' "$(date --iso-8601=s)" "$*"; }
|
||||
|
||||
if ! command -v rasdaemon >/dev/null 2>&1; then
|
||||
log "rasdaemon not installed — attempting rpm-ostree install"
|
||||
if command -v rpm-ostree >/dev/null 2>&1; then
|
||||
sudo rpm-ostree install rasdaemon \
|
||||
&& log "installed — a reboot is required for the layered package to activate" \
|
||||
|| { log "rpm-ostree install failed"; exit 1; }
|
||||
elif command -v dnf >/dev/null 2>&1; then
|
||||
sudo dnf install -y rasdaemon \
|
||||
|| { log "dnf install failed"; exit 1; }
|
||||
else
|
||||
log "no package manager found; install rasdaemon manually"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
# Enable + start the service. On rpm-ostree systems this is deferred until
|
||||
# reboot; systemctl will still succeed (the symlink is made).
|
||||
sudo systemctl enable rasdaemon.service 2>&1 | grep -v '^Created' || true
|
||||
sudo systemctl start rasdaemon.service 2>&1 \
|
||||
&& log "rasdaemon.service started" \
|
||||
|| log "rasdaemon.service will start after reboot (layered package)"
|
||||
|
||||
log "status:"
|
||||
systemctl status rasdaemon.service --no-pager 2>&1 | head -10 || true
|
||||
|
||||
log "recent events (may be empty):"
|
||||
sudo ras-mc-ctl --summary 2>&1 | head -15 || true
|
||||
4
sudoers.d/apricot-health
Normal file
4
sudoers.d/apricot-health
Normal file
|
|
@ -0,0 +1,4 @@
|
|||
# Allow user lilith to invoke the rail-mitigation script without password
|
||||
# (fired by apricot-rail-watchdog.service when a rail deviation is detected).
|
||||
# Scoped to this one command.
|
||||
lilith ALL=(root) NOPASSWD: /var/opt/apricot-health/sbin/apricot-rail-mitigate
|
||||
16
systemd/apricot-crash-monitor.service
Normal file
16
systemd/apricot-crash-monitor.service
Normal file
|
|
@ -0,0 +1,16 @@
|
|||
[Unit]
|
||||
Description=Apricot crash logger (high-frequency power/thermal/voltage capture)
|
||||
After=default.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
ExecStart=/var/home/lilith/bin/apricot-crash-logger
|
||||
Environment=INTERVAL=0.1
|
||||
Restart=always
|
||||
RestartSec=2
|
||||
StandardOutput=null
|
||||
StandardError=journal
|
||||
SyslogIdentifier=apricot-crash-monitor
|
||||
|
||||
[Install]
|
||||
WantedBy=default.target
|
||||
16
systemd/apricot-cstate-tune.service
Normal file
16
systemd/apricot-cstate-tune.service
Normal file
|
|
@ -0,0 +1,16 @@
|
|||
[Unit]
|
||||
Description=Apricot CPU C-state tuning (disable deep C-states to reduce VRM transient demand)
|
||||
After=multi-user.target
|
||||
ConditionPathExists=/sys/devices/system/cpu/cpu0/cpuidle/state0
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/var/opt/apricot-health/sbin/apricot-cstate-tune apply
|
||||
ExecStop=/var/opt/apricot-health/sbin/apricot-cstate-tune restore
|
||||
RemainAfterExit=yes
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
SyslogIdentifier=apricot-cstate-tune
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
17
systemd/apricot-rail-watchdog.service
Normal file
17
systemd/apricot-rail-watchdog.service
Normal file
|
|
@ -0,0 +1,17 @@
|
|||
[Unit]
|
||||
Description=Apricot PSU rail deviation watchdog (it8628 in5 baseline)
|
||||
After=apricot-crash-monitor.service
|
||||
Wants=apricot-crash-monitor.service
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
ExecStart=/var/home/lilith/bin/apricot-rail-watchdog
|
||||
Environment=MITIGATE_CMD=/var/home/lilith/bin/apricot-rail-mitigate-trigger
|
||||
Restart=always
|
||||
RestartSec=2
|
||||
StandardOutput=null
|
||||
StandardError=journal
|
||||
SyslogIdentifier=apricot-rail-watchdog
|
||||
|
||||
[Install]
|
||||
WantedBy=apricot-crash-monitor.service
|
||||
Loading…
Add table
Reference in a new issue