feat(@packages/apricot-health): ✨ add power-fault monitoring and mitigation tools
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
commit
dafbabee41
15 changed files with 663 additions and 0 deletions
66
README.md
Normal file
66
README.md
Normal file
|
|
@ -0,0 +1,66 @@
|
||||||
|
# apricot-health
|
||||||
|
|
||||||
|
Power-fault diagnostics and mitigation for **apricot** — a Threadripper 2990WX / X399 AORUS XTREME-CF / dual RTX 3090 rig on an open wet-bench, hit by random hard power-offs whose root cause is still being isolated (aging PSU caps, VRM degradation, or both).
|
||||||
|
|
||||||
|
## What's in here
|
||||||
|
|
||||||
|
| Component | What it does |
|
||||||
|
|---|---|
|
||||||
|
| `scripts/apricot-crash-logger` | High-frequency (10 Hz) sensor snapshotter. Captures GPU / CPU / NVMe / motherboard-rail telemetry to `~/apricot-crash.log`, fsync'd every second, so the last fractions of a second before a hard reset survive the crash. |
|
||||||
|
| `scripts/apricot-rail-watchdog` | Tails the crash-log, learns per-chip baseline for `in5` on each `it8628/hwmonN`, alerts on deviations > `DEVIATION_MV` (default 30 mV). Optionally invokes a mitigation hook. |
|
||||||
|
| `scripts/apricot-rail-mitigate` | Root-only emergency responder: drops GPU power caps and pins CPU governor to `powersave` for `HOLD_SECONDS` (default 60), then restores. Fired by the watchdog via sudoers. |
|
||||||
|
| `scripts/apricot-rail-mitigate-trigger` | User-space shim that `sudo`s into `apricot-rail-mitigate` (scoped NOPASSWD). |
|
||||||
|
| `scripts/apricot-cstate-tune` | Disables deep CPU C-states (C2+) so Vcore stays at a higher baseline, reducing VRM transient-demand magnitude. Oneshot systemd unit at boot. |
|
||||||
|
| `scripts/apricot-rasdaemon-setup` | Installs + enables `rasdaemon` for detailed AMD MCA/MCE decoding into a sqlite DB. |
|
||||||
|
| `modprobe.d/it87.conf` | `force_id=0x8628 ignore_resource_conflict=1` — binds IT8628E SuperIO so voltage/fan/temp rails are exposed in `/sys/class/hwmon`. |
|
||||||
|
| `modules-load.d/it87.conf` | Loads `it87` at boot. |
|
||||||
|
| `sudoers.d/apricot-health` | NOPASSWD rule for `lilith` to invoke the mitigation entrypoint (scoped to one command). |
|
||||||
|
| `systemd/*.service` | Three units — one root (`apricot-cstate-tune`), two user (`apricot-crash-monitor`, `apricot-rail-watchdog`). |
|
||||||
|
|
||||||
|
## Install
|
||||||
|
|
||||||
|
```sh
|
||||||
|
./install.sh # targets HOST=apricot by default
|
||||||
|
HOST=other-host ./install.sh # or override
|
||||||
|
```
|
||||||
|
|
||||||
|
Idempotent. Re-run to push updates.
|
||||||
|
|
||||||
|
## Tuning
|
||||||
|
|
||||||
|
All runtime behavior is env-overridable through systemd drop-ins:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
systemctl --user edit apricot-rail-watchdog
|
||||||
|
# [Service]
|
||||||
|
# Environment=DEVIATION_MV=50 BASELINE_SAMPLES=40 RAIL_KEY=in5
|
||||||
|
```
|
||||||
|
|
||||||
|
Key knobs:
|
||||||
|
|
||||||
|
- `INTERVAL` (crash-logger) — sample period in seconds; `0.1` = 10 Hz.
|
||||||
|
- `DEVIATION_MV` (watchdog) — deviation from learned baseline that triggers an alert.
|
||||||
|
- `MITIGATE_CMD` (watchdog) — path to mitigation hook; empty = alert only.
|
||||||
|
- `GPU_LIMIT_SAFE` (mitigate) — wattage to clamp GPUs to during mitigation.
|
||||||
|
- `HOLD_SECONDS` (mitigate) — how long to hold the safe state.
|
||||||
|
|
||||||
|
## Outputs
|
||||||
|
|
||||||
|
- `~/apricot-crash.log` — per-sample telemetry.
|
||||||
|
- `~/apricot-rail-alerts.log` — watchdog alerts + baselines.
|
||||||
|
- `journalctl --user -u apricot-rail-watchdog` — live alerts (WARNING priority).
|
||||||
|
- `journalctl -u apricot-cstate-tune` — one-shot C-state tune result at boot.
|
||||||
|
- `/var/lib/rasdaemon/ras-mc_event.db` (after rasdaemon setup) — decoded MCEs.
|
||||||
|
|
||||||
|
## Post-mortem flow when a crash happens
|
||||||
|
|
||||||
|
1. `ssh apricot` (after it comes back — BIOS "AC Back: Power On" auto-restarts).
|
||||||
|
2. `grep -n '^=== session start' ~/apricot-crash.log | tail -5` — find the new session boundary.
|
||||||
|
3. Everything between the previous session's last line and the new session marker is the last ~N seconds before death.
|
||||||
|
4. `tail ~/apricot-rail-alerts.log` — did the watchdog see rail deviation before the event?
|
||||||
|
5. `journalctl -b -1 --no-pager | tail -40` — kernel's last words (often normal; hard-off gives no panic).
|
||||||
|
6. SMART unsafe-shutdown counter: `sudo smartctl -a /dev/nvme0 | grep -i unsafe` — should increment by 1.
|
||||||
|
|
||||||
|
## Diagnosis so far
|
||||||
|
|
||||||
|
See [`docs/DIAGNOSIS.md`](docs/DIAGNOSIS.md).
|
||||||
78
docs/DIAGNOSIS.md
Normal file
78
docs/DIAGNOSIS.md
Normal file
|
|
@ -0,0 +1,78 @@
|
||||||
|
# apricot hard-off diagnosis
|
||||||
|
|
||||||
|
Running log of the investigation. Newest findings at top.
|
||||||
|
|
||||||
|
## Platform
|
||||||
|
- Gigabyte X399 AORUS XTREME-CF, 8 years old, open-frame wet-bench (no mineral oil; "wet" refers to open-air test bench).
|
||||||
|
- AMD Threadripper 2990WX (32-core, 250 W TDP).
|
||||||
|
- 2× NVIDIA RTX 3090 (stock 370 W cap each).
|
||||||
|
- 2× NVMe + 3× SATA.
|
||||||
|
- 2× Corsair PSUs:
|
||||||
|
- **HX1500i** — was producing audible coil-whine before the split; now carries only drives + Molex.
|
||||||
|
- **HX1200** — now carries mobo + CPU + both GPUs.
|
||||||
|
- Fedora Bluefin (ostree), kernel 6.17.12-200.fc42.
|
||||||
|
- Non-ECC memory (`amd64_edac` cannot bind).
|
||||||
|
|
||||||
|
## Failure signature (consistent across all events)
|
||||||
|
|
||||||
|
1. Journal cuts abruptly mid-operation. No `Reached target Shutdown`, no `systemd-shutdown`, no kernel panic.
|
||||||
|
2. Next boot runs `XFS (dm-0): Starting recovery` — filesystem wasn't unmounted cleanly.
|
||||||
|
3. NVMe SMART `Unsafe Shutdowns` increments by 1 on each event. Current ratio ~66 % of all power cycles are unclean.
|
||||||
|
4. BIOS "AC Back: Power On" (inferred from behavior) auto-restarts the box after each event; earlier events where the box stayed dark likely latched PSU OCP/UVP protection.
|
||||||
|
5. No MCE / thermal-throttle / OOM / hung-task entries.
|
||||||
|
|
||||||
|
→ The kernel never runs a shutdown — the 12 V plane collapses from under it. Classic PSU OCP/UVP or VRM brownout.
|
||||||
|
|
||||||
|
## Timeline of captured crashes
|
||||||
|
|
||||||
|
| Timestamp (PDT) | GPU 0 | GPU 1 | CPU Tctl | Load profile |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| 2026-04-16 15:58:06 | 158 W | **368 W** (pegged) | — | Sustained high — GPU 1 inference under load |
|
||||||
|
| 2026-04-17 03:22:54 | 117 W | 25 W (idle) | 70 °C | **Near-idle** — background auto-commit + tor-manager only |
|
||||||
|
| 2026-04-17 11:15:42 | 20 W | **368 W** (pegged) | 72 °C | High GPU 1 load |
|
||||||
|
| 2026-04-17 21:35:10 | 117 W | 129 W | 69 °C | Moderate, both GPUs in P2 |
|
||||||
|
|
||||||
|
Crashes span idle-to-sustained-peak — no consistent load correlation.
|
||||||
|
|
||||||
|
## Rail observations (it8628 SuperIO, after binding via `it87 force_id=0x8628`)
|
||||||
|
|
||||||
|
Stable rails during normal operation:
|
||||||
|
|
||||||
|
- `in5` on chip 1 (hwmon3/hwmon8 depending on boot order): **852 mV steady** → likely +12 V scaled ~14:1 → ~11.9 V actual.
|
||||||
|
- `in5` on chip 2: **1632 mV steady** → likely +5 V scaled ~3:1 → ~4.9 V actual.
|
||||||
|
|
||||||
|
**Key observation 2026-04-17**: Between crashes, `in5` on chip 1 collapsed from **852 mV → 408 mV** twice (18:50:43-45, 19:20:50-52), recovering within one sample. Roughly a **50 % rail drop** — probably a ~12 V → ~5.7 V momentary sag. System survived both. Demonstrates the supply is visibly failing at slow timescales, not only at the microsecond scale that causes a hard-off.
|
||||||
|
|
||||||
|
## What has been ruled out
|
||||||
|
|
||||||
|
- **Thermal**: all CPU/GPU/NVMe temps well below throttle thresholds at every crash.
|
||||||
|
- **OOM / hung task**: journal shows none.
|
||||||
|
- **MCE**: `edac_mce_amd` loaded, no events logged.
|
||||||
|
- **Graceful shutdown path**: no systemd shutdown-target progression.
|
||||||
|
- **nvidia-oc daemon**: fixed independently — was thrashing sqlite locks; not related to crashes.
|
||||||
|
- **HX1500i as sole cause**: crashes continued after moving all load off it onto HX1200.
|
||||||
|
|
||||||
|
## What's consistent with observations
|
||||||
|
|
||||||
|
- **Aging filter caps on PSU and/or motherboard VRM**. Both the squealing HX1500i *and* the HX1200 have produced visible rail excursions. Board is 8 years old.
|
||||||
|
- **Load-independent failure**: crashes happen at both idle and peak load, but the in5 rail drops caught by the watchdog indicate intermittent supply failure decoupled from workload.
|
||||||
|
|
||||||
|
## What remains to rule out (physical)
|
||||||
|
|
||||||
|
- Visual inspection of VRM caps on the board (open bench, trivial).
|
||||||
|
- Multimeter back-probe of 12 V at the 24-pin during load, to watch for sag below 11.4 V.
|
||||||
|
- Swap to a third known-good PSU for a day.
|
||||||
|
- Reseat EPS12V / 24-pin connectors (oxidation on 8-year-old pins is plausible).
|
||||||
|
|
||||||
|
## Software stack currently deployed
|
||||||
|
|
||||||
|
- **10 Hz telemetry logger** (`apricot-crash-monitor.service`) — writes ~/apricot-crash.log, fsync per second.
|
||||||
|
- **Rail watchdog** (`apricot-rail-watchdog.service`) — baseline-learning on `in5`, 30 mV deviation threshold, invokes mitigation on trigger.
|
||||||
|
- **Emergency mitigation** (`apricot-rail-mitigate`) — drops GPU cap to 250 W, pins CPU governor to powersave, holds 60 s, restores.
|
||||||
|
- **C-state tune** (`apricot-cstate-tune.service`) — disables C2+ at boot to reduce VRM transient demand.
|
||||||
|
- **IT8628E binding** (`/etc/modprobe.d/it87.conf` + `/etc/modules-load.d/it87.conf`) — SuperIO sensors auto-load with correct `force_id`.
|
||||||
|
- **rasdaemon** — optional, via `apricot-rasdaemon-setup`.
|
||||||
|
|
||||||
|
## Non-software fixes kept separate from this package
|
||||||
|
|
||||||
|
- nvidia-oc WAL-mode patch (upstreamed via ACS to `origin/master` of the nvidia-oc repo, commit `bea1934`).
|
||||||
105
install.sh
Executable file
105
install.sh
Executable file
|
|
@ -0,0 +1,105 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# Install apricot-health on the target host (default: apricot).
|
||||||
|
#
|
||||||
|
# Layout on target:
|
||||||
|
# /var/home/lilith/bin/ user-runnable scripts
|
||||||
|
# /var/opt/apricot-health/sbin/ root-only entrypoints (ostree-safe)
|
||||||
|
# /etc/modprobe.d/it87.conf IT8628E force_id
|
||||||
|
# /etc/modules-load.d/it87.conf load it87 at boot
|
||||||
|
# /etc/sudoers.d/apricot-health NOPASSWD shim for mitigation
|
||||||
|
# /etc/systemd/system/apricot-cstate-tune.service root systemd unit
|
||||||
|
# /var/home/lilith/.config/systemd/user/*.service user systemd units
|
||||||
|
#
|
||||||
|
# Idempotent: re-running copies updates and daemon-reloads.
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
HOST="${HOST:-apricot}"
|
||||||
|
PKG_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
|
||||||
|
echo "==> apricot-health install to $HOST (pkg=$PKG_DIR)"
|
||||||
|
|
||||||
|
# --- stage tarball locally so we upload in one round-trip ---------------
|
||||||
|
stage=$(mktemp -d)
|
||||||
|
trap 'rm -rf "$stage"' EXIT
|
||||||
|
mkdir -p "$stage"/{bin,root-sbin,etc-modprobe,etc-modules-load,etc-sudoers,etc-systemd,user-systemd}
|
||||||
|
|
||||||
|
cp "$PKG_DIR/scripts/apricot-crash-logger" "$stage/bin/"
|
||||||
|
cp "$PKG_DIR/scripts/apricot-rail-watchdog" "$stage/bin/"
|
||||||
|
cp "$PKG_DIR/scripts/apricot-rail-mitigate-trigger" "$stage/bin/"
|
||||||
|
cp "$PKG_DIR/scripts/apricot-rasdaemon-setup" "$stage/bin/"
|
||||||
|
cp "$PKG_DIR/scripts/apricot-rail-mitigate" "$stage/root-sbin/"
|
||||||
|
cp "$PKG_DIR/scripts/apricot-cstate-tune" "$stage/root-sbin/"
|
||||||
|
cp "$PKG_DIR/modprobe.d/it87.conf" "$stage/etc-modprobe/"
|
||||||
|
cp "$PKG_DIR/modules-load.d/it87.conf" "$stage/etc-modules-load/"
|
||||||
|
cp "$PKG_DIR/sudoers.d/apricot-health" "$stage/etc-sudoers/"
|
||||||
|
cp "$PKG_DIR/systemd/apricot-cstate-tune.service" "$stage/etc-systemd/"
|
||||||
|
cp "$PKG_DIR/systemd/apricot-crash-monitor.service" "$stage/user-systemd/"
|
||||||
|
cp "$PKG_DIR/systemd/apricot-rail-watchdog.service" "$stage/user-systemd/"
|
||||||
|
|
||||||
|
tar -czf "$stage/pkg.tar.gz" -C "$stage" bin root-sbin etc-modprobe etc-modules-load etc-sudoers etc-systemd user-systemd
|
||||||
|
echo "==> staged $(du -h "$stage/pkg.tar.gz" | cut -f1)"
|
||||||
|
|
||||||
|
# --- ship it ------------------------------------------------------------
|
||||||
|
scp -q "$stage/pkg.tar.gz" "$HOST:/tmp/apricot-health.tar.gz"
|
||||||
|
|
||||||
|
ssh "$HOST" bash -s <<'REMOTE'
|
||||||
|
set -euo pipefail
|
||||||
|
echo "==> remote install"
|
||||||
|
|
||||||
|
t=$(mktemp -d)
|
||||||
|
tar -xzf /tmp/apricot-health.tar.gz -C "$t"
|
||||||
|
|
||||||
|
# User-runnable scripts
|
||||||
|
mkdir -p /var/home/lilith/bin
|
||||||
|
install -m 0755 -o lilith -g lilith "$t"/bin/* /var/home/lilith/bin/
|
||||||
|
|
||||||
|
# Root-only entrypoints (ostree-safe path under /var)
|
||||||
|
sudo mkdir -p /var/opt/apricot-health/sbin
|
||||||
|
sudo install -m 0755 -o root -g root "$t"/root-sbin/* /var/opt/apricot-health/sbin/
|
||||||
|
|
||||||
|
# Kernel module config
|
||||||
|
sudo install -m 0644 "$t"/etc-modprobe/it87.conf /etc/modprobe.d/it87.conf
|
||||||
|
sudo install -m 0644 "$t"/etc-modules-load/it87.conf /etc/modules-load.d/it87.conf
|
||||||
|
|
||||||
|
# Sudoers (visudo-check first — malformed sudoers can lock the user out)
|
||||||
|
tmp_sudo=$(mktemp)
|
||||||
|
cp "$t"/etc-sudoers/apricot-health "$tmp_sudo"
|
||||||
|
if sudo visudo -cf "$tmp_sudo" >/dev/null 2>&1; then
|
||||||
|
sudo install -m 0440 -o root -g root "$tmp_sudo" /etc/sudoers.d/apricot-health
|
||||||
|
echo " sudoers: installed"
|
||||||
|
else
|
||||||
|
echo " sudoers: SYNTAX ERROR — not installing" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
rm -f "$tmp_sudo"
|
||||||
|
|
||||||
|
# Root systemd units
|
||||||
|
sudo install -m 0644 "$t"/etc-systemd/apricot-cstate-tune.service /etc/systemd/system/
|
||||||
|
sudo systemctl daemon-reload
|
||||||
|
sudo systemctl enable --now apricot-cstate-tune.service
|
||||||
|
echo " apricot-cstate-tune.service: enabled + started"
|
||||||
|
|
||||||
|
# User systemd units (under lilith)
|
||||||
|
sudo -u lilith mkdir -p /var/home/lilith/.config/systemd/user
|
||||||
|
sudo -u lilith install -m 0644 "$t"/user-systemd/apricot-crash-monitor.service /var/home/lilith/.config/systemd/user/
|
||||||
|
sudo -u lilith install -m 0644 "$t"/user-systemd/apricot-rail-watchdog.service /var/home/lilith/.config/systemd/user/
|
||||||
|
sudo loginctl enable-linger lilith 2>/dev/null || true
|
||||||
|
sudo systemctl --user -M lilith@.host daemon-reload
|
||||||
|
sudo systemctl --user -M lilith@.host enable --now apricot-crash-monitor.service
|
||||||
|
sudo systemctl --user -M lilith@.host restart apricot-rail-watchdog.service 2>/dev/null \
|
||||||
|
|| sudo systemctl --user -M lilith@.host enable --now apricot-rail-watchdog.service
|
||||||
|
echo " user units: enabled + started"
|
||||||
|
|
||||||
|
# Load it87 now if not yet loaded
|
||||||
|
if ! lsmod | grep -q '^it87 '; then
|
||||||
|
sudo modprobe it87 force_id=0x8628 ignore_resource_conflict=1 \
|
||||||
|
&& echo " it87 module: loaded" \
|
||||||
|
|| echo " it87 module: load FAILED (try reboot)"
|
||||||
|
fi
|
||||||
|
|
||||||
|
rm -rf "$t" /tmp/apricot-health.tar.gz
|
||||||
|
echo "==> install complete"
|
||||||
|
REMOTE
|
||||||
|
|
||||||
|
echo "==> done"
|
||||||
1
modprobe.d/it87.conf
Normal file
1
modprobe.d/it87.conf
Normal file
|
|
@ -0,0 +1 @@
|
||||||
|
options it87 force_id=0x8628 ignore_resource_conflict=1
|
||||||
1
modules-load.d/it87.conf
Normal file
1
modules-load.d/it87.conf
Normal file
|
|
@ -0,0 +1 @@
|
||||||
|
it87
|
||||||
83
scripts/apricot-crash-logger
Executable file
83
scripts/apricot-crash-logger
Executable file
|
|
@ -0,0 +1,83 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# Continuously appends power/thermal/voltage state to $LOG so that the last
|
||||||
|
# fractions of a second before a hard reset survive the crash.
|
||||||
|
#
|
||||||
|
# Env overrides:
|
||||||
|
# LOG output path (default ~/apricot-crash.log)
|
||||||
|
# INTERVAL sample period in seconds (default 0.1 = 10 Hz)
|
||||||
|
# SENSOR_CHIPS regex of hwmon name(s) to capture (default k10temp|nvme|it8628|nct6*|w83*)
|
||||||
|
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
LOG="${LOG:-${HOME}/apricot-crash.log}"
|
||||||
|
INTERVAL="${INTERVAL:-0.1}"
|
||||||
|
GPU_SAMPLE_EVERY="${GPU_SAMPLE_EVERY:-10}" # nvidia-smi is slow; only invoke every Nth iter
|
||||||
|
SENSOR_CHIPS="${SENSOR_CHIPS:-k10temp|nvme|it8628|nct6.*|w83.*}"
|
||||||
|
|
||||||
|
printf '=== session start %s (pid=%s interval=%ss gpu_every=%s chips=%s) ===\n' \
|
||||||
|
"$(date --iso-8601=ns)" "$$" "$INTERVAL" "$GPU_SAMPLE_EVERY" "$SENSOR_CHIPS" >> "$LOG"
|
||||||
|
|
||||||
|
# Pre-resolve matching hwmon paths once per second (cheaper than per-sample).
|
||||||
|
declare -a HWMONS
|
||||||
|
refresh_hwmons() {
|
||||||
|
HWMONS=()
|
||||||
|
for h in /sys/class/hwmon/hwmon*; do
|
||||||
|
[ -d "$h" ] || continue
|
||||||
|
[ -r "$h/name" ] || continue
|
||||||
|
name=$(<"$h/name") # bash builtin — no fork
|
||||||
|
[[ "$name" =~ ^(${SENSOR_CHIPS})$ ]] || continue
|
||||||
|
HWMONS+=("$h")
|
||||||
|
done
|
||||||
|
}
|
||||||
|
refresh_hwmons
|
||||||
|
last_refresh=$SECONDS
|
||||||
|
iter=0
|
||||||
|
|
||||||
|
while :; do
|
||||||
|
ts=$(date --iso-8601=ns)
|
||||||
|
|
||||||
|
# GPU telemetry — skip most iterations because nvidia-smi startup is
|
||||||
|
# ~300-500ms, which would cap the loop at ~2 Hz otherwise.
|
||||||
|
if (( iter % GPU_SAMPLE_EVERY == 0 )); then
|
||||||
|
while IFS= read -r gpu_line; do
|
||||||
|
printf '%s gpu %s\n' "$ts" "$gpu_line"
|
||||||
|
done < <(nvidia-smi \
|
||||||
|
--query-gpu=index,temperature.gpu,power.draw,clocks.gr,clocks.mem,pstate,utilization.gpu,memory.used \
|
||||||
|
--format=csv,noheader,nounits 2>/dev/null)
|
||||||
|
fi
|
||||||
|
iter=$(( iter + 1 ))
|
||||||
|
|
||||||
|
# Platform sensors — use $(<file) bash builtin everywhere to avoid
|
||||||
|
# fork+exec per-read. With ~60 sensor files that's the difference
|
||||||
|
# between ~600ms per iteration and <20ms.
|
||||||
|
for h in "${HWMONS[@]}"; do
|
||||||
|
[ -r "$h/name" ] || continue
|
||||||
|
name=$(<"$h/name")
|
||||||
|
hb=${h##*/}
|
||||||
|
for inp in "$h"/temp*_input "$h"/in*_input "$h"/fan*_input "$h"/curr*_input; do
|
||||||
|
[ -r "$inp" ] || continue
|
||||||
|
n=${inp##*/}; n=${n%_input}
|
||||||
|
label_file="$h/${n}_label"
|
||||||
|
if [ -r "$label_file" ]; then
|
||||||
|
label=$(<"$label_file")
|
||||||
|
else
|
||||||
|
label="$n"
|
||||||
|
fi
|
||||||
|
raw=$(<"$inp")
|
||||||
|
printf '%s sensor %s/%s %s=%s\n' "$ts" "$name" "$hb" "$label" "$raw"
|
||||||
|
done
|
||||||
|
done
|
||||||
|
|
||||||
|
# Refresh hwmon list every ~5s in case modules load/unload.
|
||||||
|
if (( SECONDS - last_refresh > 5 )); then
|
||||||
|
refresh_hwmons
|
||||||
|
last_refresh=$SECONDS
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Fsync once per second regardless of sample rate (amortized).
|
||||||
|
if (( ${ts:20:1} == 0 )); then
|
||||||
|
sync "$LOG" 2>/dev/null || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
sleep "$INTERVAL"
|
||||||
|
done >> "$LOG"
|
||||||
52
scripts/apricot-cstate-tune
Executable file
52
scripts/apricot-cstate-tune
Executable file
|
|
@ -0,0 +1,52 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# Disable deep CPU C-states so Vcore stays at a higher baseline and the VRM
|
||||||
|
# doesn't have to slam from C6/C7 idle back to full current on every workload
|
||||||
|
# transient. Reduces transient-demand magnitude; does NOT fix root-cause PSU
|
||||||
|
# or VRM degradation, but often reduces crash frequency on aging boards.
|
||||||
|
#
|
||||||
|
# Leaves C0 + C1 enabled (basic halt). Disables C2+ (package C-states).
|
||||||
|
#
|
||||||
|
# Reversible: run with `--restore` to re-enable everything.
|
||||||
|
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
log() { printf '[%s] apricot-cstate-tune: %s\n' "$(date --iso-8601=s)" "$*"; }
|
||||||
|
|
||||||
|
mode="${1:-apply}"
|
||||||
|
|
||||||
|
case "$mode" in
|
||||||
|
apply)
|
||||||
|
n_cpus=$(ls -d /sys/devices/system/cpu/cpu[0-9]* 2>/dev/null | wc -l)
|
||||||
|
disabled=0
|
||||||
|
for s in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable; do
|
||||||
|
[ -w "$s" ] || continue
|
||||||
|
idx="${s%/disable}"; idx="${idx##*state}"
|
||||||
|
# Keep states 0 (POLL/C0) and 1 (C1/halt); disable 2+.
|
||||||
|
if (( idx >= 2 )); then
|
||||||
|
echo 1 > "$s" 2>/dev/null && disabled=$(( disabled + 1 ))
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
log "disabled $disabled idle-state entries across $n_cpus CPUs (kept C0+C1)"
|
||||||
|
;;
|
||||||
|
restore)
|
||||||
|
enabled=0
|
||||||
|
for s in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable; do
|
||||||
|
[ -w "$s" ] || continue
|
||||||
|
echo 0 > "$s" 2>/dev/null && enabled=$(( enabled + 1 ))
|
||||||
|
done
|
||||||
|
log "re-enabled $enabled idle-state entries"
|
||||||
|
;;
|
||||||
|
status)
|
||||||
|
printf 'cpu0 idle states:\n'
|
||||||
|
for d in /sys/devices/system/cpu/cpu0/cpuidle/state*; do
|
||||||
|
[ -d "$d" ] || continue
|
||||||
|
name=$(cat "$d/name" 2>/dev/null)
|
||||||
|
dis=$(cat "$d/disable" 2>/dev/null)
|
||||||
|
printf ' %s disable=%s name=%s\n' "$(basename "$d")" "$dis" "$name"
|
||||||
|
done
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
echo "usage: $0 {apply|restore|status}" >&2
|
||||||
|
exit 2
|
||||||
|
;;
|
||||||
|
esac
|
||||||
92
scripts/apricot-rail-mitigate
Executable file
92
scripts/apricot-rail-mitigate
Executable file
|
|
@ -0,0 +1,92 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# Emergency rail-deviation responder. Invoked by apricot-rail-watchdog when
|
||||||
|
# a rail excursion is detected. Goal: reduce power demand for N seconds to
|
||||||
|
# let the rail recover, then restore.
|
||||||
|
#
|
||||||
|
# Argv (from watchdog): <chip> <val_mV> <baseline_mV> <delta_mV> <src_ts>
|
||||||
|
#
|
||||||
|
# Actions:
|
||||||
|
# 1. Drop both GPU power caps to GPU_LIMIT_SAFE (default 250W).
|
||||||
|
# 2. Pin CPU governor to "powersave".
|
||||||
|
# 3. Hold for HOLD_SECONDS (default 60).
|
||||||
|
# 4. Restore prior values if we recorded them.
|
||||||
|
#
|
||||||
|
# Requires root (nvidia-smi -pl, writing to /sys/devices/system/cpu/...).
|
||||||
|
# Intended to run as a root-side systemd unit triggered via a fifo or via
|
||||||
|
# sudoers allowlist for the lilith user — install.sh sets this up.
|
||||||
|
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
: "${GPU_LIMIT_SAFE:=250}"
|
||||||
|
: "${HOLD_SECONDS:=60}"
|
||||||
|
: "${STATE_DIR:=/run/apricot-rail-mitigate}"
|
||||||
|
: "${GOVERNOR_SAFE:=powersave}"
|
||||||
|
|
||||||
|
mkdir -p "$STATE_DIR"
|
||||||
|
STAMP=$(date --iso-8601=ns)
|
||||||
|
LOCK="$STATE_DIR/active.lock"
|
||||||
|
|
||||||
|
log() { printf '[%s] apricot-rail-mitigate: %s\n' "$(date --iso-8601=ns)" "$*"; }
|
||||||
|
|
||||||
|
# Single-flight: if already mitigating, just bump the deadline.
|
||||||
|
if [[ -f "$LOCK" ]]; then
|
||||||
|
deadline=$(( $(date +%s) + HOLD_SECONDS ))
|
||||||
|
echo "$deadline" > "$LOCK"
|
||||||
|
log "already mitigating, extending deadline to $(date -d "@$deadline" --iso-8601=s) (trigger=$*)"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
deadline=$(( $(date +%s) + HOLD_SECONDS ))
|
||||||
|
echo "$deadline" > "$LOCK"
|
||||||
|
log "engage trigger=$* hold=${HOLD_SECONDS}s gpu_limit=${GPU_LIMIT_SAFE}W governor=${GOVERNOR_SAFE}"
|
||||||
|
|
||||||
|
# --- capture prior state -------------------------------------------------
|
||||||
|
PRIOR_GPU=$(nvidia-smi --query-gpu=index,power.limit --format=csv,noheader,nounits 2>/dev/null | sed 's/ //g')
|
||||||
|
echo "$PRIOR_GPU" > "$STATE_DIR/prior_gpu"
|
||||||
|
|
||||||
|
PRIOR_GOV=""
|
||||||
|
for g in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
|
||||||
|
[ -r "$g" ] && PRIOR_GOV="$(cat "$g")" && break
|
||||||
|
done
|
||||||
|
echo "$PRIOR_GOV" > "$STATE_DIR/prior_gov"
|
||||||
|
|
||||||
|
# --- apply safe state ----------------------------------------------------
|
||||||
|
while IFS=, read -r idx _; do
|
||||||
|
[[ "$idx" =~ ^[0-9]+$ ]] || continue
|
||||||
|
nvidia-smi -i "$idx" -pl "$GPU_LIMIT_SAFE" >/dev/null 2>&1 \
|
||||||
|
&& log "gpu $idx -> ${GPU_LIMIT_SAFE}W"
|
||||||
|
done <<< "$PRIOR_GPU"
|
||||||
|
|
||||||
|
for g in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
|
||||||
|
[ -w "$g" ] || continue
|
||||||
|
echo "$GOVERNOR_SAFE" > "$g" 2>/dev/null || true
|
||||||
|
done
|
||||||
|
log "cpu governor -> $GOVERNOR_SAFE (prior=$PRIOR_GOV)"
|
||||||
|
|
||||||
|
# --- hold, honoring deadline bumps --------------------------------------
|
||||||
|
while true; do
|
||||||
|
now=$(date +%s)
|
||||||
|
target=$(cat "$LOCK" 2>/dev/null || echo 0)
|
||||||
|
(( now >= target )) && break
|
||||||
|
sleep $(( target - now ))
|
||||||
|
done
|
||||||
|
|
||||||
|
# --- restore -------------------------------------------------------------
|
||||||
|
while IFS=, read -r idx prior_w; do
|
||||||
|
[[ "$idx" =~ ^[0-9]+$ ]] || continue
|
||||||
|
prior_w="${prior_w%.*}"
|
||||||
|
[[ -n "$prior_w" ]] || continue
|
||||||
|
nvidia-smi -i "$idx" -pl "$prior_w" >/dev/null 2>&1 \
|
||||||
|
&& log "gpu $idx -> ${prior_w}W (restored)"
|
||||||
|
done < "$STATE_DIR/prior_gpu"
|
||||||
|
|
||||||
|
if [[ -n "$PRIOR_GOV" ]]; then
|
||||||
|
for g in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
|
||||||
|
[ -w "$g" ] || continue
|
||||||
|
echo "$PRIOR_GOV" > "$g" 2>/dev/null || true
|
||||||
|
done
|
||||||
|
log "cpu governor -> $PRIOR_GOV (restored)"
|
||||||
|
fi
|
||||||
|
|
||||||
|
rm -f "$LOCK"
|
||||||
|
log "disengage"
|
||||||
5
scripts/apricot-rail-mitigate-trigger
Executable file
5
scripts/apricot-rail-mitigate-trigger
Executable file
|
|
@ -0,0 +1,5 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# User-space shim invoked by the watchdog. Delegates to the root-owned
|
||||||
|
# apricot-rail-mitigate via sudoers (install.sh installs a NOPASSWD rule
|
||||||
|
# scoped to this one command).
|
||||||
|
exec sudo -n /var/opt/apricot-health/sbin/apricot-rail-mitigate "$@"
|
||||||
85
scripts/apricot-rail-watchdog
Executable file
85
scripts/apricot-rail-watchdog
Executable file
|
|
@ -0,0 +1,85 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# Watches a stable PSU-derived rail (default: in5 on it8628 chips) by
|
||||||
|
# learning each chip's baseline from the first BASELINE_SAMPLES and alerting
|
||||||
|
# when later samples deviate by more than DEVIATION_MV.
|
||||||
|
#
|
||||||
|
# Works for any rail that shouldn't swing under normal operation. For Vcore
|
||||||
|
# (which swings 600mV+ during P-state transitions on Threadripper) this
|
||||||
|
# approach is unsuitable — use in5 (+12V divided) or in7 (3VSB) instead.
|
||||||
|
#
|
||||||
|
# hwmon numbering is boot-order-dependent, so we resolve it per-line.
|
||||||
|
#
|
||||||
|
# Optional mitigation hook (set MITIGATE_CMD) runs when a deviation fires —
|
||||||
|
# receives the chip, value, baseline, delta on its argv. Use to auto-throttle
|
||||||
|
# GPU power or CPU governor as an emergency response.
|
||||||
|
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
LOG="${HOME}/apricot-crash.log"
|
||||||
|
ALERTS="${HOME}/apricot-rail-alerts.log"
|
||||||
|
|
||||||
|
: "${DEVIATION_MV:=30}"
|
||||||
|
: "${BASELINE_SAMPLES:=20}"
|
||||||
|
: "${RAIL_KEY:=in5}"
|
||||||
|
: "${CHIP_REGEX:=it8628/hwmon[0-9]+}"
|
||||||
|
: "${MITIGATE_CMD:=}"
|
||||||
|
|
||||||
|
printf '=== rail-watchdog start %s key=%s deviation=%smV baseline_samples=%s chip=%s mitigate=%s ===\n' \
|
||||||
|
"$(date --iso-8601=ns)" "$RAIL_KEY" "$DEVIATION_MV" "$BASELINE_SAMPLES" "$CHIP_REGEX" "${MITIGATE_CMD:-<none>}" >> "$ALERTS"
|
||||||
|
|
||||||
|
emit() {
|
||||||
|
local ts msg="$*"
|
||||||
|
ts=$(date --iso-8601=ns)
|
||||||
|
printf '%s [WARN] %s\n' "$ts" "$msg" | tee -a "$ALERTS" >&2
|
||||||
|
}
|
||||||
|
|
||||||
|
info() {
|
||||||
|
local ts msg="$*"
|
||||||
|
ts=$(date --iso-8601=ns)
|
||||||
|
printf '%s [INFO] %s\n' "$ts" "$msg" >> "$ALERTS"
|
||||||
|
}
|
||||||
|
|
||||||
|
declare -A seen_count
|
||||||
|
declare -A baseline
|
||||||
|
declare -A buffer
|
||||||
|
|
||||||
|
chip_re="($CHIP_REGEX)"
|
||||||
|
val_re=" ${RAIL_KEY}=([0-9]+)$"
|
||||||
|
|
||||||
|
median_of() {
|
||||||
|
printf '%s\n' $1 | sort -n | awk -v n=$(wc -w <<< "$1") 'NR==int((n+1)/2){print;exit}'
|
||||||
|
}
|
||||||
|
|
||||||
|
tail -F -n 0 "$LOG" 2>/dev/null | while IFS= read -r line; do
|
||||||
|
[[ "$line" =~ $chip_re ]] || continue
|
||||||
|
chip="${BASH_REMATCH[1]}"
|
||||||
|
[[ "$line" =~ $val_re ]] || continue
|
||||||
|
val="${BASH_REMATCH[1]}"
|
||||||
|
src_ts="${line%% *}"
|
||||||
|
|
||||||
|
n="${seen_count[$chip]:-0}"
|
||||||
|
n=$(( n + 1 ))
|
||||||
|
seen_count[$chip]=$n
|
||||||
|
|
||||||
|
if (( n <= BASELINE_SAMPLES )); then
|
||||||
|
buffer[$chip]="${buffer[$chip]:+${buffer[$chip]} }$val"
|
||||||
|
if (( n == BASELINE_SAMPLES )); then
|
||||||
|
b=$(median_of "${buffer[$chip]}")
|
||||||
|
baseline[$chip]=$b
|
||||||
|
info "baseline_learned chip=${chip} key=${RAIL_KEY} baseline=${b}mV samples=${BASELINE_SAMPLES}"
|
||||||
|
unset 'buffer[$chip]'
|
||||||
|
fi
|
||||||
|
continue
|
||||||
|
fi
|
||||||
|
|
||||||
|
b="${baseline[$chip]}"
|
||||||
|
dev=$(( val - b ))
|
||||||
|
(( dev < 0 )) && dev=$(( -dev ))
|
||||||
|
if (( dev > DEVIATION_MV )); then
|
||||||
|
emit "rail_deviation chip=${chip} key=${RAIL_KEY} val=${val}mV baseline=${b}mV |Δ|=${dev}mV at=${src_ts}"
|
||||||
|
if [[ -n "$MITIGATE_CMD" ]]; then
|
||||||
|
# Detach mitigation so a slow command can't block alert delivery.
|
||||||
|
"$MITIGATE_CMD" "$chip" "$val" "$b" "$dev" "$src_ts" >> "$ALERTS" 2>&1 &
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
done
|
||||||
42
scripts/apricot-rasdaemon-setup
Executable file
42
scripts/apricot-rasdaemon-setup
Executable file
|
|
@ -0,0 +1,42 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# Install + enable rasdaemon for detailed AMD MCA/MCE parsing.
|
||||||
|
#
|
||||||
|
# rasdaemon runs a trace-buffer consumer that decodes machine-check events
|
||||||
|
# into a sqlite DB (~/ras-mc_event.db usually at /var/lib/rasdaemon/) and
|
||||||
|
# syslogs them in human-readable form. Much more detail than edac_mce_amd
|
||||||
|
# alone. If any crash is in-CPU or NB-side (not pure board-level power
|
||||||
|
# loss), this catches it.
|
||||||
|
#
|
||||||
|
# Idempotent. Safe to re-run.
|
||||||
|
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
log() { printf '[%s] apricot-rasdaemon-setup: %s\n' "$(date --iso-8601=s)" "$*"; }
|
||||||
|
|
||||||
|
if ! command -v rasdaemon >/dev/null 2>&1; then
|
||||||
|
log "rasdaemon not installed — attempting rpm-ostree install"
|
||||||
|
if command -v rpm-ostree >/dev/null 2>&1; then
|
||||||
|
sudo rpm-ostree install rasdaemon \
|
||||||
|
&& log "installed — a reboot is required for the layered package to activate" \
|
||||||
|
|| { log "rpm-ostree install failed"; exit 1; }
|
||||||
|
elif command -v dnf >/dev/null 2>&1; then
|
||||||
|
sudo dnf install -y rasdaemon \
|
||||||
|
|| { log "dnf install failed"; exit 1; }
|
||||||
|
else
|
||||||
|
log "no package manager found; install rasdaemon manually"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Enable + start the service. On rpm-ostree systems this is deferred until
|
||||||
|
# reboot; systemctl will still succeed (the symlink is made).
|
||||||
|
sudo systemctl enable rasdaemon.service 2>&1 | grep -v '^Created' || true
|
||||||
|
sudo systemctl start rasdaemon.service 2>&1 \
|
||||||
|
&& log "rasdaemon.service started" \
|
||||||
|
|| log "rasdaemon.service will start after reboot (layered package)"
|
||||||
|
|
||||||
|
log "status:"
|
||||||
|
systemctl status rasdaemon.service --no-pager 2>&1 | head -10 || true
|
||||||
|
|
||||||
|
log "recent events (may be empty):"
|
||||||
|
sudo ras-mc-ctl --summary 2>&1 | head -15 || true
|
||||||
4
sudoers.d/apricot-health
Normal file
4
sudoers.d/apricot-health
Normal file
|
|
@ -0,0 +1,4 @@
|
||||||
|
# Allow user lilith to invoke the rail-mitigation script without password
|
||||||
|
# (fired by apricot-rail-watchdog.service when a rail deviation is detected).
|
||||||
|
# Scoped to this one command.
|
||||||
|
lilith ALL=(root) NOPASSWD: /var/opt/apricot-health/sbin/apricot-rail-mitigate
|
||||||
16
systemd/apricot-crash-monitor.service
Normal file
16
systemd/apricot-crash-monitor.service
Normal file
|
|
@ -0,0 +1,16 @@
|
||||||
|
[Unit]
|
||||||
|
Description=Apricot crash logger (high-frequency power/thermal/voltage capture)
|
||||||
|
After=default.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=simple
|
||||||
|
ExecStart=/var/home/lilith/bin/apricot-crash-logger
|
||||||
|
Environment=INTERVAL=0.1
|
||||||
|
Restart=always
|
||||||
|
RestartSec=2
|
||||||
|
StandardOutput=null
|
||||||
|
StandardError=journal
|
||||||
|
SyslogIdentifier=apricot-crash-monitor
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=default.target
|
||||||
16
systemd/apricot-cstate-tune.service
Normal file
16
systemd/apricot-cstate-tune.service
Normal file
|
|
@ -0,0 +1,16 @@
|
||||||
|
[Unit]
|
||||||
|
Description=Apricot CPU C-state tuning (disable deep C-states to reduce VRM transient demand)
|
||||||
|
After=multi-user.target
|
||||||
|
ConditionPathExists=/sys/devices/system/cpu/cpu0/cpuidle/state0
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=/var/opt/apricot-health/sbin/apricot-cstate-tune apply
|
||||||
|
ExecStop=/var/opt/apricot-health/sbin/apricot-cstate-tune restore
|
||||||
|
RemainAfterExit=yes
|
||||||
|
StandardOutput=journal
|
||||||
|
StandardError=journal
|
||||||
|
SyslogIdentifier=apricot-cstate-tune
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
17
systemd/apricot-rail-watchdog.service
Normal file
17
systemd/apricot-rail-watchdog.service
Normal file
|
|
@ -0,0 +1,17 @@
|
||||||
|
[Unit]
|
||||||
|
Description=Apricot PSU rail deviation watchdog (it8628 in5 baseline)
|
||||||
|
After=apricot-crash-monitor.service
|
||||||
|
Wants=apricot-crash-monitor.service
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=simple
|
||||||
|
ExecStart=/var/home/lilith/bin/apricot-rail-watchdog
|
||||||
|
Environment=MITIGATE_CMD=/var/home/lilith/bin/apricot-rail-mitigate-trigger
|
||||||
|
Restart=always
|
||||||
|
RestartSec=2
|
||||||
|
StandardOutput=null
|
||||||
|
StandardError=journal
|
||||||
|
SyslogIdentifier=apricot-rail-watchdog
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=apricot-crash-monitor.service
|
||||||
Loading…
Add table
Reference in a new issue