🛡️ Recovery & Stabilization Plan: NFS Hang Incident (2026-02-05)
🚨 Incident Summary
- Event: Fleet-wide NFS lockup causing DS9 (TrueNAS) to become unresponsive and Risa (Media) processes to enter Uninterruptible Sleep (
Dstate). - Root Cause (Confirmed): I/O Saturation. A scheduled Proxmox Backup job triggered during a period of high system activity, starving the DS9 VM of disk I/O.
- Status: ✅ System Fully Stabilized.
🗓️ Staged Execution Plan
🛑 Stage 1: Infrastructure Stabilization (Risa & DS9)
Goal: Harden the storage infrastructure against future I/O starvation and verify permissions.
1. DONE: Backup Schedule Audit: Proxmox jobs suspended.
2. DONE: DS9 Log Hardening: systemd-journald configured for persistent, aggressive disk sync.
3. DONE: Permission Alignment: GID 3000 applied to /mnt/TheWarpCore/vault.
4. DONE: Mount Verification: Risa mounts verified writable.
🤖 Stage 2: AI Pipeline Tuning (Starfleet)
Goal: Optimize AI performance for limited CPU hardware.
1. DONE: Implement Smart Router: Documents < 5k chars use llama3.2:3b.
2. DONE: Implement Failsafe: Documents > 50k chars are auto-skipped.
3. DONE: Fixed Timeout: Patched customService.js to honor AI_TIMEOUT (90m).
4. DONE: Model Switch: Set llama3.2:3b as default for configuration validation.
▶️ Stage 3: Pipeline Resumption & Monitoring
Goal: Successfully process the backlog. 1. DONE: Process backlog: ID 22 successfully tagged in 3 mins. 2. DONE: Skip Blocker: ID 16 manually tagged as processed.
📝 Success Criteria
- [x] DS9 remains stable under moderate load.
- [x] Risa can write to NFS shares without latency.
- [x] AI pipeline processes new uploads autonomously and quickly.