Skip to content

🛡️ Recovery & Stabilization Plan: NFS Hang Incident (2026-02-05)

🚨 Incident Summary

  • Event: Fleet-wide NFS lockup causing DS9 (TrueNAS) to become unresponsive and Risa (Media) processes to enter Uninterruptible Sleep (D state).
  • Root Cause (Confirmed): I/O Saturation. A scheduled Proxmox Backup job triggered during a period of high system activity, starving the DS9 VM of disk I/O.
  • Status:System Fully Stabilized.

🗓️ Staged Execution Plan

🛑 Stage 1: Infrastructure Stabilization (Risa & DS9)

Goal: Harden the storage infrastructure against future I/O starvation and verify permissions. 1. DONE: Backup Schedule Audit: Proxmox jobs suspended. 2. DONE: DS9 Log Hardening: systemd-journald configured for persistent, aggressive disk sync. 3. DONE: Permission Alignment: GID 3000 applied to /mnt/TheWarpCore/vault. 4. DONE: Mount Verification: Risa mounts verified writable.

🤖 Stage 2: AI Pipeline Tuning (Starfleet)

Goal: Optimize AI performance for limited CPU hardware. 1. DONE: Implement Smart Router: Documents < 5k chars use llama3.2:3b. 2. DONE: Implement Failsafe: Documents > 50k chars are auto-skipped. 3. DONE: Fixed Timeout: Patched customService.js to honor AI_TIMEOUT (90m). 4. DONE: Model Switch: Set llama3.2:3b as default for configuration validation.

▶️ Stage 3: Pipeline Resumption & Monitoring

Goal: Successfully process the backlog. 1. DONE: Process backlog: ID 22 successfully tagged in 3 mins. 2. DONE: Skip Blocker: ID 16 manually tagged as processed.


📝 Success Criteria

  • [x] DS9 remains stable under moderate load.
  • [x] Risa can write to NFS shares without latency.
  • [x] AI pipeline processes new uploads autonomously and quickly.