Skip to content

Operator RunbooksΒΆ

Concrete, step-by-step incident response procedures for operating Moon.

Runbook When to reach for it
shard-count-change.md Startup refusal: ERR shard count changed (manifest=N, config=M)
corrupted-aof-recovery.md AOF corruption β€” partial replay and recovery fallback
disk-full-during-wal-rotation.md Persistence hits ENOSPC mid-rotation
oom-during-snapshot.md Memory pressure during BGSAVE
replica-fell-behind.md Diagnosing and remediating replication lag
rolling-restart.md Graceful drain + binary swap under load
multi-shard-aof-rewrite.md Per-shard BGREWRITEAOF operations
tls-cert-rotation.md Zero-downtime certificate rotation via SIGHUP

Read alongside the Production Contract (durability and availability guarantees) and the Operator Guide (memory accounting, sizing).