Runbook — multi-shard AOF (per-shard layout)¶
Status: Resolved. The startup refusal gate introduced in v0.1.13 was lifted in v0.1.13-patch (PR #129) once the per-shard AOF replay path was shipped and verified (CRASH-01-LITE: 200/200 SIGKILL recovery, 0 data loss).
Background (historical context)¶
Prior to PR #129, Moon refused to start with --shards >= 2 + --appendonly yes
because the single-writer AOF implementation lost ~50 % of writes on SIGKILL
in that configuration. The fix was a full per-shard AOF architecture (Option B):
each shard owns its own writer task and recovery walks every shard's segment
manifest independently.
If you are running Moon ≤ v0.1.13 and hit the old startup error, see the Upgrading section below.
Current architecture (v0.1.13-patch / PR #129 and later)¶
Per-shard AOF layout¶
<persistence_dir>/
appendonlydir/
moon.aof.manifest ← top-level manifest (layout: PerShard)
shard-0/
moon.aof.1.base.rdb ← base snapshot for shard 0
moon.aof.1.incr.aof ← incremental log for shard 0
shard-1/
moon.aof.1.base.rdb
moon.aof.1.incr.aof
shard-N/
...
Each shard's writer task appends to its own .incr.aof file. On restart, Moon
opens the manifest, discovers all shard directories, and replays each shard's
log independently. Shard replay is parallel — recovery time does not grow
linearly with shard count.
Durability invariants¶
| appendfsync | Guarantee |
|---|---|
always |
Write is on durable storage before +OK (AppendSync rendezvous) |
everysec (default) |
Fsync runs every second; at most ~1 s of writes at risk on crash |
no |
OS decides when to flush; fastest but weakest guarantee |
BGREWRITEAOF in per-shard mode¶
BGREWRITEAOF is not yet supported for PerShard layouts. Issuing it on a
PerShard instance returns the following error immediately:
ERR BGREWRITEAOF is not yet supported under per-shard AOF layout; per-shard rewrite ships in step 6 of the per-shard AOF migration
Per-shard BGREWRITEAOF (each shard compacts its own log independently, with all N acks awaited before returning confirmation) is tracked for v0.2.
Upgrading from v0.1.13 (old TopLevel AOF) to per-shard layout¶
If you have an existing deployment with a TopLevel AOF manifest (single
writer, layout: TopLevel) and want to migrate to per-shard layout:
Option A — cold migration (recommended, zero-risk)¶
- Stop the server.
- Run
BGSAVEon the last healthy instance, or copydump.rdbfrom--dir. - Remove
appendonlydir/entirely. - Restart with
--shards N --appendonly yes. Moon creates a fresh per-shard manifest. Recovery loads from RDB; the incremental AOF starts empty.
Option B — in-place migration (future tooling)¶
An offline migration CLI subcommand is planned for v0.2. Until then, use Option A.
Safety guard — TopLevel manifest with multi-shard startup¶
If Moon detects an existing TopLevel AOF manifest at startup with
--shards >= 2, it refuses to start with exit code 2 and prints the following
to stderr:
REFUSING TO START: legacy TopLevel AOF manifest at <path> detected with
--shards N (>= 2). This combination silently loses data for shards 1..N-1.
See docs/runbooks/multi-shard-aof-rewrite.md for migration instructions.
(Exact text may include additional context such as the manifest path and shard
count substituted in; the key phrase to match in alerting is REFUSING TO START.)
This is intentional — a TopLevel log does not capture per-shard ordering, so replaying it on a multi-shard instance would produce incorrect key routing.
Resolution: Follow Option A above (remove appendonlydir/ and restart).
Deprecated flag: --unsafe-multishard-aof¶
The --unsafe-multishard-aof flag was introduced in v0.1.13 as an escape hatch
to acknowledge the known ~50 % data-loss risk. It is now deprecated and will
be removed in v0.2:
- The underlying bug is fixed — the flag no longer suppresses any safety gate.
- Passing it emits a
[DEPRECATED]warning at startup. - If you have scripts or systemd units that pass
--unsafe-multishard-aof, remove that flag — it is a no-op.
Monitoring and telemetry¶
INFO Persistence fields added in PR #129¶
A non-zero value indicates the AOF writer is falling behind write throughput.
Investigate disk I/O or increase aof-rewrite-min-size.
Prometheus / alerting¶
A dedicated gauge for aof_backpressure_dropped is planned for v0.2. Until
then, monitor via INFO persistence polling.
Crash recovery verification¶
Run the CRASH-01-LITE suite to verify your configuration recovers cleanly:
Expected: 200/200 entries recovered across all shards after SIGKILL.
Escalation¶
If you observe data loss after a crash on v0.1.13-patch or later, collect:
appendonlydir/moon.aof.manifestcontentsappendonlydir/shard-*/file sizes and modification times- Server log from the crashed process (look for AOF writer task exit reason)
INFO persistenceoutput from the recovered instance
File a P0 with these artifacts attached.