Point-in-Time Recovery (PITR)ΒΆ
Moon v0.2 adds point-in-time recovery on top of the existing per-shard WAL + RDB snapshot stack. Operators can rewind a shard to any LSN that still lives in the WAL, or to a wall-clock time anchored by a temporal record.
Status: PITR ships as flags-only β there is no separate restore binary. Stop the server, restart with a target flag, and the recovery pipeline does the rest.
FlagsΒΆ
| Flag | Type | Meaning |
|---|---|---|
--recovery-target-lsn <N> |
u64 |
Stop replay at the last record with lsn <= N. |
--recovery-target-time <RFC3339> |
string | Resolve to an LSN via the WAL's temporal anchors, then stop. |
The two flags are mutually exclusive in practice β if both are set, the explicit LSN wins. Omitting both yields a normal full-replay restart.
# Rewind to LSN 12345 (one shard)
./moon --port 6399 --dir /var/lib/moon --recovery-target-lsn 12345
# Rewind to a wall-clock instant (UTC, RFC3339)
./moon --port 6399 --dir /var/lib/moon \
--recovery-target-time 2026-05-12T09:30:00Z
How it worksΒΆ
- Snapshot selection. Recovery scans available
.rdbfiles and picks the newest snapshot whose embeddedlast_lsnis<= target_lsn. v1 snapshots (pre-0.2) ship withlast_lsn = 0and are skipped conservatively β replay falls back to a full WAL scan from segment 0 in that case. - WAL replay.
replay_wal_v3_dir_until(target_lsn)walks segments in order. The loop breaks on the first record withlsn > target_lsn, and the resumedwal_flush_lsnis not advanced past the target. This keeps the control file truthful in case a subsequent restart drops the flag. - Time resolution.
resolve_target_time_to_lsnscans the WAL forTemporalUpsertandGraphTemporalrecords (the only record types that carrysystem_fromtimestamps) and returns the LSN of the last record withts <= target_time. Workloads without temporal commands have no anchors; for those, prefer--recovery-target-lsn.
Snapshot LSN provenanceΒΆ
The snapshot file header was bumped to v2 (SHARD_RDB_VERSION = 2) and now
carries:
| Field | Bytes | Purpose |
|---|---|---|
last_lsn |
8 | WAL LSN captured at snapshot time |
created_at_unix_ms |
8 | Wall-clock when snapshot was sealed |
v1 snapshots load with last_lsn = 0 and created_at_unix_ms = 0. That value
is a safety signal β PITR refuses to use them as a starting point.
Operator note. Live snapshots produced by the persistence tick still embed
last_lsn = 0in v0.2 β the wiring towal_flush_lsnships in P3c. Until then, PITR effectively performs a full WAL replay up to the target. Plan retention accordingly: keep enough WAL segments to cover the recovery window.
VerificationΒΆ
The CI suite asserts:
test_recovery_stops_at_target_lsnβ write 100 commands, restart withtarget_lsn = 50, only the first 50 SETs are visible.test_recovery_target_time_resolvesβ temporal upserts at known timestamps; restart with a mid-stream target time; correct cutoff.test_v1_snapshot_loads_with_zero_lsnβ backward-compat round-trip.
What is not in v0.2ΒΆ
- β Cross-shard global cutoff. Each shard PITRs independently; cross-shard transactional consistency lands with the v0.3 distributed-txn work.
- β Logical undo. PITR is forward replay up to a cutoff, not rollback of individual operations.
- β Mixed
target_lsn+target_timearbitration UI β LSN wins silently.