Skip to content

Status: Active (v0.1.13 Wave 1) Β· Owner: Storage/Reclamation Β· Version: v0.1.13

Reclamation RunbookΒΆ

This runbook covers the four reclamation-induced failure modes that operators are most likely to encounter. Each scenario opens with violated SLOs, shows which INFO reclamation fields to check, provides a decision tree, and lists concrete mitigation commands.

Companion documents:

Wave-1 commands available (v0.1.13 P8):

The manual reclamation commands below ship as part of Wave-1 P8 in the same v0.1.13 release:

VACUUM                           # reclaim across all subsystems, returns bytes reclaimed per category
VACUUM VECTOR <index> [VERBOSE]  # compact a specific vector index
VACUUM GRAPH [VERBOSE]           # compact graph dead edges
VACUUM MVCC [VERBOSE]            # sweep MVCC zombies and prune committed versions
KILL SNAPSHOT <snapshot-id>      # force-close a stuck snapshot (use with care β€” see SLO 5)

How to Start Any DiagnosisΒΆ

Before diving into a scenario, run these two commands. They give you the full picture in under 30 seconds:

# 1. Snapshot the reclamation section
redis-cli -p 6399 INFO reclamation

# 2. Identify which SLO is breached (scan the key fields)
redis-cli -p 6399 INFO reclamation | grep -E \
  'disk_free_bytes|wal_bytes|write_stall_active|dead_fraction_max|manifest_tombstones|immutable_segments|read_amp_p99|mvcc_oldest_snapshot_age_secs|compaction_pending_bytes|autovacuum_throttled'

Then match the output to the scenario table below:

What you see Go to
reclamation_write_stall_active:true OR reclamation_disk_free_bytes critically low Scenario 1
reclamation_manifest_tombstones > 80 K OR reclamation_immutable_segments > 16 Scenario 2
reclamation_mvcc_oldest_snapshot_age_secs > 600 OR process RSS growing while keyspace is stable Scenario 3
reclamation_dead_fraction_max > 0.35 OR reclamation_compaction_pending_bytes high but not stalled Scenario 4

Scenario 1: Disk Filling UpΒΆ

Violated SLOs: SLO 2 β€” WAL Size

Symptoms:

  • Write commands return ERR WRITE_STALL disk usage too high
  • reclamation_write_stall_active:true in INFO reclamation
  • Disk partition at > 95% usage (i.e., below --disk-free-min-pct, default 5%)
  • Moon logs: write stall activated at WARN level

This scenario is the reclamation lens on disk pressure: the disk is filling because compaction has fallen behind and dead bytes have not been reclaimed. This is distinct from WAL-rotation ENOSPC (see Disk Full During WAL Rotation, which covers hard ENOSPC before write-stall kicks in).

Step 1: Confirm write-stall and free headroomΒΆ

# Check write-stall flag and free bytes
redis-cli -p 6399 INFO reclamation | grep -E 'write_stall|disk_free_bytes|wal_bytes|wal_segments'

# Check OS-level disk usage on the persistence partition
df -h <persistence-dir>    # --persistence-dir value

Decision:

reclamation_write_stall_active:true?
β”œβ”€β”€ YES β†’ write-stall is active, new writes are rejected
β”‚   β”œβ”€β”€ reclamation_disk_free_bytes < 2 GB β†’ CRITICAL: proceed to Step 2 immediately
β”‚   └── reclamation_disk_free_bytes β‰₯ 2 GB β†’ stall may be a threshold misconfiguration; see Step 4
└── NO β†’ write-stall not yet active but disk is trending full β†’ proceed to Step 3 proactively

Step 2: Immediate space recovery (write-stall active)ΒΆ

# Force WAL checkpoint to seal and archive segments (frees in-flight WAL space)
redis-cli -p 6399 DEBUG RELOAD

# Run VACUUM to reclaim dead bytes across all subsystems
redis-cli -p 6399 VACUUM
# Output: "reclaimed: vector=<N>MB graph=<N>MB mvcc=<N>MB manifest=<N>MB total=<N>MB"

# If vector indexes are large, compact them explicitly
redis-cli -p 6399 VACUUM VECTOR <index-name> VERBOSE

# Recheck disk free
redis-cli -p 6399 INFO reclamation | grep disk_free_bytes

Write-stall clears automatically when reclamation_disk_free_bytes rises above the threshold. You do not need to restart Moon.

# How much WAL is outstanding?
redis-cli -p 6399 INFO reclamation | grep -E 'wal_bytes|wal_segments'

# Run compaction to free dead bytes before stall kicks in
redis-cli -p 6399 VACUUM

# Reduce WAL retention if checkpoint lag is acceptable
# (increase checkpoint frequency β†’ smaller WAL footprint)
# Edit config and SIGHUP or restart with:
#   --wal-max-checkpoint-lag-ms 5000   (default 10000)

Step 4: If write-stall threshold is misconfiguredΒΆ

# Check configured threshold
redis-cli -p 6399 INFO reclamation | grep write_stall_threshold_pct
# If threshold is too aggressive (e.g. 20%) and disk is not actually critical:
# Restart with a lower threshold, e.g.:
#   --disk-free-min-pct 5   (default)

Step 5: Prevent recurrenceΒΆ

  • Alert at reclamation_disk_free_bytes < 20% of partition (before stall triggers at 5%).
  • Place WAL and vector segment directories on a dedicated partition with monitoring.
  • If compaction consistently falls behind ingest, this is Scenario 4 β€” plan for Wave-2 autovacuum.

See also: Disk Full During WAL Rotation for ENOSPC that occurs before write-stall activates.


Scenario 2: Query Latency CliffedΒΆ

Violated SLOs: SLO 3 β€” Manifest Commit Latency, SLO 4 β€” Read Amplification

Symptoms:

  • FT.SEARCH latency has increased step-function (a "cliff", not gradual degradation)
  • P99 read latency is elevated but P50 is normal β€” indicates multi-segment probe overhead
  • reclamation_immutable_segments > 16
  • reclamation_manifest_tombstones > 80 000 (manifest commit adds latency to every write)
  • reclamation_read_amp_p99 > 20 (v0.1.14+ only; in v0.1.13 use immutable_segments as proxy)

Step 1: Identify the root causeΒΆ

redis-cli -p 6399 INFO reclamation | grep -E \
  'immutable_segments|warm_segments|cold_segments|graph_segments|manifest_tombstones|manifest_active|read_amp'

Decision tree:

reclamation_immutable_segments > 16?
β”œβ”€β”€ YES β†’ vector query read amplification is elevated
β”‚   └── FT.SEARCH query is hitting N segments instead of ≀ 16 β†’ goto Step 2 (vector compaction)
└── NO
    └── reclamation_manifest_tombstones > 80000?
        β”œβ”€β”€ YES β†’ manifest commit latency is elevated β†’ goto Step 3 (manifest GC)
        └── NO β†’ latency cliff may not be reclamation-related
                 Check: redis-cli SLOWLOG GET 10 and server CPU/memory

Step 2: Compact vector indexes (too many immutable segments)ΒΆ

# List indexes (if you have multiple)
redis-cli -p 6399 FT.INFO <index-name>

# Compact a specific index β€” forces mutable β†’ immutable segment flush and HNSW graph build
redis-cli -p 6399 FT.COMPACT <index-name>

# Or use VACUUM VECTOR for each index with verbose output
redis-cli -p 6399 VACUUM VECTOR <index-name> VERBOSE
# Output shows segments merged and recall delta

# Recheck segment count
redis-cli -p 6399 INFO reclamation | grep immutable_segments

Caution: FT.COMPACT is a no-op if the mutable segment has fewer entries than COMPACT_THRESHOLD. If the command is silent, either set COMPACT_THRESHOLD to match your dataset size or use VACUUM VECTOR which bypasses this gate.

After compaction, segment count should drop to 1–3. Re-run the query to confirm latency returned.

Step 3: Force manifest GC (too many tombstones)ΒΆ

# Check tombstone count and retention config
redis-cli -p 6399 INFO reclamation | grep -E 'manifest_tombstones|manifest_active'

# Run VACUUM to trigger manifest GC (tombstones below retain threshold are pruned)
redis-cli -p 6399 VACUUM

# If tombstones are not dropping, the retain-epochs or retain-secs window is too long.
# Restart with a shorter window (hot-path config β€” requires restart):
#   --manifest-tombstone-retain-epochs 1   (default 2)
#   --manifest-tombstone-retain-secs 60    (default 300)

Step 4: Confirm latency restoredΒΆ

# Run a representative FT.SEARCH and observe response time
time redis-cli -p 6399 FT.SEARCH <index-name> "*" LIMIT 0 10

# Recheck read-amp proxy
redis-cli -p 6399 INFO reclamation | grep -E 'immutable_segments|read_amp_p99'

Step 5: Prevent recurrenceΒΆ

  • Alert on reclamation_immutable_segments > 16 (available v0.1.13).
  • Alert on reclamation_manifest_tombstones > 80000.
  • Plan for Wave-2 P2 (immutable segment merge daemon) which automates this compaction. Until then, schedule FT.COMPACT during off-peak windows if ingest is continuous.

Scenario 3: OOM / Memory GrowthΒΆ

Violated SLOs: SLO 5 β€” MVCC Pinning

Symptoms:

  • Process RSS growing steadily while keyspace size is stable
  • reclamation_mvcc_committed count is rising without bound
  • reclamation_mvcc_oldest_snapshot_age_secs > 600 (default threshold)
  • In extreme cases: OOM killer terminates Moon (check dmesg | grep -i oom)

This scenario is distinct from the BGSAVE memory spike (see OOM During Snapshot). Here the growth is gradual and driven by MVCC version accumulation behind a pinned snapshot.

Step 1: Confirm MVCC pinningΒΆ

redis-cli -p 6399 INFO reclamation | grep -E \
  'mvcc_committed|mvcc_active|mvcc_oldest_snapshot_age_secs|mvcc_oldest_snapshot_lag|mvcc_zombies_swept_total|delete_pending_visible_lsn'

Decision tree:

reclamation_mvcc_oldest_snapshot_age_secs > 600?
β”œβ”€β”€ YES β†’ at least one snapshot is older than the threshold
β”‚   β”œβ”€β”€ reclamation_mvcc_active > 0 β†’ open snapshots exist (expected; check age)
β”‚   └── reclamation_mvcc_committed growing β†’ versions are accumulating behind pinned snapshot
β”‚       β†’ goto Step 2 (identify and close the stuck snapshot)
└── NO
    └── reclamation_mvcc_committed very large but age < threshold?
        β†’ high write throughput with short-lived snapshots is normal; check ingest rate
        β†’ if committed count > 1M, goto Step 3 (force prune)

Step 2: Identify and close the stuck snapshotΒΆ

# Get the age and lag of the oldest snapshot
redis-cli -p 6399 INFO reclamation | grep -E 'mvcc_oldest|delete_pending_visible_lsn'

# If the snapshot is from a long-running FT.SEARCH or graph query,
# identify the client connection holding it:
redis-cli -p 6399 CLIENT LIST | grep -v "cmd=ping\|cmd=client"
# Look for a client with a very old "age" or an in-flight FT.SEARCH

# Force-close the stuck snapshot (Wave-1 P8 command)
# WARNING: this aborts the query associated with the snapshot.
# Use the snapshot ID from the MVCC manager log or INFO output.
redis-cli -p 6399 KILL SNAPSHOT <snapshot-id>

Step 3: Force MVCC sweepΒΆ

# Trigger a MVCC zombie sweep and committed-set prune
redis-cli -p 6399 VACUUM MVCC VERBOSE
# Output: "swept=<N> zombies, pruned=<N> committed entries, freed=<N>MB"

# Recheck committed count
redis-cli -p 6399 INFO reclamation | grep mvcc_committed

The VACUUM MVCC command runs the same sweep that Wave-2 autovacuum will run automatically. In v0.1.13, it must be triggered manually.

Step 4: Monitor RSS after the fixΒΆ

# Check process RSS on Linux
grep VmRSS /proc/$(pgrep moon)/status

# On macOS (dev only)
ps -o pid,rss -p $(pgrep moon)

RSS should stabilize within a few minutes of closing the pinned snapshot.

Step 5: Prevent recurrenceΒΆ

  • Alert on reclamation_mvcc_oldest_snapshot_age_secs > 600.
  • Add a client-side timeout on FT.SEARCH and graph queries β€” long-running queries are the primary source of stuck snapshots.
  • Tune --mvcc-old-snapshot-threshold-secs down to match your P99 query latency SLO. A value 3Γ— your expected max query duration is a reasonable starting point.
  • Plan for Wave-2 MA2 (old-snapshot detection daemon) which auto-kills zombied snapshots.

If Moon was OOM-killed: restart it. Moon recovers from WAL v3 on startup. After restart, run VACUUM MVCC immediately to prune any MVCC state that accumulated pre-crash.


Scenario 4: Compaction Not Keeping UpΒΆ

Violated SLOs: SLO 1 β€” Bloat, SLO 6 β€” Request Impact

Symptoms:

  • reclamation_dead_fraction_max > 0.35 and rising
  • reclamation_compaction_pending_bytes large but not triggering a stall
  • Disk usage growing even though no new keys are being added (dead bytes accumulating)
  • Optionally: reclamation_autovacuum_throttled_due_to_load:1 continuously (v0.1.14+)

This is the slow-burn failure mode: compaction exists but runs too slowly for the ingest rate. The system is not yet stalled, but it will be if nothing changes.

Step 1: Quantify the backlogΒΆ

redis-cli -p 6399 INFO reclamation | grep -E \
  'dead_fraction_max|compaction_pending_bytes|compaction_throughput_bps|immutable_segments|warm_segments|cold_segments|graph_segments|autovacuum'

Decision tree:

reclamation_dead_fraction_max > 0.35?
β”œβ”€β”€ YES β€” bloat is above SLO threshold
β”‚   β”œβ”€β”€ reclamation_compaction_pending_bytes > 0 β†’ compaction backlog exists
β”‚   β”‚   β”œβ”€β”€ reclamation_write_stall_active:true β†’ URGENT: this is Scenario 1, go there first
β”‚   β”‚   └── write_stall_active:false β†’ compaction is behind but not stalled β†’ goto Step 2
β”‚   └── compaction_pending_bytes β‰ˆ 0 β†’ dead fraction is high but no pending work
β”‚       β†’ dead bytes are in immutable segments awaiting merge (Wave-2 P2)
β”‚       β†’ manual FT.COMPACT is the only recourse until Wave-2 ships β†’ goto Step 3
└── NO (≀ 0.35) β†’ bloat is within SLO; compaction may be slow for a different reason
    β†’ check reclamation_autovacuum_throttled_due_to_load (v0.1.14+) β†’ goto Step 4

Step 2: Manually drive compactionΒΆ

# Run a full VACUUM across all subsystems
redis-cli -p 6399 VACUUM
# Check returned bytes per category and total

# If vector indexes dominate the dead fraction:
redis-cli -p 6399 VACUUM VECTOR <index-name> VERBOSE

# If graph dead-edge fraction is high:
redis-cli -p 6399 VACUUM GRAPH VERBOSE

# Recheck dead fraction after each pass
redis-cli -p 6399 INFO reclamation | grep dead_fraction_max

Run VACUUM in a loop during off-peak hours until dead_fraction_max < 0.20. This is manual work in v0.1.13 β€” Wave-2 P4 ships the autovacuum daemon.

Step 3: Compact immutable segments (dead fraction in immutable tier)ΒΆ

# Force compact each vector index
# Note: FT.COMPACT is a no-op if mutable_len < compact_threshold.
# To force: set COMPACT_THRESHOLD to 1 temporarily or use VACUUM VECTOR.
redis-cli -p 6399 VACUUM VECTOR <index-name> VERBOSE

# Check if immutable segment count drops
redis-cli -p 6399 INFO reclamation | grep immutable_segments

Dead bytes in immutable segments can only be reclaimed by the Wave-2 segment merge (P2). If the dead fraction is dominated by immutable-segment tombstones, the mitigation is to limit ingest rate until Wave-2 ships, or to accept elevated disk usage as a known gap.

Step 4: Autovacuum throttling (v0.1.14+)ΒΆ

# Check if autovacuum is throttled due to high request load
redis-cli -p 6399 INFO reclamation | grep autovacuum_throttled_due_to_load
# 1 = throttled, 0 = running at full budget

# If throttled continuously: autovacuum cannot catch up under current load
# Options:
#   A. Schedule manual VACUUM during a low-traffic window
#   B. Increase autovacuum budget (v0.1.14 flag: --autovacuum-max-budget-ms)
#   C. Reduce ingest rate temporarily

Step 5: Prevent the backlog from growingΒΆ

  • Alert on reclamation_dead_fraction_max > 0.35.
  • If dead fraction is consistently above 0.20 at steady-state, the ingest rate exceeds the single-pass compaction budget. This is a capacity planning signal: either the dataset is growing beyond the compaction SLO boundary, or Wave-2 weighted compaction (MA4) is needed.
  • Review vector index configuration: high COMPACT_THRESHOLD delays compaction, allowing the mutable segment (brute-force search) to grow and degrade query performance before compaction fires.
  • Check the OPERATOR-GUIDE.md memory accounting section for guidance on RSS growth that may accompany a large compaction backlog.

When to EscalateΒΆ

Escalate to the on-call engineering lead when:

  1. Write-stall persists > 10 minutes after running VACUUM and freeing disk space. Indicates a compaction bug or WAL checkpoint failure.

  2. MVCC committed count growing unbounded (> 10M entries) after VACUUM MVCC and KILL SNAPSHOT. Indicates a snapshot handle leak in the MVCC manager.

  3. Moon OOM-killed more than once in 24 hours on the same instance with stable keyspace size. Indicates a memory leak outside the MVCC path.

  4. reclamation_dead_fraction_max > 0.70 for > 30 minutes. Compaction is critically behind β€” risk of write-stall cascading into disk full.

  5. FT.SEARCH latency did not improve after FT.COMPACT / VACUUM VECTOR. The read amplification cliff may have a different root cause (HNSW graph corruption, TQ4 quantization degradation on low-dimensional embeddings β€” see CLAUDE.md gotchas).

  6. Manifest tombstones not dropping after VACUUM with --manifest-tombstone-retain-epochs 1. Indicates a manifest GC bug.


Resource Location
Reclamation SLO contract docs/operations/reclamation-slo.md
WAL disk-full runbook docs/runbooks/disk-full-during-wal-rotation.md
OOM / snapshot runbook docs/runbooks/oom-during-snapshot.md
AOF corruption recovery docs/runbooks/corrupted-aof-recovery.md
Memory accounting guide docs/OPERATOR-GUIDE.md
Production contract docs/PRODUCTION-CONTRACT.md
Wave 1/2 roadmap TODO.md
Coding rules and gotchas CLAUDE.md