Operator GuideΒΆ
Practical reference for running Moon in production: memory accounting, allocator tuning, and observability. Companion to PRODUCTION-CONTRACT.md (SLOs and durability) and BENCHMARK.md (throughput methodology).
Memory AccountingΒΆ
Moon allocates memory in three logically-separable layers: keyspace data (DashTable + per-subsystem indexes), allocator metadata (arenas, slabs, spare runs), and operating-system page mappings (VSZ). Each layer can inflate independently; this section explains how to read each one and when a number indicates a real problem.
1. VSZ vs RSS β What They MeanΒΆ
Virtual memory (VSZ) is the total address space a process has reserved from the operating system. It includes pages that the process has mapped but never written to, shared library segments, and memory-mapped I/O regions that exist in the virtual address space but are not backed by physical RAM until they are first accessed. VSZ is a reservation, not consumption. The OS will never charge your RAM or swap budget for a virtual page that has never been touched.
Resident set size (RSS) is the amount of RAM the process is actually using right now β pages that are physically in RAM, whether written by the process or read from disk. RSS is what you pay for on hosted infrastructure and what the OOM killer tracks. It is the correct number to monitor.
Reading tools:
- Linux:
cat /proc/$PID/statusβ look forVmRSS:(current resident) andVmPeak:(high-water mark).ps -o pid,vsz,rss -p $PIDalso works. - macOS:
ps -o pid,vsz,rss -p $(pgrep moon)β VSZ and RSS in kilobytes. Activity Monitor's "Memory" column approximates RSS (compressed memory included), not VSZ.
The 44 GB incident: On a 16-core Linux host with default jemalloc settings (64 arenas), Moon at idle can show VSZ β 438 GB while RSS β 228 MB. The 438 GB is jemalloc's reserved virtual address space (6.8 GB per arena, 64 arenas). Zero of those pages are backed by RAM. See section 2 for how Moon caps this by default.
macOS specifics: On macOS aarch64 with narenas:8 active (post-PERF-10), Moon has been measured at VSZ β 391.61 GB, RSS β 8.31 MB at idle. The large VSZ on macOS is dominated by monoio's mmap region allocations (the io_uring equivalent β monoio uses kqueue on macOS, but its internal buffer rings are mmap'd at startup). The jemalloc arena cap has a smaller effect on macOS than on Linux because monoio mmap is the dominant contributor. On Linux, capping arenas from 64 to 8 reduces VSZ significantly because jemalloc arenas are the dominant reservation. This platform difference is expected; RSS behavior on both platforms is identical.
The authoritative check for whether Moon is consuming real RAM is:
2. Jemalloc Arena LayoutΒΆ
Jemalloc divides its heap into arenas to reduce contention between threads.
By default, jemalloc creates 4 Γ ncpus arenas. On first use each arena
reserves a contiguous block of virtual address space β approximately 6.8 GB
per arena on 64-bit systems. The reservation is virtual-only; only pages
actually written are backed by RAM.
On a 16-core Linux host: 64 arenas Γ 6.8 GB β 438 GB VSZ at idle. This
is the source of the "44 GB" (or higher) reading operators see in Activity
Monitor or top's VIRT column. The process's actual RAM consumption is in the
hundreds of megabytes.
Moon's default cap (PERF-10): Moon exports a static _rjem_malloc_conf
symbol that bakes narenas:8 into the binary. This is read by jemalloc before
any allocation, before main() runs. Result on a 16-core host: 8 arenas Γ 6.8 GB
β 54 GB VSZ β a linear reduction in virtual address reservation with zero
impact on RSS, throughput, or latency.
CLI override: Pass --memory-arenas-cap N (N in 1β256, default 8) to spawn
Moon with a different arena count. The flag works by re-spawning the process
with _RJEM_MALLOC_CONF=narenas:N injected into the environment before
jemalloc initializes. A sentinel environment variable (MOON_ARENAS_CAP_APPLIED)
prevents infinite re-spawn loops.
Environment variable precedence: If _RJEM_MALLOC_CONF is already set in
the environment when Moon starts, Moon detects this and logs a warning:
The operator-supplied value always wins over --memory-arenas-cap. This
allows advanced jemalloc tuning (e.g., narenas:16,background_thread:true)
without rebuilding the binary. See https://jemalloc.net/jemalloc.3.html Β§OPTIONS
for the full jemalloc option reference.
Non-jemalloc builds: If Moon is built with mimalloc-alt (see section 5),
the _rjem_malloc_conf static symbol is absent and --memory-arenas-cap is
accepted but logs a no-op warning. The arena concept does not apply to mimalloc.
Contention note: Reducing narenas below ncpus will cause arena sharing
across threads. Moon is a thread-per-core server β each shard event loop runs
on its own OS thread and allocates predominantly within its own working set.
8 arenas is comfortably above the contention threshold for typical 8-shard
deployments. If you run more than 8 shards, consider --memory-arenas-cap N
where N β₯ --shards.
3. Reading MEMORY DOCTORΒΆ
MEMORY DOCTOR is a Moon admin command that returns a multi-line breakdown of
process memory across all instrumented subsystems. Run it with any Redis client:
Sample output from a live Moon instance with 100 keys loaded on macOS aarch64 (jemalloc build, 1 shard):
Sample of Moon memory usage at 2026-04-27T17:23:51Z
Process:
RSS: 8.34 MB
VSZ: 391.62 GB
Allocator: jemalloc
Arenas: 8
Per-subsystem (resident):
DashTable + entries: 24.74 KB (0.3%)
HNSW (vector): 0 B (0.0%)
CSR (graph): 0 B (0.0%)
WAL writers: 0 B (0.0%)
Sealed segments: 0 B (0.0%)
Replication backlog: 0 B (0.0%)
Allocator overhead: 8.32 MB (99.7%)
Mapped regions:
File-backed mmap: n/a
Anonymous mmap: n/a
Recommendations:
- VSZ-vs-RSS ratio is 48061x (high -- consider --memory-arenas-cap 8)
- Allocator overhead dominates RSS (>50%). Possible fragmentation -- consider MEMORY PURGE or restart.
Interpreting each field:
- Process: RSS β current resident set size. This is your real memory consumption.
- Process: VSZ β virtual address space reserved. Large values are expected; see section 1.
- Process: Allocator β
jemalloc(default) ormimalloc(mimalloc-alt build). - Process: Arenas β the configured narenas cap, read live via
tikv-jemalloc-ctl. Confirms which of (a) the built-in default 8, (b)--memory-arenas-cap N, or (c) operator-set_RJEM_MALLOC_CONF=narenas:Nis active. Showsn/aon non-jemalloc builds. - Per-subsystem (resident) β bytes attributed to each subsystem, derived from
resident_bytes()accessors added in Phase 190. The percentages are relative to RSS. The 7 fixed labels are: DashTable + entriesβ key-value storage (DashTable structural overhead + entry bytes)HNSW (vector)β HNSW graph nodes and edges in active vector indexesCSR (graph)β property graph adjacency and MemGraph node/edge SlotMapsWAL writersβ per-shard write-ahead log buffers (note: always 0 β WAL writers are stack-owned by shard event loops and not reachable from command dispatch)Sealed segmentsβ immutable text-search index segments pending compactionReplication backlogβ replica replication buffer VecDequesAllocator overheadβ computed asmax(0, RSS - sum(other six)). This includes jemalloc thread caches, slab metadata, spare runs, and any subsystem not yet instrumented. Healthy values are 5β15% of RSS. Sustained >25% on a steady workload suggests fragmentation; consider the mimalloc-alt A/B (see section 5).- Mapped regions β
n/auntil a future phase implements/proc/self/smapsparsing (Linux) orvmmapenumeration (macOS). These fields are reserved. - Recommendations β diagnostic hints generated automatically from the observed ratios. A high VSZ-vs-RSS ratio on a fresh install is normal and expected.
4. Prometheus moon_memory_bytes{kind=...}ΒΆ
Moon emits a labeled Prometheus gauge moon_memory_bytes with 7 kind labels.
The metric is updated every 15 seconds by a background publisher on the admin-HTTP
thread. All 7 labels are always present regardless of which features are enabled;
disabled subsystems emit zero-valued series (stable label set for Grafana dashboards).
Scrape endpoint: GET http://<admin-host>:<admin-port>/metrics (default admin port
is 6380). The metric appears in standard Prometheus text format.
| kind | Source | Notes |
|---|---|---|
dashtable |
DashTable structural overhead + per-entry bytes | Scales with key count and average value size |
hnsw |
HNSW graph nodes and edges across all mutable vector segments | Grows on HSET into indexed fields; resets to mutable after FT.COMPACT |
csr |
Graph adjacency (CSR storage) + MemGraph SlotMaps | Grows with GRAPH.ADDNODE / GRAPH.ADDEDGE |
wal |
Per-shard WAL buffer capacity | Always 0 β WAL writers are stack-owned by event loops, not accessible from admin thread |
sealed |
Immutable text-search index segment buffers | Grows until FT.COMPACT flushes mutable to immutable |
replication_backlog |
Replica backlog VecDeque allocated capacity | Bounded by repl-backlog-size; 0 when no replicas connected |
allocator_overhead |
max(0, RSS - sum of other 6 kinds) |
Includes jemalloc thread caches, spare runs, and any uninstrumented subsystem |
Querying:
# Total memory by subsystem (most recent values)
sum by (kind) (moon_memory_bytes)
# Allocator overhead as percentage of RSS
moon_memory_bytes{kind="allocator_overhead"} / moon_rss_bytes * 100
# Dashtable growth rate (bytes per minute)
rate(moon_memory_bytes{kind="dashtable"}[5m]) * 60
Coverage: The sum of all 7 kind values is expected to equal or slightly exceed RSS by construction (allocator_overhead absorbs the difference). The Phase 190 milestone gate (OBS-04) requires that the sum covers β₯95% of RSS. Lower coverage would indicate a subsystem not yet instrumented; file a bug if the sum falls below 80% of RSS on a loaded server.
Scrape interval: The publisher updates every 15 seconds. Prometheus scrape
intervals faster than 15s will see stale values between publisher ticks. For
real-time memory investigation, use MEMORY DOCTOR directly β it reads
live values on each invocation.
5. mimalloc-alt Opt-InΒΆ
mimalloc-alt is an optional build feature that replaces jemalloc with
Microsoft's mimalloc as the global allocator. It is provided for A/B evaluation
and allocator-specific performance investigation. It is not the supported
production default.
When to use mimalloc-alt:
- Investigating allocator-bound performance on synthetic microbenchmarks where jemalloc's thread-cache warm-up latency is a confounder.
- Evaluating a smaller VSZ profile on macOS development hosts (mimalloc does not use the same large-arena reservation model as jemalloc).
- Diagnosing suspected jemalloc-specific fragmentation: if
MEMORY DOCTORshowsAllocator overheadclimbing steadily over days without keyspace growth, an A/B comparison can confirm whether the fragmentation is allocator-specific.
When NOT to use mimalloc-alt:
Production deployments. jemalloc is the validated allocator with stronger thread-cache and better under-load fragmentation behavior for Moon's thread-per-core workload pattern. mimalloc is the experimental knob.
Build command:
cargo build --release --no-default-features \
--features runtime-monoio,mimalloc-alt,graph,text-index
For the tokio runtime (CI parity):
cargo build --release --no-default-features \
--features runtime-tokio,mimalloc-alt,graph,text-index
Mutual exclusion: jemalloc and mimalloc-alt are mutually exclusive.
Enabling both features at once produces a compile_error! at build time:
A/B benchmark: The script scripts/bench-allocator-ab.sh builds both
allocator variants and runs identical workloads against each:
Options:
- --quick β reduced request count for fast iteration
- --requests N β override default request count (default: 500000)
- --shards N β number of shards (default: 8)
- --clients N β number of client connections (default: 50)
Output is written to tmp/allocator-ab-<timestamp>.txt. The script runs
SET p=64 and GET p=64 against both binaries and prints throughput side-by-side.
Trade-off summary:
| Dimension | jemalloc (default) | mimalloc-alt |
|---|---|---|
| VSZ profile | High (large arena reservations) | Lower (no arena model) |
| Thread-cache | Strong; benefits long-running shards | Per-thread heap pools |
| Fragmentation under load | Better on sustained write workloads | Better on allocation-heavy micro-tasks |
--memory-arenas-cap flag |
Supported | No-op (logged warning) |
MEMORY DOCTOR Arenas: |
Numeric (e.g., 8) |
n/a |
| Production support | Validated | Experimental |
6. TroubleshootingΒΆ
Q: Activity Monitor shows Moon using 44 GB (or 200+ GB). Is something broken?
No. The large number is jemalloc's reserved virtual address space β see section 1 (VSZ vs RSS) and section 2 (Jemalloc Arena Layout). Virtual address reservations cost zero RAM, zero swap, and zero disk. The number the operator is seeing is VSZ, not RSS.
To confirm: run ps -o pid,vsz,rss -p $(pgrep moon) or send MEMORY DOCTOR to
the server. The RSS line in the MEMORY DOCTOR output is the real memory
consumption in the hundreds of megabytes, not tens of gigabytes.
On macOS, Activity Monitor's "Memory" column approximates RSS (including
compressed memory), not VSZ. If Activity Monitor shows a number much larger than
what MEMORY DOCTOR reports for RSS, the discrepancy is almost certainly
compressed memory or macOS's own memory management heuristics β not a Moon leak.
Q: How do I detect a real memory leak?
A real leak shows RSS growth over time without a corresponding increase in stored data. Watch:
If RSS is climbing >5% per hour while DBSIZE, vector segment count, and graph
node count are flat, that is a leak signal. Cross-check MEMORY DOCTOR: if
Allocator overhead is climbing while per-subsystem totals are flat, suspect
allocator fragmentation rather than a data-structure leak.
To identify which subsystem is responsible, monitor the Prometheus gauge:
# One-shot scrape of per-kind values
curl -s http://localhost:6380/metrics | grep 'moon_memory_bytes'
If a per-subsystem kind is climbing without corresponding keyspace growth (DBSIZE or index size unchanged), file a bug with the subsystem name and the rate of growth.
To rule out fragmentation as the root cause, run the allocator A/B comparison:
If mimalloc-alt shows meaningfully lower allocator overhead under the same workload, the issue is jemalloc fragmentation rather than a logic-level leak.
Q: Can I disable the arena cap entirely and use jemalloc's default (4 Γ ncpus arenas)?
The narenas:8 cap is always active on default jemalloc builds. To use a
different value, pass --memory-arenas-cap N where N is in 1β256. For example,
to match jemalloc's default on a 4-core host:
To bypass both the static symbol and the CLI flag entirely, set _RJEM_MALLOC_CONF
in the environment directly β it takes precedence over both:
To use jemalloc's uncapped default (4 Γ ncpus, potentially hundreds of GB VSZ),
build Moon with the mimalloc-alt feature β that path bypasses the static
_rjem_malloc_conf symbol entirely and does not apply any arena limit.
Advanced jemalloc options are documented at https://jemalloc.net/jemalloc.3.html Β§OPTIONS.
Q: What is a healthy Allocator overhead percentage?
- 5β15% of RSS: normal and expected.
- 15β25%: acceptable on bursty write workloads where jemalloc retains thread caches between bursts.
- Sustained >25% on a steady workload: investigate. Run the allocator A/B (section 5). If mimalloc-alt shows significantly lower overhead, the issue is jemalloc fragmentation. If both allocators show the same overhead, there may be an uninstrumented subsystem accumulating memory.
Q: The CI memory-steady-state job failed. What do I do?
The job runs scripts/bench-memory-steady-state.sh and compares per-kind
values against the committed baseline in tests/fixtures/memory-baseline.json
with a Β±5% tolerance.
If the failure is expected (intentional data structure change, new entry overhead):
# Regenerate the baseline
bash scripts/bench-memory-steady-state.sh \
--write-baseline tests/fixtures/memory-baseline.json
# Verify the new baseline
jq . tests/fixtures/memory-baseline.json
# Commit with the required tag in the subject
git commit -m "chore(190-04): [memory-baseline-update] reason: <why the change is expected>"
The [memory-baseline-update] tag in the commit subject is the project
convention for baseline updates and is required for the CI job to recognize
the change as intentional.
If the failure is unexpected, compare the captured values to the baseline to identify which subsystem regressed, then follow the real-leak detection procedure above.
See also:
- PRODUCTION-CONTRACT.md for memory-related SLOs and platform guarantees.
- production-guide.md for deployment configuration and tuning recommendations.
- Phase 190 plans for
MEMORY DOCTORand Prometheus internals. - Phase 191 plans for the arena-cap
and
mimalloc-altfeature design notes.