Architecture¶

Moon uses a thread-per-core shared-nothing architecture where each shard runs independently on its own thread with no shared mutable state.

flowchart TD
    C["Client connections"] --> L["TCP / TLS listener"]
    L --> R{"Shard router<br/>hash(key) % N"}
    R --> S0["Shard 0"]
    R --> S1["Shard 1"]
    R --> SN["Shard N-1"]
    S0 --> D0[("DashTable")]
    S1 --> D1[("DashTable")]
    SN --> DN[("DashTable")]
    D0 --> W0["Per-shard WAL"]
    D1 --> W1["Per-shard WAL"]
    DN --> WN["Per-shard WAL"]

The full picture — request path, shard internals, storage layer, persistence, and the feature satellites (Lua, PubSub, cluster, ACL, replication) — is shown below:

Per-shard components¶

Each shard runs on its own thread with:

Event loop — Tokio current_thread or Monoio FusionDriver runtime (io_uring on Linux, kqueue on macOS)
DashTable — Segmented hash table with SIMD probing
WAL writer — Per-shard persistence with no global lock
PubSub registry — Cross-shard fan-out via flume (MPSC) channels
Lua VM — Lazy-initialized on first EVAL command

All cross-shard communication uses lock-free message passing. No mutexes on the hot path.

DashTable¶

The primary key-value engine is a segmented Swiss-table hash map with SIMD probing. Each segment holds 60 slots with a 90% load threshold.

Property	Value
Segment size	~3 KB (64B ctrl + 8B meta + 60 slots)
Load threshold	90% (54/60 slots)
Average fill	~67%
Probe method	16-way SIMD parallel match (SSE2/NEON)

Keys are routed to segments via xxhash. Within a segment, SIMD control-byte matching finds the target slot in a single instruction.

Memory layout¶

Moon uses specialized compact types to minimize per-key overhead:

Struct	Size	Description
CompactKey	24 B	Inline keys up to 23 bytes (zero heap allocation)
CompactEntry	24 B	CompactValue (16B) + TTL delta (4B) + metadata (4B)
CompactValue	16 B	SSO for values up to 12 bytes inline
HeapString	24 B	`Vec<u8>` with no Arc overhead for non-shared values

Why Moon uses less memory at larger values¶

Moon:  CompactValue(16B) → Box<HeapString> → Vec<u8>(24B) → data
       Total overhead: 48 bytes + data

Redis: dictEntry(24B) → robj(16B) → SDS(8-17B + data) + jemalloc rounding
       Total overhead: ~64-80 bytes + data

TTL is packed as a 4-byte delta inside CompactEntry, costing zero extra bytes per expiring key. Redis maintains a separate expires hash table with a full dictEntry (24 bytes) per expiring key.

Dual runtime¶

Moon supports two async runtimes:

Runtime	Platform	Use case
Monoio (default)	Linux (io_uring), macOS (kqueue)	Peak performance, production
Tokio	All platforms	Portability, CI, GitHub Actions

Monoio's thread-per-core model avoids work-stealing overhead. On Linux, io_uring batch I/O amortizes syscall costs across multiple commands.

Key optimizations¶

Optimization	Impact	Component
SIMD probing	16-way parallel key match	DashTable segments
Lock-free oneshot	12% CPU reduction	Cross-shard dispatch
CachedClock	4% throughput gain	Per-shard event loop
Lazy Lua VM	-18 MB baseline	Shard startup
Lazy replication backlog	-12 MB baseline	Shard startup
Per-shard WAL	Linear scaling with shards	Persistence
io_uring batch I/O	Amortized syscalls	Network layer
Software prefetch	Overlapped fetch + hash	DashTable lookup
Zero-copy argument slicing	Eliminates parse buffer copies	RESP parser
Direct GET serialization	Bypasses Frame allocation	Response path

Vector search engine¶

Moon ships an in-process vector search engine accessed via Redis-compatible FT.CREATE / FT.SEARCH commands. It uses TurboQuant 4-bit quantization to compress f32 vectors to ~4 bits per dimension while preserving rank-order similarity.

The diagram's insert rate (~30K vec/s) measures raw mutable-segment append (brute-force, before indexing). End-to-end insert throughput including HNSW build and compaction is ~10K vec/s — see the Performance vs Qdrant table below.

Tiered segment architecture¶

Segment	Backing	Search algorithm	Use case
Mutable	RAM, append-only	Brute-force TQ-ADC	Active inserts
Immutable	RAM, frozen	HNSW + TQ-ADC	Hot data, post-compact
Warm	mmap'd .mpf files	HNSW + TQ-ADC	Aged-out data
Cold	DiskANN	Vamana + PQ	Massive datasets

HSET key field <f32_blob> automatically encodes + indexes vectors. When the mutable segment hits COMPACT_THRESHOLD, the next FT.SEARCH triggers asynchronous compaction into a frozen HNSW immutable segment. Explicit FT.COMPACT forces unconditional compaction (e.g., end of bulk load).

TurboQuant 4-bit ADC¶

The search hot path uses Asymmetric Distance Computation (ADC) with a per-query lookup table:

Query vector is FWHT-rotated and normalized once per query
A 16-entry LUT (or 32-entry with sub-centroid signs) is built per coordinate
HNSW beam search computes per-candidate distance via 192 nibble-indexed LUT lookups (for 384d) instead of 384 f32 multiply-adds
Distance kernel is 8-way ILP unrolled with unsafe pointer arithmetic and 8 independent f32 accumulators (verified via objdump: 8 parallel vaddss into xmm3-xmm8 on x86)

The LUT is pre-allocated in SearchScratch (zero alloc per query). Sub-centroid sign bits provide 2× quantization resolution at zero memory cost in the search path.

Performance vs Qdrant (10K MiniLM, 384d, real semantic embeddings)¶

	Moon ARM64	Moon x86	Qdrant FP32 x86
Recall@10	0.967	0.967	~0.95
Search QPS	843	1,296	507
Search p50	1.20 ms	0.78 ms	1.79 ms
Insert	9,950 v/s	11,270 v/s	~2,600 v/s

Moon beats Qdrant on QPS (2.56×), latency (2.3× lower), recall (+1.7%), insert throughput (4.3×), and memory (~20% less per vector via TQ4).

Graph engine¶

Moon ships an in-process property-graph engine queried with openCypher via GRAPH.QUERY / GRAPH.RO_QUERY. It reuses the same per-shard, thread-per-core foundations as the key-value and vector engines: a graph lives inside the shard that owns it, mutates through the per-shard WAL, and snapshots to immutable segments — no global lock, no separate process.

flowchart LR
    Q["GRAPH.QUERY<br/>(openCypher)"] --> P["logos lexer +<br/>recursive-descent parser"]
    P --> PL["Plan compiler<br/>(+ xxhash plan cache)"]
    PL --> EX["Row pipeline:<br/>NodeScan · Expand · Filter<br/>Project · Merge · ShortestPath"]
    EX --> M[("MemGraph<br/>mutable write buffer")]
    M -- "freeze + compact<br/>(~64K edges)" --> C[("CSR segment<br/>immutable, mmap")]
    M --> T["Traversal:<br/>BFS · DFS · Dijkstra"]
    C --> T
    C -. "graph-filtered KNN" .-> V["Vector engine"]
    M --> W["Per-shard WAL<br/>+ manifest"]

Two-tier storage¶

Like the vector engine, the graph uses a mutable write buffer that freezes into immutable read-optimized segments.

Tier	Backing	Writes	Reads
MemGraph	RAM, generational `SlotMap` + inline bidirectional adjacency	O(1) append, instant	adjacency-list walk
CSR segment	RAM heap or mmap'd `.csr`	frozen at ~64K edges, then read-only	contiguous neighbor scan + minimal-perfect-hash node lookup

Nodes and edges carry properties inline (SmallVec<[(u16, PropertyValue)]>); labels and relationship types are dictionary-encoded to u16. Immutable segments store edges in Compressed Sparse Row form (row_offsets + col_indices), so a neighbor scan is one contiguous memory read. Deletes flip a bit in a Roaring validity bitmap — the CSR arrays are never rewritten until compaction merges segments and drops tombstones.

The segment list is swapped with ArcSwap, so reads are a single lock-free atomic load and in-flight traversals keep old segments alive across compaction.

Cypher surface¶

The query path is a logos lexer → recursive-descent parser → linear physical-plan compiler (with a per-graph xxhash plan cache) → row-based execution pipeline. Supported clauses include MATCH / WHERE / RETURN (with DISTINCT, ORDER BY, LIMIT, SKIP), CREATE, MERGE (with ON CREATE / ON MATCH), SET, DELETE / DETACH DELETE, WITH, UNWIND, variable-length paths, and shortestPath() (Dijkstra).

Fourteen GRAPH.* commands are exposed: CREATE, ADDNODE, ADDEDGE, DELETE, DROP, QUERY, RO_QUERY, EXPLAIN, PROFILE, NEIGHBORS, INFO, LIST, VSEARCH, and HYBRID. The last two bridge to the vector engine for graph-filtered KNN, vector-guided walks, and graph-constrained re-ranking.

Persistence & consistency¶

Mutations append to the per-shard WAL as RESP records and replay on recovery. Frozen segments are written as CRC-checked .csr files tracked by a JSON manifest; recovery loads the manifest, maps the CSR files, then replays the WAL tail. Every node and edge carries MVCC version stamps plus bi-temporal valid_from / valid_to, and TXN.ABORT rolls back CREATE / SET / DELETE / MERGE via an undo log. Cross-shard traversal uses SPSC scatter-gather across shards.

Implemented scope

The engine is production-grade for property-graph storage and the Cypher subset above (well covered by ~9K lines of integration tests). Tracked gaps, each marked with a phase number in the source: aggregation (count / collect are currently per-row no-ops), OPTIONAL MATCH left-join semantics, regex match (=~), multi-hop edge-variable binding ([r*2..5]), and wiring the label index into NodeScan (label scans are currently linear over the write buffer).

Design inspirations¶

Dragonfly — shared-nothing thread-per-core architecture (C++)
Dash (VLDB 2020) — segmented hash table design
Swiss Table / Abseil — SIMD control-byte probing
VLL (VLDB 2012) — lightweight multi-key coordination
ScyllaDB / Seastar — thread-per-core for databases