Phase 7 Benchmarks Live

Uncompromising Speed. Absolute Determinism.

A high-frequency order matching engine engineered in Rust. Utilizing an LMAX-inspired single-threaded core, benchmarked at sub-microsecond latency and 4.7M simulated orders/sec on a single machine.

End-to-End Latency Profile

Windows 11 · Rust 2024 · criterion 0.8.2 · HdrHistogram · release profile

P50100ns
P90200ns
P99500ns
P99.9800ns
Mean151ns
Max~38µs
4.7M
Orders / Second
Single-threaded, local benchmark
Zero
Heap Allocations
Via borrowed &[Fill] slices
56ns
WAL Overhead
Encode + CRC32 append
1.4ms
Crash Recovery
10K record deterministic replay

Cross-validated via criterion: 34–64 ns/order across 1K–100K crossing and mixed workloads

LMAX-Inspired Data Flow

Thread 1: Ingestion
TCP Network Rx
Binary Decoder
SPSC Ring Buffer
Atomic Acquire/Release
Thread 2: Matching
Matching Engine
BTreeMap & 1M-slot Arena
mmap WAL
UDP Multicast

Path to Sub-Microsecond

Phase 1Domain Model & Correctness
Proptest Baseline

Established the baseline logic with HashMap and VecDeque. Fixed-point i64 prices were implemented on day one to prevent IEEE 754 floating-point rounding errors.

Phase 2Zero Hot-Path Allocations
-58% Cancel Latency

Replaced collections with a 1M-slot Arena object pool and custom intrusive doubly linked lists. Eliminated all OS malloc contention.

Phase 3Sorted Price Levels
-67% Sweep Latency

Swapped HashMaps for BTreeMaps. Best-price recomputation went from an O(n) linear scan to an O(log n) tree lookup upon level exhaustion.

Phase 4Lock-Free Concurrency
8.8x Throughput

Implemented a Disruptor-style SPSC ring buffer for inter-thread communication. Engineered with CachePadded atomics and Acquire/Release memory ordering.

Phase 5Binary Protocol & Multicast
Zero-copy codecs

Dropped JSON overhead for custom fixed-size little-endian structs. Integrated TCP ingestion and UDP multicast execution report broadcasting.

Phase 6Deterministic Recovery
56ns WAL append

Added a memory-mapped Write-Ahead Log (WAL) and bincode snapshots. Ensures bit-exact state recovery in under 1.4 milliseconds after a crash.

Phase 7Observability & Proof
500ns P99 / 4.7M Ops

Eliminated the final Vec<Fill> allocation via borrowed slices. Verified architecture limits using HdrHistogram and a custom load generator.

Systems Engineering

Lock-Free SPSC Pipeline

A Disruptor-style ring buffer bridges ingestion to matching. Engineered with atomic Acquire/Release memory ordering and CachePadded structures, delivering an 8.8x throughput increase over standard mpsc channels.

Arena Memory & Intrusive Lists

The order book is backed by a pre-allocated 1M-slot arena. Custom intrusive doubly linked lists bypass OS malloc contention entirely. Cancel complexity reduced from O(n) to O(1) via direct index unlinking.

BTreeMap Best-Price Tracking

Replaced naive HashMaps with BTreeMaps for sorted price levels. Best-price recomputation upon level exhaustion improved from an O(n) linear scan to an O(log n) tree lookup, cutting multi-level sweep latency by 67%.

Cache-Line Optimization

Structs are meticulously padded to 64 bytes (#[repr(C, align(64))]) to occupy exactly one x86 cache line. This actively prevents false sharing and CPU cache invalidation during multi-threaded ring buffer access.

Deterministic Recovery

Load the latest snapshot, replay the WAL, resume. Recovery completes in under 1.4 milliseconds with bit-exact state reconstruction. The recovered book is identical to pre-crash state.

Data Integrity

Every WAL record carries a CRC32 checksum. Corrupted records from power loss or disk faults are detected on replay and cleanly truncated to the last valid entry.

Graceful Back-Pressure

When the matching engine falls behind ingestion rate, the 65,536-slot ring buffer absorbs burst traffic. Inbound orders are delayed via TCP flow control, never dropped.

Proven Under Pressure

Order Cancellation

Middle of 1,000 resting orders

2.16 µs0.91 µs
-58%

Multi-Level Price Sweep

100 ask levels exhausted sequentially

45.14 µs14.73 µs
-67%

Worst-Case Cancel

1,000 orders across 1,000 distinct prices

696.94 µs124.07 µs
-82%

Cross-Thread Throughput

SPSC ring buffer vs std::sync::mpsc

16.35 ms1.85 ms
8.8x
137
Tests Passing
6
Unsafe Blocks
Each with documented safety invariants
Zero
External Dependencies
On the matching hot path
Bit-Exact
Deterministic Replay
Same input always produces same state

Production Horizon

Deliberately out of scope for this project — but engineered with these production realities in mind.

DPDK / AF_XDP

Kernel bypass networking for true wire-to-wire latency elimination

Multi-Instrument Routing

Gateway distributing orders to per-instrument engine instances

Aeron Transport

Reliable UDP multicast with built-in flow control and backpressure

FIX Protocol Gateway

Industry-standard order entry interface for external client connectivity

Hot-Hot Failover

Secondary engine replaying the same WAL for zero-downtime recovery