Uncompromising Speed.
Absolute Determinism.
A high-frequency order matching engine engineered in Rust. Utilizing an LMAX-inspired single-threaded core, benchmarked at sub-microsecond latency and 4.7M simulated orders/sec on a single machine.
End-to-End Latency Profile
Windows 11 · Rust 2024 · criterion 0.8.2 · HdrHistogram · release profile
Cross-validated via criterion: 34–64 ns/order across 1K–100K crossing and mixed workloads
LMAX-Inspired Data Flow
Path to Sub-Microsecond
Established the baseline logic with HashMap and VecDeque. Fixed-point i64 prices were implemented on day one to prevent IEEE 754 floating-point rounding errors.
Replaced collections with a 1M-slot Arena object pool and custom intrusive doubly linked lists. Eliminated all OS malloc contention.
Swapped HashMaps for BTreeMaps. Best-price recomputation went from an O(n) linear scan to an O(log n) tree lookup upon level exhaustion.
Implemented a Disruptor-style SPSC ring buffer for inter-thread communication. Engineered with CachePadded atomics and Acquire/Release memory ordering.
Dropped JSON overhead for custom fixed-size little-endian structs. Integrated TCP ingestion and UDP multicast execution report broadcasting.
Added a memory-mapped Write-Ahead Log (WAL) and bincode snapshots. Ensures bit-exact state recovery in under 1.4 milliseconds after a crash.
Eliminated the final Vec<Fill> allocation via borrowed slices. Verified architecture limits using HdrHistogram and a custom load generator.
Systems Engineering
Lock-Free SPSC Pipeline
A Disruptor-style ring buffer bridges ingestion to matching. Engineered with atomic Acquire/Release memory ordering and CachePadded structures, delivering an 8.8x throughput increase over standard mpsc channels.
Arena Memory & Intrusive Lists
The order book is backed by a pre-allocated 1M-slot arena. Custom intrusive doubly linked lists bypass OS malloc contention entirely. Cancel complexity reduced from O(n) to O(1) via direct index unlinking.
BTreeMap Best-Price Tracking
Replaced naive HashMaps with BTreeMaps for sorted price levels. Best-price recomputation upon level exhaustion improved from an O(n) linear scan to an O(log n) tree lookup, cutting multi-level sweep latency by 67%.
Cache-Line Optimization
Structs are meticulously padded to 64 bytes (#[repr(C, align(64))]) to occupy exactly one x86 cache line. This actively prevents false sharing and CPU cache invalidation during multi-threaded ring buffer access.
Deterministic Recovery
Load the latest snapshot, replay the WAL, resume. Recovery completes in under 1.4 milliseconds with bit-exact state reconstruction. The recovered book is identical to pre-crash state.
Data Integrity
Every WAL record carries a CRC32 checksum. Corrupted records from power loss or disk faults are detected on replay and cleanly truncated to the last valid entry.
Graceful Back-Pressure
When the matching engine falls behind ingestion rate, the 65,536-slot ring buffer absorbs burst traffic. Inbound orders are delayed via TCP flow control, never dropped.
Proven Under Pressure
Order Cancellation
Middle of 1,000 resting orders
Multi-Level Price Sweep
100 ask levels exhausted sequentially
Worst-Case Cancel
1,000 orders across 1,000 distinct prices
Cross-Thread Throughput
SPSC ring buffer vs std::sync::mpsc
Production Horizon
Deliberately out of scope for this project — but engineered with these production realities in mind.
DPDK / AF_XDP
Kernel bypass networking for true wire-to-wire latency elimination
Multi-Instrument Routing
Gateway distributing orders to per-instrument engine instances
Aeron Transport
Reliable UDP multicast with built-in flow control and backpressure
FIX Protocol Gateway
Industry-standard order entry interface for external client connectivity
Hot-Hot Failover
Secondary engine replaying the same WAL for zero-downtime recovery