Phase 7 Benchmarks Live

Uncompromising Speed.
Absolute Determinism.

A high-frequency order matching engine engineered in Rust. Utilizing an LMAX-inspired single-threaded core, benchmarked at sub-microsecond latency and 4.7M simulated orders/sec on a single machine.

End-to-End Latency Profile

Windows 11 · Rust 2024 · criterion 0.8.2 · HdrHistogram · release profile

P50100ns

P90200ns

P99500ns

P99.9800ns

Mean151ns

Max~38µs

4.7M

Orders / Second

Single-threaded, local benchmark

Zero

Heap Allocations

Via borrowed &[Fill] slices

56ns

WAL Overhead

Encode + CRC32 append

1.4ms

Crash Recovery

10K record deterministic replay

Cross-validated via criterion: 34–64 ns/order across 1K–100K crossing and mixed workloads

LMAX-Inspired Data Flow

Thread 1: Ingestion

TCP Network Rx

Binary Decoder

SPSC Ring Buffer

Atomic Acquire/Release

Thread 2: Matching

Matching Engine

BTreeMap & 1M-slot Arena

mmap WAL

UDP Multicast

Path to Sub-Microsecond

Phase 1Domain Model & Correctness

Proptest Baseline

Established the baseline logic with HashMap and VecDeque. Fixed-point i64 prices were implemented on day one to prevent IEEE 754 floating-point rounding errors.

Phase 2Zero Hot-Path Allocations

-58% Cancel Latency

Replaced collections with a 1M-slot Arena object pool and custom intrusive doubly linked lists. Eliminated all OS malloc contention.

Phase 3Sorted Price Levels

-67% Sweep Latency

Swapped HashMaps for BTreeMaps. Best-price recomputation went from an O(n) linear scan to an O(log n) tree lookup upon level exhaustion.

Phase 4Lock-Free Concurrency

8.8x Throughput

Implemented a Disruptor-style SPSC ring buffer for inter-thread communication. Engineered with CachePadded atomics and Acquire/Release memory ordering.

Phase 5Binary Protocol & Multicast

Zero-copy codecs

Dropped JSON overhead for custom fixed-size little-endian structs. Integrated TCP ingestion and UDP multicast execution report broadcasting.

Phase 6Deterministic Recovery

56ns WAL append

Added a memory-mapped Write-Ahead Log (WAL) and bincode snapshots. Ensures bit-exact state recovery in under 1.4 milliseconds after a crash.

Phase 7Observability & Proof

500ns P99 / 4.7M Ops

Eliminated the final Vec<Fill> allocation via borrowed slices. Verified architecture limits using HdrHistogram and a custom load generator.

Systems Engineering

Lock-Free SPSC Pipeline

A Disruptor-style ring buffer bridges ingestion to matching. Engineered with atomic Acquire/Release memory ordering and CachePadded structures, delivering an 8.8x throughput increase over standard mpsc channels.

Arena Memory & Intrusive Lists

The order book is backed by a pre-allocated 1M-slot arena. Custom intrusive doubly linked lists bypass OS malloc contention entirely. Cancel complexity reduced from O(n) to O(1) via direct index unlinking.

BTreeMap Best-Price Tracking

Replaced naive HashMaps with BTreeMaps for sorted price levels. Best-price recomputation upon level exhaustion improved from an O(n) linear scan to an O(log n) tree lookup, cutting multi-level sweep latency by 67%.

Cache-Line Optimization

Structs are meticulously padded to 64 bytes (#[repr(C, align(64))]) to occupy exactly one x86 cache line. This actively prevents false sharing and CPU cache invalidation during multi-threaded ring buffer access.

Deterministic Recovery

Load the latest snapshot, replay the WAL, resume. Recovery completes in under 1.4 milliseconds with bit-exact state reconstruction. The recovered book is identical to pre-crash state.

Data Integrity

Every WAL record carries a CRC32 checksum. Corrupted records from power loss or disk faults are detected on replay and cleanly truncated to the last valid entry.

Graceful Back-Pressure

When the matching engine falls behind ingestion rate, the 65,536-slot ring buffer absorbs burst traffic. Inbound orders are delayed via TCP flow control, never dropped.

Proven Under Pressure

Order Cancellation

Middle of 1,000 resting orders

2.16 µs0.91 µs

-58%

Multi-Level Price Sweep

100 ask levels exhausted sequentially

45.14 µs14.73 µs

-67%

Worst-Case Cancel

1,000 orders across 1,000 distinct prices

696.94 µs124.07 µs

-82%

Cross-Thread Throughput

SPSC ring buffer vs std::sync::mpsc

16.35 ms1.85 ms

8.8x

137

Tests Passing

Unsafe Blocks

Each with documented safety invariants

Zero

External Dependencies

On the matching hot path

Bit-Exact

Deterministic Replay

Same input always produces same state

Production Horizon

Deliberately out of scope for this project — but engineered with these production realities in mind.

DPDK / AF_XDP

Kernel bypass networking for true wire-to-wire latency elimination

Multi-Instrument Routing

Gateway distributing orders to per-instrument engine instances

Aeron Transport

Reliable UDP multicast with built-in flow control and backpressure

FIX Protocol Gateway

Industry-standard order entry interface for external client connectivity

Hot-Hot Failover

Secondary engine replaying the same WAL for zero-downtime recovery

Uncompromising Speed. Absolute Determinism.