MPMC Queue Algorithm
This document explains the core algorithm and data structures used in the lockless MPMC queue.
Algorithm Background
The queue combines ideas from several research algorithms:
- Michael & Scott (1996): Lock-free queue with atomic operations
- LMAX Disruptor: Sequence-based coordination and cache optimization
- 1024cores research: Memory ordering optimization
Key Innovations
- Ring Buffer: Fixed-size buffer instead of linked list for better cache performance
- Sequence Numbers: Each slot uses atomic sequence numbers for coordination
- Power-of-2 Sizing: Enables fast bitwise AND instead of expensive modulo
- Cache-Line Alignment: Separate cache lines for producer/consumer positions
Memory Layout
struct MpmcQueue<T> {
buffer: Box<[Slot<T>]>,
capacity: usize,
mask: usize, // capacity - 1 for fast indexing
producer_pos: ProducerPos, // 64-byte aligned
consumer_pos: ConsumerPos, // 64-byte aligned
}
struct Slot<T> {
sequence: AtomicUsize, // Coordination mechanism
data: UnsafeCell<MaybeUninit<T>>, // Storage
}
Cache Optimization
- Producer and consumer positions are in separate cache lines (64-byte aligned)
- Prevents false sharing between producers and consumers
- Each slot is independently cacheable
Sequence Number States
Slot States (8-element queue example):
Initial: [0][1][2][3][4][5][6][7] ← Sequence numbers
[∅][∅][∅][∅][∅][∅][∅][∅] ← Data (empty)
After producer writes to slot 0:
[1][1][2][3][4][5][6][7] ← Sequences
[A][∅][∅][∅][∅][∅][∅][∅] ← Data
After consumer reads from slot 0:
[8][1][2][3][4][5][6][7] ← Sequences
[∅][∅][∅][∅][∅][∅][∅][∅] ← Data (consumed)
Sequence States:
• seq == index: Ready for producer
• seq == index + 1: Ready for consumer
• seq == index + capacity: Available after wrap-around
Operation Flow
Producer (Send)
- Load current head position
- Calculate slot index using
head & mask - Check if slot sequence equals expected value
- If yes, try CAS to claim slot
- Store data and update sequence to signal completion
Consumer (Receive)
- Load current tail position
- Calculate slot index using
tail & mask - Check if slot sequence equals
tail + 1(data ready) - If yes, try CAS to claim slot
- Read data and advance sequence to mark slot available
Performance Characteristics
- Sub-10ns latency per operation
- Linear scaling up to 8 producer/consumer threads
- No convoy effects due to lockless design
- Optimized cache access patterns minimize memory latency
Load Balancing
The queue automatically balances work between consumers based on their processing speed:
- Fast consumers naturally get more items due to higher CAS success rates
- Slow consumers are never starved due to temporal windows when fast consumers are processing
- No explicit priority system needed - pure competition through atomic operations
- Queue buffering prevents deadlock scenarios
Key Benefits
- Automatic fairness: Work distribution matches consumer capabilities
- No starvation: All consumers eventually get items to process
- Natural backpressure: Queue fills if consumers can't keep up with producers
- Scalable: Performance scales linearly with thread count up to hardware limits