🔬 Implementation Deep Dive

Technical details and performance engineering

← Back to Documentation Index

Implementation Details

This document explains the key implementation decisions and technical details of the lockless MPMC queue.

Core Algorithm

Sequence-Based Coordination

The algorithm uses sequence numbers instead of flags for coordination:



struct Slot<T> {
    sequence: AtomicUsize,              // Single coordination point
    data: UnsafeCell<MaybeUninit<T>>,   // Raw storage
}

Benefits:

Memory Ordering

Careful memory ordering ensures correctness:



// Producer stores data BEFORE updating sequence
unsafe { (*slot.data.get()).write(item); }
slot.sequence.store(new_seq, Ordering::Release);

ABA Problem Prevention

Sequence numbers prevent the ABA problem by always advancing:



Slot 0: 0 → 1 → 8 → 9 → 16 → 17 → ...
        ↑   ↑   ↑   ↑    ↑    ↑
       Init Prod Cons Prod Cons Prod

Even if the same logical state occurs, the sequence number is different.

Power-of-2 Optimization

Capacity is rounded to power-of-2 for fast indexing:



// Instead of expensive modulo:
position % capacity  // ~20-40 cycles
// Use fast bitwise AND:
position & mask      // ~1 cycle
// Example: capacity=1024, mask=1023
// 5000 % 1024 = 904  (slow)
// 5000 & 1023 = 904  (fast)

Capacity is automatically rounded up to the next power-of-2.

Cache-Line Alignment

Prevents false sharing by separating producer and consumer positions:



#[repr(align(64))]  // 64-byte cache line alignment
struct ProducerPos {
    head: AtomicUsize,
    // 56 bytes padding
}
#[repr(align(64))]
struct ConsumerPos {
    tail: AtomicUsize, 
    // 56 bytes padding
}

Benefit: ~40% performance improvement by eliminating cache invalidations between cores.

Atomic Operations

Compare-Exchange-Weak

Uses compareexchangeweak instead of strong version:

Memory Ordering Strategy

Memory Safety

Achieves memory safety without garbage collection:



struct Slot<T> {
    sequence: AtomicUsize,              // Coordination
    data: UnsafeCell<MaybeUninit<T>>,   // Safe uninitialized storage
}

Safety invariants:

  1. Data written only when sequence == expected
  2. Data read only when sequence == expected + 1
  3. Sequence coordination prevents access races
  4. Proper cleanup in Drop implementation

Send/Sync traits: Safely implemented using sequence coordination to prevent data races.

Performance Optimizations

Branch Prediction

Code structure favors common case (successful operations) to help CPU branch predictor.

CPU Cache Utilization

Common Pitfalls

Sequence Number Overflow

Wrapping arithmetic is intentional and safe - maintains ordering even when numbers wrap around.

Memory Ordering

Critical to use Release ordering when storing sequences to make data writes visible.

Capacity Requirements

Capacity is automatically rounded up to next power-of-2 for optimal performance.

Performance Tips

Profiling

Common Issues

Debugging


// Queue provides introspection methods
queue.capacity();  // Power-of-2 capacity
queue.len();       // Approximate current length
queue.is_empty();  // Snapshot view
queue.is_full();   // Snapshot view

SIMD Optimizations (Nightly Rust)

The MPMC queue includes SIMD (Single Instruction Multiple Data) optimizations for u64 data types when compiled with nightly Rust and the simd feature.

SIMD Architecture


use std::simd::{u64x4, SimdPartialEq};
struct SimdMpmcQueue<T> {
    buffer: Box<[SimdSlot<T>]>,
    simd_batch_size: usize,    // 4 for u64x4 SIMD width
    // ... other fields
}

Vectorized Operations

Batch Sequence Checking

The SIMD implementation can check 4 slot sequences simultaneously:



// Load 4 sequence numbers using SIMD
let sequences = unsafe { self.load_sequences_simd(head, 4) };
let expected = self.generate_expected_sequences_simd(head, 4);
// Compare all 4 sequences at once
let mask = sequences.simd_eq(expected);
if mask.all() {
    // All 4 slots are available
}

Batch Operations

Process 4 u64 elements in a single operation:



// Send batch of 4 u64 values
let batch = vec![1u64, 2u64, 3u64, 4u64];
producer.send_batch(&batch)?;
// Receive batch of 4 u64 values
let mut buffer = vec![0u64; 4];
let count = consumer.recv_batch(&mut buffer);

SIMD Performance Characteristics

Throughput Improvements:


Memory Requirements:

SIMD Algorithm Details

Vectorized Availability Check

Instead of checking each slot individually:



// Traditional approach (4 separate checks)
for i in 0..4 {
    let slot = &buffer[(head + i) & mask];
    if slot.sequence.load(Acquire) == head + i {
        // Slot available
    }
}
// SIMD approach (single vectorized check)
let sequences = u64x4::from_array([
    slot0.sequence.load(Acquire) as u64,
    slot1.sequence.load(Acquire) as u64,
    slot2.sequence.load(Acquire) as u64,
    slot3.sequence.load(Acquire) as u64,
]);
let expected = u64x4::from_array([head, head+1, head+2, head+3]);
let all_ready = sequences.simd_eq(expected).all();

Batch Memory Operations

Leverages CPU's vectorized memory instructions:



// Store 4 u64 values efficiently
let simd_data = u64x4::from_slice(items);
// Hardware can optimize this to vectorized stores

SIMD Usage Guidelines

Best Performance:


When to Use Regular Queue:


Hybrid Strategy:


// SIMD queue provides both interfaces
if data.len() >= 4 && data.len() % 4 == 0 {
    // Use SIMD batch operations
    producer.send_batch(&data[..4])?;
} else {
    // Fall back to single operations
    producer.send(data[0])?;
}

SIMD Compilation Requirements

Toolchain:


rustup default nightly

Features:


[features]
simd = []
default = ["simd"]

Build Command:


cargo build --features simd

The SIMD implementation maintains all wait-free, lockless guarantees while providing significant performance improvements for suitable workloads.