SIMD Optimization Guide
This guide covers the SIMD-optimized MPMC queue implementation, providing detailed usage instructions, performance analysis, and best practices for achieving optimal throughput.
Overview
The SIMD-optimized MPMC queue (SimdMpmcQueue) uses vectorized instructions to process multiple elements simultaneously, providing significant performance improvements for u64 numeric workloads.
Quick Start
Prerequisites
Rust Toolchain:
rustup toolchain install nightly
rustup default nightly
Project Configuration:
# Cargo.toml
[features]
simd = []
default = ["simd"]
[dependencies]
mpmc-std = { version = "0.1.0", features = ["simd"] }
Basic Usage
use mpmc_std::simd_queue::{SimdMpmcQueue, SimdProducer, SimdConsumer};
use std::sync::Arc;
fn main() {
// Create SIMD-optimized queue for u64 data
let queue = Arc::new(SimdMpmcQueue::<u64>::new(64));
let producer = SimdProducer::new(Arc::clone(&queue));
let consumer = SimdConsumer::new(Arc::clone(&queue));
// Batch operations (optimal performance)
let data = vec![100u64, 200u64, 300u64, 400u64];
producer.send_batch(&data).unwrap();
let mut buffer = vec![0u64; 4];
let count = consumer.recv_batch(&mut buffer);
assert_eq!(count, 4);
assert_eq!(buffer, data);
// Single operations (still optimized)
producer.send(500u64).unwrap();
assert_eq!(consumer.recv(), Some(500u64));
}
Performance Characteristics
Benchmark Results
Single-Threaded Performance:
Scenario Regular Queue SIMD Queue Improvement
------------------------------------------------------------
Single operations 41M ops/sec 46M ops/sec +12%
Batch operations - 18M ops/sec N/A (4.5 ns/element)
Multi-Threaded Performance:
Thread Pairs Regular Queue SIMD Batch Improvement
----------------------------------------------------------
1 pair 43M ops/sec 24M ops/sec -44% (overhead)
2 pairs 31M ops/sec 31M ops/sec ~0% (equal)
4 pairs 17M ops/sec 32M ops/sec +88%
8 pairs 12M ops/sec 28M ops/sec +133%
Key Insights:
- SIMD excels in high-contention scenarios (4+ threads)
- Single-threaded batch operations have higher per-element overhead
- Best performance with u64 data in exactly 4-element batches
Memory and CPU Requirements
Memory Layout:
- Minimum capacity: 16 elements (2x SIMD width)
- Cache-line aligned slots (64-byte alignment)
- Power-of-2 capacity requirement maintained
CPU Requirements:
- x86-64 with AVX2 support (u64x4 SIMD)
- ARM64 with NEON (future support)
Advanced Usage Patterns
Hybrid Processing Strategy
For mixed workloads, use both batch and single operations:
fn process_data_stream(producer: &SimdProducer<u64>, data: &[u64]) -> Result<(), ()> {
let mut i = 0;
// Process in SIMD batches when possible
while i + 4 <= data.len() {
let batch = &data[i..i+4];
match producer.send_batch(batch) {
Ok(sent) => i += sent,
Err(_) => {
// Queue full, fall back to single operations
producer.send(data[i]).map_err(|_| ())?;
i += 1;
}
}
}
// Handle remaining elements individually
while i < data.len() {
producer.send(data[i]).map_err(|_| ())?;
i += 1;
}
Ok(())
}
High-Throughput Consumer Pattern
fn consume_with_batching(consumer: &SimdConsumer<u64>) -> Vec<u64> {
let mut results = Vec::new();
let mut batch_buffer = vec![0u64; 4];
loop {
// Try batch receive first
let batch_count = consumer.recv_batch(&mut batch_buffer);
if batch_count > 0 {
results.extend_from_slice(&batch_buffer[..batch_count]);
continue;
}
// Fall back to single receive
match consumer.recv() {
Some(item) => results.push(item),
None => break, // Queue empty
}
}
results
}
Multi-Producer Coordination
use std::thread;
fn spawn_simd_producers(queue: Arc<SimdMpmcQueue<u64>>, data_sets: Vec<Vec<u64>>) {
let handles: Vec<_> = data_sets.into_iter().enumerate().map(|(id, data)| {
let producer = SimdProducer::new(Arc::clone(&queue));
thread::spawn(move || {
for chunk in data.chunks(4) {
if chunk.len() == 4 {
// Optimal: Full SIMD batch
while producer.send_batch(chunk).is_err() {
thread::yield_now(); // Wait for space
}
} else {
// Partial batch: Use single operations
for &item in chunk {
while producer.send(item).is_err() {
thread::yield_now();
}
}
}
}
println!("Producer {} completed", id);
})
}).collect();
for handle in handles {
handle.join().unwrap();
}
}
Performance Optimization Tips
1. Batch Size Alignment
Optimal:
// Perfect alignment for u64x4 SIMD
let data = vec![1u64, 2u64, 3u64, 4u64];
producer.send_batch(&data).unwrap(); // Single SIMD operation
Suboptimal:
// Non-aligned batch sizes
let data = vec![1u64, 2u64, 3u64]; // Falls back to single operations
producer.send_batch(&data).unwrap();
2. Queue Sizing
Recommended Capacities:
// High contention: Larger capacity reduces blocking
let queue = SimdMpmcQueue::<u64>::new(1024); // Good for 8+ threads
// Low contention: Smaller capacity improves cache locality
let queue = SimdMpmcQueue::<u64>::new(64); // Good for 2-4 threads
3. Memory Access Patterns
Cache-Friendly:
// Process data in sequential batches
for chunk in data.chunks(4) {
producer.send_batch(chunk)?;
}
Cache-Unfriendly:
// Random access patterns hurt SIMD performance
for &index in random_indices {
producer.send(data[index])?; // Consider regular queue
}
4. Thread Scaling
Optimal Thread Distribution:
let cpu_cores = num_cpus::get();
let optimal_threads = std::cmp::min(cpu_cores, 8); // Diminishing returns after 8
// Create balanced producer/consumer pairs
for _ in 0..optimal_threads {
spawn_producer();
spawn_consumer();
}
Compilation and Build Options
Feature Flags
[features]
default = ["simd"]
simd = []
# Optional: Disable SIMD for stable Rust
# default = []
Build Commands
# Development (with SIMD)
cargo build --features simd
# Release optimization
cargo build --release --features simd
# Benchmarking
cargo bench --features simd
# Testing
cargo test --features simd
Conditional Compilation
#[cfg(feature = "simd")]
use mpmc_std::simd_queue::{SimdMpmcQueue, SimdProducer, SimdConsumer};
#[cfg(not(feature = "simd"))]
use mpmc_std::{MpmcQueue as SimdMpmcQueue, Producer as SimdProducer, Consumer as SimdConsumer};
// Code works with both variants
fn generic_processing() {
let queue = Arc::new(SimdMpmcQueue::<u64>::new(64));
let producer = SimdProducer::new(Arc::clone(&queue));
// ...
}
Troubleshooting
Common Issues
1. Compilation Errors:
error[E0658]: use of unstable library feature `portable_simd`
Solution: Ensure nightly toolchain: rustup default nightly
2. Performance Lower Than Expected:
- Check batch sizes are multiples of 4
- Verify CPU supports AVX2 instructions
- Measure with high-contention scenarios (4+ threads)
3. Queue Capacity Issues:
assertion failed: capacity > 0
Solution: SIMD queue has minimum capacity of 8 elements
Performance Debugging
// Add timing measurements
use std::time::Instant;
let start = Instant::now();
for batch in data.chunks(4) {
producer.send_batch(batch)?;
}
let duration = start.elapsed();
println!("Throughput: {:.0} ops/sec",
(data.len() as f64) / duration.as_secs_f64());
Profiling Tools
CPU Profiling:
# Profile SIMD instruction usage
perf record -e cycles,instructions,cache-misses cargo bench --features simd
perf report
Memory Analysis:
# Check cache performance
valgrind --tool=cachegrind cargo run --features simd --example simd_benchmark
When to Use SIMD vs Regular Queue
Choose SIMD Queue When:
- Processing u64 numeric data
- Batch sizes are multiples of 4
- High-contention scenarios (4+ threads)
- Latency-sensitive applications with numeric workloads
- CPU supports AVX2+ instructions
Choose Regular Queue When:
- Mixed data types or generic types
- Variable or small batch sizes
- Low-contention scenarios (1-2 threads)
- Stable Rust requirement
- Memory-constrained environments
Migration Strategy
// Gradual migration approach
fn create_optimal_queue<T>() -> Box<dyn QueueTrait<T>> {
#[cfg(feature = "simd")]
if std::mem::size_of::<T>() == 8 && is_numeric::<T>() {
Box::new(SimdMpmcQueue::<u64>::new(capacity))
} else {
Box::new(MpmcQueue::<T>::new(capacity))
}
#[cfg(not(feature = "simd"))]
Box::new(MpmcQueue::<T>::new(capacity))
}
This guide provides comprehensive coverage of SIMD optimization techniques for achieving maximum throughput with the mpmc-std library.