SIMD & Performance
Vectorized operations for AI workloads
What is SIMD?
SIMD (Single Instruction, Multiple Data) allows the CPU to process multiple data elements in a single instruction. This is the key to high-performance computing and AI workloads.
Without SIMD: Process 1 number → 1 operation/cycle
With AVX2: Process 8 numbers → 8 operations/cycle
With AVX-512: Process 16 numbers → 16 operations/cycle
Check SIMD Support
// Check your CPU's SIMD capabilities
info = simdInfo()
print("SIMD Level:", info.level)
print("Has AVX2:", info.hasAVX2)
print("Has AVX-512:", info.hasAVX512)
Example output:
SIMD Level: AVX2
Has AVX2: true
Has AVX-512: false
SIMD Functions
SIMD Sum Loop
Process 4 integers at once using AVX2:
// SIMD-accelerated sum loop
iterations = 500000000
result = simdSumLoop(iterations)
sum = result[0]
opsPerSec = result[1]
print("Sum:", sum)
print("Performance:", int(opsPerSec / 1000000000), "BILLION ops/sec")
SIMD Dot Product
Vector dot product using SIMD instructions:
// Dot product of two 10M element vectors
vectorSize = 10000000
result = simdDotProduct(vectorSize)
dotProduct = result[0]
opsPerSec = result[1]
print("Dot product:", dotProduct)
print("Performance:", int(opsPerSec / 1000000000), "BILLION ops/sec")
Matrix Operations
Cache-blocked matrix multiplication with SIMD acceleration:
// 512×512 matrix multiplication
result = matmulBenchmark(512)
gflops = result[0]
seconds = result[1]
print("Time:", seconds, "seconds")
print("Performance:", int(gflops), "GFLOPS")
Matrix Multiplication Performance
| Matrix Size | Time | GFLOPS |
|---|---|---|
| 256×256 | ~2ms | ~15 GFLOPS |
| 512×512 | ~15ms | ~18 GFLOPS |
| 1024×1024 | ~100ms | ~21 GFLOPS |
AI Activation Functions
SIMD-accelerated activation functions for neural networks:
// ReLU benchmark: max(0, x) for 10M elements
reluResult = reluBenchmark(10000000)
sum = reluResult[0]
opsPerSec = reluResult[1]
print("ReLU result sum:", sum)
print("Performance:", int(opsPerSec / 1000000000), "BILLION ops/sec")
How It Works
AVX2 Vector Add
// C++ with AVX2 intrinsics
void vecAdd_f32(float* dst, const float* a, const float* b, size_t n) {
for (size_t i = 0; i + 8 <= n; i += 8) {
__m256 va = _mm256_loadu_ps(a + i); // Load 8 floats
__m256 vb = _mm256_loadu_ps(b + i); // Load 8 floats
__m256 vc = _mm256_add_ps(va, vb); // Add 8 floats
_mm256_storeu_ps(dst + i, vc); // Store 8 floats
}
}
One vaddps instruction adds 8 floating-point numbers simultaneously.
Cache Blocking for MatMul
Large matrices don't fit in CPU cache. Cache blocking divides the computation into smaller tiles that fit in L1 cache:
Standard matmul: Random memory access → Cache misses
Blocked matmul: Tile-by-tile → Data stays in cache
Running SIMD Benchmarks
./bin/nevaarize examples/benchmarks/simd.nva
./bin/nevaarize examples/benchmarks/aiCore.nva
Expected output:
=== SIMD BENCHMARK ===
Testing AVX2 SIMD vectorization
SIMD Level: AVX2
Has AVX2: true
Has AVX-512: false
=== SIMD Sum Loop (500000000 iterations) ===
Sum: 125000000250000000
Operations per second: 4200000000
Performance: 4 BILLION ops/sec
=== SIMD Dot Product (10000000 elements) ===
Dot product result: 20000000.0
Operations per second: 2500000000
Performance: 2 BILLION ops/sec