Chapter 10

SIMD & Performance

Vectorized operations for AI workloads

What is SIMD?

SIMD (Single Instruction, Multiple Data) allows the CPU to process multiple data elements in a single instruction. This is the key to high-performance computing and AI workloads.

Without SIMD:  Process 1 number  → 1 operation/cycle
With AVX2:     Process 8 numbers → 8 operations/cycle
With AVX-512:  Process 16 numbers → 16 operations/cycle

Check SIMD Support

// Check your CPU's SIMD capabilities
info = simdInfo()

print("SIMD Level:", info.level)
print("Has AVX2:", info.hasAVX2)
print("Has AVX-512:", info.hasAVX512)

Example output:

SIMD Level: AVX2
Has AVX2: true
Has AVX-512: false

SIMD Functions

SIMD Sum Loop

Process 4 integers at once using AVX2:

// SIMD-accelerated sum loop
iterations = 500000000

result = simdSumLoop(iterations)
sum = result[0]
opsPerSec = result[1]

print("Sum:", sum)
print("Performance:", int(opsPerSec / 1000000000), "BILLION ops/sec")

SIMD Dot Product

Vector dot product using SIMD instructions:

// Dot product of two 10M element vectors
vectorSize = 10000000

result = simdDotProduct(vectorSize)
dotProduct = result[0]
opsPerSec = result[1]

print("Dot product:", dotProduct)
print("Performance:", int(opsPerSec / 1000000000), "BILLION ops/sec")

Matrix Operations

Cache-blocked matrix multiplication with SIMD acceleration:

// 512×512 matrix multiplication
result = matmulBenchmark(512)
gflops = result[0]
seconds = result[1]

print("Time:", seconds, "seconds")
print("Performance:", int(gflops), "GFLOPS")

Matrix Multiplication Performance

Matrix Size	Time	GFLOPS
256×256	~2ms	~15 GFLOPS
512×512	~15ms	~18 GFLOPS
1024×1024	~100ms	~21 GFLOPS

AI Activation Functions

SIMD-accelerated activation functions for neural networks:

// ReLU benchmark: max(0, x) for 10M elements
reluResult = reluBenchmark(10000000)
sum = reluResult[0]
opsPerSec = reluResult[1]

print("ReLU result sum:", sum)
print("Performance:", int(opsPerSec / 1000000000), "BILLION ops/sec")

How It Works

AVX2 Vector Add

// C++ with AVX2 intrinsics
void vecAdd_f32(float* dst, const float* a, const float* b, size_t n) {
    for (size_t i = 0; i + 8 <= n; i += 8) {
        __m256 va = _mm256_loadu_ps(a + i);  // Load 8 floats
        __m256 vb = _mm256_loadu_ps(b + i);  // Load 8 floats
        __m256 vc = _mm256_add_ps(va, vb);   // Add 8 floats
        _mm256_storeu_ps(dst + i, vc);       // Store 8 floats
    }
}

One vaddps instruction adds 8 floating-point numbers simultaneously.

Cache Blocking for MatMul

Large matrices don't fit in CPU cache. Cache blocking divides the computation into smaller tiles that fit in L1 cache:

Standard matmul:   Random memory access → Cache misses
Blocked matmul:    Tile-by-tile → Data stays in cache

Running SIMD Benchmarks

./bin/nevaarize examples/benchmarks/simd.nva
./bin/nevaarize examples/benchmarks/aiCore.nva

Expected output:

=== SIMD BENCHMARK ===
Testing AVX2 SIMD vectorization

SIMD Level: AVX2
Has AVX2: true
Has AVX-512: false

=== SIMD Sum Loop (500000000 iterations) ===
Sum: 125000000250000000
Operations per second: 4200000000
Performance: 4 BILLION ops/sec

=== SIMD Dot Product (10000000 elements) ===
Dot product result: 20000000.0
Operations per second: 2500000000
Performance: 2 BILLION ops/sec