GPGPU

Notes on GPU architecture: grids, blocks, warps, memory hierarchy, and CUDA execution model.

  • Grid

    • Array of Blocks (1D, 2D or 3D)
    • Blocks of the same size (up to 1024 Threads)
    • $i = \text{blockIdx.x} \cdot \text{blockDim.x} + \text{threadIdx.x}$
  • Block

    • Array of Threads (1D, 2D or 3D)
  • Barrier Synchronization

    • Threads wait for each other at same point
    • If waiting in different branches $\Rightarrow$ wait for each other forever (deadlock)
    • Not allowed between Blocks $\Rightarrow$ Blocks can execute in any order
  • Transparent Scalability

    • Same code works on different hardware resources
    • Automatic scaling across different GPU architectures
  • Streaming Multiprocessor (SM)

    • Device defines limit of Blocks and Threads per SM
    • e.g. Blackwell: 32 Blocks; 2048 Threads
      • $256 \text{ (Threads/block)} \cdot 8 \text{ (Blocks)} = 2048 \text{ (Threads)}$
      • $512 \text{ (Threads/block)} \cdot 4 \text{ (Blocks)} = 2048 \text{ (Threads)}$
  • Warp

    • Scheduling purpose unit (fundamental execution unit)

    • A block assigned to a SM is divided into 32 thread scheduling units called Warps

    • Threads in warp executed following SIMD model

      • One instruction fetched and executed for all Threads in the warp
      • All Threads have the same execution timing (less control overhead)
      • Most efficient when same control-flow and branching
    • While a Warp waits for long instruction a ready Warp can be selected for execution (1 cycle context switch)

    • Latency tolerance / hiding (zero-overhead context switching)

    • Size of Warps are implementation-specific (currently 32 threads)

    • e.g.

      • Each Block has 256 Threads
      • Each Block has $256/32 = 8$ Warps
      • With 3 Blocks in each SM => $8 \cdot 3 = 24$ Warps in each SM
    • e.g. 8 Blocks and 1024 Threads per SM, 512 Threads per Block. Best thread block size?

      • 8x8 Blocks:
        • $8 \cdot 8 = 64$ Threads/Block
        • $1024/64 = 16$ Blocks/SM (but limited to 8 Blocks/SM)
        • $16 > 8 \Rightarrow$ limited by Block count
        • $64 \cdot 8 = 512$ (Threads/SM) $\Rightarrow$ resources underutilized because of fewer warps
      • 16x16 Blocks:
        • $16 \cdot 16 = 256$ Threads/Block
        • $1024/256 = 4$ Blocks/SM
        • $4 < 8 \Rightarrow$ only 50% of Block capacity used
        • $256 \cdot 4 = 1024$ Threads/SM $\Rightarrow$ Full Thread utilization
      • 32x32 Blocks:
        • $32 \cdot 32 = 1024$ Threads/Block
        • $1024 > 512$ Threads/Block limitation $\Rightarrow$ exceeds hardware limit
    • Linearized Thread Blocks are partitioned into Warps (Warp 0 starts with Thread 0)

    • Warp size can change with new architectures (currently 32 on NVIDIA GPUs)

    • Dependencies between threads solved with __syncthreads() - do not depend on Warp order

  • Control Divergence

    • Threads in a warp take different branches (if-then-else / different loop iterations)

    • Branches are serialized (performance penalty)

      • Only threads in same current branch will be executed in parallel
      • Other threads wait idle
    • If branches on threadIdx $\Rightarrow$ divergence (threads within same warp differ)

    • If branches on blockIdx $\Rightarrow$ no divergence (all threads in warp take same path)

    • e.g.

      __global__ void vecAddKernel(float* A, float* B, float* C, int n) {
          int i = threadIdx.x + blockDim.x * blockIdx.x;
          if (i<n) {
              C[i] = A[i] + B[i];  // boundary check causes divergence}
          }
      }
      
    • Calculate how many warps are outside and how many inside the valid range

    • In divergent warps look for threads diverging

    • Differentiate between:

      • Type 1: Blocks that only diverge at the end (y-centered, x-outside at the end)
      • Type 2: Blocks that always diverge (y-outside-down)
      • e.g. Matrix Multiplication
        • 16x16 Tiles and Thread Blocks
        • $16 \times 16 / 32 = 256/32 = 8$ Warps
        • Matrix is 100x100
        • Each Thread has 7 Phases (ceiling of $100/16 = 6.25$)
        • 49 Thread Blocks ($7 \times 7$ grid)
        • Type 1:
          • $6 \cdot 7= 42$ Blocks with $8 \times 42 = 336$ Warps
          • $7 \cdot 336 = 2352$ Warp Phases
          • Only diverge in last phase
          • $1 \cdot 336 = 336$ Phases diverge
        • Type 2:
          • $7$ Blocks with $8 \times 7 = 56$ Warps
          • $7 \cdot 56 = 392$ Warp Phases
          • First 2 Warps in Block are within valid range
          • Other 6 Warps in Block are outside valid range
          • $2 \cdot 7 = 14$ Warp Phases diverge
        • Impact:
          • $(336+14)$ [Divergent Warp Phases] $/(2352+392)$ [Warp Phases] $= 0.1276$ (12.76% divergence)
  • Shared Memory

    • Per SM (not per thread or per block - shared among blocks on same SM)
    • Way faster (latency and throughput) than global memory
    • Lifetime of thread block (released when block completes)
    • Load/store instructions (programmer-managed)
    • Temporary arena or cache-like usage
  • Tiling/Blocking

    • Content from Global Memory divided into Tiles
    • Threads focus on few Tiles at a time
    • Good when Threads have similar memory access patterns
    • Core idea:
      • Identify Tile in Global Memory
      • Load Tile into Shared Memory cooperatively
      • Synchronize threads to start a Phase (e.g. Tile loaded)
      • Let Threads read from Shared Memory (fast access)
      • Let Threads compute using Shared Memory data
      • Synchronize threads to end a Phase (e.g. Tile processed)
      • Move to next Tile
    • Each Block should aim for max Threads
      • e.g. 16x16 Tile = 256 Threads
        • $2 \times 256$ loads
        • $256 \times (2 \times 16) = 8192$ mul/add ops
        • 16 flops/load
        • $16 \times 2 \times 256 \times 4\text{B} = 32768\text{B} = 32\text{KB}$ Shared Memory Required
          • If SM has 48KB Shared Memory $\Rightarrow$ 48KB/32KB = 1-2 Simultaneous Blocks in SM
      • e.g. 32x32 Tile = 1024 Threads
        • $2 \times 1024$ loads
        • $1024 \times (2 \times 32) = 65536$ mul/add ops
        • 32 flops/load
        • $32 \times 2 \times 1024 \times 4\text{B} = 262144\text{B} = 256\text{KB}$ Shared Memory Required
          • If SM with 48KB Shared Memory $\Rightarrow$ cannot fit (exceeds available memory)
    • Each __syncthreads() can reduce the count of active threads (acts as barrier)
      • $\Rightarrow$ More active threads can be more profitable for latency hiding
  • Matrix Multiplication Tile Size Handling

    • If outside of valid range load 0 into __shared__
    • No impact on multiply + add (neutral element for addition)
    • Ensures correct results without boundary checks in computation loop
  • Padding

    • To have a multiple of TILE_SIZE add elements
    • Big memory and transfer overhead (trade-off)
    • Simplifies indexing and removes boundary checks
  • Row Major

    matrix = AAAA
             BBBB
             CCCC
             DDDD
    
    matrix_linearized = AAAABBBBCCCCDDDD
    
    • Default memory layout in C/CUDA
    • Adjacent columns in same row are adjacent in memory
  • Global Memory

    • All Threads read from Global Memory
    • Memory access 4 bytes per floating-point addition $\Rightarrow$ 4B/FLOP bandwidth requirement
    • e.g. 1500 GFLOPS peak, 200GB/s DRAM bandwidth
      • $\frac{200[\text{GB/s}]}{4[\text{B/FLOP}]} = 50$ GFLOPS limit (memory-bound)
      • $50/1500 = 3.333%$ of max speed possible by the GPU
      • $\Rightarrow$ Reduce memory access count via reuse and shared memory
  • Textures

    • Faster parallel access via graphic-specific caches (texture cache)
    • Read-only from kernel perspective
    • Good if locality (e.g. image processing, spatial locality)
    • Good if same data accessed by multiple threads (broadcast)
    • Hardware interpolation support
  • Surfaces

    • Like Textures but writable
    • 1 mipmap level only
    • 2 APIs:
      • Texture Reference API (legacy, limited, everywhere)
      • Texture Object API (Compute Capability 3.0+, more flexible)
  • Memory Coalescing

    • Mix multiple small memory accesses of parallel Threads into big memory access
    • If parallel Threads access the same burst section only in the same instruction (SIMD), only one DRAM access is needed
    • If multiple burst sections (no coalescing) $\Rightarrow$ more than one DRAM access needed
    • If not coalesced, unused data will be transferred with the burst section (wasted bandwidth)
    • Memory access in Warps are contiguous when A[(expression independent of threadIdx.x) + threadIdx.x]
    • Maximizes memory throughput efficiency
  • Corner Turning

    • Row-wise access (coalesced) instead of Column-wise (strided)
    • Used for loading Tiles efficiently
    • Transpose data layout in shared memory if needed
  • DRAM Burst

[0, 1, 2, 3][4, 5, 6, 7][8, 9, 10, 11][12, 13, 14, 15]
^           ^           ^              ^
Burst sections

16 Byte Memory Space
4 Byte Burst Sections
- Access to data transports entire burst section to GPU (not just requested byte)
- Minimum granularity of memory transfers
  • DRAM Core Organization

    • DRAM Core Array around 16 Mbits
    • Stored in capacitor (needs refresh)
                    ┌─────────────┐      ┌────────────────┐
       Row   ──────►│    Row      │─────►│   Memory Cell  │
       Addr         │   Decoder   │      │   Core Array   │
                    └─────────────┘      └───────┬────────┘
                                                 ▼
                                         ┌────────────────┐
                                         │   Sense Amps   │
                                         └───────┬────────┘
                                                 ▼
                                         ┌────────────────┐
                                         │ Column Latches │
                                         └───────┬────────┘
                                                 │ Wide
                                                 ▼
                    Column  ────────────►┌──────────────┐
                    Addr                 │     Mux      │
                                         └──────┬───────┘
                                                │ Narrow
                                                ▼
                                          Pin Interface
                                         ───────────────
                                          Off-chip Data
    
    • Reading a cell from an Array is very slow (e.g. DDR3/GDDR4 Core Speed = 1/8 of Interface Speed)
    • SDRAM Cores cycle at 1/N of Interface Speed
      • Load N x Interface-Width DRAM Bits from the same row in an internal Buffer, then transport them in N steps
      • DDR3/DDR4: Buffer-Width = 8x Interface-Width
      • Like caching then sending each read Bit per Interface Speed
      • More Banks/Channels for better bandwidth (parallel access)
  • Intrinsics

    • Functions that compile to single instructions (e.g., __fmaf_rn() for fused multiply-add)
    • Provide direct access to hardware operations
    • Typically faster than regular functions
  • Atomic Operations

    • Hardware-atomic read-modify-write operation (>1000 cycles in DRAM global memory)
    • Other threads wait in queue (serialization)
    • Other atomic operations are sequential (one at a time)
    • Operations: add, sub, inc, dec, min, max, exch, CAS (Compare and Swap)
    • atomicAdd(int *address, int value): read address, add value, store result at address and return old value
    • DRAM atomic operation starts with read: 100s of cycles
    • DRAM atomic operation ends with write: 100s of cycles
    • Memory access < 1/1000 of peak bandwidth if too many atomic ops (severe bottleneck)
    • Since Fermi: L2 Cache atomic ops latency average 1/10 of DRAM latency (shared between all Blocks)
    • Improvement via shared memory possible (much faster atomics)
  • Privatization

    • Operations must be commutative and associative
    • Private data must fit in shared memory
    • Costs:
      • Overhead of preparing private copies (initialization)
      • Overhead of committing private copy to final copy (final atomic adds)
    • Benefits:
      • Less access control and serialization/sequential access
      • 10x+ performance improvements possible
      • Reduces global memory atomic contention
  • Naive Histogram: Partitions

    • Split Input into Partitions per Thread
    • Add to bins sequentially
    • Contiguous threads do not access contiguous memory in parallel
    • No coalesced access $\Rightarrow$ Bad bandwidth usage
  • Naive Histogram: Interleaved

    • Contiguous threads access contiguous memory in parallel
    • Coalesced access $\Rightarrow$ Good bandwidth usage
    • Still suffers from atomic contention
  • Reduction

    • Input data order for processing not required (Associative and Commutative)
    • Divide into multiple parts
    • Let Threads process these parts in parallel
    • Reduction-tree to aggregate results (logarithmic depth)
    • Reduction operation (max, min, sum, mul)
    • User defined reduction operation possible:
      • Must be associative and commutative
      • Requires neutral element (identity element)
  • Efficient Sequential Reduction O(N)

    • Initialize with neutral Element
    • Do reduction iterating input and updating result
    • Each input read only one time (efficient)
    • Single-threaded approach
  • Efficient Parallel Reduction O(log(N))

    • Work: $(1/2)N + (1/4)N + (1/8)N + … = N-1$
    • Average parallelism: $(N-1)/\log_2(N)$
      • Average: 50000 threads (example)
      • Peak: 500000 threads (example)
      • Not resource efficient (many threads idle after early steps)
    • Tournament-like structure (tree reduction)
    • Workload comparable to sequential (same total work)
    • Not all parallel algorithms are compute-efficient (some sacrifice efficiency for speed)
  • Parallel Sum Reduction

    • 2 values per thread initially
    • Needed threads halved each iteration
    • $\log_2(n)$ steps and $n/2$ Threads needed initially
    • In-place implementation:
      • Input in global memory
      • Partial sum in shared memory
      • At start just copy from global memory
      • Block size limits n (shared memory size constraint)
    • Try to keep contiguous threads active to avoid divergence
      • Use threadIdx offsets (not modulo operations)
  • Direct Memory Access (DMA)

    • Used with cudaMemcpy()
      • Performs 1 or more DMA transfers
      • Map address, check if page exists, check source and destination
      • For same DMA only 1 mapping necessary (cached)
    • Specialized HW data transport (no CPU involvement after setup)
    • Uses physical addresses or mapped IO spaces
    • PCIe interface (current standard)
    • OS could page out DMA received data but uses pinned memory instead (page-locked)
    • cudaMemcpy() is 2x faster if memory already pinned
  • Virtual Memory

    • Maps virtual pointer addresses to physical addresses (via MMU hardware)
    • Paging mechanism (pages typically 4KB)
    • Allows isolation and flexible memory management
  • Device Overlapping

    • Run Kernel while copying data (concurrent execution)
    • Divide into segments and do copying and compute in parallel for following segments
    • Pipeline approach for better throughput
  • Streams

    • cudaMemcpy() and Kernel can execute in parallel (different streams)
    • Commands in FIFO-Queues (kernel and memcpy calls)
      • Driver and Device asynchronous
      • Driver makes sure commands run sequentially within a stream
    • Ops/Tasks in different streams can run in parallel (if hardware supports)
    • Host can check status with events (cudaEventQuery())
  • Hyper Queues

    • Multiple queues per engine (compute capability 3.5+)
    • More parallelism: some Streams of an Engine keep working, while others are blocked
    • Better resource utilization
  • Dynamic Parallelism

    • Recursive Kernels (kernel launches kernel)
    • Compute capability >= 3.5 required
    • -rdc=true flag: multiple Device Code Objects linked (relocatable device code)
    • cudadevrt.lib makes part of Host API available for Device
    • Limits:
      • cudaDeviceSetLimit(cudaLimitDevRuntimeSyncDepth, n); - recursion depth limit
      • cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, n); - waiting kernel limit
    • No shared/local memory nor registers can be passed between parent and child kernels
    • Synchronization:
      • All threads in block wait at same point: __syncthreads()
      • Wait for child kernels: cudaDeviceSynchronize()
      • Usually: 1 Thread per block starts recursive call (avoid massive duplication)
  • Compute Graphics Interop

    • Avoid GPU (Compute) -> CPU -> GPU (Graphics) data transfers
    • Direct sharing between compute and graphics pipelines
    • OpenGL:
      • Buffer Object (Vertex Buffer) maps to CUDA Buffer (1D or 2D array)
      • Textures map to CUDA Arrays
    • Steps:
      • Create Object (Graphics Context)
      • Register Object (Compute Context)
      • Map or Unmap Objects (exclusive access)
    • Textures:
      • Surface (write access from CUDA)
      • CUDA-Array (device arrays for copy operations, etc.)

Cheatsheet

// Function qualifiers
__global__          // declares kernel function (called by host, runs on device)
__host__            // declares host only function (default)
__device__          // declares device only function
__host__ __device__ // same function for both host and device

// Variables                                 //  MEMORY      SCOPE   LIFETIME
                        int local_var;       //  register    thread  thread
__device__              int global_var;      //  global      grid    application
__device__ __shared__   int shared_var;      //  shared      block   block
__device__ __constant__ int constant_var;    //  constant    grid    application
// __device__ is optional for __shared__ and __constant__
// Automatic variables are stored in Registers
// Exceptions:
//  - per-Thread Arrays      -> global memory (spilled)
//  - no space in registers  -> global memory (register spilling)
//  - local memory is still slow (backed by global), same for stack (recursion)

// Malloc
cudaMalloc(&device_A, size);

// Free
cudaFree(device_A);

// Memcpy
cudaMemcpy(device_A, host_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(host_A, device_A, size, cudaMemcpyDeviceToHost);

// Host malloc (pinned memory for faster transfers)
cudaHostAlloc(ptr, size, cudaHostAllocDefault);

// Host free
cudaFreeHost(ptr);

// Error handling
cudaError_t err = cudaMalloc(&device_A, size);
if (err != cudaSuccess) {
    printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__);
    exit(EXIT_FAILURE);
}

// Kernel definition
__global__
void vecAddKernel(float *A, float *B, float *C, int n) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < n) {
        C[i] = A[i] + B[i];
    }
}

// Kernel launch
vecAddKernel<<<block_count, threads_per_block>>>(device_A, device_B, device_C, n);

// Grid and block dimensions
dim3 dimGrid(32, 1, 1);
dim3 dimBlock(128, 1, 1);
vecAddKernel<<<dimGrid, dimBlock>>>(...);

dim3 dimGrid(ceil(n/256.0), 1, 1);
dim3 dimBlock(256, 1, 1);
vecAddKernel<<<dimGrid, dimBlock>>>(...);

vecAddKernel<<<ceil(n/256.0), 256>>>(...); // 1D only (simplified syntax)


// Synchronization
__syncthreads() // barrier synchronization (all threads in block wait for each other)
                // if put in different if statements they would wait for each other forever (deadlock)

Tiled Matrix Multiplication

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {

    // Define Tiles in shared memory
    __shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
    __shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

    // Get Block indices
    int bx = blockIdx.x;
    int by = blockIdx.y;
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    // Calculate global Row and Col
    int Row = by * blockDim.y + ty;
    int Col = bx * blockDim.x + tx;

    // Accumulator for result
    float P_result_value = 0;

    // Loop over the M and N tiles required to compute the P element
    for (int ph = 0; ph < Width/TILE_WIDTH; ++ph) {

        // Collaborative loading of M and N tiles into shared memory
        // Each thread loads one element from M and one from N
        ds_M[ty][tx] = M[Row*Width + ph*TILE_WIDTH+tx];
        ds_N[ty][tx] = N[(ph*TILE_WIDTH+ty)*Width + Col];

        __syncthreads(); // Wait for all threads to finish loading
        
        // Calculate using all data loaded in shared memory
        for (int i = 0; i < TILE_WIDTH; ++i) {
            P_result_value += ds_M[ty][i] * ds_N[i][tx];
        }
        __syncthreads(); // Wait before loading next tile
    }
    // Write result to global memory
    P[Row*Width+Col] = P_result_value;
}

Textures and Surfaces Cheatsheet


struct cudaChannelFormatDesc {
    int x, y, z, w; // Bits per channel (e.g., 8, 16, 32)
    enum cudaChannelFormatKind f;
};

enum cudaChannelFormatKind {
    cudaChannelFormatKindSigned;      // Signed integer
    cudaChannelFormatKindUnsigned;    // Unsigned integer
    cudaChannelFormatKindFloat;       // Floating point
};

struct cudaTextureDesc
{
    enum cudaTextureAddressMode addressMode[3]; // cudaAddressModeBorder: specified border color
                                                // cudaAddressModeClamp: value of border pixel
                                                // cudaAddressModeWrap: modulo, loop, repeat (only normalized coords)
                                                // cudaAddressModeMirror: mirrored (only normalized coords)
    enum cudaTextureFilterMode filterMode; // Interpolation if normalized-float access
                                           // cudaFilterModePoint: nearest neighbor
                                           // cudaFilterModeLinear: linear interpolation
    enum cudaTextureReadMode readMode; // cudaReadModeNormalizedFloat: [0.0..1.0] unsigned, [-1.0..1.0] signed
                                       // cudaReadModeElementType: raw values (pixel coords)
    int sRGB;                          // sRGB color space conversion
    int normalizedCoords;              // normalized coords [0..1]
                                       // alternative: pixel coords [0..width-1]
                                       // surfaces only support pixel coords
    unsigned int maxAnisotropy;
    enum cudaTextureFilterMode mipmapFilterMode;
    float mipmapLevelBias;
    float minMipmapLevelClamp;
    float maxMipmapLevelClamp;
};

// Kernel example
__global__ void transformKernel(float* output, cudaTextureObject_t texObj, int width, int height, float theta) {
    // Calculate normalized texture coordinates
    unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
    float u = x / (float)width;
    float v = y / (float)height;
    
    // Transform coordinates (rotation)
    u -= 0.5f;
    v -= 0.5f;
    float tu = u * cosf(theta) - v * sinf(theta) + 0.5f;
    float tv = v * cosf(theta) + u * sinf(theta) + 0.5f;
    
    // Read from texture and write to global memory
    output[y * width + x] = tex2D<float>(texObj, tu, tv);
}

int main() {
    // Allocate CUDA array in device memory
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
    cudaArray* cuArray; // Opaque type, only access via API functions, can change in future versions
    cudaMallocArray(&cuArray, &channelDesc, width, height); // Alloc and link array to format description

    // Copy to device memory some data located at address h_data in host memory
    cudaMemcpy2DToArray(cuArray, 0, 0, h_data, width*sizeof(float), width*sizeof(float), height, cudaMemcpyHostToDevice);

    // Specify texture resource
    struct cudaResourceDesc resDesc; // Texture object description
    memset(&resDesc, 0, sizeof(resDesc));
    resDesc.resType = cudaResourceTypeArray;
    resDesc.res.array.array = cuArray;

    // Specify texture object parameters
    struct cudaTextureDesc texDesc;
    memset(&texDesc, 0, sizeof(texDesc));
    texDesc.addressMode[0] = cudaAddressModeWrap;
    texDesc.addressMode[1] = cudaAddressModeWrap;
    texDesc.filterMode = cudaFilterModeLinear;
    texDesc.readMode = cudaReadModeElementType;
    texDesc.normalizedCoords = 1;

    // Create texture object
    cudaTextureObject_t texObj = 0;
    cudaCreateTextureObject(&texObj, &resDesc, &texDesc, NULL);

    // Allocate result of transformation in device memory
    float* output;
    cudaMalloc(&output, width * height * sizeof(float));

    // Invoke kernel
    dim3 dimBlock(16, 16);
    dim3 dimGrid((width + dimBlock.x - 1) / dimBlock.x, (height + dimBlock.y - 1) / dimBlock.y);
    transformKernel<<<dimGrid, dimBlock>>>(output, texObj, width, height, angle);

    // Destroy texture object
    cudaDestroyTextureObject(texObj);

    // Free device memory
    cudaFreeArray(cuArray);
    cudaFree(output);

    return 0;
}

Histogram (Naive with Atomics)

__global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo) {

    int i = threadIdx.x + blockIdx.x * blockDim.x;

    // Stride is total number of threads (grid-stride loop)
    int stride = blockDim.x * gridDim.x;

    // All threads handle blockDim.x * gridDim.x consecutive elements (grid-stride pattern)
    while (i < size) {
        atomicAdd(&(histo[buffer[i]]), 1);
        i += stride;
    }
}

Histogram with Privatization (Shared Memory)

__global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo) {

    // Create private histogram in shared memory (bins)
    __shared__ unsigned int histo_private[7];

    // Initialize private histogram (first 7 threads do this)
    if (threadIdx.x < 7) histo_private[threadIdx.x] = 0;

    __syncthreads(); // Wait for initialization to complete
    
    int i = threadIdx.x + blockIdx.x * blockDim.x;

    // Stride is total number of threads
    int stride = blockDim.x * gridDim.x;
    
    // Update private histogram (much faster atomics in shared memory)
    while (i < size) {
        atomicAdd(&(histo_private[buffer[i]/4]), 1);
        i += stride;
    }
    __syncthreads(); // Wait for all updates to complete

    // Commit private histogram to global final histogram (only once per block)
    if (threadIdx.x < 7) {
        atomicAdd(&(histo[threadIdx.x]), histo_private[threadIdx.x]);
    }
}

Naive Sum Reduction (Divergent)

__shared__ float partialSum[2*BLOCK_SIZE];

unsigned int t = threadIdx.x;
unsigned int start = 2*blockIdx.x*blockDim.x;

// Load two elements per thread from global to shared memory
partialSum[t] = input[start + t];
partialSum[blockDim.x+t] = input[start + blockDim.x+t];

// Reduction in shared memory (divergent approach)
for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2)
{
    __syncthreads();
    if (t % stride == 0)  // Causes divergence (modulo operation)
        partialSum[2*t] += partialSum[2*t+stride];
}

// Host could add results itself or start another kernel

// Problems:
// - Half of threads in warps do nothing in every iteration (divergent warps)
// - Modulo operation causes branch divergence
// - Non-contiguous threads remain active

Better Sum Reduction (Minimized Divergence)

// Same loading as above...

// Better reduction pattern (fewer divergent warps)
for (unsigned int stride = blockDim.x; stride > 0; stride /= 2) {
    __syncthreads();
    if (t < stride) {  // Contiguous threads stay active
        partialSum[t] += partialSum[t+stride];
    }
}

// Benefits:
// - First warp fully active (no divergence within warp)
// - Contiguous threads reduce divergence
// - Better memory access patterns

Multi-Stream Example

cudaStream_t stream0, stream1;

cudaStreamCreate(&stream0);
cudaStreamCreate(&stream1);

float *d_A0, *d_B0, *d_C0; // Device memory for stream 0
float *d_A1, *d_B1, *d_C1; // Device memory for stream 1

// cudaMalloc() calls for d_A0, d_B0, d_C0, d_A1, d_B1, d_C1 go here

// Naive approach (limited overlap)
for (int i=0; i<n; i+=SegSize*2) {
    cudaMemcpyAsync(d_A0, h_A+i, SegSize*sizeof(float), cudaMemcpyHostToDevice, stream0);
    cudaMemcpyAsync(d_B0, h_B+i, SegSize*sizeof(float), cudaMemcpyHostToDevice, stream0);
    vecAdd<<<SegSize/256, 256, 0, stream0>>>(d_A0, d_B0, d_C0, SegSize);
    cudaMemcpyAsync(h_C+i, d_C0, SegSize*sizeof(float), cudaMemcpyDeviceToHost, stream0);
    
    cudaMemcpyAsync(d_A1, h_A+i+SegSize, SegSize*sizeof(float), cudaMemcpyHostToDevice, stream1);
    cudaMemcpyAsync(d_B1, h_B+i+SegSize, SegSize*sizeof(float), cudaMemcpyHostToDevice, stream1);
    vecAdd<<<SegSize/256, 256, 0, stream1>>>(d_A1, d_B1, d_C1, SegSize);
    cudaMemcpyAsync(h_C+i+SegSize, d_C1, SegSize*sizeof(float), cudaMemcpyDeviceToHost, stream1);
}

// Better approach (maximal overlap - batch operations)
for (int i=0; i<n; i+=SegSize*2) {
    // Batch all H2D copies together
    cudaMemcpyAsync(d_A0, h_A+i, SegSize*sizeof(float), cudaMemcpyHostToDevice, stream0);
    cudaMemcpyAsync(d_B0, h_B+i, SegSize*sizeof(float), cudaMemcpyHostToDevice, stream0);
    cudaMemcpyAsync(d_A1, h_A+i+SegSize, SegSize*sizeof(float), cudaMemcpyHostToDevice, stream1);
    cudaMemcpyAsync(d_B1, h_B+i+SegSize, SegSize*sizeof(float), cudaMemcpyHostToDevice, stream1);

    // Batch all kernel launches
    vecAdd<<<SegSize/256, 256, 0, stream0>>>(d_A0, d_B0, d_C0, SegSize);
    vecAdd<<<SegSize/256, 256, 0, stream1>>>(d_A1, d_B1, d_C1, SegSize);

    // Batch all D2H copies together
    cudaMemcpyAsync(h_C+i, d_C0, SegSize*sizeof(float), cudaMemcpyDeviceToHost, stream0);
    cudaMemcpyAsync(h_C+i+SegSize, d_C1, SegSize*sizeof(float), cudaMemcpyDeviceToHost, stream1);
}

cudaStreamSynchronize(stream_id); // Wait for all tasks in THAT STREAM to finish
cudaDeviceSynchronize();          // Wait for all tasks in ALL STREAMS to finish

Dynamic Parallelism Example

__managed__ unsigned long long int facRes;

__global__ void
facCuda(const unsigned int n, unsigned long long int* res){

    if (n <= 1) {
        *res = 1; 
        return;
    }

    // Recursive kernel launch (only from device)
    facCuda<<<1, 1>>>(n - 1, res);
    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess){
        printf("%d!: %s\n", n, cudaGetErrorString(err));
    }

    cudaDeviceSynchronize(); // Wait for child kernel to finish
    __syncthreads();         // Block-level synchronization

    *res = n * *res; // Multiply current n with factorial(n-1)
}

int main(void) {

    unsigned int n = 10;

    // Set recursion limits (important for dynamic parallelism)
    cudaDeviceSetLimit(cudaLimitDevRuntimeSyncDepth, n+1);
    cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, 32768);

    facCuda<<<1, 1>>>(n, &facRes);
    cudaDeviceSynchronize(); // Wait for entire recursion tree

    printf("%d! = %lld\n", n, facRes);
}

Compute Graphics Interop Examples


// ============================================
// Vertex Buffer Object (VBO) Interop
// ============================================
GLuint vboId;                   // OpenGL resource ID
cudaGraphicsResource_t vboRes;  // CUDA resource handle

// Reserve and create OpenGL buffer
glGenBuffers(1, &vboId);
glBindBuffer(GL_ARRAY_BUFFER, vboId);
glBufferData(GL_ARRAY_BUFFER, vboSize, 0, GL_DYNAMIC_DRAW); // Initialize
glBindBuffer(GL_ARRAY_BUFFER, 0);

// Register with CUDA for interop
cudaGraphicsGLRegisterBuffer(&vboRes, vboId, cudaGraphicsRegisterFlagsNone);

float* vboPtr;
while (!done) {
    // Reserve buffer for CUDA compute (lock from OpenGL)
    cudaGraphicsMapResources(1, &vboRes, 0); // count, resourceRef, stream

    // Get device pointer to mapped buffer
    size_t size;
    cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
    runCUDA(vboPtr); // Compute kernel modifies vertex data

    // Return buffer to OpenGL (unlock)
    cudaGraphicsUnmapResources(1, &vboRes, 0);
    runGL(vboId); // Render using updated vertices
}

// ============================================
// Texture Interop
// ============================================
GLuint texId;
cudaGraphicsResource_t texRes;

// Reserve and create OpenGL texture
glGenTextures(1, &texId);
glBindTexture(GL_TEXTURE_2D, texId);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8UI_EXT, texWidth, texHeight, 0, GL_RGBA_INTEGER_EXT, GL_UNSIGNED_BYTE, 0);
glBindTexture(GL_TEXTURE_2D, 0);

// Register with CUDA
cudaGraphicsGLRegisterImage(&texRes, texId, GL_TEXTURE_2D, cudaGraphicsRegisterFlagsNone);

cudaArray* texArray;
<pixelType>* d_buffer; // Needs to get allocated as device memory

while (!done) {
    // Reserve texture for CUDA compute
    cudaGraphicsMapResources(1, &texRes, 0);

    // Get CUDA array pointer (texture data)
    cudaGraphicsSubResourceGetMappedArray(&texArray, texRes, 0, 0);
    
    runCUDA(d_buffer); // Compute kernel generates texture data
    
    // Copy computed data to texture array (device-to-device)
    cudaMemcpyToArray(texArray, 0, 0, d_buffer, d_buffer_size, cudaMemcpyDeviceToDevice);

    // Return texture to OpenGL
    cudaGraphicsUnmapResources(1, &texRes, 0);
    runGL(texId); // Render using updated texture
}