GPU Notes | Monishver Chandrasekaran

Concepts:

Shared memory is divided into banks (usually 32 banks), so warp-level accesses can be parallelized if threads don’t conflict. If warp threads access different banks → all 32 loads happen in parallel. If multiple threads hit the same bank → bank conflict, serialized access, slower. That’s why memory access patterns matter (e.g. row-major vs col-major).
There is an upper bound on the # threads that can within a block. Similarly, there is an upper bound on the # blocks per SM. This is because of the hardware as it needs to schedule the warps based on them and the hardware is fixed. SOTA is like 2048 threads per block for now. But, this upper bound is good enough and even higher for most applications.
Compute capability: a property of the hardware
Pay for branch divergence, instead of resolving idle SPs or inactive threads in a warp, as the former is less expensive.
If you need a global/kernel level sync, then it’s a good point to cut to form a new kernel starting from that point.
Don’t use __syncthreads in an if-else block, it leads to undefined behavior.
CUDA only assigns a block to a SM, if all of the block resources are available beforehand. This is needed for zero context switch overhead for GPUs.
Thread scheduling is totally outside our control. So, don’t make assumption on which warps finishes before other
Maximize threads per SM to the upper bound, instead of # blocks per SM -> latency hiding (how?)
Warps are the unit of scheduling inside a SM and execution of a warp is the definition of SIMD.
Branch prediction - applicable in every multi-core systems. In GPU, we try to avoid this if-else conditions, as it causes warp divergence.
Latency hiding/ Keeping the gpu busy by switching between warps while other warps wait for data. This is also called latency tolerance.
No. of warps and problems size are two key factors that play a important role in deciding the threads/block.
Rule of thumb - use cuda get device properties and maximize threads/SM for max parallelism and best performance. Because this causes highest latency tolerance and as there will be some warps to execute.
Maxing block/SM is not in your control, that scheduling happens at the chip level.
Never make your code dependable on a hardware feature, like warp size.
CGMA ratio - Compute to global memory access. Also called arithmetic intensity.
Global memory persists for an application and not just the kernel execution
Shared memory of a SM is virtualized, meaning if two blocks are assigned to a SM, then each block sees and accesses its own shared memory block only.
Constant memory is written only by the host and its read only for the device. It’s much faster accessible than global memory. It’s accessed by everyone and in our direct control. The size of this memory is comparable to that of shared memory.
Registers, shared memory, constant memory and global memory are in our direct control. Cache is not in our direct control/in-direct control.
The life-time of a register is related to the thread that holds it.
Each access to registers involves fewer machine instructions than global memory.
Register files means group of registers inside a SM.
Address space is the set of addresses your program can access.
If in a kernel, I’ve int x, then every thread has that variable in its set of registers.
If its shared int y, then every block in a SM, has their own y variable.
device int z, is in global memory, so everyone in the grid can access it.
Automatic array variables local to a thread reside in local memory. This local memory doesn’t physically exist. It’s an abstractions to the local scope of a thread and put in global memory by the compiler. Again, since this is in global memory, its takes 100x more instructions to access this even though it appear local. So, try to make the variable put in registers as much as possible. Reduce this lmem(local memory) for performance improvements.
Global memory access is a performance bottleneck. So, we try to reduce this. A common strategy is tiling, in which we partition the data we want to use into subset called tiles, such that each tile fits into the shared memory.
In GPU, we want the performance/watt to be high and they’re quite good given the right problem.
Global synchronization across blocks of a grid isn’t possible and not expected. This is due to the possibility of deadlock and the fact that there can be many blocks waiting to be scheduled on the SM. In this case, blocks executing in a SM, will finish and reach the barrier, they wait for the others that are waiting to be scheduled in the SM, but can’t because the finished ones are waiting too and this is a deadlock.
Registers are faster, but those values are fetched from global memory. So, if I have a variable x in a thread execution, then each of x value for each thread will need to fetched by global memory. So, putting x in shared memory will only fetch once from the global memory. So, in this case putting in shared memory is best.
A kernel that is memory bound is an indication that it must be optimized for performance. Always make sure the kernel is compute bound and most of its work is in this part. Also, to determine whether some kernel is compute/memory bound, we need both SW and HW details.
In thread divergence, you lose 50%, even if 31 falls in if case and 1 falls in else case and 16 in if case and 16 in else case. The power consumption is double, as both cases are executed.
System to device memory bandwidth link is a bottleneck. The GPU(SM & shared memory) to device memory is better as its handled by NVIDIA an optimized for GPU usage.
Sending large amount of data in one transfer is better than sending several small transfers. The reason is we avoid the overheads of the each individual transfer.
Coalescing of the memory access is done at a per warp level, as the threads in the warp execute the same instructions at once.
Accessing col wise in a 2D matrix of its memory locations is better than accessing the memory locations row wise. Draw a diagram to understand. So, in a matmul, accessing both col wise of its input is better. And there is a way to do, check about it. This is at a warp level and not at each individual thread level.
Less global memory instructions -> more opportunities to coalesce. Meaning, if there are less trips to the memory, we can get a coalesced block of data that we require in less trips.
Making the memory accessed to be contiguous as much as possible. There is an alignment that is done by the OS, but as a programmer, you also need to make sure it’s aligned to get better performance. It reduces the no. of memory trips to the memory, there by increasing coalescing. Also, the cache picks a block of data from memory, so if the data you want is already contiguoos, you avoid a lot of cache misses.
The # registers used by the thread/warp is directly based on the code we write, the variables and the structure determine it.
Latency hiding - something must be executing while you’re retrieving the memory.
If you’ve many instructions between the (instruction to memory access) and the actual usage, then for those many instructions you can schedule the computations, thus making it compute bound a little more.
Loop unrolling reduces the # instructions that are executed by the GPU assembly. Do this, once your code is already optimized for memory, resources etc. Follow the order of optimizations as we learned in class slides.
Reduce global memory trips using tiling, shared memory etc.
Data prefetching becomes less beneficial as thread granularity increases. The reason is the shared memory is already used by the threads in this case to the max.
What if we don’t know the shared memory size beforehand? Dynamic shared memory. Even here you need to know the size before kernel launch, but after the program launch. And you can’t have more than one dynamic allocation like this explicitly. But, we can play with pointers, but that’s not used much as the shared memory is small for this sophistication.
If you’ve two files (c/c++) with int x; Then the linker, picks one at random, if we want to run both files. This can be a source of bug. To avoid this, we use extern keyword to tell the linker where to check for.
The memory is cut into pieces to avoid it being big, power-hungry and slow. So, we’ve many pieces called banks. How about the address spaces for each?
Shared memory accesses that span b distinct banks yield an effective bandwidth that is b times as high as the bandwidth of when accesses map to the same bank. Shared memory is now 32 banks with fixed width of four bytes. Consecutive four-byte data map to consecutive banks. So, if the threads is a warp access contiguous memory locations, it’ll take the memory from the 32 banks, which are independent and results b times higher than accessing the same bank. So, memory coalescing is important in this aspect too.
Shared memory bank conflicts - since the width of the bank is minimal, only one at a time is possible for transfer, so it takes more time than if the accesses are from different banks which can be accessed in parallel.
IEEE floating point standard (single-precision): (–1)^sign x (1+mantissa) x 2^(exponent – 127)
With just this encoding, we can’t define 0 in this format. So, we’ve 3 encoding schemes based on exp.
In tensor cores, the FMA is between 3 matrices, multiply two and add with third. The multiply is single precision and the additions is half precision.
In multiplying two vectors, parallel compute approach results in more accurate than serial approach, but FMA is the most accurate when compared to the exact value. The GPU always uses FMA and the CPU is serial, and that’s the reason why GPU is more favorable in-terms of accuracy of floating point operations than CPU. Also, SP’s are optimized to do FMA in a single cycle.
Asynchronous = returns to host right-away and doesn’t wait for device. (Non-blocking).
Some CUDA API calls and all kernel launches are asynchronous with respect to the host code. This means error-reporting is also asynchronous.
If the kernel launch is non-blocking and if I have a mem copy to host from device is next statement after the kernel launch statement, then it might lead to error right, because the kernel isn’t done yet, but we’re trying to load the result from device to host. So, what happens under the hood is that, there is a queue(streams) maintained in the device, and this queue holds the kernel launch first and then the mem copy, and the queue’s function will be executed sequentially, this is the reason, the prior works.
You can have many queues, called streams and with CUDA, you’ve control to put functions in different streams. Until now, we’ve dealt with only 1 stream, so all operations are executed sequentially in that queue of the device.
Operations in different streams can be interleaved and, when possible, they can even run concurrently. We can have several open streams to the same device at once. Need GPUs with concurrent transfer/execution capability.
All operations to non-default streams are non-blocking wrt the host code.
Some times for correctness reasons, you need to synchronize the host code with operations in a stream.
Three options: cudaDeviceSynchronize() → blocks host for all streams, cudaStreamSynchronize(stream) → blocks host for this particular stream and cudaStreamQuery(stream) → does not block host for this stream.
Streams are good way to overlap execution and transfer, hardware permits.
Accessing host memory from device without explicit copy is called zero-copy mechanism. If it’s a small data access, then zero-copy is better, as it avoids the overhead of transferring the data to global memory and then accessing it within the device.
As we program GPUs we need to pay attention to several performance bottlenecks:
- Branch diversion
- Global memory latency
- Global memory bandwidth
- Shared memory bank conflicts
- Communication
- Limited resources
We have several techniques in our arsenal to enhance performance
- Try to make threads in the same warp follow the same control flow
- Tiling
- Coalescing
- Loop unrolling
- Increase thread granularity
- Trade one resource for another
- Memory access pattern
- Streams
_syncthreads() - within one block - inside a kernel - waits for all threads in the same block to reach that point before continuing.
cudaDeviceSynchronize() - across entire device - one host side - waits until all previously launched kernels on the GPU are complete.
cudaEvent_t - calculates the actual GPU execution time (when GPU was busy running your kernels). Its measures pure GPU time.
General rule of thumb: as the problem size increases, increasing the # threads must be done and that needs to increase performance, and not amount of work done by each thread. This applies to parallel computing in general.
Order of cost: communication & memory access > computation
In a CUDA program, if we suspect an error has occurred during a kernel launch, then we must explicitly check for it after the kernel has executed.
If debugging, compile with: nvcc -DDEBUG code.cu -o code. This invokes the cuda error handling implicitly.
MPI, OPENMP and CUDA - Multicore + MultiGPU communication setup. MPI is for Multi-node GPU communication.
Zero-copy: nothing is copied, instead accessed directly. Between host to device or device to device. Transfer happens between the other memory to the current device’s L2 cache directly.
Unified virtual address: puts all the host, device memory of all devices into a single address space.
Unified memory: creates a pool of managed memory that is shared between the CPU and GPU. Under the hood, the data(pages) automatically migrates from CPU to GPU and among GPU’s for which ever needs that data.
All this, is for ease of use and not for performance reasons. So, sometimes manual control is better/best as in the prior you might not exactly know the actual place where the data resides.
UM is built on top of UVA, and UM has this extra capability of data(page) movements.
All this, is only with respect to the GPU’s global memory. The shared memory, tiling, L1 cache are still within and unique for each devices.
Zero-Copy: use it when you have a small piece of data for reading a few times. Manual: complex pattern between host and device. If it’s structured regular thing, use unified memory.
Within the UVA, when we need data, we use the copy… option, but it actually transfers the pages. Without, this the copy can easily fill up the space and its redundant.
UM is performant than UVA.
Coherence: will not allow writes to the same page at the same time, even when it leads to low performance.
Dynamic parallelism: something related to nested code/recursion. In GPUs, it means, the kernel can start new kernel. Streams can spawn new streams. It permits dynamic run time decisions.
Until the child kernel finishes, the parent kernel isn’t done, subject to the availability of the resources. If there’s less resources, then there is a possibility of a deadlock.
To speed-up, start a dummy kernel first. Because, the stage setting takes time with the CUDA runtime, so the first kernel launch takes more time than the subsequent ones.
CudaThreadSynchronize() - the device waits for the work to be done on another device, for the work sent to the child kernel by the parent kernel. Even without this, the parent waits till the child kernel finishes. But does the parent kernel block, or it does some work itself? Check??
Alignment with Cuda Pitch for 2D arrays/kernels. CUDA ensures that each row starts at an address aligned to 64 B or 128 B multiples. Aligned memory → fewer memory transactions → higher bandwidth utilization.
Cuda Compilation: CUDA file(.cu)-> PTX(Intermediate representation) -> SASS(or other assemblies) -> CuBit(Cuda binary) -> execute
-arch: for virtual compute architecture for generation of PTX code.
-code: specifies the actual device that will be targeted by SASS ad the cuBin binary.
Without -code, the final form is PTX, so every time for the GPU code generation, it uses a JIT compiler.
Fat binary: an executable or object file that contains multiple versions of GPU code. One or more machine-specific binaries (SASS), compiled for concrete GPU architectures like sm_30, sm_35, etc. Optionally, one or more PTX versions, virtual assembly for JIT-compilation on future GPUs.
When a CUDA kernel is launched, the driver checks the GPU’s SM version and looks in the fatbinary for a matching compiled binary (cubin). If it finds one, it runs it directly for maximum speed; otherwise, it falls back to the embedded PTX, JIT-compiles it into a cubin, and caches it (in ~/.nv/ComputeCache) for future use. This allows the driver to automatically choose the optimal version at runtime with no code changes needed.
In the nvcc cmd, Arch gets you the PTX for that architecture and code gives you the assembly and binary for your specific device.
(-arch=compute_Xi works with -code=sm_Xj, where i<=j) - Check
With just -arch and without -code, it’ll give you the ptx only. And at runtime, it uses the JIT compiler to compile this into an assembly and binary as executable. So, it takes a bit more time.
Write code, that’s wrap friendly(that reduces thread divergence) and cache friendly for good memory access. This can be cultivated with experience and system design thinking.
shfl_sync: fastest way to share data between threads within a warp, instead of going into shared memory or cache. This goes into low-level hardware and used by Nvidia libraries.
Thread block cluster - a group of thread blocks. We introduce this a layer in between the blocks and grids. The next layer is grid with cluster. For a thread block cluster, all blocks within must be assigned to each of the SMs for the single thread block to be executed. The shared memory of all the blocks within a thread block cluster are accessible and called distributed shared memory. This needs C++, Cuda Blackwell architecture. One advantage is that this now can give us block level synchronization. - Check last point.
How to reduce the performance dip because of AtomicAdd, but still maintaining the atomicity? Solution: Privatization. This is always good when we’ve severe collision. But, it’s costly. There’s overhead for creating and initializing private copies for each thread block and the overhead for accumulating the contents of private copies into the final copy. The benefit is much less contention and serialization in accessing both the private copies and the final copy. The overall performance can often be improved more than 10x. These private copies are stored in shared memory. Even in AtomicAdd, now we’ve variations based on the level of atomicity we want. For block level, we can use AtomicAdd_Block with privatization for best performance.
What if the copy is too large to privatize? Sometimes one can partially privatize an output copy and use range testing to go to either global memory or shared memory.

Lecture doubts:

If the kernel call API is non-blocking, then the after steps like bring the results back to host will start immediately, how is it possible? Or why kernel call API is non-blocking? - A: CUDA streams.
What is codaMalloc(void) -> what’s this void means? cudaMalloc() allocates memory on the GPU and writes the GPU address into your device pointer variable. To let CUDA modify that pointer, you must pass its address (i.e., a pointer to your pointer). Since cudaMalloc() expects a void, we cast our variable’s address — e.g., (void)&d_curr — to match its signature. This cast simply tells CUDA, “here’s the address of my pointer; fill it with the device memory location.”
As a programmer, do we have access to coalesce the memory access before the actual accessing from the memory?
What is data prefetch? technique used to hide memory access latency by loading data into faster memory (like shared memory or registers) before it is actually needed for computation.
What is presorting overhead in floating point operations?
For mem copy from host to device, will it be done via pinned pages or the usage of pinned pages in in programmers control? If so, what are the specific api’s that allows us to do this?
Nvidia doesn’t support backward compatibility??
Access vs Transfer in GPU peer-to-peer ??. Access is unto L2 cache and transfer goes till the GPU global memory. Check more??
What is zero copy and how its different from/related to peer-to-peer copy??
cudaHostAlloc() ??
By default, grids launched within a thread block are executed sequentially. • This happens even if grids are launched by different threads within the block. • To deal with this drawback → streams • streams created on the host cannot be used on the device. • Streams created in a block can be used by all threads in that block. ??
shfl_sync and its variations. How its used and what happens due to this under the hood and what evens without this?
Q: If threads in a warp should access contiguous memory for performance, but shared memory accesses must avoid bank conflicts to be parallel, isn’t this contradictory?
A: No, because these rules apply to two different memory systems. Contiguous access refers to global memory, where consecutive addresses allow requests from a warp to be coalesced into fewer DRAM transactions, maximizing bandwidth. Bank conflicts apply to shared memory, which is divided into banks; here, threads must access different banks to avoid serialization. In practice, an optimal kernel loads data from global memory using coalesced (contiguous) accesses, then rearranges it in shared memory into a layout that avoids bank conflicts for computation. Both conditions are required: coalescing ensures efficient data movement into the SM, while conflict-free shared memory ensures fast, parallel use of that data once it is on-chip.

C++ & CUDA:

std::shared_ptr → smart pointer, auto memory cleanup.
Custom deleter = control how memory is freed (e.g. cudaFree).
Constructor initializer list (: h(h_), w(w_)) → compact way to set members.
Templates (template ) → write generic code for any type.
Macros (#define Index(…)) → text substitution, quick shorthand.
Exceptions (throw std::runtime_error(“msg”)) → safe error reporting.
Header files:
- #include = “paste the file here.”
- <…> = system/standard headers
- “…” = local project headers
- #pragma once -> Ensures header file is included only once and prevents duplicate definitions.
Files:
- .hh / .hpp → C++ headers (project-specific)
- .cuh → CUDA headers (contain CUDA-related declarations)
- .cpp / .cu → source files (where you usually put bigger function definitions, kernels, training loops)
operator() makes an object callable like a function
functors (function objects)
kernels can’t take references (&) — arguments must be passed by value