⏹️

Accelerators


GPUs

  • DRAM
  • L2 cache
  • Streaming multiprocessors
  • (Within SMs) [FP32|FP64|Tensor] Cores
Simplified version of an Nvidia GPU
Simplified version of an Nvidia GPU

Execution Model

2-level parallelism hierarchy:
  1. Parallel functions broken into a set of thread blocks, each of which is assigned to ➡️ an SM
      • Thread blocks executed concurrently = a wave
  1. Within an SM, each thread is assigned to ➡️ an instruction pipeline
      • threads within an SM can communicate via shared memory & synchronise
 
SIMT:
  • For SIMD we operate on continuous vectors
  • Here we have multiple same-instruction threads operating on scalars
  • This removes the continuous constraint
We need more thread blocks than cores / threads than instruction pipelines because
  1. Hide instruction dependency by switching
  1. Less waiting for the "tail" of instructions to finish

Core Types

CUDA cores: general cores
Tensor cores: faster matmuls on small matrices - e.g. fp16 inputs but accumulated in fp32

Performance

FLOPS (i.e. memory bandwidth!)
💡
Example FLOPS calculation: An A100 has 108 SMs, a 1.41 GHz clock rate, and can do 1024 FP16 operations per clock cycle. Multiply these together to get the FLOPS, which = 108 * 1024 * 1.41 * 10^9 = 156 TFLOPS
For an algorithm:
Factors:
  1. Memory bandwidth:
  1. Math bandwidth:
  1. Latency
Latency only tends to be an issue if there's not sufficient parallelism.
We assume memory access time and math time can largely be parallelised.
Therefore, we are math limited if , and otherwise memory limited.
Hence if an algorithm has a higher arithmetic intensity to the ops:byte ratio then we are math limited, and otherwise memory limited.
💡
Example FLOP/B calculation: "A V100 has a peak math rate of 125 FP16 Tensor TFLOPS, an on-chip L2 bandwidth of 3.1 TB/s, and an off-chip memory bandwidth of approx. 0.9 TB/s, giving it an ops:byte ratio between 40 and 139"
Most operations have low arithmetic intensity and so are memory bounded! The main exception is large linear algebra operations.
Real-terms considerations:
  1. Throughput significantly below theoretical peak if memory access is non-contiguous
  1. For best performance, instantiate large blocks of threads that share the same program - because of SIMD (within SMs?)

Tesla V100 Specs

SMs: 80
[FP32|FP64|Tensor] Cores / SM: 64 | 32 | 8 (Total: 5120 | 2560 | 640)
Peak [FP32|FP64|Tensor] TFLOPS: 15.7 | 7.8 | 125
Memory Size & Bandwidth: 16 GB at 900 GB/s (HBM2)
L2 Cache Size & Bandwidth: 6144 KB at 2 TB/s (est)
L1 Cache + Shared Mem (combined in V100) Size & Bandwidth: 128 KB at 150 TB/s (est)
Inter-chip Bandwidth: 32 GB/s (PCIe), or 300 GB/s (NVLINK)
notion image

Nvidia A100 Specs

 

🔲
GPUs (deprecated)

IPUs

Design Philosophy

Designed to satisfy:
  1. Irregular fine-grained computation → true MIMD
  1. Irregular data accesses
  1. High bandwidth, low latency memory access
Key design decisions:
  1. No penalty for different instructions
  1. No penalty for irregular memory access
  1. No shared memory, just v fast local scratchpad SRAM

Colossus™ MK2 GC200 IPU

notion image
IPU-Tiles: (1472)
like an SM: it has its own mem and independent program unlike an SM: there are a lot more, but it has its a 1-1 correspondence with cores
IPU Cores: (1-per tile)
Each has 6 independent (different instructions?) program threads
Accumulating Matrix Product (AMP) units: (1 within each IPU core)
Similar idea to tensor cores
64 mixed-precision or 16 single-precision floating point operations per clock cycle
In-Processor Memory:
Size: (624KB per-tile, 900MB total; SRAM)
Local (i.e. non-shared)
For code and data
less total mem than GPU DRAM (20x) more total mem than GPU l2 cache (150x) more per-tile mem than GPU l1 cache (5x)
Bandwidth: (47.5 TB/s)
Far higher than GPU DRAM (50x), and even L2 cache (10x - tho est)
Inter-tile bandwidth: 8 TB/s IPU-Exchange, any communication pattern (10x GPU DRAM)
Inter-chip bandwidth: 320 GB/s IPU-Links (same as GPU NVLINK, 10x GPU PCIe)
 
 
Support for single and half-precision.
Can connect multiple IPUs and treat them as one large one (from programmer's perspective)

BSP

Designed around idea of BSP:
  1. all do local computation
  1. sync = wait for all to finish
  1. exchange across IPU-exchange
Poplar SDK describes computation as vertices, data exchange as (static) edges, and data as tensors ➡️ rest is done by compiler

Mixed Precision Training

(see
🔟
Number Formats
for more info on formats)
Benefits of sub-32-bit fp training:
  1. Less memory usage
  1. Less memory bandwidth required (local and network)
  1. Math faster
Mixed precision training: Identifies the steps that require 32-bit, and uses 16-bit elsewhere.
Steps to use:
  1. Porting the model to use the FP16 data type where appropriate.
  1. Adding loss scaling to preserve small gradient values.

Tensor Core Math

Tensor Cores perform D = A x B + C
A and B are half precision 4x4 matrices
D and C can be either half or single precision 4x4 matrices ➡️ which determines the precision of the output
8x throughput compared with single-precision math pipelines
🛠
We can split a large matrix into blocks, and then do a matmul in the "standard way", with our blocks in the position of scalars. The complexity is unchanged, and we can now decompose into operations of the right size for e.g. tensor cores.

Scaling Issues

The following image demonstrates a potential problem with half-precision training.
We see here training for FP32. Most of the FP16 equivalent values would either end up as zero or inaccurate denorm values. However, as most of the FP16 range is unused, some scaling would mitigate much of the problem.

(Maths: FP16 has 5 exponent bits, giving a bias of . The minimum value therefore is . This gives the blue vertical line. The 10 significand bits give the red line.)
We see here training for FP32. Most of the FP16 equivalent values would either end up as zero or inaccurate denorm values. However, as most of the FP16 range is unused, some scaling would mitigate much of the problem. (Maths: FP16 has 5 exponent bits, giving a bias of . The minimum value therefore is . This gives the blue vertical line. The 10 significand bits give the red line.)
Solution: loss scaling ⬇️ (all in FP16)
  1. Multiply loss by scale factor
  1. Standard backprop (chain rule ensures scaling propagates)
  1. Multiply the weight gradient by and feed it to the optimiser
 
Exceptions to the all in FP16 rule:
  1. Some cases will require an FP32 copy of the weights to be maintained and updated, though fwd & back-prop still FP16.
  1. Large reductions (e.g. batch-norm mean & var statistics) should still be FP32.
 
Choosing a scaling factor:
Fixed scaling factors can work well (e.g. 8-32K). If grad statistics available then this can easily be computed, otherwise consider a hyperparam.
For dynamic scaling factors, if no overflow occurs for a chosen number of iterations N then increase the scaling factor. If an overflow occurs, skip the weight update and decrease the scaling factor.

Automatic Mixed Precision

Libraries that implement AMP do the following:
  1. Convert the model to use the float16 data type where possible.
  1. Keep float32 master weights to accumulate per-iteration weight updates.
  1. Use loss scaling to preserve small gradient values.

Inference Quantisation

We can use even lower-precision formats like INT8 during inference.
One way to counted performance loss here is to to quantisation-aware training, where we simulate lower-precision training in the forward pass and loss, but update full-precision weights.