GPUsExecution ModelCore TypesPerformanceTesla V100 SpecsNvidia A100 SpecsIPUsDesign PhilosophyColossus™ MK2 GC200 IPUBSPMixed Precision TrainingTensor Core MathScaling IssuesAutomatic Mixed PrecisionInference Quantisation
- L2 cache
- Streaming multiprocessors
- (Within SMs) [FP32|FP64|Tensor] Cores
2-level parallelism hierarchy:
- Parallel functions broken into a set of thread blocks, each of which is assigned to ➡️ an SM
- Thread blocks executed concurrently = a wave
- Within an SM, each thread is assigned to ➡️ an instruction pipeline
- threads within an SM can communicate via shared memory & synchronise
- For SIMD we operate on continuous vectors
- Here we have multiple same-instruction threads operating on scalars
- This removes the continuous constraint
We need more thread blocks than cores / threads than instruction pipelines because
- Hide instruction dependency by switching
- Less waiting for the "tail" of instructions to finish
CUDA cores: general cores
Tensor cores: faster matmuls on small matrices - e.g. fp16 inputs but accumulated in fp32
FLOPS (i.e. memory bandwidth!)
Example FLOPS calculation: An A100 has 108 SMs, a 1.41 GHz clock rate, and can do 1024 FP16 operations per clock cycle. Multiply these together to get the FLOPS, which = 108 * 1024 * 1.41 * 10^9 = 156 TFLOPS
For an algorithm:
- Memory bandwidth:
- Math bandwidth:
Latency only tends to be an issue if there's not sufficient parallelism.
We assume memory access time and math time can largely be parallelised.
Therefore, we are math limited if , and otherwise memory limited.
Hence if an algorithm has a higher arithmetic intensity to the ops:byte ratio then we are math limited, and otherwise memory limited.
Example FLOP/B calculation: "A V100 has a peak math rate of 125 FP16 Tensor TFLOPS, an on-chip L2 bandwidth of 3.1 TB/s, and an off-chip memory bandwidth of approx. 0.9 TB/s, giving it an ops:byte ratio between 40 and 139"
Most operations have low arithmetic intensity and so are memory bounded! The main exception is large linear algebra operations.
- Throughput significantly below theoretical peak if memory access is non-contiguous
- For best performance, instantiate large blocks of threads that share the same program - because of SIMD (within SMs?)
[FP32|FP64|Tensor] Cores / SM: 64 | 32 | 8 (Total: 5120 | 2560 | 640)
Peak [FP32|FP64|Tensor] TFLOPS: 15.7 | 7.8 | 125
Memory Size & Bandwidth: 16 GB at 900 GB/s (HBM2)
L2 Cache Size & Bandwidth: 6144 KB at 2 TB/s (est)
L1 Cache + Shared Mem (combined in V100) Size & Bandwidth: 128 KB at 150 TB/s (est)
Inter-chip Bandwidth: 32 GB/s (PCIe), or 300 GB/s (NVLINK)
Designed to satisfy:
- Irregular fine-grained computation → true MIMD
- Irregular data accesses
- High bandwidth, low latency memory access
Key design decisions:
- No penalty for different instructions
- No penalty for irregular memory access
- No shared memory, just v fast local scratchpad SRAM
like an SM: it has its own mem and independent program unlike an SM: there are a lot more, but it has its a 1-1 correspondence with cores
IPU Cores: (1-per tile)
Each has 6 independent (different instructions?) program threads
Accumulating Matrix Product (AMP) units: (1 within each IPU core)
Similar idea to tensor cores
64 mixed-precision or 16 single-precision floating point operations per clock cycle
Size: (624KB per-tile, 900MB total; SRAM)
Local (i.e. non-shared)
For code and data
less total mem than GPU DRAM (20x) more total mem than GPU l2 cache (150x) more per-tile mem than GPU l1 cache (5x)
Bandwidth: (47.5 TB/s)
Far higher than GPU DRAM (50x), and even L2 cache (10x - tho est)
Inter-tile bandwidth: 8 TB/s IPU-Exchange, any communication pattern (10x GPU DRAM)
Inter-chip bandwidth: 320 GB/s IPU-Links (same as GPU NVLINK, 10x GPU PCIe)
Support for single and half-precision.
Can connect multiple IPUs and treat them as one large one (from programmer's perspective)
Designed around idea of BSP:
- all do local computation
- sync = wait for all to finish
- exchange across IPU-exchange
Poplar SDK describes computation as vertices, data exchange as (static) edges, and data as tensors ➡️ rest is done by compiler
(see Number Formats for more info on formats)
Benefits of sub-32-bit fp training:
- Less memory usage
- Less memory bandwidth required (local and network)
- Math faster
Mixed precision training: Identifies the steps that require 32-bit, and uses 16-bit elsewhere.
Steps to use:
- Porting the model to use the FP16 data type where appropriate.
- Adding loss scaling to preserve small gradient values.
Tensor Cores perform
D = A x B + C
Bare half precision 4x4 matrices
Ccan be either half or single precision 4x4 matrices ➡️ which determines the precision of the output
8x throughput compared with single-precision math pipelines
We can split a large matrix into blocks, and then do a matmul in the "standard way", with our blocks in the position of scalars. The complexity is unchanged, and we can now decompose into operations of the right size for e.g. tensor cores.
The following image demonstrates a potential problem with half-precision training.
Solution: loss scaling ⬇️ (all in FP16)
- Multiply loss by scale factor
- Standard backprop (chain rule ensures scaling propagates)
- Multiply the weight gradient by and feed it to the optimiser
Exceptions to the all in FP16 rule:
- Some cases will require an FP32 copy of the weights to be maintained and updated, though fwd & back-prop still FP16.
- Large reductions (e.g. batch-norm mean & var statistics) should still be FP32.
Choosing a scaling factor:
Fixed scaling factors can work well (e.g. 8-32K). If grad statistics available then this can easily be computed, otherwise consider a hyperparam.
For dynamic scaling factors, if no overflow occurs for a chosen number of iterations N then increase the scaling factor. If an overflow occurs, skip the weight update and decrease the scaling factor.
Libraries that implement AMP do the following:
- Convert the model to use the float16 data type where possible.
- Keep float32 master weights to accumulate per-iteration weight updates.
- Use loss scaling to preserve small gradient values.
We can use even lower-precision formats like INT8 during inference.
One way to counted performance loss here is to to quantisation-aware training, where we simulate lower-precision training in the forward pass and loss, but update full-precision weights.