### Contents

GPUsExecution ModelCore TypesPerformanceTesla V100 SpecsNvidia A100 SpecsIPUsDesign PhilosophyColossus™ MK2 GC200 IPUBSPMixed Precision TrainingTensor Core MathScaling IssuesAutomatic Mixed PrecisionInference Quantisation

### GPUs

- DRAM

- L2 cache

- Streaming multiprocessors

- (Within SMs) [FP32|FP64|Tensor] Cores

#### Execution Model

2-level parallelism hierarchy:

- Parallel functions broken into a
*set of***thread blocks**, each of which is assigned to ➡️ an**SM** - Thread blocks executed concurrently = a
**wave**

- Within an SM, each thread is assigned to ➡️ an
**instruction pipeline** - threads within an SM can communicate via shared memory & synchronise

SIMT:

- For SIMD we operate on continuous vectors

- Here we have multiple same-instruction threads operating on scalars

- This removes the continuous constraint

We need more thread blocks than cores / threads than instruction pipelines because

- Hide instruction dependency by switching

- Less waiting for the "tail" of instructions to finish

#### Core Types

**CUDA cores:**general cores

**Tensor cores:**faster matmuls on small matrices - e.g. fp16 inputs but accumulated in fp32

#### Performance

**FLOPS**(i.e. memory bandwidth!)

**Example FLOPS calculation:**An A100 has 108 SMs, a 1.41 GHz clock rate, and can do 1024 FP16 operations per clock cycle. Multiply these together to get the FLOPS, which = 108 * 1024 * 1.41 * 10^9 = 156 TFLOPS

**For an algorithm:**

Factors:

- Memory bandwidth:

- Math bandwidth:

- Latency

Latency only tends to be an issue if there's not sufficient parallelism.

We assume memory access time and math time can largely be parallelised.

Therefore, we are math limited if , and otherwise memory limited.

Hence if an algorithm has a higher arithmetic intensity to the ops:byte ratio then we are math limited, and otherwise memory limited.

**Example FLOP/B calculation: "**A V100 has a peak math rate of 125 FP16 Tensor TFLOPS, an on-chip L2 bandwidth of 3.1 TB/s, and an off-chip memory bandwidth of approx. 0.9 TB/s, giving it an ops:byte ratio between 40 and 139"

Most operations have low arithmetic intensity and so are memory bounded! The main exception is large linear algebra operations.

**Real-terms considerations:**

- Throughput significantly below theoretical peak if memory access is non-contiguous

- For best performance, instantiate large blocks of threads that share the same program - because of SIMD (within SMs?)

#### Tesla V100 Specs

**SMs:**80

**[FP32|FP64|Tensor] Cores / SM:**64 | 32 | 8 (Total: 5120 | 2560 | 640)

**Peak [FP32|FP64|Tensor] TFLOPS:**15.7 | 7.8 | 125

**Memory Size & Bandwidth:**16 GB at 900 GB/s (HBM2)

**L2 Cache Size & Bandwidth:**6144 KB at 2 TB/s (est)

**L1 Cache + Shared Mem (combined in V100) Size & Bandwidth:**128 KB at 150 TB/s (est)

**Inter-chip Bandwidth:**32 GB/s (PCIe), or 300 GB/s (NVLINK)

### Nvidia A100 Specs

GPUs (deprecated)

### IPUs

#### Design Philosophy

Designed to satisfy:

- Irregular fine-grained computation → true MIMD

- Irregular data accesses

- High bandwidth, low latency memory access

Key design decisions:

- No penalty for different instructions

- No penalty for irregular memory access

- No shared memory, just v fast local scratchpad SRAM

#### Colossus™ MK2 GC200 IPU

**IPU-Tiles:**(1472)

**like an SM:**it has its own mem and independent program

**unlike an SM:**there are a lot more, but it has its a 1-1 correspondence with cores

**IPU Cores**: (1-per tile)

Each has 6 independent (different instructions?) program threads

**Accumulating Matrix Product (AMP) units:**(1 within each IPU core)

Similar idea to tensor cores

64 mixed-precision or 16 single-precision floating point operations per clock cycle

**In-Processor Memory:**

**Size:**(624KB per-tile, 900MB total; SRAM)

Local (i.e. non-shared)

For code and data

**less total mem**than GPU DRAM (20x)

**more total mem**than GPU l2 cache (150x)

**more per-tile mem**than GPU l1 cache (5x)

**Bandwidth:**(47.5 TB/s)

**Far higher**than GPU DRAM (50x), and even L2 cache (10x -

*tho est*)

**Inter-tile bandwidth:**8 TB/s IPU-Exchange,

*any*communication pattern (10x GPU DRAM)

**Inter-chip bandwidth:**320 GB/s IPU-Links (same as GPU NVLINK, 10x GPU PCIe)

Support for single and half-precision.

Can connect multiple IPUs and treat them as one large one (from programmer's perspective)

#### BSP

Designed around idea of BSP:

- all do
**local computation**

**sync**= wait for all to finish

**exchange**across IPU-exchange

Poplar SDK describes computation as vertices, data exchange as (static) edges, and data as tensors ➡️ rest is done by compiler

### Mixed Precision Training

(see Number Formats for more info on formats)

Benefits of sub-32-bit fp training:

- Less memory usage

- Less memory bandwidth required (local and network)

- Math faster

**Mixed precision training:**Identifies the steps that require 32-bit, and uses 16-bit elsewhere.

Steps to use:

- Porting the model to use the FP16 data type where appropriate.

- Adding loss scaling to preserve small gradient values.

#### Tensor Core Math

Tensor Cores perform

`D = A x B + C`

`A`

and `B`

are half precision 4x4 matrices`D`

and `C`

can be either half or single precision 4x4 matrices ➡️ which determines the precision of the output8x throughput compared with single-precision math pipelines

We can split a large matrix into blocks, and then do a matmul in the "standard way", with our blocks in the position of scalars. The complexity is unchanged, and we can now decompose into operations of the right size for e.g. tensor cores.

### Scaling Issues

The following image demonstrates a potential problem with half-precision training.

**Solution:**loss scaling ⬇️ (all in FP16)

- Multiply loss by scale factor

- Standard backprop (chain rule ensures scaling propagates)

- Multiply the weight gradient by and feed it to the optimiser

**Exceptions to the all in FP16 rule:**

- Some cases will require an FP32 copy of the weights to be maintained and updated, though fwd & back-prop still FP16.

- Large reductions (e.g. batch-norm mean & var statistics) should still be FP32.

**Choosing a scaling factor:**

*Fixed scaling factors*can work well (e.g. 8-32K). If grad statistics available then this can easily be computed, otherwise consider a hyperparam.

For

*dynamic scaling factors*, if no overflow occurs for a chosen number of iterations N then increase the scaling factor. If an overflow occurs, skip the weight update and decrease the scaling factor.#### Automatic Mixed Precision

Libraries that implement AMP do the following:

- Convert the model to use the
**float16**data type**where possible**.

- Keep
**float32 master weights**to accumulate per-iteration weight updates.

- Use
**loss scaling**to preserve small gradient values.

#### Inference Quantisation

We can use even lower-precision formats like INT8 during inference.

One way to counted performance loss here is to to

**quantisation-aware training**, where we simulate lower-precision training in the forward pass and loss, but update full-precision weights.