Accelerators

GPUs Execution Model Core Types Performance Tesla V100 Specs Nvidia A100 Specs IPUs Design Philosophy Colossus™ MK2 GC200 IPU BSP Mixed Precision Training Tensor Core Math Scaling Issues Automatic Mixed Precision Inference Quantisation

About

(This page is based primarily on material from the Nvidia docs)

Anki

Corresponding Anki pages

IPUs

Mixed Precision Training

GPUs

DRAM

L2 cache

Streaming multiprocessors

(Within SMs) [FP32|FP64|Tensor] Cores

Execution Model

2-level parallelism hierarchy:

Parallel functions broken into a set of thread blocks, each of which is assigned to ➡️ an SM

Thread blocks executed concurrently = a wave

Within an SM, each thread is assigned to ➡️ an instruction pipeline

threads within an SM can communicate via shared memory & synchronise

SIMT:

For SIMD we operate on continuous vectors

Here we have multiple same-instruction threads operating on scalars

This removes the continuous constraint

We need more thread blocks than cores / threads than instruction pipelines because

Hide instruction dependency by switching

Less waiting for the "tail" of instructions to finish

Core Types

CUDA cores: general cores

Tensor cores: faster matmuls on small matrices - e.g. fp16 inputs but accumulated in fp32

Performance

FLOPS (i.e. memory bandwidth!)

💡

Example FLOPS calculation: An A100 has 108 SMs, a 1.41 GHz clock rate, and can do 1024 FP16 operations per clock cycle. Multiply these together to get the FLOPS, which = 108 * 1024 * 1.41 * 10^9 = 156 TFLOPS

For an algorithm:

Factors:

Memory bandwidth:

Math bandwidth:

Latency

Latency only tends to be an issue if there's not sufficient parallelism.

We assume memory access time and math time can largely be parallelised.

Therefore, we are math limited if , and otherwise memory limited.

Hence if an algorithm has a higher arithmetic intensity to the ops:byte ratio then we are math limited, and otherwise memory limited.

💡

Example FLOP/B calculation: "A V100 has a peak math rate of 125 FP16 Tensor TFLOPS, an on-chip L2 bandwidth of 3.1 TB/s, and an off-chip memory bandwidth of approx. 0.9 TB/s, giving it an ops:byte ratio between 40 and 139"

Most operations have low arithmetic intensity and so are memory bounded! The main exception is large linear algebra operations.

Real-terms considerations:

Throughput significantly below theoretical peak if memory access is non-contiguous

For best performance, instantiate large blocks of threads that share the same program - because of SIMD (within SMs?)

Tesla V100 Specs

SMs: 80

[FP32|FP64|Tensor] Cores / SM: 64 | 32 | 8 (Total: 5120 | 2560 | 640)

Peak [FP32|FP64|Tensor] TFLOPS: 15.7 | 7.8 | 125

Memory Size & Bandwidth: 16 GB at 900 GB/s (HBM2)

L2 Cache Size & Bandwidth: 6144 KB at 2 TB/s (est)

L1 Cache + Shared Mem (combined in V100) Size & Bandwidth: 128 KB at 150 TB/s (est)

Inter-chip Bandwidth: 32 GB/s (PCIe), or 300 GB/s (NVLINK)

Nvidia A100 Specs

🔲

GPUs (deprecated)

IPUs

Design Philosophy

Designed to satisfy:

Irregular fine-grained computation → true MIMD

Irregular data accesses

High bandwidth, low latency memory access

Key design decisions:

No penalty for different instructions

No penalty for irregular memory access

No shared memory, just v fast local scratchpad SRAM

Colossus™ MK2 GC200 IPU

IPU-Tiles: (1472)

like an SM: it has its own mem and independent program unlike an SM: there are a lot more, but it has its a 1-1 correspondence with cores

IPU Cores: (1-per tile)

Each has 6 independent (different instructions?) program threads

Accumulating Matrix Product (AMP) units: (1 within each IPU core)

Similar idea to tensor cores

64 mixed-precision or 16 single-precision floating point operations per clock cycle

In-Processor Memory:

Size: (624KB per-tile, 900MB total; SRAM)

Local (i.e. non-shared)

For code and data

less total mem than GPU DRAM (20x) more total mem than GPU l2 cache (150x) more per-tile mem than GPU l1 cache (5x)

Bandwidth: (47.5 TB/s)

Far higher than GPU DRAM (50x), and even L2 cache (10x - tho est)

Inter-tile bandwidth: 8 TB/s IPU-Exchange, any communication pattern (10x GPU DRAM)

Inter-chip bandwidth: 320 GB/s IPU-Links (same as GPU NVLINK, 10x GPU PCIe)

Support for single and half-precision.

Can connect multiple IPUs and treat them as one large one (from programmer's perspective)

BSP

Designed around idea of BSP:

all do local computation

sync = wait for all to finish

exchange across IPU-exchange

Poplar SDK describes computation as vertices, data exchange as (static) edges, and data as tensors ➡️ rest is done by compiler

Mixed Precision Training

(see

🔟

Number Formats for more info on formats)

Benefits of sub-32-bit fp training:

Less memory usage

Less memory bandwidth required (local and network)

Math faster

Mixed precision training: Identifies the steps that require 32-bit, and uses 16-bit elsewhere.

Steps to use:

Porting the model to use the FP16 data type where appropriate.

Adding loss scaling to preserve small gradient values.

Tensor Core Math

Tensor Cores perform D = A x B + C

A and B are half precision 4x4 matrices

D and C can be either half or single precision 4x4 matrices ➡️ which determines the precision of the output

8x throughput compared with single-precision math pipelines

🛠

We can split a large matrix into blocks, and then do a matmul in the "standard way", with our blocks in the position of scalars. The complexity is unchanged, and we can now decompose into operations of the right size for e.g. tensor cores.

Scaling Issues

The following image demonstrates a potential problem with half-precision training.

We see here training for FP32. Most of the FP16 equivalent values would either end up as zero or inaccurate denorm values. However, as most of the FP16 range is unused, some scaling would mitigate much of the problem.

(Maths: FP16 has 5 exponent bits, giving a bias of . The minimum value therefore is . This gives the blue vertical line. The 10 significand bits give the red line.) — We see here training for FP32. Most of the FP16 equivalent values would either end up as zero or inaccurate denorm values. However, as most of the FP16 range is unused, some scaling would mitigate much of the problem. (Maths: FP16 has 5 exponent bits, giving a bias of . The minimum value therefore is . This gives the blue vertical line. The 10 significand bits give the red line.)

Solution: loss scaling ⬇️ (all in FP16)

Multiply loss by scale factor

Standard backprop (chain rule ensures scaling propagates)

Multiply the weight gradient by and feed it to the optimiser

Exceptions to the all in FP16 rule:

Some cases will require an FP32 copy of the weights to be maintained and updated, though fwd & back-prop still FP16.

Large reductions (e.g. batch-norm mean & var statistics) should still be FP32.

Choosing a scaling factor:

Fixed scaling factors can work well (e.g. 8-32K). If grad statistics available then this can easily be computed, otherwise consider a hyperparam.

For dynamic scaling factors, if no overflow occurs for a chosen number of iterations N then increase the scaling factor. If an overflow occurs, skip the weight update and decrease the scaling factor.

Automatic Mixed Precision

Libraries that implement AMP do the following:

Convert the model to use the float16 data type where possible.

Keep float32 master weights to accumulate per-iteration weight updates.

Use loss scaling to preserve small gradient values.

Inference Quantisation

We can use even lower-precision formats like INT8 during inference.

One way to counted performance loss here is to to quantisation-aware training, where we simulate lower-precision training in the forward pass and loss, but update full-precision weights.

Accelerators

Contents

About

Anki

GPUs

Execution Model

Core Types

Performance

Tesla V100 Specs

Nvidia A100 Specs

IPUs

Design Philosophy

Colossus™ MK2 GC200 IPU

BSP

Mixed Precision Training

Tensor Core Math

Scaling Issues

Automatic Mixed Precision

Inference Quantisation