🖼️

Convolutional Neural Networks


Basic case

Dimensions:
(Note: )
x = arange(w) # input of size 'W' θ = arange(k) # weights _x = x.reshape(w//k,k) y = einsum('Wk,k->W', _x, θ)

2D convolutions

Basic

Dimensions:
The matrix is typically what we mean by a "kernel"
x = arange(w,h) θ = arange(k,l) _x = x.reshape(w//k,h//l,k,l) y = einsum('WHkl,kl->WH', _x, θ)

With input channel

The matrix is typically what we mean by a "filter"
x = arange(w,h,c) θ = arange(k,l,c) _x = x.reshape(w//k,h//l,k,l,c) y = einsum('WHklc,klc->WH', _x, θ)

With batch, input & output channel

Complexity:
This is the typical "convolution" operation
x = arange(b,w,h,c) θ = arange(k,l,c,C) _x = x.reshape(b,w//k,h//l,k,l,c) y = einsum('bWHklc,klcC->bWHC', _x, θ)

3D convolution

x = arange(b,w,h,c) θ = arange(k,l,m) _x = x.reshape(b,w//k,h//l,c//m,k,l,m) y = einsum('bWHCklm,klm->bWHC', _x, θ)

1x1 Convolution

Basic (really a 1 conv)

(output not just , actually a scalar multiple of )
x = arange(w) θ = arange(1) # k=1 _x = x.reshape(w,1) y = einsum('wk,k->w', _x, θ)

Filter

or really:
💡
In a sense we always have a size-1 convolution applied over the batch layer, and a max-size convolutional kernel over the output channel → in a way every dimension has a kind of convolution applied!
x = arange(b,w,h,c) θ = arange(c) # or arange(1,1,c) _x = x.reshape(b,w,h,c) # or x.reshape(b,w,h,1,1,c) y = einsum('bwhc,c->bwh', _x, θ) # or 'bwhklc,klc->bwh'

Full 1x1 convolution

This is what is typically meant by a 1x1 conv: it amounts to basically just the standard kind of linear layer we see in a regular FFN, but treating W & H as "batch" insofar as we leave them untouched → this is what enables any-sized inputs to our convnet!
This is really useful for dimensionality reduction/change → the typicaly purpose of such a layer
⚠️
Be careful! Make sure you don't confuse convolutional kernel, filter, and layer. E.g. a 1x1 convolutional kernel and filter are quite different!
x = arange(b,w,h,c) θ = arange(c,C) _x = x.reshape(b,w,h,c) y = einsum('bwhc,cC->bwhC', _x, θ)

Separable convolutions

Spatially separable convolution

Here we factor the kernel into a vertical and horizontal kernel. The below approach is equivalent to taking the outer product of these kernels and doing a standard 2D convolution.
Note that the intermediate output has to have both the input and output channel dimensions.
Complexity:
x = arange(b,w,h,c) t1 = arange(k,c,C) t2 = arange(l,c,C) _x = x.reshape(b,w//k,h,k,c) inter = einsum('bWhkc,kcC->bWhcC', _x, t1) _inter = inter.reshape(b,w//k,h//l,l,c,C) y = einsum('bWHlcC,lcC->bWHC', _inter, t2)
Ratio of complexity vs regular:
If , this gives .

Depth-wise separable convolution

Here instead of factoring , we do something similar (though not identical) on .
We basically just split the spatial part and the output channel part of a regular 2D conv into separate steps:
This seems quite intuitive to me → each point in the convolutional filter doesn't appear to really need it's own way of transforming the channel dimension, which is what this cuts out.
x = arange(b,w,h,c) t1 = arange(k,l,c,) t2 = arange(c,C) _x = x.reshape(b,w//k,h//l,k,l,c) inter = einsum('bWHklc,klc->bWHc', _x, t1) y = einsum('bWHc,cC->bWHC', inter, t2)
Complexity:
Ratio of complexity vs regular:
If this gives .

Flattened convolution

Like spatially separable convolution, but factoring the channel dimension too, and applying it first. Note that each kernel has an output channel dim which is maintained throughout:

Grouped convolutions

Standard group conv

Here we simply split the channel dimension into "groups" and handle each individually. We then apply independent convolutional layers of out-dim and then reshape back to normal at the end:
(Note: the subscript indicates that the size of these dims is divided by )
x = arange(b,w,h,c) θ = arange(k,l,g,c//g,C//g) _x = x.reshape(b,w//k,h//l,k,l,g,c//g) _y = einsum('bWHklgs,klgsS->bWHgS', _x, θ) y = _y.reshape(b,w//k,h//l,C)
Complexity:
Ratio of complexity vs regular:
Vs model parallelism:
The only difference here is that we both split at the beginning and concat at the end.
We can think of the model parallel (i.e. sharding) approach as splitting the tensor in one axis. Whereas grouped operations are equivalent to a diagonal-block matrix.
Similarity to depth-wise separable convolution
If we were to set , then group conv becomes the same as depthwise-separable conv's first stage. We could also add to group conv the second 1x1 stage of depthwise-separable conv if we wanted to.
We can think of group conv as a depthwise separable conv (first half) "wrapped around" smaller, independent regular convs. When those smaller convs have depth=1 we get pure depthwise.

Shuffled group conv

Takes the output of a group conv layer and shuffles the tokens between groups.
This enables the next group conv layer to leverage information across all previous groups, rather than just its corresponding group.

Point-wise group conv

Applying grouping to 1x1 convolutional layers