### About

This page is based primarily on material from this fantastic blog post

### Zorbi

Convolutional Neural Networks`x = arange(w) # input of size 'W' θ = arange(k) # weights _x = x.reshape(w//k,k) y = einsum('Wk,k->W', _x, θ)`

### 2D convolutions

`x = arange(w,h) θ = arange(k,l) _x = x.reshape(w//k,h//l,k,l) y = einsum('WHkl,kl->WH', _x, θ)`

`x = arange(w,h,c) θ = arange(k,l,c) _x = x.reshape(w//k,h//l,k,l,c) y = einsum('WHklc,klc->WH', _x, θ)`

`x = arange(b,w,h,c) θ = arange(k,l,c,C) _x = x.reshape(b,w//k,h//l,k,l,c) y = einsum('bWHklc,klcC->bWHC', _x, θ)`

`x = arange(b,w,h,c) θ = arange(k,l,m) _x = x.reshape(b,w//k,h//l,c//m,k,l,m) y = einsum('bWHCklm,klm->bWHC', _x, θ)`

### 1x1 Convolution

`x = arange(w) θ = arange(1) # k=1 _x = x.reshape(w,1) y = einsum('wk,k->w', _x, θ)`

#### Filter

or really:

In a sense we always have a size-1 convolution applied over the batch layer, and a max-size convolutional kernel over the output channel → in a way every dimension has a kind of convolution applied!

`x = arange(b,w,h,c) θ = arange(c) # or arange(1,1,c) _x = x.reshape(b,w,h,c) # or x.reshape(b,w,h,1,1,c) y = einsum('bwhc,c->bwh', _x, θ) # or 'bwhklc,klc->bwh'`

#### Full 1x1 convolution

This is what is typically meant by a 1x1 conv: it amounts to basically just the standard kind of linear layer we see in a regular FFN, but treating W & H as "batch" insofar as we leave them untouched → this is what enables any-sized inputs to our convnet!

This is really useful for dimensionality reduction/change → the typicaly purpose of such a layer

Be careful! Make sure you don't confuse convolutional kernel, filter, and layer. E.g. a 1x1 convolutional kernel and filter are quite different!

`x = arange(b,w,h,c) θ = arange(c,C) _x = x.reshape(b,w,h,c) y = einsum('bwhc,cC->bwhC', _x, θ)`

### Separable convolutions

#### Spatially separable convolution

Here we factor the kernel into a vertical and horizontal kernel. The below approach is equivalent to taking the outer product of these kernels and doing a standard 2D convolution.

Note that the intermediate output has to have both the input and output channel dimensions.

**Complexity:**

`x = arange(b,w,h,c) t1 = arange(k,c,C) t2 = arange(l,c,C) _x = x.reshape(b,w//k,h,k,c) inter = einsum('bWhkc,kcC->bWhcC', _x, t1) _inter = inter.reshape(b,w//k,h//l,l,c,C) y = einsum('bWHlcC,lcC->bWHC', _inter, t2)`

**Ratio of complexity vs regular:**

If , this gives .

#### Depth-wise separable convolution

Here instead of factoring , we do something similar (though not identical) on .

We basically just split the spatial part and the output channel part of a regular 2D conv into separate steps:

This seems quite intuitive to me → each point in the convolutional filter doesn't appear to really need it's own way of transforming the channel dimension, which is what this cuts out.

`x = arange(b,w,h,c) t1 = arange(k,l,c,) t2 = arange(c,C) _x = x.reshape(b,w//k,h//l,k,l,c) inter = einsum('bWHklc,klc->bWHc', _x, t1) y = einsum('bWHc,cC->bWHC', inter, t2)`

**Complexity:**

**Ratio of complexity vs regular:**

If this gives .

#### Flattened convolution

Like spatially separable convolution, but factoring the channel dimension too, and applying it first. Note that each kernel has an output channel dim which is maintained throughout:

### Grouped convolutions

#### Standard group conv

Here we simply split the channel dimension into "groups" and handle each individually. We then apply independent convolutional layers of out-dim and then reshape back to normal at the end:

(Note: the subscript indicates that the size of these dims is divided by )

`x = arange(b,w,h,c) θ = arange(k,l,g,c//g,C//g) _x = x.reshape(b,w//k,h//l,k,l,g,c//g) _y = einsum('bWHklgs,klgsS->bWHgS', _x, θ) y = _y.reshape(b,w//k,h//l,C)`

**Complexity:**

**Ratio of complexity vs regular:**

**Vs model parallelism:**

The only difference here is that we both split at the beginning

*and*concat at the end.We can think of the model parallel (i.e. sharding) approach as splitting the tensor in one axis. Whereas grouped operations are equivalent to a diagonal-block matrix.

**Similarity to depth-wise separable convolution**

If we were to set , then group conv becomes the same as depthwise-separable conv's

*first*stage. We could also add to group conv the second 1x1 stage of depthwise-separable conv if we wanted to.We can think of group conv as a depthwise separable conv (first half) "wrapped around" smaller, independent regular convs. When those smaller convs have depth=1 we get pure depthwise.

#### Shuffled group conv

Takes the output of a group conv layer and shuffles the tokens between groups.

This enables the next group conv layer to leverage information across

*all*previous groups, rather than just its corresponding group.#### Point-wise group conv

Applying grouping to 1x1 convolutional layers