### Contents

### Reading List

### Mathematical foundations

## Variance

## Variance of a continuous random variable

## Variance of a constant addition

## Variance of a constant multiplication

## Variance of a sum of random variables

## Covariance

## Why does independence mean no covariance?

Independence means .

This makes covariance equal to .

## Variance of a sum of uncorrelated random variables

Β

## Variance of the mean of a random variable

## Variance of the product of uncorrelated random variables

## Expected product of uncorrelated random variables

## Magnitude of a vector

## Expression of magnitude, in terms of variance

## Expression of variance in terms of magnitude

## Variance of the product of two vectors, each sampled indep. from different distributions

## Expected product of two vectors, each sampled indep. from different distributions

## Variance of an output element of a matrix-vector product, each sampled indep. from different distributions

Same as the variance the product of vectors:

## Central limit theorem

## What is the distribution of the sum of independent random samples?

From the central limit theorem:

## What is the distribution of the product of two vectors, each sampled indep. from different distributions

## What is the mean *absolute* value of a unit normal distribution?

## What is the distribution of the product of independent random samples?

## What does the PDF of a lognormal distribution look like?

## When is a variable log-normally distributed?

When

## If , what quantity is log-normally distributed?

## Proof sketch that the product of independent random samples is log-normal

- Take the log of the product β we now have a sum

- is normally distributed by the CLT

- We undo the original log by taking

- By definition, this is log-normally distributed

### Paper Notes

#### Glorot Paper

## What is the Xavier/Glorot init?

## What limit should one use for a uniform distribution to make it have unit variance?

## What limit should one unit for a uniform distribution under Xavier/Glorot init?

#### Random walk paper

## Whatβs my objection to the random walk paper?

- They model the magnitude at each layer using a distribution (to get a single value for a vector one has to sum over squares - we do this too)

- The heuristic for approximating the distribution by a Gaussian is

- Hence for hidden sizes we can just model this using a Gaussian

- If you do this, for their scaling factor just becomes ( for relu)

Β