⚖️

# Value scaling

### Mathematical foundations

Variance
Variance of a continuous random variable
Variance of a constant multiplication
Variance of a sum of random variables
Covariance
Why does independence mean no covariance?
Independence means .
This makes covariance equal to .
Variance of a sum of uncorrelated random variables

Variance of the mean of a random variable
Variance of the product of uncorrelated random variables
Expected product of uncorrelated random variables
Magnitude of a vector
Expression of magnitude, in terms of variance
Expression of variance in terms of magnitude
Variance of the product of two vectors, each sampled indep. from different distributions
Expected product of two vectors, each sampled indep. from different distributions
Variance of an output element of a matrix-vector product, each sampled indep. from different distributions
Same as the variance the product of vectors:
Central limit theorem
What is the distribution of the sum of independent random samples?
From the central limit theorem:
What is the distribution of the product of two vectors, each sampled indep. from different distributions
What is the mean absolute value of a unit normal distribution?
What is the distribution of the product of independent random samples?
What does the PDF of a lognormal distribution look like?
When is a variable log-normally distributed?
When
If , what quantity is log-normally distributed?
Proof sketch that the product of independent random samples is log-normal
1. Take the log of the product → we now have a sum
1. is normally distributed by the CLT
1. We undo the original log by taking
1. By definition, this is log-normally distributed

### Paper Notes

#### Glorot Paper

What is the Xavier/Glorot init?
What limit should one use for a uniform distribution to make it have unit variance?
What limit should one unit for a uniform distribution under Xavier/Glorot init?

#### Random walk paper

What’s my objection to the random walk paper?
1. They model the magnitude at each layer using a distribution (to get a single value for a vector one has to sum over squares - we do this too)
1. The heuristic for approximating the distribution by a Gaussian is
1. Hence for hidden sizes we can just model this using a Gaussian
1. If you do this, for their scaling factor just becomes ( for relu)