Contents:
Contents:IntroductionWavenetHigher FidelityParallel WavenetNormalising flowsInverse-autoregressive flowsImplementationProbability Density DistillationAdditional loss termsExperiments
Introduction
Problem: WaveNet not well suited to production environments:
- "Extreme" form of auto-regression
- 24000 samples predicted per second
- Parallel training (outputs regressed on all available as training data), but sequential execution (outputs regressed on are previous output)
Solution:
- Probability Density Distilation
- Teacher β‘οΈ regular WaveNet
- Student β‘οΈ Inverse-autoregressive flows
Wavenet
See WaveNet
Higher Fidelity
8-bit quantisation replaced with direct modelling of 16-bit audio signal.
Massive softmax addressed by using "discretized mixture of logistics distribution" presented in Salimans, 2017.
Sampling rate up from 16kHz to 24kHz.
Larger convolution filter size.
Β
Parallel Wavenet
Normalising flows
Model data as , where is invertible and differentiable, and is a latent variable from some simple tractable distribution .
By the change of variables formula, gives us:
where is the Jacobian of .
Thus as long as we can easily compute the above RHS, we can get a probability of . If we make parametric then we can adjust to minimise some target log likelihood.
Inverse-autoregressive flows
We make our model of the output at a timestep depend on all previous latent variables:
As depends on sequential inputs, the Jacobian is diagonal, giving the simple log-determinant:
Implementation
At each timestep we sample .
The choice of is:
Where the functions and are autoregressive modelsβspecifically, the same convolutional structure as the original WaveNet.
Note the key feature of this! We can sample the all at once and run this in parallel for each output! π
Four of these "flow iterations" are stacked, each with their own weights, with subsequent layers treating the output of previous ones as the .
Training here is sequential, I'm not quite sure why but this is my best guess:
We wish to train on , but we have instead.
We have to use the inverse flow (i.e. inversion of ) to calculate this probability.
This must be done in a sequential manner.
Probability Density Distillation
Motivation: we want the training parallelism of causal convolution, but the inference parallelism of IAF.
Solution: Use a trained WaveNet as a 'teacher' for a parallel WaveNet 'student', where the student tries to match the probability distribution of the teacher.
Β
We try to minimise the Probability Density Distillation loss:
See the paper for full details. Essentially, can be estimated directly from the , and can be estimated by generating a sample from the student, feeding it to the teacher, and then comparing the probs.
This would appear to suffer from the same sequential inference problem as regular IAF... however
The key trick here is that:
1. We can generate the params for the IAF output distribution once, and then trivially take a batch of samples from it.
2. Recall that the cross entropy $H(P_S, P_T)$ involves evaluating the teacher probability over the distribution defined by the student model
3. The batch of samples can be used to calculate a much lower-variance estimate of the cross entropy than using a single sample.
This means we can train an IAF model much faster than before!
Additional loss terms
- Power loss: prevents whispering
- Perceptual loss: based on expected phone pronunciation
- Contrastive loss: low T-S difference when conditioned on same additional feature, high if features different
Experiments
Β