β™’

Parallel WaveNet

Title
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
Authors
Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C. Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, Demis Hassabis
Date
2018
Venue
ICML
Keywords
generative
text-to-speech
distillation

Contents:



Introduction

Problem: WaveNet not well suited to production environments:
  1. "Extreme" form of auto-regression
  1. 24000 samples predicted per second
  1. Parallel training (outputs regressed on all available as training data), but sequential execution (outputs regressed on are previous output)
Solution:
  1. Probability Density Distilation
  1. Teacher ➑️ regular WaveNet
  1. Student ➑️ Inverse-autoregressive flows

Wavenet

Higher Fidelity

8-bit quantisation replaced with direct modelling of 16-bit audio signal.
Massive softmax addressed by using "discretized mixture of logistics distribution" presented in Salimans, 2017.
Sampling rate up from 16kHz to 24kHz.
Larger convolution filter size.
Β 

Parallel Wavenet

Normalising flows

Model data as , where is invertible and differentiable, and is a latent variable from some simple tractable distribution .
By the change of variables formula, gives us:
where is the Jacobian of .
Thus as long as we can easily compute the above RHS, we can get a probability of . If we make parametric then we can adjust to minimise some target log likelihood.

Inverse-autoregressive flows

We make our model of the output at a timestep depend on all previous latent variables:
As depends on sequential inputs, the Jacobian is diagonal, giving the simple log-determinant:

Implementation

At each timestep we sample .
The choice of is:
Where the functions and are autoregressive modelsβ€”specifically, the same convolutional structure as the original WaveNet.
πŸ”‘
Note the key feature of this! We can sample the all at once and run this in parallel for each output! πŸ˜‡
Four of these "flow iterations" are stacked, each with their own weights, with subsequent layers treating the output of previous ones as the .
Training here is sequential, I'm not quite sure why but this is my best guess:
We wish to train on , but we have instead.
We have to use the inverse flow (i.e. inversion of ) to calculate this probability.
This must be done in a sequential manner.

Probability Density Distillation

Motivation: we want the training parallelism of causal convolution, but the inference parallelism of IAF.
Solution: Use a trained WaveNet as a 'teacher' for a parallel WaveNet 'student', where the student tries to match the probability distribution of the teacher.
Β 
We try to minimise the Probability Density Distillation loss:
See the paper for full details. Essentially, can be estimated directly from the , and can be estimated by generating a sample from the student, feeding it to the teacher, and then comparing the probs.
This would appear to suffer from the same sequential inference problem as regular IAF... however
πŸ”‘
The key trick here is that: 1. We can generate the params for the IAF output distribution once, and then trivially take a batch of samples from it. 2. Recall that the cross entropy $H(P_S, P_T)$ involves evaluating the teacher probability over the distribution defined by the student model 3. The batch of samples can be used to calculate a much lower-variance estimate of the cross entropy than using a single sample.
This means we can train an IAF model much faster than before!

Additional loss terms

  1. Power loss: prevents whispering
  1. Perceptual loss: based on expected phone pronunciation
  1. Contrastive loss: low T-S difference when conditioned on same additional feature, high if features different

Experiments

notion image
Β