Contents:IntroductionContributionWaveNetDilated Causal ConvolutionsSoftmax DistributionsGated Activation UnitsConditional WaveNetsExperimentsMulti-speakerText-to-speechSpeech Recognition
Inspired by recent advances in autoregressive generative models for images and text.
- Much more natural TTS
- New architectures based on dilated causal convolution
- Conditioning on voice type
- Can be used for speech recognition too
Problem: directly model waveform:
- At each step, predict probability distribution
- Architecture: stack of convolutional layers, no pooling
- Output: categorical distribution over with a softmax layer
- Input = data, output = same data shifted 1
- Max-likelihood objective used
- Each prediction is fed back to the model as the input for the next timestep
Regular Causal Convolution:
Output is restricted so it can't depend on future timesteps.
This is the equivalent of masked convolution for images.
- Training parallel
- Train faster than RNNs, especially for very long sequences
- Inference sequential
- Require many layers or large filters to have large receptive field
Dilated Causal Convolution:
"Dilate" filter with zeros to allow larger receptive field
Similar to pooling or strided convolutions (although maintains size).
Approach used in paper is to double dilation for every layer up to a limit and then repeat. e.g. 1,2,4,...,512,1,2,4,...,512,1,2,4,...,512. Combines exponential receptive field growth with "stacking" benefits.
Even though data is implicitly continuous, research exists showing softmax tends to work better than alternatives.
Problem: However, raw audio typically stored as 16-bit number ( possible values).
Solution: a -law companding transformation is used to transform the data (eqn in paper, looks like a tanh curve), after which it is quantised down to possible values
Replacing ReLU with the following gating scheme was very effective:
where is the convolution operator, is the layer, and the s are learnable convolutional filters.
Residual and skip connections are also used.
If we condition each prediction on some additional input we can do TTS, voice types, etc.
We can either condition on some global input, or on time-specific local input.
For TTS it is conditioned locally on the fundamental frequency—a signal changing over time representing some "fundamental" pitch relating to the words/phonemes
One-hot encoding of 109 speakers - could effectively condition on each
~ a day of English and Mandarin speech data
Locally conditioned on linguistic features or logarithmic fundamental frequency
Receptive field size 240ms
- LSTM (parametric)
- Hiden Markov Model (concatenative)
- Subjective paired comparison test: which do you prefer?
- Mean poinion score: rate each on 5-point Likert scale
To make this work:
- Mean-pooling layer added to the output to coarsen frames to 10ms
- Followed by a few regular convolution layers
- Two losses - one to classify frame and one to predict next sample
Best results from a model trained directly on raw audio.