WaveNet: A Generative Model for Raw Audio
AΓ€ron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, Koray Kavukcuoglu



Inspired by recent advances in autoregressive generative models for images and text.


  1. Much more natural TTS
  1. New architectures based on dilated causal convolution
  1. Conditioning on voice type
  1. Can be used for speech recognition too


Problem: directly model waveform:
  1. At each step, predict probability distribution
  1. Architecture: stack of convolutional layers, no pooling
  1. Output: categorical distribution over with a softmax layer
  1. Input = data, output = same data shifted 1
  1. Max-likelihood objective used
  1. Each prediction is fed back to the model as the input for the next timestep

Dilated Causal Convolutions

Regular Causal Convolution:
This represents a "convolution" insofar as we have a size-2 filter over previous values. We can set the filter as large as we like ofc.
This represents a "convolution" insofar as we have a size-2 filter over previous values. We can set the filter as large as we like ofc.
Output is restricted so it can't depend on future timesteps.
This is the equivalent of masked convolution for images.
  1. Training parallel
  1. Train faster than RNNs, especially for very long sequences
  1. Inference sequential
  1. Require many layers or large filters to have large receptive field
Dilated Causal Convolution:
"Dilate" filter with zeros to allow larger receptive field
Dilation values here are: 1, 2, 4, 8.
Dilation values here are: 1, 2, 4, 8.
Similar to pooling or strided convolutions (although maintains size).
Approach used in paper is to double dilation for every layer up to a limit and then repeat. e.g. 1,2,4,...,512,1,2,4,...,512,1,2,4,...,512. Combines exponential receptive field growth with "stacking" benefits.

Softmax Distributions

Even though data is implicitly continuous, research exists showing softmax tends to work better than alternatives.
Problem: However, raw audio typically stored as 16-bit number ( possible values).
Solution: a -law companding transformation is used to transform the data (eqn in paper, looks like a tanh curve), after which it is quantised down to possible values

Gated Activation Units

Replacing ReLU with the following gating scheme was very effective:
where is the convolution operator, is the layer, and the s are learnable convolutional filters.
Residual and skip connections are also used.

Conditional WaveNets

If we condition each prediction on some additional input we can do TTS, voice types, etc.
We can either condition on some global input, or on time-specific local input.
For TTS it is conditioned locally on the fundamental frequencyβ€”a signal changing over time representing some "fundamental" pitch relating to the words/phonemes



One-hot encoding of 109 speakers - could effectively condition on each


~ a day of English and Mandarin speech data
Locally conditioned on linguistic features or logarithmic fundamental frequency
Receptive field size 240ms
  1. LSTM (parametric)
  1. Hiden Markov Model (concatenative)
  1. Subjective paired comparison test: which do you prefer?
    1. notion image
  1. Mean poinion score: rate each on 5-point Likert scale
    1. notion image

Speech Recognition

To make this work:
  1. Mean-pooling layer added to the output to coarsen frames to 10ms
  1. Followed by a few regular convolution layers
  1. Two losses - one to classify frame and one to predict next sample
Best results from a model trained directly on raw audio.