WaveNet

Introduction

Inspired by recent advances in autoregressive generative models for images and text.

Contribution

Much more natural TTS

New architectures based on dilated causal convolution

Conditioning on voice type

Can be used for speech recognition too

WaveNet

Problem: directly model waveform:

Model:

At each step, predict probability distribution

Architecture: stack of convolutional layers, no pooling

Output: categorical distribution over with a softmax layer

Training:

Input = data, output = same data shifted 1

Max-likelihood objective used

Inference:

Each prediction is fed back to the model as the input for the next timestep

Dilated Causal Convolutions

Regular Causal Convolution:

This represents a "convolution" insofar as we have a size-2 filter over previous values. We can set the filter as large as we like ofc.

Output is restricted so it can't depend on future timesteps.

This is the equivalent of masked convolution for images.

Pros:

Training parallel

Train faster than RNNs, especially for very long sequences

Cons:

Inference sequential

Require many layers or large filters to have large receptive field

Dilated Causal Convolution:

"Dilate" filter with zeros to allow larger receptive field

Similar to pooling or strided convolutions (although maintains size).

Approach used in paper is to double dilation for every layer up to a limit and then repeat. e.g. 1,2,4,...,512,1,2,4,...,512,1,2,4,...,512. Combines exponential receptive field growth with "stacking" benefits.

Softmax Distributions

Even though data is implicitly continuous, research exists showing softmax tends to work better than alternatives.

Problem: However, raw audio typically stored as 16-bit number ( possible values).

Solution: a -law companding transformation is used to transform the data (eqn in paper, looks like a tanh curve), after which it is quantised down to possible values

Gated Activation Units

Replacing ReLU with the following gating scheme was very effective:

where is the convolution operator, is the layer, and the s are learnable convolutional filters.

Residual and skip connections are also used.

Conditional WaveNets

If we condition each prediction on some additional input we can do TTS, voice types, etc.

We can either condition on some global input, or on time-specific local input.

For TTS it is conditioned locally on the fundamental frequency—a signal changing over time representing some "fundamental" pitch relating to the words/phonemes

Experiments

Multi-speaker

One-hot encoding of 109 speakers - could effectively condition on each

Text-to-speech

Training:

~ a day of English and Mandarin speech data

Locally conditioned on linguistic features or logarithmic fundamental frequency

Receptive field size 240ms

Baselines

LSTM (parametric)

Hiden Markov Model (concatenative)

Results

Subjective paired comparison test: which do you prefer?

Mean poinion score: rate each on 5-point Likert scale

Speech Recognition

To make this work:

Mean-pooling layer added to the output to coarsen frames to 10ms

Followed by a few regular convolution layers

Two losses - one to classify frame and one to predict next sample

Best results from a model trained directly on raw audio.

WaveNet

Contents:

Introduction

Contribution

WaveNet

Dilated Causal Convolutions

Softmax Distributions

Gated Activation Units

Conditional WaveNets

Experiments

Multi-speaker

Text-to-speech

Speech Recognition