Blog
The latest news from Google AI
Expressive Speech Synthesis with Tacotron
Tuesday, March 27, 2018
Posted by Yuxuan Wang, Research Scientist and RJ Skerry-Ryan, Software Engineer, on behalf of the Machine Perception, Google Brain and TTS Research teams
At Google, we're excited about the recent rapid progress of neural network-based text-to-speech (TTS) research. In particular, end-to-end architectures, such as the
Tacotron
systems we announced last year, can both simplify voice building pipelines and produce natural-sounding speech. This will help us build better human-computer interfaces, like conversational assistants, audiobook narration, news readers, or voice design software. To deliver a truly human-like voice, however, a TTS system must learn to model
prosody
, the collection of expressive factors of speech, such as intonation, stress, and rhythm. Most current end-to-end systems, including Tacotron, don't explicitly model prosody, meaning they can't control exactly how the generated speech should sound. This may lead to monotonous-sounding speech, even when models are trained on very expressive datasets like audiobooks, which often contain character voices with significant variation. Today, we are excited to share two new papers that address these problems.
Our first paper, “
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
”, introduces the concept of a
prosody embedding
. We augment the Tacotron architecture with an additional prosody encoder that computes a low-dimensional embedding from a clip of human speech (the reference audio).
We augment Tacotron with a prosody encoder. The lower half of the image is the original Tacotron sequence-to-sequence model. For technical details, please refer to
the paper
.
This embedding captures characteristics of the audio that are independent of phonetic information and idiosyncratic speaker traits — these are attributes like stress, intonation, and timing. At inference time, we can use this embedding to perform prosody transfer, generating speech in the voice of a completely different speaker, but exhibiting the prosody of the reference.
Text:
*Is*
that Utah travel agency?
Reference prosody (Australian)
Synthesized without prosody embedding (American)
Synthesized with prosody embedding (American)
The embedding can also transfer fine time-aligned prosody from one phrase to a slightly different phrase, though this technique works best when the reference and target phrases are similar in length and structure.
Reference Text: For the first time in her life she had been danced tired.
Synthesized Text: For the last time in his life he had been handily embarrassed.
Reference prosody (American)
Synthesized without prosody embedding (American)
Synthesized with prosody embedding (American)
Excitingly, we observe prosody transfer even when the reference audio comes from a speaker whose voice is not in Tacotron's training data.
Text: I've Swallowed a Pollywog.
Reference prosody (Unseen American Speaker)
Synthesized without prosody embedding (British)
Synthesized with prosody embedding (British)
This is a promising result, as it paves the way for voice interaction designers to use their own voice to customize speech synthesis. You can listen to the full set of audio demos for “
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
” on
this web page
.
Despite their ability to transfer prosody with high fidelity, the embeddings from the paper above don't completely disentangle prosody from the content of a reference audio clip. (This explains why they transfer prosody best to phrases of similar structure and length.) Furthermore, they require a clip of reference audio at inference time. A natural question then arises: can we develop a model of expressive speech that alleviates these problems?
In our second paper, “
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
”, we do just that. Building upon the architecture in our first paper, we propose a new unsupervised method for modeling latent "factors" of speech. The key to this model is that, rather than learning fine time-aligned prosodic elements, it learns higher-level speaking style patterns that can be transferred across arbitrarily different phrases.
The model works by adding an extra attention mechanism to Tacotron, forcing it to represent the prosody embedding of any speech clip as the linear combination of a fixed set of basis embeddings. We call these embeddings
Global Style Tokens
(GSTs), and find that they learn text-independent variations in a speaker's style (soft, high-pitch, intense, etc.), without the need for explicit style labels.
Model architecture of Global Style Tokens. The prosody embedding is decomposed into “style tokens” to enable unsupervised style control and transfer. For technical details, please refer to
the paper
.
At inference time, we can select or modify the combination weights for the tokens, allowing us to force Tacotron to use a specific speaking style without needing a reference audio clip. Using GSTs, for example, we can make different sentences of varying lengths sound more "lively", "angry", "lamenting", etc:
Text: United Airlines five six three from Los Angeles to New Orleans has Landed.
Style 1
Style 2
Style 3
Style 4
Style 5
The text-independent nature of GSTs make them ideal for
style transfer
, which takes a reference audio clip spoken in a specific style and transfers its style to any target phrase we choose. To achieve this, we first run inference to predict the GST combination weights for an utterance whose style we want to imitate. We can then feed those combination weights to the model to synthesize completely different phrases — even those with very different lengths and structure — in the same style.
Finally, our paper shows that Global Style Tokens can model more than just speaking style. When trained on noisy YouTube audio from unlabeled speakers, a GST-enabled Tacotron learns to represent noise sources and distinct speakers as separate tokens. This means that by selecting the GSTs we use in inference, we can synthesize speech free of background noise, or speech in the voice of a specific unlabeled speaker from the dataset. This exciting result provides a path towards highly scalable but robust speech synthesis. You can listen to the full set of demos for "
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
" on
this web page
.
We are excited about the potential applications and opportunities that these two bodies of research enable. In the meantime, there are new important research problems to be addressed. We'd like to extend the techniques of the first paper to support prosody transfer in the natural pitch range of the target speaker. We'd also like to develop techniques to select appropriate prosody or speaking style automatically from context, using, for example, the integration of natural language understanding with TTS. Finally, while our first paper proposes an initial set of objective and subjective metrics for prosody transfer, we'd like to develop these further to help establish generally-accepted methods for prosodic evaluation.
Acknowledgements
These projects were done jointly between multiple Google teams. Contributors include RJ Skerry-Ryan, Yuxuan Wang, Daisy Stanton, Eric Battenberg, Ying Xiao, Joel Shor, Rif A. Saurous, Yu Zhang, Ron J. Weiss, Rob Clark, Fei Ren and Ye Jia.
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
AI for Social Good
Algorithms
Android
Android Wear
API
App Engine
App Inventor
April Fools
Art
Audio
Augmented Reality
Australia
Automatic Speech Recognition
AutoML
Awards
BigQuery
Cantonese
Chemistry
China
Chrome
Cloud Computing
Collaboration
Compression
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gboard
Gmail
Google Accelerated Science
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Maps
Google Photos
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICCV
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
India
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
Kaggle
KDD
Keyboard Input
Klingon
Korean
Labs
Linear Optimization
localization
Low-Light Photography
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
Magenta
MapReduce
market algorithms
Market Research
Mixed Reality
ML
ML Fairness
MOOC
Moore's Law
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
NeurIPS
Nexus
Ngram
NIPS
NLP
On-device Learning
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
Peer Review
ph.d. fellowship
PhD Fellowship
PhotoScan
Physics
PiLab
Pixel
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum AI
Quantum Computing
Recommender Systems
Reinforcement Learning
renewable energy
Research
Research Awards
resource optimization
Robotics
schema.org
Search
search ads
Security and Privacy
Self-Supervised Learning
Semantic Models
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Sound Search
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorBoard
TensorFlow
TPU
Translate
trends
TTS
TV
UI
University Relations
UNIX
Unsupervised Learning
User Experience
video
Video Analysis
Virtual Reality
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
Year in Review
YouTube
Archive
2021
Mar
Feb
Jan
2020
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2019
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2018
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2017
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Follow @googleai
Give us feedback in our
Product Forums
.