Blog
The latest from Google Research
SpecAugment: A New Data Augmentation Method for Automatic Speech Recognition
Monday, April 22, 2019
Posted by Daniel S. Park, AI Resident and William Chan, Research Scientist
Automatic Speech Recognition (ASR), the process of taking an audio input and transcribing it to text, has benefited greatly from
the ongoing development of deep neural networks
. As a result, ASR has become ubiquitous in many modern devices and products, such as Google Assistant, Google Home and YouTube. Nevertheless, there remain many important challenges in developing deep learning-based ASR systems. One such challenge is that ASR models, which have many parameters, tend to overfit the training data and have a hard time generalizing to unseen data when the training set is not extensive enough.
In the absence of an adequate volume of training data, it is possible to increase the effective size of existing data through the process of
data augmentation
, which has contributed to significantly improving the performance of deep networks in the domain of
image classification
. In the case of speech recognition, augmentation traditionally involves deforming the audio waveform used for training in some fashion (e.g., by speeding it up or slowing it down), or adding background noise. This has the effect of making the dataset effectively larger, as multiple augmented versions of a single input is fed into the network over the course of training, and also helps the network become robust by forcing it to learn relevant features. However, existing conventional methods of augmenting audio input introduces additional computational cost and sometimes requires additional data.
In our recent paper, “
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
”, we take a new approach to augmenting audio data, treating it as a visual problem rather than an audio one. Instead of augmenting the input audio waveform as is traditionally done, SpecAugment applies an augmentation policy directly to the audio spectrogram (i.e., an image representation of the waveform). This method is simple, computationally cheap to apply, and does not require additional data. It is also surprisingly effective in improving the performance of ASR networks, demonstrating state-of-the-art performance on the ASR tasks
LibriSpeech 960h
and
Switchboard 300h
.
SpecAugment
In traditional ASR, the audio waveform is typically encoded as a visual representation, such as a
spectrogram
, before being input as training data for the network. Augmentation of training data is normally applied to the waveform audio before it is converted into the spectrogram, such that after every iteration, new spectrograms must be generated. In our approach, we investigate the approach of augmenting the spectrogram itself, rather than the waveform data. Since the augmentation is applied directly to the input features of the network, it can be run online during training without significantly impacting training speed.
A waveform is typically converted into a visual representation (in our case, a log mel spectrogram; steps 1 through 3 of
this article
) before being fed into a network.
SpecAugment modifies the spectrogram by
warping
it in the time direction, masking blocks of consecutive frequency channels, and masking blocks of utterances in time. These augmentations have been chosen to help the network to be robust against deformations in the time direction, partial loss of frequency information and partial loss of small segments of speech of the input. An example of such an augmentation policy is displayed below.
The log mel spectrogram is augmented by warping in the time direction, and masking (multiple) blocks of consecutive time steps (vertical masks) and mel frequency channels (horizontal masks). The masked portion of the spectrogram is displayed in purple for emphasis.
To test SpecAugment, we performed some experiments with the LibriSpeech dataset, where we took three
Listen Attend and Spell (LAS) networks
, end-to-end networks commonly used for speech recognition, and compared the test performance between networks trained with and without augmentation. The performance of an ASR network is measured by the
Word Error Rate
(WER) of the transcript produced by the network against the target transcript. Here, all hyperparameters were kept the same, and only the data fed into the network was altered. We found that SpecAugment improves network performance without any additional adjustments to the network or training parameters.
Performance of networks on the test sets of LibriSpeech with and without augmentation. The LibriSpeech test set is divided into two portions, test-clean and test-other, the latter of which contains noisier audio data.
More importantly, SpecAugment prevents the network from over-fitting by giving it deliberately corrupted data. As an example of this, below we show how the WER for the training set and the development (or dev) set evolves through training with and without augmentation. We see that without augmentation, the network achieves near-perfect performance on the training set, while grossly under-performing on both the clean and noisy dev set. On the other hand, with augmentation, the network struggles to perform as well on the training set, but actually shows better performance on the clean dev set, and shows comparable performance on the noisy dev set. This suggests that the network is no longer over-fitting the training data, and that improving training performance would lead to better test performance.
Training, clean (dev-clean) and noisy (dev-other) development set performance with and without augmentation.
State-of-the-Art Results
We can now focus on improving training performance, which can be done by adding more capacity to the networks by making them larger. By doing this in conjunction with increasing training time, we were able to get state-of-the-art (SOTA) results on the tasks
LibriSpeech 960h
and
Switchboard 300h
.
Word error rates (%) for state-of-the-art results for the tasks LibriSpeech 960h and Switchboard 300h. The test set for both tasks have a clean (clean/Switchboard) and a noisy (other/CallHome) subset. Previous SOTA results taken from
Li et. al (2019)
,
Yang et. al (2018)
and
Zeyer et. al (2018)
.
The simple augmentation scheme we have used is surprisingly powerful—we are able to improve the performance of the end-to-end LAS networks so much that it surpasses those of classical ASR models, which traditionally did much better on smaller academic datasets such as LibriSpeech or Switchboard.
Performance of various classes of networks on LibriSpeech and Switchboard tasks. The performance of LAS models is compared to classical (e.g.,
HMM
) and other end-to-end models (e.g.,
CTC
/
ASG
) over time.
Language Models
Language models
(LMs), which are trained on a bigger corpus of text-only data, have played a significant role in improving the performance of an ASR network by leveraging information learned from text. However, LMs typically need to be trained separately from the ASR network, and can be very large in memory, making it hard to fit on a small device, such as a phone. An unexpected outcome of our research was that models trained with SpecAugment out-performed all prior methods even without the aid of a language model. While our networks still benefit from adding an LM, our results are encouraging in that it suggests the possibility of training networks that can be used for practical purposes without the aid of an LM.
Word error rates for LibriSpeech and Switchboard tasks with and without LMs. SpecAugment outperforms previous state-of-the-art even before the inclusion of a language model.
Most of the work on ASR in the past has been focused on looking for better networks to train. Our work demonstrates that looking for better ways to train networks is a promising alternative direction of research.
Acknowledgements
We would like to thank the co-authors of our paper Chung-Cheng Chiu, Ekin Dogus Cubuk, Quoc Le, Yu Zhang and Barret Zoph. We also thank Yuan Cao, Ciprian Chelba, Kazuki Irie, Ye Jia, Anjuli Kannan, Patrick Nguyen, Vijay Peddinti, Rohit Prabhavalkar, Yonghui Wu and Shuyuan Zhang for useful discussions.
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
AI for Social Good
Algorithms
Android
Android Wear
API
App Engine
App Inventor
April Fools
Art
Audio
Augmented Reality
Australia
Automatic Speech Recognition
AutoML
Awards
BigQuery
Cantonese
Chemistry
China
Chrome
Cloud Computing
Collaboration
Compression
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gboard
Gmail
Google Accelerated Science
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Maps
Google Photos
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICCV
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
India
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
Kaggle
KDD
Keyboard Input
Klingon
Korean
Labs
Linear Optimization
localization
Low-Light Photography
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
Magenta
MapReduce
market algorithms
Market Research
materials science
Mixed Reality
ML
ML Fairness
MOOC
Moore's Law
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
NeurIPS
Nexus
Ngram
NIPS
NLP
On-device Learning
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
Peer Review
ph.d. fellowship
PhD Fellowship
PhotoScan
Physics
PiLab
Pixel
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum AI
Quantum Computing
Recommender Systems
Reinforcement Learning
renewable energy
Research
Research Awards
resource optimization
Responsible AI
Robotics
schema.org
Search
search ads
Security and Privacy
Self-Supervised Learning
Semantic Models
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Sound Search
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorBoard
TensorFlow
TPU
Translate
trends
TTS
TV
UI
University Relations
UNIX
Unsupervised Learning
User Experience
video
Video Analysis
Virtual Reality
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
Year in Review
YouTube
Archive
2022
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2021
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2020
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2019
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2018
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2017
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Follow @googleai
Give us feedback in our
Product Forums
.