Blog
The latest news from Google AI
Measuring Compositional Generalization
Friday, March 6, 2020
Posted by Marc van Zee, Software Engineer, Google Research
People are capable of learning the meaning of a new word and then applying it to other language contexts. As
Lake and Baroni
put it, “Once a person learns the meaning of a new verb ‘dax’, he or she can immediately understand the meaning of ‘dax twice’ and ‘sing and dax’.” Similarly, one can learn a new object shape and then recognize it with different compositions of previously learned colors or materials (e.g., in
the CLEVR dataset
). This is because people exhibit the capacity to understand and produce a potentially infinite number of novel combinations of known components, or as
Chomsky said
, to make “infinite use of finite means.” In the context of a machine learning model learning from a set of training examples, this skill is called
compositional generalization.
A common approach for measuring compositional generalization in machine learning (ML) systems is to split the training and testing data based on properties that
intuitively
correlate with compositional structure. For instance,
one approach
is to split the data based on
sequence length
— the training set consists of short examples, while the test set consists of longer examples.
Another approach
uses
sequence patterns
, meaning the split is based on randomly assigning clusters of examples sharing the same pattern to either train or test sets. For instance, the questions "
Who directed Movie1
" and "
Who directed Movie2
" both fall into the pattern "
Who directed <MOVIE>
" so they would be grouped together. Yet
another method
uses
held out primitives
— some linguistic primitives are shown very rarely during training (e.g., the verb “jump”), but are very prominent in testing. While each of these experiments are useful, it is not immediately clear which experiment is a "better" measure for compositionality. Is it possible to systematically design an “optimal” compositional generalization experiment?
In “
Measuring Compositional Generalization: A Comprehensive Method on Realistic Data
”, we attempt to address this question by introducing the largest and most comprehensive benchmark for compositional generalization using realistic natural language understanding tasks, specifically, semantic parsing and question answering. In this work, we propose a metric —
compound divergence
— that allows one to quantitatively assess how much a train-test split measures the compositional generalization ability of an ML system. We analyze the compositional generalization ability of three sequence to sequence ML architectures, and find that they fail to generalize compositionally. We also are releasing the
Compositional Freebase Questions
dataset used in the work as a resource for researchers wishing to improve upon these results.
Measuring Compositionality
In order to measure the compositional generalization ability of a system, we start with the assumption that we understand the underlying principles of how examples are generated. For instance, we begin with the grammar rules to which we must adhere when generating questions and answers. We then draw a distinction between
atoms
and
compounds
.
Atoms
are the building blocks that are used to generate examples and
compounds
are concrete (potentially partial) compositions of these atoms. For example, in the figure below, every box is an atom (e.g.,
Shane Steel, brother, <entity>'s <entity>, produce, etc.)
, which fits together to form compounds, such as
produce and <verb>, Shane Steel’s brother, Did Shane Steel’s brother produce and direct Revenge of the Spy?,
etc.
Building compositional sentences (compounds) from building blocks (atoms).
An ideal compositionality experiment then should have a
similar atom distribution
, i.e., the distribution of words and sub-phrases in the training set is as similar as possible to their distribution in the test set, but with a
different compound distribution
. To measure compositional generalization on a question answering task about a movie domain, one might, for instance, have the following questions in train and test:
While atoms such as “directed”, “Inception”, and “who <predicate> <entity>” appear in both the train and test sets, the compounds are different.
The Compositional Freebase Questions dataset
In order to conduct an accurate compositionality experiment, we created the Compositional Freebase Questions (CFQ) dataset, a simple, yet realistic, large dataset of natural language questions and answers generated from the public
Freebase knowledge base
. The CFQ can be used for text-in / text-out tasks, as well as
semantic parsing
. In our experiments, we focus on semantic parsing, where the input is a natural language question and the output is a query, which when executed against Freebase, produces the correct outcome. CFQ contains around 240k examples and almost 35k query patterns, making it significantly larger and more complex than comparable datasets — about 4 times that of
WikiSQL
with about 17x more query patterns than
Complex Web Questions
. Special care has been taken to ensure that the questions and answers are natural. We also quantify the complexity of the syntax in each example using the “complexity level” metric (
L
), which corresponds roughly to the depth of the
parse tree
, examples of which are shown below.
Compositional Generalization Experiments on CFQ
For a given train-test split, if the compound distributions of the train and test sets are very similar, then their
compound divergence
would be close to 0, indicating that they are not difficult tests for compositional generalization. A compound divergence close to 1 means that the train-test sets have many different compounds, which makes it a good test for compositional generalization. Compound divergence thus captures the notion of "different compound distribution", as desired.
We algorithmically generate train-test splits using the CFQ dataset that have a compound divergence ranging from 0 to 0.7 (the maximum that we were able to achieve). We fix the atom divergence to be very small. Then, for each split we measure the performance of three standard ML architectures —
LSTM
+
attention
,
Transformer
, and
Universal Transformer
. The results are shown in the graph below.
Compound divergence vs accuracy for three ML architectures. There is a surprisingly strong negative correlation between compound divergence and accuracy.
We measure the performance of a model by comparing the correct answers with the output string given by the model. All models achieve an accuracy greater than 95% when the compound divergence is very low. The mean accuracy on the split with highest compound divergence is below 20% for all architectures, which means that even a large training set with a similar atom distribution between train and test is not sufficient for the architectures to generalize well. For all architectures, there is a strong negative correlation between the compound divergence and the accuracy. This seems to indicate that compound divergence successfully captures the core difficulty for these ML architectures to generalize compositionally.
Potentially promising directions for future work might be to apply unsupervised pre-training on input language or output queries, or to use more diverse or more targeted learning architectures, such as
syntactic attention
. It would also be interesting to apply this approach to other domains such as visual reasoning, e.g. based on
CLEVR
, or to extend our approach to broader subsets of language understanding, including the use of ambiguous constructs, negations, quantification, comparatives, additional languages, and other vertical domains. We hope that this work will inspire others to use this benchmark to advance the compositional generalization capabilities of learning systems.
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
AI for Social Good
Algorithms
Android
Android Wear
API
App Engine
App Inventor
April Fools
Art
Audio
Augmented Reality
Australia
Automatic Speech Recognition
AutoML
Awards
BigQuery
Cantonese
Chemistry
China
Chrome
Cloud Computing
Collaboration
Compression
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gboard
Gmail
Google Accelerated Science
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Maps
Google Photos
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICCV
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
India
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
Kaggle
KDD
Keyboard Input
Klingon
Korean
Labs
Linear Optimization
localization
Low-Light Photography
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
Magenta
MapReduce
market algorithms
Market Research
Mixed Reality
ML
ML Fairness
MOOC
Moore's Law
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
NeurIPS
Nexus
Ngram
NIPS
NLP
On-device Learning
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
Peer Review
ph.d. fellowship
PhD Fellowship
PhotoScan
Physics
PiLab
Pixel
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum AI
Quantum Computing
Recommender Systems
Reinforcement Learning
renewable energy
Research
Research Awards
resource optimization
Robotics
schema.org
Search
search ads
Security and Privacy
Self-Supervised Learning
Semantic Models
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Sound Search
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorBoard
TensorFlow
TPU
Translate
trends
TTS
TV
UI
University Relations
UNIX
Unsupervised Learning
User Experience
video
Video Analysis
Virtual Reality
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
Year in Review
YouTube
Archive
2021
Jan
2020
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2019
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2018
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2017
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Follow @googleai
Give us feedback in our
Product Forums
.