Blog
The latest news from Google AI
Languages of the World (Wide Web)
Thursday, July 7, 2011
Posted by Daniel Ford and Josh Batson
The web is vast and infinite. Its pages link together in a complex network, containing remarkable structures and patterns. Some of the clearest patterns relate to language.
Most web pages link to other pages on the same web site, and the few off-site links they have are almost always to other pages in the same language. It's as if each language has its own web which is loosely linked to the webs of other languages. However, there are a small but significant number of off-site links between languages. These give tantalizing hints of the world beyond the virtual.
To see the connections between languages, start by taking the several billion most important pages on the web in 2008, including all pages in smaller languages, and look at the off-site links between these pages. The particular choice of pages in our corpus here reflects decisions about what is `important'. For example, in a language with few pages every page is considered important, while for languages with more pages some selection method is required, based on pagerank for example.
We can use our corpus to draw a very simple graph of the web, with a node for each language and an edge between two languages if more than one percent of the offsite links in the first language land on pages in the second. To make things a little clearer, we only show the languages which have at least a hundred thousand pages and have a strong link with another language, meaning at least 1% of off-site links go to that language. We also leave out English, which we'll discuss more in a moment. (Figure 1)
Looking at the language web in 2008, we see a surprisingly clear map of Europe and Asia.
The language linkages invite explanations around geopolitics, linguistics, and historical associations.
Figure 1: Language links on the web.
The outlines of the Iberian and Scandinavian Peninsulas are clearly visible, which suggest geographic rather than purely linguistic associations.
Examining links between other languages, it seems that many are explained by people and communities which speak both languages.
The language webs of many former Soviet republics link back to the Russian web, with the strongest link from Ukrainian. While Russia is the major importer of Ukrainian products, the bilingual nature of Ukraine is a more plausible explanation. Most Ukrainians speak both languages, and Russian is even the dominant language in large parts of the country.
The link from Arabic to French speaks to the long connection between France and its former colonies. In many of these countries Arabic and French are now commonly spoken together, and there has been significant emigration from these countries to France.
Another strong link is between the Malay/Malaysian and Indonesian webs. Malaysia and Indonesia share a border, but more importantly the languages are nearly eighty percent cognate, meaning speakers of one can easily understand the other.
What about the sizes of each language web? Both the number of sites in each language and the number of urls seen by Google's crawler follow an exponential distribution, although the ordering for each is slightly different (Figure 2). The exact number of pages in each language in 2008 is unknown, since multiple urls may point to the same page and some pages may not have been seen at all. However, the language of an un-crawled url can be guessed by the dominant language of its site. In fact, calendar pages and other infinite spaces mean that there really are an unlimited number of pages on the web, though some are more useful than others.
Figure 2: The number of sites and seen urls per language are roughly exponentially distributed.
The largest language on the web, in terms of size and centrality, has always been English, but where is it on our map?
Every language on the web has strong links to English, usually with around twenty percent of offsite links and occasionally over forty five percent, such as from Tagalog/Filipino, spoken in the Philippines, and Urdu, principally spoken in Pakistan (Figure 3). Both the Philippines and Pakistan are former British colonies where English is one of the two official languages.
Figure 3: Language links to and from English
You might wonder whether off-site links landing on English pages can be explained simply by the number of English pages available to be linked to. The webs of other languages in our corpus typically have sixty to eighty percent of their out-language links to English pages. However, only 38 percent of the pages and 42 percent of sites in our set are English, while it attracts 79 percent of all out-language links from other languages.
Chinese and Japanese also seem unusual because there are relatively few links from pages in these languages to pages in English. This is despite the fact that Japanese and Chinese sites are the most popular non-English sites for English sites to link to. However, the number of sites in a language is a strong predictor of its `introversion', or fraction of off-site links to pages in the same language. Taking this into account shows that Chinese and Japanese webs are not unusually introverted given their size. In general, language webs with more sites are more introverted, perhaps due to better availability of content. (Figure 4)
Figure 4: Language size vs introversion.
There is a roughly linear relationship between the (log) number of sites in a language and the fraction of off-site links which point to pages in the same language, with a correlation of 0.9 if English is removed. However, only 45 percent of off-site links from English pages are to other English pages, making English the most extroverted web language given its size. Other notable outliers are the Hindi web, which is unusually introverted, and the Tagalog and Malay webs which are unusually extroverted.
We can generate another map by connecting languages if the number of links from one to the other is 50 times greater than expected given the number of out-of-language links and the size of the language linked to (Figure 5). This time, the native languages of India show up clearly. Surprising links include those from Hindi to Ukrainian, Kurdish to Swedish, Swahili to Tagalog and Bengali, and Esperanto to Polish.
Figure 5: Unexpected connections, given the size of each language.
What's happened since 2008? The languages of the web have become more densely connected. There is now significant content in even more languages, and these languages are more closely linked. We hope that tools like Google page translation, voice translation, and other services will accelerate this process and bring more people in the world closer together, whichever languages they speak.
UPDATE
9 July 2011:
As has been pointed out in the comments, in
both the Philippines and Pakistan, English is one of the two
official languages; however, the Philippines was not a British colony.
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
AI for Social Good
Algorithms
Android
Android Wear
API
App Engine
App Inventor
April Fools
Art
Audio
Augmented Reality
Australia
Automatic Speech Recognition
AutoML
Awards
BigQuery
Cantonese
Chemistry
China
Chrome
Cloud Computing
Collaboration
Compression
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gboard
Gmail
Google Accelerated Science
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Maps
Google Photos
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICCV
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
India
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
Kaggle
KDD
Keyboard Input
Klingon
Korean
Labs
Linear Optimization
localization
Low-Light Photography
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
Magenta
MapReduce
market algorithms
Market Research
Mixed Reality
ML
ML Fairness
MOOC
Moore's Law
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
NeurIPS
Nexus
Ngram
NIPS
NLP
On-device Learning
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
Peer Review
ph.d. fellowship
PhD Fellowship
PhotoScan
Physics
PiLab
Pixel
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum AI
Quantum Computing
Recommender Systems
Reinforcement Learning
renewable energy
Research
Research Awards
resource optimization
Robotics
schema.org
Search
search ads
Security and Privacy
Self-Supervised Learning
Semantic Models
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Sound Search
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorBoard
TensorFlow
TPU
Translate
trends
TTS
TV
UI
University Relations
UNIX
Unsupervised Learning
User Experience
video
Video Analysis
Virtual Reality
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
Year in Review
YouTube
Archive
2021
Feb
Jan
2020
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2019
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2018
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2017
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Follow @googleai
Give us feedback in our
Product Forums
.