Blog
The latest from Google Research
Slicing and dicing data for interactive visualization
lunes, 28 de febrero de 2011
Posted by Benjamin Yolken, Google Public Data Product Manager
A year ago, we introduced the
Google Public Data Explorer
, a tool that allows users to interactively explore public-interest datasets from a variety of influential sources like the World Bank, IMF, Eurostat, and the US Census Bureau. Today, users can visualize over 300 metrics across
31 datasets
, including everything from
labor productivity
(OECD) to
Internet speed
(Ookla) to
gender balance in parliaments
(UNECE) to
government debt levels
(IMF) to
population density by municipality
(Statistics Catalonia), with more data being added every week.
Last week, as part of the launch of our
dataset upload interface
, we released one of the key pieces of technology behind the product: the
Dataset Publishing Language
(DSPL). We created this format to address a key problem in the Public Data Explorer and other, similar tools, namely, that existing data formats don’t provide enough information to support easy yet powerful data exploration by non-technical users.
DSPL addresses this by adding an additional layer of metadata on top of the raw, tabular data in a dataset. This metadata, expressed in XML, describes the
concepts
in the dataset, for instance “country”, “gender”, “population”, and “unemployment”, giving descriptions, URLs, formatting properties, etc. for each. These concepts are then referenced in
slices
, which partition the former into
dimensions
(i.e., categories) and
metrics
(i.e., quantitative values) and link them with the underlying data tables (provided in CSV format). This structure, along with some additional metadata, is what allows us to provide rich, interactive dataset visualizations in the Public Data Explorer.
With the release of DSPL, we hope to accelerate the process of making the world’s datasets searchable, visualizable, and understandable, without requiring a PhD in statistics. We encourage you to
read more
about the format and try it yourself, both in the
Public Data Explorer
and in your own software. Stay tuned for more DSPL extensions and applications in the future!
Where does my data live?
viernes, 25 de febrero de 2011
Posted by Daniel Ford, Senior Mathematician
Have you ever wondered what happens when you upload a photo to Picasa, or where all your Gmail or YouTube videos are stored? How it is that you can read or watch them from anywhere at any time?
If you stored your data on a single hard disk, like the one in your personal computer, then the disk would eventually fail and your data would be lost forever. If you want to protect your data from the possibility of such a failure, you can store copies across many different disks so that if any one fails then you just access the data from another.
However, once storage systems get large enough, anything and everything can and does go wrong. You have to plan not just for disk failures but for server, network, and entire datacenter failures. Add to this software bugs and maintenance operations and you have a whole lot more failures.
Using measurements from dozens of Google data centers, we found that almost-simultaneous failure of many servers in a data center has the greatest impact on availability. On the other hand, disk failures have relatively little impact because our systems are specifically designed to cope with these failures.
Once you have a model of failures, you can also look at the impact of various design choices. Where exactly should you place your data replicas? How fast do you need recover from losing a disk or server? What encoding scheme or number of replicas of the data is enough, given a desired level of availability? For example, we found that storing data across multiple data centers reduces data unavailability by many orders of magnitude compared to having the same number of replicas in a single data center. The added complexity and potential for slower recovery times is worth it to get better availability, or use less storage space, or even both at the same time.
As you can see, something as simple as storing your photos, mail, or videos becomes a lot more involved when you want to be sure it's always available.
In our paper,
Availability in Globally Distributed Storage Systems
, we characterize the availability of cloud storage systems, based on extensive monitoring of Google's main storage infrastructure, and the sources of failure which affect availability. We also present statistical models for reasoning about the impact of design choices such as data placement, recovery speed, and replication strategies, including replication across multiple data centers.
A Runtime Solution for Online Contention Detection and Response
viernes, 25 de febrero de 2011
Posted by Jason Mars, Software Engineering Intern
In our recent paper,
Contention Aware Execution: Online Contention Detection and Response
, we have made a big step forward in addressing an important and pressing problem in the field of Computer Science today. This work appears in the
2010 Proceedings of the International Symposium on Code Generation and Optimization (CGO)
and was awarded the CGO 2010 Best Presentation Award at the conference.
One of the greatest challenges when using multicore processors arise when critical resources, such as the on-chip caches, are shared by multiple executing programs. If these programs simultaneously place heavy demands on shared resources, the may be forced to "take turns," and as a result, unpredictable and abrupt slowdowns may occur. This unexpected "cross-core interference" is especially problematic when considering the latency sensitive applications that are found in Google's datacenters, such as web-search. The commonly used solution is to dedicate separate machines to each application, however this leaves the processing capabilities of multicore processors underutilized. In our work, we present the Contention Aware Execution Runtime (CAER) environment that provides a lightweight runtime solution that minimizes cross-core interference, while maximizing utilization. CAER leverages the ubiquitous performance monitoring capabilities present in current state-of-the-art multicore processors to infer and respond to cross-core interference and requires no added hardware support. Our experiments show that when using our CAER system, we are able to increase the utilization of the multicore CPU by 58% on average. Meanwhile CAER brings the performance penally due to allowing co-location from 17% down to just 4% on average.
Congratulations to Ken Thompson
martes, 22 de febrero de 2011
Posted by Bill Coughran, Senior Vice President of Engineering
I’m happy to share that
Ken Thompson
has been chosen as the recipient of the prestigious
Japan Prize
. The Japan Prize is bestowed for achievements in science and technology that promote the peace and prosperity of mankind.
Ken was awarded the prize along with Dennis Ritchie for their development of the UNIX operating system in 1969 while at Bell Labs. UNIX changed the direction of computing as a whole and paved the way for the development of the personal computers and the server systems that power the Internet.
It’s an enormous source of pride for us to have such amazing talent working here and Ken continues to serve as an inspiration to the rest of us. We’re excited to see what Ken will come up with next.
You can read the full press release
here
.
Query Language Modeling for Voice Search
jueves, 17 de febrero de 2011
Posted by Ciprian Chelba, Research Scientist
About three years ago we set a goal to enable speaking to the Google Search engine on smart-phones. On the language modeling side, the motivation was that we had access to large amounts of typed text data from our users. At the same time, that meant that the users also had a clear expectation for how they would interact with a speech-enabled version of the Google Search application.
The challenge lay in the scale of the problem and the perceived sparsity of the query data. Our paper,
Query Language Modeling for Voice Search
, describes the approach we took, and the empirical findings along the way.
Besides data availability, the project succeeded due to our excellent computational platform, the culture built around teams that wholeheartedly tackle such challenges with the conviction that they will set a new bar, and a collaborative mindset that leverages resources across the company. In this case we used training data made available by colleagues working in query spelling correction, query stream sampling procedures devised for search quality evaluation, the
open finite state tools
, and
distributed language modeling infrastructure
built for machine translation.
Perhaps the most satisfying part of this research project was its impact on the end-user: when presenting the poster at SLT 2010 in Berkeley I offered to demo Google Voice Search, and often got the answer “Thanks, I already use it!”.
Etiquetas
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
AI for Social Good
Algorithms
Android
Android Wear
API
App Engine
App Inventor
April Fools
Art
Audio
Augmented Reality
Australia
Automatic Speech Recognition
AutoML
Awards
BigQuery
Cantonese
Chemistry
China
Chrome
Cloud Computing
Collaboration
Compression
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gboard
Gmail
Google Accelerated Science
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Maps
Google Photos
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICCV
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
India
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
Kaggle
KDD
Keyboard Input
Klingon
Korean
Labs
Linear Optimization
localization
Low-Light Photography
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
Magenta
MapReduce
market algorithms
Market Research
materials science
Mixed Reality
ML
ML Fairness
MOOC
Moore's Law
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
NeurIPS
Nexus
Ngram
NIPS
NLP
On-device Learning
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
Peer Review
ph.d. fellowship
PhD Fellowship
PhotoScan
Physics
PiLab
Pixel
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum AI
Quantum Computing
Recommender Systems
Reinforcement Learning
renewable energy
Research
Research Awards
resource optimization
Responsible AI
Robotics
schema.org
Search
search ads
Security and Privacy
Self-Supervised Learning
Semantic Models
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Sound Search
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorBoard
TensorFlow
TPU
Translate
trends
TTS
TV
UI
University Relations
UNIX
Unsupervised Learning
User Experience
video
Video Analysis
Virtual Reality
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
Year in Review
YouTube
Archive
2022
jun
may
abr
mar
feb
ene
2021
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2020
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2019
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2018
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2017
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2016
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2015
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2014
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2013
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2012
dic
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2011
dic
nov
sep
ago
jul
jun
may
abr
mar
feb
ene
2010
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2009
dic
nov
ago
jul
jun
may
abr
mar
feb
ene
2008
dic
nov
oct
sep
jul
may
abr
mar
feb
2007
oct
sep
ago
jul
jun
feb
2006
dic
nov
sep
ago
jul
jun
abr
mar
feb
Feed
Follow @googleai
Give us feedback in our
Product Forums
.