A special report on managing information ı February 27th 2010
A vast amount of that information is
shared. By 2013 the amount of trafﬁc ﬂow-
ing over the internet annually will reach
667 exabytes, according to Cisco, a maker of
communications gear. And the quantity of
data continues to grow faster than the abil-
ity of the network to carry it all.
People have long groused that they were
swamped by information. Back in 1917 the
manager of a Connecticut manufacturing
ﬁrm complained about the effects of the
telephone: “Time is lost, confusion results
and money is spent.” Yet what is happening
now goes way beyond incremental growth.
The quantitative change has begun to make
a qualitative difference.
This shift from information scarcity to sur-
feit has broad effects. “What we are seeing is
the ability to have economies form around
the data—and that to me is the big change at
a societal and even macroeconomic level,”
says Craig Mundie, head of research and
strategy at Microsoft. Data are becoming the
new raw material of business: an economic
input almost on a par with capital and la-
bour. “Every day I wake up and ask, ‘how
can I ﬂow data better, manage data better,
analyse data better?” says Rollin Ford, the
CIO of Wal-Mart.
Sophisticated quantitative analysis is be-
ing applied to many aspects of life, not just
missile trajectories or ﬁnancial hedging
strategies, as in the past. For example, Fare-
cast, a part of Microsoft’s search engine Bing,
can advise customers whether to buy an air-
line ticket now or wait for the price to come
down by examining 225 billion ﬂight and
price records. The same idea is being extend-
ed to hotel rooms, cars and similar items.
Personal-ﬁnance websites and banks are
aggregating their customer data to show up
macroeconomic trends, which may develop
into ancillary businesses in their own right.
Number-crunchers have even uncovered
match-ﬁxing in Japanese sumo wrestling.
Dross into gold
“Data exhaust”—the trail of clicks that in-
ternet users leave behind from which value
can be extracted—is becoming a mainstay
of the internet economy. One example is
Google’s search engine, which is partly
guided by the number of clicks on an item
to help determine its relevance to a search
query. If the eighth listing for a search term
is the one most people go to, the algorithm
puts it higher up.
As the world is becoming increasingly
digital, aggregating and analysing data is
likely to bring huge beneﬁts in other ﬁelds
as well. For example, Mr Mundie of Micro-
soft and Eric Schmidt, the boss of Google,
sit on a presidential task force to reform
American health care. “Early on in this
process Eric and I both said: ‘Look, if you
really want to transform health care, you
basically build a sort of health-care econo-
my around the data that relate to people’,”
Mr Mundie explains. “You would not just
think of data as the ‘exhaust’ of providing
health services, but rather they become a
central asset in trying to ﬁgure out how you
would improve every aspect of health care.
It’s a bit of an inversion.”
To be sure, digital records should make
life easier for doctors, bring down costs for
providers and patients and improve the
quality of care. But in aggregate the data
can also be mined to spot unwanted drug
interactions, identify the most effective treat-
ments and predict the onset of disease be-
fore symptoms emerge. Computers already
attempt to do these things, but need to be
explicitly programmed for them. In a world
of big data the correlations surface almost
Sometimes those data reveal more than
was intended. For example, the city of Oak-
land, California, releases information on
where and when arrests were made, which
is put out on a private website, Oakland
Crimespotting. At one point a few clicks
revealed that police swept the whole of a
busy street for prostitution every evening ex-
cept on Wednesdays, a tactic they probably
meant to keep to themselves.
But big data can have far more serious
consequences than that. During the recent ﬁ-
nancial crisis it became clear that banks and
rating agencies had been relying on mod-
els which, although they required a vast
amount of information to be fed in, failed
to reﬂect ﬁnancial risk in the real world. This
was the ﬁrst crisis to be sparked by big data—
and there will be more.
The way that information is managed
touches all areas of life. At the turn of the
20th century new ﬂows of information
through channels such as the telegraph
and telephone supported mass production.
Today the availability of abundant data
enables companies to cater to small niche
markets anywhere in the world. Economic
production used to be based in the factory,
where managers pored over every machine
and process to make it more efﬁcient. Now
statisticians mine the information output of
the business for new ideas.
“The data-centred economy is just na-
scent,” admits Mr Mundie of Microsoft. “You
can see the outlines of it, but the technical,
infrastructural and even business-model
implications are not well understood right
now.” This special report will point to where
it is beginning to surface.
Global information created and available storage
All too much
Monstrous amounts of data
UANTIFYING the amount of information
that exists in the world is hard. What is
clear is that there is an awful lot of it, and it is
growing at a terriﬁc rate (a compound an-
nual 60%) that is speeding up all the time. The
ﬂood of data from sensors, computers, research
labs, cameras, phones and the like surpassed
the capacity of storage technologies in 2007.
Experiments at the Large Hadron Collider at
CERN, Europe’s particle-physics laboratory near
Geneva, generate 40 terabytes every second—
orders of magnitude more than can be stored
or analysed. So scientists collect what they can
and let the rest dissipate into the ether.
According to a 2008 study by Internation-
al Data Corp (IDC), a market-research ﬁrm,
around 1,200 exabytes of digital data will be
generated this year. Other studies measure
slightly different things. Hal Varian and the late
Peter Lyman of the University of California in
Berkeley, who pioneered the idea of counting
the world’s bits, came up with a far smaller
amount, around 5 exabytes in 2002, because
they counted only the stock of original content.
What about the information that is actually
consumed? Researchers at the University of Cali-
fornia in San Diego (UCSD) examined the ﬂow
of data to American households. They found
that in 2008 such households were bombarded
with 3.6 zettabytes of information (or 34 giga-
bytes per person per day). The biggest data hogs
were video games and television. In terms of
bytes, written words are insigniﬁcant, amount-
A special report on managing information ı February 27th 2010
N 1879 James Ritty, a saloon-keeper in Dayton, Ohio, received a
patent for a wooden contraption that he dubbed the “incorrupt-
ible cashier”. With a set of buttons and a loud bell, the device, sold
by National Cash Register (NCR), was little more than a simple add-
ing machine. Yet as an early form of managing information ﬂows
in American business the cash register had a huge impact. It not
only reduced pilferage by alerting the shopkeeper when the till was
opened; by recording every transaction, it also provided an instant
overview of what was happening in the business.
Sales data remain one of a company’s most important assets. In
2004 Wal-Mart peered into its mammoth databases and noticed that
before a hurricane struck, there was a run on ﬂashlights and batter-
ies, as might be expected; but also on Pop-Tarts, a sugary American
breakfast snack. On reﬂection it is clear that the snack would be a
handy thing to eat in a blackout, but the retailer would not have
thought to stock up on it before a storm. The company whose sys-
tem crunched Wal-Mart’s numbers was none other than NCR and
its data-warehousing unit, Teradata, now an independent ﬁrm.
A few years ago such technologies, called “business intelligence”,
were available only to the world’s biggest companies. But as the price
of computing and storage has fallen and the software systems have got
better and cheaper, the technology has moved into the mainstream.
Companies are collecting more data than ever before. In the past they
were kept in different systems that were unable to talk to each other,
such as ﬁnance, human resources or customer management. Now
the systems are being linked, and companies are using data-mining
techniques to get a complete picture of their operations—“a single ver-
sion of the truth”, as the industry likes to call it. That allows ﬁrms to
operate more efﬁciently, pick out trends and improve their forecasting.
Consider Cablecom, a Swiss telecoms operator. It has reduced cus-
tomer defections from one-ﬁfth of subscribers a year to under 5% by
crunching its numbers. Its software spotted that although custom-
er defections peaked in the 13th month, the decision to leave was
made much earlier, around the ninth month (as indicated by things
like the number of calls to customer support services). So Cablecom
offered certain customers special deals seven months into their sub-
scription and reaped the rewards.
Agony and torture
Such data-mining has a dubious reputation. “Torture the data long
enough and they will confess to anything,” statisticians quip. But
it has become far more effective as more companies have started
to use the technology. Best Buy, a retailer, found that 7% of its cus-
tomers accounted for 43% of its sales, so it reorganised its stores to
concentrate on those customers’ needs. Airline yield management
improved because analytical techniques uncovered the best predic-
tor that a passenger would actually catch a ﬂight he had booked:
that he had ordered a vegetarian meal.
The IT industry is piling into business intelligence, seeing it as a
natural successor of services such as accountancy and computing
in the ﬁrst and second half of the 20th century respectively. Accen-
ture, PricewaterhouseCoopers, IBM and SAP are investing heavily
in their consulting practices. Technology vendors such as Oracle, In-
formatica, TIBCO, SAS and EMC have beneﬁted. IBM believes busi-
ness intelligence will be a pillar of its growth as sensors are used
to manage things from a city’s trafﬁc ﬂow to a patient’s blood ﬂow.
It has invested $12 billion in the past four years and is opening six
analytics centres with 4,000 employees worldwide.
A different game
Information is transforming traditional business
ing to less than 0.1% of the total. However, the
amount of reading people do, previously in de-
cline because of television, has almost tripled
since 1980, thanks to all that text on the internet.
In the past information consumption was largely
passive, leaving aside the telephone. Today half
of all bytes are received interactively, according
to the UCSD. Future studies will extend beyond
American households to quantify consumption
globally and include business use as well.
March of the machines
Signiﬁcantly, “information created by ma-
chines and used by other machines will prob-
ably grow faster than anything else,” explains
Roger Bohn of the UCSD, one of the authors of
the study on American households. “This is pri-
marily ‘database to database’ information—peo-
ple are only tangentially involved in most of it.”
Only 5% of the information that is created is
“structured”, meaning it comes in a standard
format of words or numbers that can be read
by computers. The rest are things like photos
and phone calls which are less easily retriev-
able and usable. But this is changing as content
on the web is increasingly “tagged”, and facial-
recognition and voice-recognition software can
identify people and words in digital ﬁles.
“It is a very sad thing that nowadays there
is so little useless information,” quipped Os-
car Wilde in 1894. He did not know the half
The prefixes are set by an intergovernmental group, the International Bureau of Weights and Measures.
Yotta and Zetta were added in 1991; terms for larger amounts have yet to be established.
Source: The Economist
What it means
1 or 0
Short for “binary digit”, after the binary code (1 or 0)
computers use to store and process data
Enough information to create an English letter or number
in computer code. It is the basic unit of computing
1,000, or 2
, bytes From “thousand” in Greek. One page of typed text is 2KB
Megabyte (MB) 1,000KB; 2
From “large” in Greek. The complete works of Shakespeare total 5MB.
A typical pop song is about 4MB
1,000MB; 230 bytes
From “giant” in Greek. A two-hour film can be compressed into 1-2GB
1,000GB; 240 bytes
From “monster” in Greek. All the catalogued books
in America’s Library of Congress total 15TB
1,000TB; 250 bytes
All letters delivered by America’s postal service this year will amount
to around 5PB. Google processes around 1PB every hour
1,000PB; 260 bytes
Equivalent to 10 billion copies of The Economist
Zettabyte (ZB) 1,000EB; 270 bytes
The total amount of information in existence
this year is forecast to be around 1.2ZB
Yottabyte (YB) 1,000ZB; 2
Currently too big to imagine
VB.NET PDF - View PDF with WPF PDF Viewer for VB.NET
1. Anticlockwise rotation. Rotate PDF page 90 degree in anticlockwise. 2. Clockwise rotation. Rotate PDF page 90 degree in clockwise. 3. Zoom in. rotate one page in pdf reader; how to rotate one page in a pdf file
A special report on managing information ı February 27th 2010
Analytics—performing statistical operations for forecasting or un-
covering correlations such as between Pop-Tarts and hurricanes—can
have a big pay-off. In Britain the Royal Shakespeare Company (RSC)
sifted through seven years of sales data for a marketing campaign
that increased regular visitors by 70%. By examining more than 2m
transaction records, the RSC discovered a lot more about its best
customers: not just income, but things like occupation and family
status, which allowed it to target its marketing more precisely. That
was of crucial importance, says the RSC’s Mary Butlin, because it
substantially boosted membership as well as fund-raising revenue.
Yet making the most of data is not easy. The ﬁrst step is to improve
the accuracy of the information. Nestlé, for example, sells more than
100,000 products in 200 countries, using 550,000 suppliers, but it
was not using its huge buying power effectively because its data-
bases were a mess. On examination, it found that of its 9m records
of vendors, customers and materials around half were obsolete or
duplicated, and of the remainder about one-third were inaccurate
or incomplete. The name of a vendor might be abbreviated in one
record but spelled out in another, leading to double-counting.
Over the past ten years Nestlé has been overhauling its IT system,
using SAP software, and improving the quality of its data. This en-
abled the ﬁrm to become more efﬁcient, says Chris Johnson, who
led the initiative. For just one ingredient, vanilla, its American opera-
tion was able to reduce the number of speciﬁcations and use fewer
suppliers, saving $30m a year. Overall, such operational improve-
ments save more than $1 billion annually.
Nestlé is not alone in having problems with its database. Most
CIOs admit that their data are of poor quality. In a study by IBM half
the managers quizzed did not trust the information on which they
had to base decisions. Many say that the technology meant to make
sense of it often just produces more data. Instead of ﬁnding a needle
in the haystack, they are making more hay.
Still, as analytical techniques become more widespread, business
decisions will increasingly be made, or at least corroborated, on the
basis of computer algorithms rather than individual hunches. This
creates a need for managers who are comfortable with data, but
statistics courses in business schools are not popular.
Many new business insights come from “dead data”: stored infor-
mation about past transactions that are examined to reveal hidden
correlations. But now companies are increasingly moving to analys-
ing real-time information ﬂows.
Wal-Mart is a good example. The retailer operates
8,400 stores worldwide, has more than 2m
employees and handles over 200m customer
transactions each week. Its revenue last year,
around $400 billion, is more than the GDP of
many entire countries. The sheer scale of the
data is a challenge, admits Rollin Ford, the CIO
at Wal-Mart’s headquarters in Bentonville, Ar-
kansas. “We keep a healthy paranoia.”
Not a sparrow falls
Wal-Mart’s inventory-management system,
called Retail Link, enables suppliers to see the ex-
act number of their products on every shelf of every
store at that precise moment. The system shows the
rate of sales by the hour, by the day, over the past
year and more. Begun in the 1990s, Retail Link gives
suppliers a complete overview of when and how
their products are selling, and with what other prod-
ucts in the shopping cart. This lets suppliers manage
their stocks better.
The technology enabled Wal-Mart to change the business model
of retailing. In some cases it leaves stock management in the hands
of its suppliers and does not take ownership of the products until
the moment they are sold. This allows it to shed inventory risk and
reduce its costs. In essence, the shelves in its shops are a highly ef-
ﬁciently managed depot.
Another company that capitalises on real-time information ﬂows
is Li & Fung, one of the world’s biggest supply-chain operators.
Founded in Guangzhou in southern China a century ago, it does
not own any factories or equipment but orchestrates a network of
12,000 suppliers in 40 countries, sourcing goods for brands ranging
from Kate Spade to Walt Disney. Its turnover in 2008 was $14 billion.
Li & Fung used to deal with its clients mostly by phone and
fax, with e-mail counting as high technology. But thanks to a new
web-services platform, its processes have speeded up. Orders ﬂow
through a web portal and bids can be solicited from pre-qualiﬁed
suppliers. Agents now audit factories in real time with hand-held
computers. Clients are able to monitor the details of every stage of
an order, from the initial production run to shipping.
One of the most important technologies has turned out to be
videoconferencing. It allows buyers and manufacturers to examine
the colour of a material or the stitching on a garment. “Before, we
weren’t able to send a 500MB image—we’d post a DVD. Now we
can stream it to show vendors in our ofﬁces. With real-time images
we can make changes quicker,” says Manuel Fernandez, Li & Fung’s
chief technology ofﬁcer. Data ﬂowing through its network soared
from 100 gigabytes a day only 18 months ago to 1 terabyte.
The information system also allows Li & Fung to look across its
operations to identify trends. In southern China, for instance, a short-
age of workers and new legislation raised labour costs, so production
moved north. “We saw that before it actually happened,” says Mr
Fernandez. The company also got advance warning of the economic
crisis, and later the recovery, from retailers’ orders before these trends
became apparent. Investment analysts use country information pro-
vided by Li & Fung to gain insights into macroeconomic patterns.
Now that they are able to process information ﬂows in real time,
organisations are collecting more data than ever. One use for such
information is to forecast when machines will break down. This
hardly ever happens out of the blue: there are usually warning signs
such as noise, vibration or heat. Capturing such data enables ﬁrms
to act before a breakdown.
Similarly, the use of “predictive analytics” on the basis of large
data sets may transform health care. Dr Carolyn McGregor of the
University of Ontario, working with IBM, conducts
research to spot potentially fatal infections in
premature babies. The system monitors subtle
changes in seven streams of real-time data, such
as respiration, heart rate and blood pressure.
The electrocardiogram alone generates 1,000
readings per second.
This kind of information is turned out by all
medical equipment, but it used to be recorded
on paper and examined perhaps once an hour.
By feeding the data into a computer, Dr McGregor
has been able to detect the onset of an infection
before obvious symptoms emerge. “You can’t see it
with the naked eye, but a computer can,” she says.
Two technology trends are helping to fuel these
new uses of data: cloud computing and open-source
software. Cloud computing—in which the internet is
used as a platform to collect, store and process data—
allows businesses to lease computing power as and
C# WPF PDF Viewer SDK to view PDF document in C#.NET
1. Anticlockwise rotation. Rotate PDF page 90 degree in anticlockwise. 2. Clockwise rotation. Rotate PDF page 90 degree in clockwise. 3. Zoom in. how to rotate all pages in pdf; how to rotate a single page in a pdf document
VB.NET PDF - WPF PDF Viewer for VB.NET Program
Existing PDF Pages. Page: Replace PDF Pages. Page: Move Page Position. Page: Copy, Paste PDF Pages. Page: Rotate a PDF Page. PDF Read. Text pdf rotate just one page; pdf rotate one page
A special report on managing information ı February 27th 2010
when they need it, rather than having to buy expensive equipment.
Amazon, Google and Microsoft are the most prominent ﬁrms to
make their massive computing infrastructure available to clients.
As more corporate functions, such as human resources or sales, are
managed over a network, companies can see patterns across the
whole of the business and share their information more easily.
A free programming language called R lets companies examine
and present big data sets, and free software called Hadoop now al-
lows ordinary PCs to analyse huge quantities of data that previously
required a supercomputer. It does this by parcelling out the tasks
to numerous computers at once. This saves time and money. For
example, the New York Times a few years ago used cloud comput-
ing and Hadoop to convert over 400,000 scanned images from its
archives, from 1851 to 1922. By harnessing the power of hundreds of
computers, it was able to do the job in 36 hours.
Visa, a credit-card company, in a recent trial with Hadoop crunched
two years of test records, or 73 billion transactions, amounting to 36
terabytes of data. The processing time fell from one month with
traditional methods to a mere 13 minutes. It is a striking successor of
Ritty’s incorruptible cashier for a data-driven age.
Clicking for gold
How internet companies proﬁt from data on the web
SST! Amazon.com does not want you to know what it knows about
you. It not only tracks the books you purchase, but also keeps a
record of the ones you browse but do not buy to help it recommend
other books to you. Information from its e-book, the Kindle, is prob-
ably even richer: how long a user spends reading each page, whether
he takes notes and so on. But Amazon refuses to disclose what data it
collects or how it uses them.
It is not alone. Across the internet economy, companies are compiling
masses of data on people, their activities, their likes and dislikes, their
relationships with others and even where they are at any particular
moment—and keeping mum. For example, Facebook, a social-network-
ing site, tracks the activities of its 400m users, half of whom spend an
average of almost an hour on the site every day, but does not talk about
what it ﬁnds. Google reveals a little but holds back a lot. Even eBay, the
online auctioneer, keeps quiet.
“They are uncomfortable bringing so much attention to this because it
is at the heart of their competitive advantage,” says Tim O’Reilly, a tech-
nology insider and publisher. “Data are the coin of the realm. They have
a big lead over other companies that do not ‘get’ this.” As the communi-
cations director of one of the web’s biggest sites admits, “we’re not in a
position to have an in-depth conversation. It has less to do with sensitive
considerations like privacy. Instead, we’re just not ready to tip our hand.”
In other words, the ﬁrm does not want to reveal valuable trade secrets.
The reticence partly reﬂects fears about consumer unease and unwel-
come attention from regulators. But this is short-sighted, for two rea-
sons. First, politicians and the public are already anxious. The chairman
of America’s Federal Trade Commission, Jon Leibowitz, has publicly
grumbled that the industry has not been sufﬁciently forthcoming. Sec-
ond, if users knew how the data were used, they would probably be
more impressed than alarmed.
Where traditional businesses generally collect information about
customers from their purchases or from surveys, internet companies
have the luxury of being able to gather data from everything that hap-
pens on their sites. The biggest websites have long recognised that in-
formation itself is their biggest treasure. And it can immediately be put
to use in a way that traditional ﬁrms cannot match.
Some of the techniques have become widespread. Before deploying
a new feature, big sites run controlled experiments to see what works
best. Amazon and Netﬂix, a site that offers ﬁlms for hire, use a statisti-
cal technique called collaborative ﬁltering to make recommendations
to users based on what other users like. The technique they came up
with has produced millions of dollars of additional sales. Nearly two-
thirds of the ﬁlm selections by Netﬂix’s customer come from the refer-
rals made by computer.
EBay, which at ﬁrst sight looks like nothing more than a neutral plat-
form for commercial exchanges, makes myriad adjustments based
on information culled from listing activity, bidding behaviour, pricing
trends, search terms and the length of time users look at a page. Every
product category is treated as a micro-economy that is actively man-
aged. Lots of searches but few sales for an expensive item may signal
unmet demand, so eBay will ﬁnd a partner to offer sellers insurance to
The company that gets the most out of its data is Google. Creating
new economic value from unthinkably large amounts of information
is its lifeblood. That helps explain why, on inspection, the market capi-
talisation of the 11-year-old ﬁrm, of around $170 billion, is not so out-
landish. Google exploits information that is a by-product of user inter-
actions, or data exhaust, which is automatically recycled to improve the
service or create an entirely new product.
Vote with your mouse
Until 1998, when Larry Page, one of Google’s founders, devised the
PageRank algorithm for search, search engines counted the number of
times that a word appeared on a web page to determine its relevance—
a system wide open to manipulation. Google’s innovation was to
count the number of inbound links from other web pages. Such links
act as “votes” on what internet users at large believe to be good content.
More links suggest a webpage is more useful, just as more citations of a
book suggests it is better.
But although Google’s system was an improvement, it too was open
to abuse from “link spam”, created only to dupe the system. The ﬁrm’s
engineers realised that the solution was staring them in the face: the
search results on which users actually clicked and stayed. A Google
search might yield 2m pages of results in a quarter of a second, but
users often want just one page, and by choosing it they “tell” Google
what they are looking for. So the algorithm was rejigged to feed that
information back into the service automatically.
From then on Google realised it was in the data-mining business. To
put the model in simple economic terms, its search results give away,
say, $1 in value, and in return (thanks to the user’s clicks) it gets 1 cent
back. When the next user visits, he gets $1.01 of value, and so on. As
one employee puts it: “We like learning from large, ‘noisy’ data sets.”
A special report on managing information ı February 27th 2010
Making improvements on the back of a big data set is not a Google
monopoly, nor is the technique new. One of the most striking exam-
ples dates from the mid-1800s, when Matthew Fontaine Maury of the
American navy had the idea of aggregating nautical logs from ships
crossing the Paciﬁc to ﬁnd the routes that offered the best winds and
currents. He created an early variant of a “viral” social network, reward-
ing captains who submitted their logbooks with a copy of his maps.
But the process was slow and laborious.
Google applies this principle of recursively learning from the data
to many of its services, including the humble spell-check, for which
it used a pioneering method that produced perhaps the world’s best
spell-checker in almost every language. Microsoft says it spent sever-
al million dollars over 20 years to develop a robust spell-checker for
its word-processing program. But Google got its raw material free: its
program is based on all the misspellings that users type into a search
window and then “correct” by clicking on the right result. With almost
3 billion queries a day, those results soon mount up. Other search en-
gines in the 1990s had the chance to do the
same, but did not pursue it. Around 2000
Yahoo! saw the potential, but nothing came
of the idea. It was Google that recognised the
gold dust in the detritus of its interactions with
its users and took the trouble to collect it up.
Two newer Google services take the same
approach: translation and voice recognition.
Both have been big stumbling blocks for
computer scientists working on artiﬁcial in-
telligence. For over four decades the bofﬁns
tried to program computers to “understand”
the structure and phonetics of language. This
meant deﬁning rules such as where nouns
and verbs go in a sentence, which are the cor-
rect tenses and so on. All the exceptions to
the rules needed to be programmed in too.
Google, by contrast, saw it as a big maths prob-
lem that could be solved with a lot of data and
processing power—and came up with something very useful.
For translation, the company was able to draw on its other services.
Its search system had copies of European Commission documents,
which are translated into around 20 languages. Its book-scanning proj-
ect has thousands of titles that have been translated into many lan-
guages. All these translations are very good, done by experts to exacting
standards. So instead of trying to teach its computers the rules of a lan-
guage, Google turned them loose on the texts to make statistical infer-
ences. Google Translate now covers more than 50 languages, according
to Franz Och, one of the company’s engineers. The system identiﬁes
which word or phrase in one language is the most likely equivalent in
a second language. If direct translations are not available (say, Hindi to
Catalan), then English is used as a bridge.
Google was not the ﬁrst to try this method. In the early 1990s IBM
tried to build a French-English program using translations from Can-
ada’s Parliament. But the system did not work well and the project
was abandoned. IBM had only a few million documents at its dis-
posal, says Mr Och dismissively. Google has billions. The system was
ﬁrst developed by processing almost 2 trillion words. But although it
learns from a big body of data, it lacks the recursive qualities of spell-
check and search.
The design of the feedback loop is critical. Google asks users for their
opinions, but not much else. A translation start-up in Germany called
Linguee is trying something different: it presents users with snippets of
possible translations and asks them to click on the best. That provides
feedback on which version is the most accurate.
Voice recognition highlights the importance of making use of data
exhaust. To use Google’s telephone directory or audio car navigation
service, customers dial the relevant number and say what they are
looking for. The system repeats the information; when the customer
conﬁrms it, or repeats the query, the system develops a record of the
different ways the target word can be spoken. It does not learn to un-
derstand voice; it computes probabilities.
To launch the service Google needed an existing voice-recognition
system, so it licensed software from Nuance, a leader in the ﬁeld. But
Google itself keeps the data from voice queries, and its voice-recogni-
tion system may end up performing better than Nuance’s—which is
now trying to get access to lots more data by partnering with every-
one in sight.
Re-using data represents a new model for how computing is done,
says Edward Felten of Princeton University. “Looking at large data sets
and making inferences about what goes together is advancing more
rapidly than expected. ‘Understanding’ turns out to be overrated, and
statistical analysis goes a lot of the way.” Many internet companies now
see things the same way. Facebook regularly examines its huge databas-
es to boost usage. It found that the best single
predictor of whether members would contrib-
ute to the site was seeing that their friends had
been active on it, so it took to sending mem-
bers information about what their friends had
been up to online. Zynga, an online games
company, tracks its 100m unique players each
month to improve its games.
“If there are user-generated data to be had,
then we can build much better systems than
just trying to improve the algorithms,” says
Andreas Weigend, a former chief scientist at
Amazon who is now at Stanford University.
Marc Andreessen, a venture capitalist who
sits on numerous boards and was one of the
founders of Netscape, the web’s ﬁrst commer-
cial browser, thinks that “these new compa-
nies have built a culture, and the processes and
the technology to deal with large amounts of
data, that traditional companies simply don’t have.”
Recycling data exhaust is a common theme in the myriad projects
going on in Google’s empire and helps explain why almost all of them
are labelled as a “beta” or early test version: they truly are in continu-
ous development. A service that lets Google users store medical records
might also allow the company to spot valuable patterns about diseases
and treatments. A service where users can monitor their use of electric-
ity, device by device, provides rich information on energy consump-
tion. It could become the world’s best database of household appli-
ances and consumer electronics—and even foresee breakdowns. The
aggregated search queries, which the company makes available free,
are used as remarkably accurate predictors for everything from retail
sales to ﬂu outbreaks.
Together, all this is in line with the company’s audacious mission to
“organise the world’s information”. Yet the words are carefully chosen:
Google does not need to own the data. Usually all it wants is to have
access to them (and see that its rivals do not). In an initiative called
“Data Liberation Front” that quietly began last September, Google is
planning to rejig all its services so that users can discontinue them very
easily and take their data with them. In an industry built on locking
in the customer, the company says it wants to reduce the “barriers to
exit”. That should help save its engineers from complacency, the curse
of many a tech champion. The project might stall if it started to hurt the
business. But perhaps Google reckons that users will be more inclined
to share their information with it if they know that they can easily take
A special report on managing information ı February 27th 2010
ROM antiquity to modern times, the nation has always been a
product of information management. The ability to impose taxes,
promulgate laws, count citizens and raise an army lies at the heart of
statehood. Yet something new is afoot. These days democratic open-
ness means more than that citizens can vote at regular intervals in free
and fair elections. They also expect to have access to government data.
The state has long been the biggest generator, collector and user of
data. It keeps records on every birth, marriage and death, compiles ﬁg-
ures on all aspects of the economy and keeps statistics on licences, laws
and the weather. Yet until recently all these data have been locked tight.
Even when publicly accessible they were hard to ﬁnd, and aggregating
lots of printed information is notoriously difﬁcult.
But now citizens and non-governmental organisations the world
over are pressing to get access to public data at the national, state and
municipal level—and sometimes government ofﬁcials enthusiastically
support them. “Government information is a form of infrastructure,
no less important to our modern life than our roads, electrical grid or
water systems,” says Carl Malamud, the boss of a group called Public.
Resource.Org that puts government data online. He was responsible for
making the databases of America’s Securities and Exchange Commis-
sion available on the web in the early 1990s.
America is in the lead on data access. On his ﬁrst full day in ofﬁce
Barack Obama issued a presidential memorandum ordering the heads
of federal agencies to make available as much information as possible,
urging them to act “with a clear presumption: in the face of doubt,
openness prevails”. This was all the more remarkable since the Bush
administration had explicitly instructed agencies to do the opposite.
Mr Obama’s directive caused a ﬂurry of activity. It is now possible
to obtain ﬁgures on job-related deaths that name employers, and to get
annual data on migration free. Some information that was previously
available but hard to get at, such as the Federal Register, a record of gov-
ernment notices, now comes in a computer-readable format. It is all on
a public website, data.gov. And more information is being released all
the time. Within 48 hours of data on ﬂight delays being made public, a
website had sprung up to disseminate them.
Providing access to data “creates a culture of accountability”, says Vi-
vek Kundra, the federal government’s CIO. One of the ﬁrst things he
did after taking ofﬁce was to create an online “dashboard” detailing the
government’s own $70 billion technology spending. Now that the in-
formation is freely available, Congress and the public can ask questions
or offer suggestions. The model will be applied to other areas, perhaps
including health-care data, says Mr Kundra—provided that looming pri-
vacy issues can be resolved.
All this has made a big difference. “There is a cultural change in what
people expect from government, fuelled by the experience of shopping
on the internet and having real-time access to ﬁnancial information,”
says John Wonderlich of the Sunlight Foundation, which promotes
open government. The economic crisis has speeded up that change,
particularly in state and city governments.
“The city is facing its eighth budget shortfall. We’re looking at a 50%
reduction in operating funds,” says Chris Vein, San Francisco’s CIO.
“We must ﬁgure out how we change our operations.” He insists that
providing more information can make government more efﬁcient. Cal-
ifornia’s generous “sunshine laws” provide the necessary legal back-
ing. Among the ﬁrst users of the newly available data was a site called
“San Francisco Crimespotting” by Stamen Design that layers historical
crime ﬁgures on top of map information. It allows users to play around
with the data and spot hidden trends. People now often come to public
meetings armed with crime maps to demand police patrols in their
Anyone can play
Other cities, including New York, Chicago and Washington, DC, are
racing ahead as well. Now that citizens’ groups and companies have
the raw data, they can use them to improve city services in ways that
cash-strapped local governments cannot. For instance, cleanscores.com
puts restaurants’ health-inspection scores online; other sites list chil-
dren’s activities or help people ﬁnd parking spaces. In the past gov-
ernment would have been pressed to provide these services; now it
simply supplies the data. Mr Vein concedes, however, that “we don’t
know what is useful or not. This is a grand experiment.”
Other parts of the world are also beginning to move to greater open-
ness. A European Commission directive in 2005 called for making
public-sector information more accessible (but it has no bite). Europe’s
digital activists use the web to track politicians and to try to improve
public services. In Britain FixMyStreet.com gives citizens the opportu-
nity to ﬂag up local problems. That allows local authorities to ﬁnd out
about people’s concerns; and once the problem has been publicly aired
it becomes more difﬁcult to ignore.
One obstacle is that most countries lack America’s open-government
ethos, nurtured over decades by laws on ethics in government, trans-
parency rules and the Freedom of Information act, which acquired
teeth after the Nixon years.
An obstacle of a different sort is Crown copyright, which means
that most government data in Britain and the Commonwealth coun-
tries are the state’s property, constraining their use. In Britain post-
codes and Ordnance Survey map data at present cannot be freely
used for commercial purposes—a source of loud complaints from
businesses and activists. But from later this year access to some parts
of both data sets will be free, thanks to an initiative to bring more
government services online.
But even in America access to some government information is re-
stricted by ﬁnancial barriers. Remarkably, this applies to court docu-
ments, which in a democracy should surely be free. Legal records are
public and available online from the Administrative Ofﬁce of the US
Courts (AOUSC), but at a costly eight cents per page. Even the federal
government has to pay: between 2000 and 2008 it spent $30m to get
access to its own records. Yet the AOUSC is currently paying $156m over
ten years to two companies, WestLaw and LexisNexis, to publish the
material online (albeit organised and searchable with the ﬁrms’ tech-
The open society
Governments are letting in the light
A special report on managing information ı February 27th 2010
nologies). Those companies, for their part, earn an estimated $2 billion
annually from selling American court rulings and extra content such
as case reference guides. “The law is locked up behind a cash register,”
says Mr Malamud.
The two ﬁrms say they welcome competition, pointing to their strong
search technology and the additional services they provide, such as
case summaries and useful precedents. It seems unlikely that they will
keep their grip for long. One administration ofﬁcial privately calls free-
ing the information a “no-brainer”. Even Google has begun to provide
some legal documents online.
The point of open information is not merely to expose the world
but to change it. In recent years moves towards more transparency in
government have become one of the most vibrant and promising areas
of public policy. Sometimes information disclosure can achieve policy
aims more effectively and at far lower cost than traditional regulation.
In an important shift, new transparency requirements are now be-
ing used by government—and by the public—to hold the private sector
to account. For example, it had proved extremely difﬁcult to persuade
American businesses to cut down on the use of harmful chemicals and
their release into the environment. An add-on to a 1986 law required
ﬁrms simply to disclose what they release, including “by computer
telecommunications”. Even to supporters it seemed like a fudge, but it
turned out to be a resounding success. By 2000 American businesses
had reduced their emissions of the chemicals covered under the law by
40%, and over time the rules were actually tightened. Public scrutiny
achieved what legislation could not.
There have been many other such successes in areas as diverse
as restaurant sanitation, car safety, nutrition, home loans for minori-
ties and educational performance, note Archon Fung, Mary Graham
and David Weil of the Transparency Policy Project at Harvard’s Ken-
nedy School of Government in their book “Full Disclosure”. But
transparency alone is not enough. There has to be a community
to champion the information. Providers need an incentive to sup-
ply the data as well as penalties for withholding them. And web
developers have to ﬁnd ways of ensuring that the public data being
released are used effectively.
Mr Fung thinks that as governments release more and more infor-
mation about the things they do, the data will be used to show the
public sector’s shortcomings rather than to highlight its achievements.
Another concern is that the accuracy and quality of the data will be
found wanting (which is a problem for business as well as for the pub-
lic sector). There is also a debate over whether governments should
merely supply the raw data or get involved in processing and display-
ing them too. The concern is that they might manipulate them—but
then so might anyone else.
Public access to government ﬁgures is certain to release economic
value and encourage entrepreneurship. That has already happened
with weather data and with America’s GPS satellite-navigation system
that was opened for full commercial use a decade ago. And many ﬁrms
make a good living out of searching for or repackaging patent ﬁlings.
Moreover, providing information opens up new forms of collabora-
tion between the public and the private sectors. Beth Noveck, one of
the Obama administration’s recruits, who is a law professor and author
of a book entitled “Wiki Government”, has spearheaded an initiative
called peer-to-patent that has opened up some of America’s patent ﬁl-
ings for public inspection.
John Stuart Mill in 1861 called for “the widest participation in the
details of judicial and administrative business…above all by the utmost
possible publicity.” These days, that includes the greatest possible dis-
closure of data by electronic means.
New ways to visualising data
N 1998 Martin Wattenberg, then a graphic designer at the magazine
SmartMoney in New York, had a problem. He wanted to depict the
daily movements in the stockmarket, but the customary way, as a line
showing the performance of an index over time, provided only a very
broad overall picture. Every day hundreds of individual companies
may rise or fall by a little or a lot. The same is true for whole sectors. Be-
ing able to see all this information at once could be useful to investors.
But how to make it visually accessible?
Mr Wattenberg’s brilliant idea was to adapt an existing technique to
create a “Map of the Market” in the form of a grid. It used the day’s clos-
ing share price to show more than 500 companies arranged by sector.
Shades of green or red indicated whether a share had risen or fallen
and by how much, showing the activity in every sector of the market.
It was an instant hit—and brought the nascent ﬁeld of data visualisation
to a mainstream audience.
In recent years there have been big advances in displaying massive
amounts of data to make them easily accessible. This is emerging as a
vibrant and creative ﬁeld melding the skills of computer science, statis-
tics, artistic design and storytelling.
“Every ﬁeld has some central tension it is trying to resolve. Visualisa-
tion deals with the inhuman scale of the information and the need to
present it at the very human scale of what the eye can see,” says Mr
Wattenberg, who has since moved to IBM and now spearheads a new
generation of data-visualisation specialists.
Market information may be hard to display, but at least the data are
numerical. Words are even more difﬁcult. One way of depicting them
is to count them and present them in clusters, with more common
ones shown in a proportionately larger font. Called a “word cloud”,
this method is popular across the web. It gives a rough indication of
what a body of text is about.
Soon after President Obama’s inauguration a word cloud with a
graphical-semiotic representation of his 21-minute speech appeared
on the web. The three most common words were nation, America
and people. His predecessor’s had been freedom, America and liberty.
Abraham Lincoln had majored on war, God and offence. The tech-
nique has a utility beyond identifying themes. Social-networking sites
let users “tag” pages and images with words describing the content.
The terms displayed in a “tag cloud” are links that will bring up a list
of the related content.
Another way to present text, devised by Mr Wattenberg and a col-
league at IBM, Fernanda Viégas, is a chart of edits made on Wikipedia.
The online encyclopedia is written entirely by volunteers. The software
creates a permanent record of every edit to show exactly who changed
what, and when. That amounts to a lot of data over time.
A special report on managing information ı February 27th 2010
One way to map the process is to assign dif-
ferent colours to different users and show how
much of their contribution remains by the
thickness of the line that represents it. The en-
try for “chocolate”, for instance, looks smooth
until a series of ragged zigzags reveals an item
of text being repeatedly removed and restored
as an arcane debate rages. Another visualisa-
tion looks at changes to Wikipedia entries by
software designed to improve the way articles
are categorised, showing the modiﬁcations as
a sea of colour. (These and other images are
Is it art? Is it information? Some data-visual
works have been exhibited in places such as
the Whitney and the Museum of Modern Art
in New York. Others have been turned into
books, such as the web project “We Feel Fine”
by Jonathan Harris and Sep Kamvar, which
captures every instance of the words “feel” or
“feeling” on Twitter, a social-networking site,
and matches it to time, location, age, sex and
even the weather.
For the purposes of data visualisation as
many things as possible are reduced to raw
data that can be presented visually, sometimes
in unexpected ways. For instance, a represen-
tation of the sources cited in the journal Na-
ture gives each source publication a line and
identiﬁes different scientiﬁc ﬁelds in different
colours. This makes it easy to see that biology
sources are most heavily cited, which is un-
surprising. But it also shows, more unexpect-
edly, that the publications most heavily cited
include the Physical Review Letters and Astro-
The art of the visible
Resembling a splendid orchid, the Nature
chart can be criticised for being more pictur-
esque than informative; but whether it is more
art or more information, it offers a new way
to look at the world at a time when almost ev-
erything generates huge swathes of data that
are hard to understand. If a picture is worth a
thousand words, an infographic is worth an
awful lot of data points.
Visualisation is a relatively new discipline.
The time series, the most common form of
chart, did not start to appear in scientiﬁc
writings until the late 18th century, notes
Edward Tufte in his classic “The Visual Dis-
play of Quantitative Information”, the bible
of the business. Today’s infographics experts
are pioneering a new medium that presents
meaty information in a compelling narrative:
“Something in-between the textbook and the
novel”, writes Nathan Yau of UCLA in a re-
cent book, “Beautiful Data”.
It’s only natural
The brain ﬁnds it easier to process infor-
mation if it is presented as an image rather
than as words or numbers. The right hemi-
sphere recognises shapes and colours. The
left side of the brain processes information
in an analytical and sequential way and is
more active when people read text or look at
a spreadsheet. Looking through a numerical
table takes a lot of mental effort, but infor-
mation presented visually can be grasped in
a few seconds. The brain identiﬁes patterns,
proportions and relationships to make in-
stant subliminal comparisons. Businesses
care about such things. Farecast, the online
price-prediction service, hired applied psy-
chologists to design the site’s charts and co-
These graphics are often based on immense
quantities of data. Jeffrey Heer of Stanford
University helped develop sense.us, a website
that gives people access to American census
data going back more than a century. Ben Fry,
an independent designer, created a map of the
26m roads in the continental United States.
The dense communities of the north-east
form a powerful contrast to the desolate far
west. Aaron Koblin of Google plotted a map
of every commercial ﬂight in America over 24
hours, with brighter lines identifying routes
with heavier trafﬁc.
Such techniques are moving into the busi-
ness world. Mr Fry designed interactive charts
for Ge’s health-care division that show the
costs borne by patients and insurers, respec-
tively, for common diseases throughout peo-
ple’s lives. Among media companies the New
York Times and the Guardian in Britain have
been the most ambitious, producing data-rich,
interactive graphics that are strong enough to
stand on their own.
The tools are becoming more accessible.
For example, Tableau Software, co-founded in
2003 by Pat Hanrahan of Stanford University,
does for visualising data what word-process-
ing did for text, allowing anyone to manipu-
late information creatively. Tableau offers both
free and paid-for products, as does a website
called Swivel.com. Some sites are entirely free.
Google and an IBM website called Many Eyes
let people upload their data to display in novel
ways and share with others.
Some data sets are best represented as a
moving image. As print publications move
to e-readers, animated infographics will
eventually become standard. The software
Gapminder elegantly displays four dynamic
variables at once.
Displaying information can make a dif-
ference by enabling people to understand
complex matters and ﬁnd creative solutions.
Valdis Krebs, a specialist in mapping social
interactions, recalls being called in to help
with a corporate project that was vastly over
budget and behind schedule. He drew up an
intricate network map of e-mail trafﬁc that
showed distinct clusters, revealing that the
teams involved were not talking directly to
each other but passing messages via manag-
ers. So the company changed its ofﬁce lay-
out and its work processes—and the project
quickly got back on track.
Needle in a haystack
The uses of information about information
S DATA become more abundant, the
main problem is no longer ﬁnding the
information as such but laying one’s hands
on the relevant bits easily and quickly.
What is needed is information about infor-
mation. Librarians and computer scientists
call it metadata.
Information management has a long his-
tory. In Assyria around three millennia ago
clay tablets had small clay labels attached to
them to make them easier to tell apart when
they were ﬁled in baskets or on shelves. The
idea survived into the 20th century in the
shape of the little catalogue cards librarians
used to note down a book’s title, author,
subject and so on before the records were
moved onto computers. The actual books
constituted the data, the catalogue cards the
metadata. Other examples include pack-
age labels to the 5 billion bar codes that are
scanned throughout the world every day.
These days metadata are undergoing a
virtual renaissance. In order to be useful, the
cornucopia of information provided by the
internet has to be organised. That is what
Google does so well. The raw material for
its search engines comes free: web pages on
the public internet. Where it adds value (and
Documents you may be interested
Documents you may be interested