Menus Subscribe Search

Book Reviews

data-set

(Photo: Joe Baker)

Through a Data Set, Darkly

• January 08, 2014 • 6:00 AM

(Photo: Joe Baker)

Is quantitative analysis the secret to understanding culture?

What to do with all this data? There’s now so much digital information being produced (about a terabyte per capita per year worldwide) that a major industry has arisen in response, promising to mine it for insight. A widely cited paper published in 2011 by the World Economic Forum and Bain & Company called personal data “the new oil.” According to its many boosters, Big Data—the neologism refers both to the bounty of information and to its analysis—is on the cusp of revolutionizing health care, scientific research, education, insurance, public health, energy policy, and national security.

The humanities are also jumping on the Big Data train. With museum archives, ancient manuscripts, and whole libraries being digitized, some researchers argue that data analysis will let studies of culture finally claim some of the empirical certainty traditionally associated with “hard” sciences like chemistry and physics. In their new book, Uncharted: Big Data as a Lens on Human Culture, Erez Aiden, a mathematician and geneticist, and Jean-Baptiste Michel, a data scientist, echo that argument, expressing total confidence that “Big data is going to change the humanities, transform the social sciences, and renegotiate the relationship between the world of commerce and the ivory tower.” Access to unprecedented amounts of information, they contend, will give rise to “culturomics,” a new form of cultural history that wrings its insights from sophisticated data analysis.

uncharted_bookThe main culturomic tool Aiden and Michel discuss is one they invented themselves: the Google Ngram Viewer, which lets users search a selection of the enormous Google Books archive for incidences of words over time. Type in a word and you get a handsome chart ostensibly showing how that word rose or fell in popularity through the years. When Ngram launched in 2010, it received millions of hits in its first 24 hours online, was covered on the front page of The New York Times, and was heralded as a fun—and potentially revolutionary—tool for exploring lexicographic history.

Aiden and Michel are enamored of their creation; just as the telescope helped us unlock the secrets of the night sky, they write, Ngram will help us unlock the secrets of culture. In their enthusiasm, however, they gloss over some of the program’s practical limitations. Early in Uncharted, they admit that Ngram’s data set, drawn from more than five million books published in seven languages between 1500 and 2000, is far from perfect. For starters, it uses only a small fraction of the Google Books reservoir of 30 million titles; major swathes remain inaccessible because of copyright concerns. (Plus, about 100 million books remain undigitized, according to Google’s estimate.) But after some early mentions, these shortcomings hardly come up again, as if merely acknowledging their existence solved them for good.

Attempting to showcase the exciting potential of culturomics, the authors use Ngram to detect when certain words and phrases replaced others (when, for example, World War I became more popular than the Great War); when certain irregular verbs (burnt) became regularized (burned); and when notable personalities’ fame waxed and waned. Combining Ngram charts with a potted history of Nazi censorship, they try to show how Jewish artists like Marc Chagall were briefly erased from history.

These are interesting observations, but too often, even the findings cherry-picked by Ngram’s creators are inconclusive or merely confirm things we already know. Dictators get written about a lot. Hollywood fame is typically fleeting. Charles Dickens helped popularize the greeting Merry Christmas. And some of the observations are just meaningless: At one point, we learn, the term Bill Clinton appeared almost as frequently as lettuce.

Aiden and Michel also fail to ask some basic questions about their own methods. Ngram counts the number of books in which a term was used, but it doesn’t consider the context in which the term was used within an individual book, nor how widely a book was read. And what about the influence of other media—television, film, radio, newspapers—in word usage?

Perhaps the most serious problem with Uncharted is that, like many Big Data enterprises, it introduces numerous correlations without illuminating much about their causes. Aiden and Michel come up with some evocative results, but then barely dig beneath the topsoil for explanations. When a causal mechanism isn’t immediately apparent, they’re far too willing to chalk it up to the mysteries of a still-developing field— something to be solved by the Ngram’s descendants—rather than their own limited curiosity or unwillingness to engage with “traditional” history.

A prime example can be found in their treatment of a finding about the persistence of calendar years in public consciousness. 1872, they tell us, was a big deal in 1872, but steadily declined in mentions thereafter. Same for 1873, and so on. Hopping forward a century, Aiden and Michel notice a change. Starting in 1973, it appears that “the half-life of collective forgetting” began to decrease. Years, and their histories, seemed to start disappearing faster from public memory.

“What caused that change?” Aiden and Michel ask. It’s a potentially fascinating question, and one that probably couldn’t have been posed without Big Data’s help. But in addition to ignoring the possible flaws in their sample, Aiden and Michel ignore history itself. The fact is, 1973 was a momentous year: the Yom Kippur War started (and ended), the United States began its pullout from Vietnam, Salvador Allende was overthrown, and the Watergate investigation was plastered across newspapers’ front pages. But it also came early in the era of satellite news, cable TV, and personal computing. Surely these factors had some effect on how history was written and published; perhaps the flowering of real-time, televised reporting came at the expense of longer, book-length treatments of history. But Aiden and Michel mention none of these events—nor anything else from 1973. “We don’t know,” they write. “For now, all we have are the naked correlations. … It may be some time before we figure out the underlying mechanisms.”

Uncharted ends on a note of intense optimism, with its authors looking beyond Google Books to the “tidal wave of data that will soon break over the social sciences.” After all, we are all data producers now—you and I and everyone we know. Aiden and Michel suggest that culturomics—armed with information from our tweets, Facebook updates, and perhaps devices that will record “every sensory experience, every beat of our heart, every rumble in our stomach, and even every thought that crosses our mind”—could make it possible to foresee the future of civilization.

That’s a lofty endnote for a thin book whose specific conclusions are limited and not particularly surprising. But this is typical of arguments on behalf of Big Data, which are prone to overstatement. The media scholars Danah Boyd and Kate Crawford convincingly argued in a 2011 paper that Big Data “encourages the practice of apophenia: seeing patterns where none actually exist.” The lure of deeper patterns leads almost invariably to the collection of more data. Or, as General Keith B. Alexander, director of the National Security Agency, put it recently, regarding widespread government surveillance of citizens: “You need the haystack to find the needle.”

In academia, overenthusiasm for suggestive correlations makes for shoddy scholarship. Out in the wider world, it affects lives, as when the government puts you on a no-fly list because the latest algorithm says your digital profile resembles that of a terrorist, or when a mortgage company denies you a loan based on patterns in your social media presence. Uncharted is, unwittingly, a cautionary tale for Big Data enthusiasts of all stripes. In their rush to inaugurate a powerful new discipline, Aiden and Michel overlook the fact that digital data and computer code are human inventions, incorporating human error, biases, and modes of thought—the stuff of history.


This post originally appeared in the January/February 2014 issue of Pacific Standard as “Through a Data Set, Darkly.” For more, consider subscribing to our bimonthly print magazine.

Jacob Silverman
Jacob Silverman is writing a book about social media and digital culture, which will be published by HarperCollins later this year.

A weekly roundup of the best of Pacific Standard and PSmag.com, delivered straight to your inbox.

Recent Posts


August 29 • 4:00 PM

The Hidden Costs of Tobacco Debt

Even when taxpayers aren’t explicitly on the hook, tobacco bonds can cost states and local governments money. Here’s how.


August 29 • 2:00 PM

Why Don’t Men and Women Wear the Same Gender-Neutral Bathing Suits?

They used to in the 1920s.


August 29 • 11:48 AM

Your Brain Decides Whether to Trust Someone in Milliseconds

We can determine trustworthiness even when we’re only subliminally aware of the other person.


August 29 • 10:00 AM

True Darwinism Is All About Chance

Though the rich sometimes forget, Darwin knew that nature frequently rolls the dice.


August 29 • 8:00 AM

Why Our Molecular Make-Up Can’t Explain Who We Are

Our genes only tell a portion of the story.


August 29 • 6:00 AM

Strange Situations: Attachment Theory and Sexual Assault on College Campuses

When college women leave home, does attachment behavior make them more vulnerable to campus rape?


August 29 • 4:00 AM

Forgive Your Philandering Partner—and Pay the Price

New research finds people who forgive an unfaithful romantic partner are considered weaker and less competent than those who ended the relationship.


August 28 • 4:00 PM

Some Natural-Looking Zoo Exhibits May Be Even Worse Than the Old Concrete Ones

They’re often designed for you, the paying visitor, and not the animals who have to inhabit them.


August 28 • 2:00 PM

What I Learned From Debating Science With Trolls

“Don’t feed the trolls” is sound advice, but occasionally ignoring it can lead to rewards.


August 28 • 12:00 PM

The Ice Bucket Challenge’s Meme Money

The ALS Association has raised nearly $100 million over the past month, 50 times what it raised in the same period last year. How will that money be spent, and how can non-profit executives make a windfall last?


August 28 • 11:56 AM

Outlawing Water Conflict: California Legislators Confront Risky Groundwater Loophole

California, where ambitious agriculture sucks up 80 percent of the state’s developed water, is no stranger to water wrangles. Now one of the worst droughts in state history is pushing legislators to reckon with its unwieldy water laws, especially one major oversight: California has been the only Western state without groundwater regulation—but now that looks set to change.


August 28 • 11:38 AM

Young, Undocumented, and Invisible

While young migrant workers struggle under poor working conditions, U.S. policy has done little to help.


August 28 • 10:00 AM

The Five Words You Never Want to Hear From Your Doctor

“Sometimes people just get pains.”


August 28 • 8:00 AM

Why I’m Not Sharing My Coke

Andy Warhol, algorithms, and a bunch of popular names printed on soda cans.


August 28 • 6:00 AM

Can Outdoor Art Revitalize Outdoor Advertising?

That art you’ve been seeing at bus stations and billboards—it’s serving a purpose beyond just promoting local museums.


August 28 • 4:00 AM

Linguistic Analysis Reveals Research Fraud

An examination of papers by the discredited Diederik Stapel finds linguistic differences between his legitimate and fraudulent studies.


August 28 • 2:00 AM

Poverty and Geography: The Myth of Racial Segregation

Migration, regardless of race, ethnicity, gender, or sexuality (not to mention class), can be a poverty-buster.


August 27 • 4:00 PM

The ‘Non-Lethal’ Flash-Bang Grenades Used in Ferguson Can Actually Be Quite Lethal

A journalist says he was singed by a flash-bang fired by St. Louis County police trying to disperse a crowd, raising questions about how to use these military-style devices safely and appropriately.


August 27 • 2:00 PM

Do Better Looking People Have Better Personalities Too?

An experiment on users of the dating site OKCupid found that members judge both looks and personality by looks alone.


August 27 • 12:00 PM

Love Can Make You Stronger

A new study links oxytocin, the hormone most commonly associated with social bonding, and the one that your body produces during an orgasm, with muscle regeneration.


August 27 • 11:05 AM

Education, Interrupted

When it comes to educational access, young Syrian refugees are becoming a “lost generation.”


August 27 • 9:47 AM

No, Smartphone-Loss Anxiety Disorder Isn’t Real

But people are anxious about losing their phones, even if they don’t do much to protect them.


August 27 • 8:00 AM

A Skeptic Meets a Psychic: When You Can See Into the Future, How Do You Handle Uncertainty?

For all the crystal balls and beaded doorways, some psychics provide a useful, non-paranormal service. The best ones—they give good advice.


August 27 • 6:00 AM

Speaking Eyebrow: Your Face Is Saying More Than You Think

Our involuntary gestures take on different “accents” depending on our cultural background.


Follow us


Subscribe Now

Your Brain Decides Whether to Trust Someone in Milliseconds

We can determine trustworthiness even when we’re only subliminally aware of the other person.

Young, Undocumented, and Invisible

While young migrant workers struggle under poor working conditions, U.S. policy has done little to help.

Education, Interrupted

When it comes to educational access, young Syrian refugees are becoming a “lost generation.”

No, Smartphone-Loss Anxiety Disorder Isn’t Real

But people are anxious about losing their phones, even if they don’t do much to protect them.

Being a Couch Potato: Not So Bad After All?

For those who feel guilty about watching TV, a new study provides redemption.

The Big One

One in two full-time American fast-food workers' families are enrolled in public assistance programs, at a cost of $7 billion per year. July/August 2014 fast-food-big-one

Copyright © 2014 by Pacific Standard and The Miller-McCune Center for Research, Media, and Public Policy. All Rights Reserved.