Garbage In, Gospel Out: When Big Data Leads to Bad Conclusions

Joan Westenberg
11 min readAug 20, 2023

--

In 2012, an angry father stormed into a Target store, demanding to speak to the manager. He was waving coupons for cribs and baby clothes the store had mailed to his teenage daughter. “She’s still in high school, and you’re sending coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?” he demanded.

The manager was bewildered until he examined the mailer and realized the father was correct — Target’s data mining had identified his daughter as potentially pregnant before the father knew. A few days later, the abashed father called to apologize. “I talked with my daughter,” he admitted. “There’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”

This real-world story, reported by Charles Duhigg in the New York Times Magazine, highlights big data analytics’s promise and perils. Using the statistical genius of analyst Andrew Pole, Target had broken new ground in customer tracking. By identifying combinations of products that indicated a woman was pregnant, even if she didn’t know it herself, Target could target moms-to-be with coupons at a crucial and expensive life stage. The data promised valuable insights but also raised worrying privacy implications.

Big data offers tantalizing promises — from optimized operations to breakthroughs in healthcare. Machine learning algorithms can detect patterns and make predictions that would elude humans. However, in our rush to embrace data-driven decisions, we risk falling victim to biased, incomplete, or just plain wrong conclusions.

Big data is only as good as the input data. Biased or unrepresentative data leads to blind spots and distortions. Just because an algorithm spits out a prediction doesn’t mean we should accept it uncritically. Algorithms can perpetuate biases that exist in society. Criminal justice systems have used sentencing algorithms later found to embed racial discrimination. Election predictions consistently fail to account for uncertainties like last-minute surprises. Targeted advertising can lead to exploitation based on what the data says we want.

The issue often comes down to mistaking correlation for causation. The numbers may reveal intriguing correlations, but the reasons behind those correlations matter. We also suffer from cognitive biases that influence how we interpret data. Our eagerness to find patterns that confirm our beliefs leads us down rabbit holes of truthiness.

So where does this leave us? Despite advances in AI, human oversight remains essential. We need intellectual humility to be open to being wrong, even when the data says otherwise. Critical thinking allows us to question the data and investigate inconsistencies rigorously. Examples abound where human discretion corrected bad decisions born of blind trust in data.

Rather than worshipping at the altar of big data, we should view it as one input to inform human judgment, not replace it. With responsible use of data analytics while retaining space for compassion and ethics, we can maximize its benefits while avoiding the pitfalls. The data revolution empowers us to uncover new truths about the world — but only if we stay grounded in our humanity.

The promise and pitfalls of big data

The lure of big data is irresistible. The promise of gleaning insights from vast troves of information is the holy grail of the information age — the magic key to unlocking transformative knowledge from the overwhelming ocean of bits and bytes surrounding us. Yet we would do well to remember that for all its glimmering potential, big data is not without its pitfalls.

In the mid-2000s, excited murmurs about the coming big data revolution began spreading through executive suites and research labs. With storage, memory, and processing power expanding exponentially, gathering, storing, and analyzing data on a massive scale was suddenly possible. Clickstreams, surveillance videos, genomic databases, social media posts — the very ether of the digital universe could be mined for precious insights.

The potential use cases were astounding. Retailers could perform real-time personalized marketing at a level never before possible. Healthcare providers could uncover lifestyle and genetic predictors of disease buried deep in patients’ records. Governments could optimize infrastructure planning based on fine-grained patterns of transit usage. The list went on.

Of course, the promise of big data did not materialize out of thin air. It was the product of skillful rhetorical framing — a little overzealous marketing from technology companies eager to sell the next generation of hardware and software. Big data was that beautiful, pesky idea that thrills executives but perplexes engineers. It sounded magnificent — who wouldn’t want to derive value from data at scale? — but was maddeningly vague on specifics.

This ambiguity in definition allowed the hype around big data to swell to unrealistic proportions. By the early 2010s, it had taken on the feel of a mania, touted as a cure-all for institutional challenges of every stripe. Frenzied think pieces heralded it as a seminal technological shift in the order of the internet or electricity.

But as is often the case with new technologies, the reality did not match the soaring rhetoric. In implementation, big data proved more complicated than its boosters had let on. The challenge was not just acquiring and storing massive datasets but deriving usable insights from their depths — a task that data scientists found far from straightforward.

For starters, real-world data tends to be messy, rife with gaps, anomalies, and biases. Critical context is missing. Correlations pop up by the thousands, but determining which ones indicate meaningful relationships requires careful statistical and analytical work — not to mention a healthy dose of human judgment.

Ironically, having too much data can impair decision-making. With such a dizzying array of patterns and connections to analyze, it becomes dangerously easy to lose perspective. Statistical flukes can look like — or be p-hacked into — causal relationships. Spurious correlations seem significant. There are simply too many trails to follow and rabbit holes to disappear down.

Proponents counter that these are merely growing pains — problems that better tools and techniques will soon overcome. This may be true. But there are also deeper reasons to be wary of wholeheartedly embracing big data.

Namely, certain biases and assumptions are embedded in data collection — human biases encoded into the methodologies of recording information. Data is never truly raw or objective; choices are always involved in what is measured and/or left out. This absolutely influences what questions can be asked and what answers might be found.

For example, facial recognition algorithms trained chiefly on photos of white men do not perform as well with women and people of color. Predictive policing systems trained on records of arrests — which disproportionately target minorities — inherit and perpetuate those same biases. The blind spots of data collection become blind spots of analysis.

There are also profound ethical considerations in relying on data for consequential decisions. Resting life-altering judgments — prison sentences, insurance rates, access to healthcare — solely on algorithmic calculations strips out human context and discretion. It reduces multifaceted people to data points and probabilities.

How big data can distort the truth:

Big data promises an intoxicating vision: make sense of the incomprehensible through technology. Tame the wilderness of information overload and extract order from the chaos. Reveal insights hidden below the surface.

But there is a wizard behind the curtain — human hands design and interpret big data analytics. However sophisticated the system, the software encodes the assumptions and beliefs of its creators. This means pernicious biases can subtly creep in, distorting the truths found. Big data has no inherent resistance to human error.

Several high-profile big data failures in recent years highlight the pitfalls of treating its insights as absolute truth. Let’s examine a few telling examples across different domains.

During the 2016 U.S. presidential election, Nate Silver’s polling aggregation site FiveThirtyEight predicted a modest but comfortable victory for Hillary Clinton. Various other quantitative election models agreed. Of course, the actual outcome defied these forecasts, suggesting the models overlooked uncertainties in the electorate.

In targeted digital advertising, big data techniques allow pinpointing users with military-grade precision. This enables advertisers to exploit vulnerabilities and finely tune persuasive messaging to maximize impact — even if doing so works against individual and societal well-being. Here, more data facilitates more effective manipulation.

Flaws also lurk in the criminal justice system’s growing reliance on recidivism risk assessment algorithms. Though touted as more objective than human judges, ProPublica found one widely-used tool was heavily biased against black defendants, incorrectly labeling them high risk at almost twice the rate of white defendants. Yet its probabilistic output imbued the illusion of scientific impartiality.

MEDLINE, the vast database of medical research papers, has been indispensable for discoveries over the past decades. Yet a recent study found that most subjects in these papers derive from WEIRD populations — Western, Educated, Industrialized, Rich, and Democratic societies. This skews findings and hampers extrapolating knowledge to other groups. The data has intrinsic demographic limitations.

What connects these examples? In each case, cognitive and human biases corrode the objectivity of the data analysis. Rather than neutral tools, these systems refract existing prejudices and partially reflect reality. Their mistakes pass as truths when taken at face value.

Data analysts — being human — unconsciously privilege confirming evidence over contradicting data in shaping conclusions. When results validate preconceived narratives, inconsistencies get overlooked. This confirmation bias distorts interpretation.

Similarly, the temptation often arises to cherry-pick data points that fit a desired story while ignoring those that complicate it. For instance, emphasizing polls favoring a preferred candidate but dismissing contradictory surveys or optimizing algorithms to maximize ad clicks while disregarding negative emotional consequences. In these ways, analysis gravitates towards comforting falsehoods over inconvenient truths.

Unquestioning trust in big data insights tilts toward truthiness — the seductive veneer of statistics and models making something feel true, although it misrepresents reality. Big data lends a misleading air of precision and objectivity to inaccurate or biased claims — garbage in, gospel out.

Does this render big data hopelessly hazardous? Not at all. Its risks are manageable when approaching analysis with responsibility:

First, rigorously audit systems for embedded biases, fill knowledge gaps, and stress-test assumptions. Strive for diverse training data, and be alert for algorithmic prejudices. Make transparency and accountability top priorities.

Second, communicate insights with humility and nuance. Data reveals shades of probability, not absolute certainties. Beware of extrapolating from limited data pools. A holistic human perspective is indispensable.

Finally, recall big data’s role as a decision aid, not the decision maker. It should inform human judgment, not replace it. Incorporate ethics and social responsibility at every step of the process.

With thoughtful oversight, big data can enlarge understanding and enhance lives. But we must remember its flaws reflect our own. Its sheen of truth comes from those who shape it. Real wisdom means recognizing how bias distorts reality in data and beyond.

The importance of the human element in interpreting data responsibly:

The exponential growth of data and artificial intelligence brings remarkable possibilities. Yet it also surfaces an uncomfortable truth: machines do not intrinsically possess human virtues like discretion, critical thinking, or humility. For all their analytical power, computers ultimately reflect the priorities of their programmers. This makes responsible human oversight of data essential.

Recent years have seen stumbles in algorithmic systems entrusted with significant public responsibilities. In the criminal justice arena, some jurisdictions implemented recidivism risk assessment tools to guide sentencing. However well-intentioned, problems emerged. The algorithms frequently demonstrated racial bias, issuing harsher risk scores for minority defendants.

In one state, authorities investigated discrepancies in the system’s results. Persistent human questioning uncovered that the tool’s training data was derived entirely from incarceration records rather than actual recidivism rates. This led to black defendants being classified as higher risk due to structural over-policing of minority communities. Critically investigating inconsistencies revealed flaws that pure data analysis had obscured. Once identified, authorities enacted reforms to address the biased outcomes.

A similar awakening emerged in the world of social media. Platforms initially saw themselves as neutral conduits for all perspectives. Moderation was light, driven by non-judgmental popularity metrics. Over time, mounting societal damage became apparent. Digital tools designed for connection and information were co-opted for disinformation and extremism.

Once again, conscientious human re-assessment of values was required. New moderation policies introduced standards for truthfulness, compassion, and diversity — ethical priorities absent from technology. Imparting social responsibility and wise values needed deliberate human guidance.

Intellectual humility is missing from data-centric decision-making — recognition that knowledge derived from data has boundaries and blind spots. Data provides a powerful lens but only captures a partial view. Pure information lacks unwritten context.

This is why critical thinking must partner with data analysis to catch oversights — rigorously self-questioning counters the tendency toward seeing only what confirms pre-existing beliefs. It also uncovers faulty assumptions, sample biases, and logical fallacies lurking beneath the numbers.

Data insights flatter our love of certainty. But the truth is often complex, multilayered, and nuanced — not easily reducible to models. However sophisticated algorithms become, human judgment remains vital to weigh contradictions and discern deeper relationships from correlations.

When powered by human wisdom, data can guide incredible progress. But if not tempered by ethics, critical thought, and humility, it risks magnifying our worst impulses: our shared moral code and sense of social responsibility anchors technology to the common good.

Finding the proper equilibrium between data-driven optimization and human discretion remains an ongoing challenge across many fields: In medicine, research insights must be weighed against individual patient needs and values. Data reveals probabilities, not certainties.

For self-driving cars, split-second navigation decisions require fusing camera inputs with social ethics. This combines fluid machine proficiency with human notions of safety for the greater good.

Upholding social goods — compassion, justice, truth — transcends calculation. Wisdom means appreciating data’s powers and limits, then guiding it with moral purpose. The big data revolution brings immense opportunities to extract insights from information at an unprecedented scale. Yet with these powers also come responsibilities. Big data holds no special immunity to human bias or blindness. AI and data worship risk distorting the truth as much as revealing it.

As the examples show, bias readily creeps into data collection and analysis unless vigilantly guarded against — cognitive errors like confirmation bias, tilt interpretation, and privileging comforting falsehoods over inconvenient truths. Truthiness — the veneer of statistics making something feel accurate despite distortions — pervades predictions and models unless questioned through rigorous critical thinking.

This is not to deny big data’s tremendous potential but to recognize its inherent limitations. Big data expands understanding but does not innately impart wisdom. It provides a powerful lens but needs a complete picture. Even with sophisticated AI, human oversight and discretion remain essential to employ it ethically.

There is a complex art in distilling truth from vast data flows whirling around us. As the pioneering analyst John Tukey wrote, “Far better an approximate answer to the right question … than an exact answer to the wrong question…” In this spirit, we have to insist that big data inform and uplift human understanding, never obscure or demean it. Handled responsibly, the data deluge can reveal profound truths about ourselves and the world at a scale never before dreamed. Handled irresponsibly, it could mean the death of our shared human judgment and experience.

I’m Joan. Transgender. Solopreneur. Tech writer. Founded studio self, a marketing agency, community, & product lab. We publish The Index, an indie tech publication & more.

https://linktr.ee/joanwestenberg

--

--