The Data Driving Democracy

Key Insights: What Can These Data Tell Us?

Back to table of contents
Christina Couch
Commission on the Practice of Democratic Citizenship

Interviews for this report included one extraordinarily broad and borderline unfair question: From your perspective, what questions can the available data and research methodologies answer, and what questions can’t be answered in this field? This section details a handful of key insights that highlight how experts view their field of study. The insights showcased here do not comprehensively cover digital political participation research, but are instead intended to give readers an introduction to what the landscape of online civic engagement research looks like to those on the inside.

1. Current data resources, tools, and methodologies can efficiently track how specific messages spread across individual platforms and channels, who they target, and how they change over time, but understanding how messages ripple through the larger media ecosystem is still an open question. Understanding their impact on beliefs and behaviors is also an open question.

In 2012, a local Florida news station aired a small story about the shooting of an unarmed teenager named Trayvon Martin. The Orlando Sentinel and the Miami Herald newspapers also ran articles on Martin’s death, but the story nearly stopped there. It wasn’t until ten days later that the national news media broke the story to a broader audience. The shooting was brought to national attention largely thanks to Benjamin Crump, a civil rights attorney who took on Martin’s case pro bono, and to publicist Ryan Julison. Martin’s story didn’t just spread from mainstream news media to race-based media outlets, activist sites, and a petition backed by celebrity voices; it pivoted along the way, from a story framed as an altercation between two people to one centered around an unarmed Black teenager dying at the hands of a neighborhood watch vigilante who wasn’t held accountable.

Readers can follow the exact path that Martin’s story took to get from that first Fox 35 Orlando piece to national protests and remarks from the President32  as well as how the story changed along the way thanks to media mapping research from Erhardt Graeff, Matt Stempeck, and Ethan Zuckerman.33 To map Martin’s story, the team used quantitative data from eight different sources—RSS feeds collected with the Media Cloud media analysis tool, front page national newspaper stories, broadcast television news mentions, Google searches for “Trayvon Martin,” Google searches for “George Zimmerman,” tweets, petition signatures, and clicks on bitly links within Media Cloud stories—combined with firsthand interviews.

Ethan Zuckerman, director of the Center for Civic Media and associate professor of the practice in media arts and sciences at the Massachusetts Institute of Technology, was the principal investigator on the paper. He recounts how the project was perceived by others in the field: “Someone cited Erhardt’s Trayvon Martin paper and basically said, ‘The researchers did this using eight different data sources. Obviously, that is insane,’” Zuckerman said, laughing. “My response was I get it, but it’s not insane. Actually we felt bad that we didn’t get certain other data sources into that study.”34

Media ecosystem studies that follow messages as they move between platforms and throughout the larger digital universe are far more rare than studies that track how messages spread on one specific platform. Experts largely agreed that current data and research methodologies are effective at identifying influential voices, issues users are talking about, accounts that are exhibiting problematic behaviors, and how messages spread within a specific platform. Many experts pointed to the entire subfield of research on the mechanisms by which fake news and disinformation spread—we have several examples35 —as crucial additions to digital civic engagement literature and as proof that valuable, useful conclusions can be drawn from single platform studies.

“You can answer lots of questions that look at how different segments of society use specific social media platforms,” said Deen Freelon, associate professor in the Hussman School of Journalism and Media at the University of North Carolina at Chapel Hill.36 “You can get a sense of how information circulates within the platforms. There’s some information about the effects that it has, who ends up engaging with it, how information flows between the people that originate it if they’re not already media professionals who are well known, and how it originates, how it flows towards folks that have more visibility and reach. Those are the kinds of questions that I think are well-answered.”

Experts also agreed that single platform studies alone can’t provide a comprehensive view of how influence moves and gets amplified throughout the Internet nor whether these messages actually impact beliefs and behaviors. Since both political outreach and disinformation campaigns are often designed to push users to further engagement across platforms—think Twitter posts that link to YouTube videos that link to blogs—single platform studies only offer a small piece of a much larger picture.

“As great as it is to have papers that are about how a certain thing travels on Facebook or on Twitter with regards to Myanmar or Mexican elections or Indian elections, none of us only exist in that one space and there are other spaces that are more important in different places in the world,” said Alondra Nelson, president of the Social Science Research Council and Harold F. Linder Chair in the School of Social Science at the Institute for Advanced Study.37 “It’s all well and good to pay a research team to do some research on the polarization situation on Twitter vis-à-vis a certain issue, but how are we going to understand how and when that moves from WhatsApp to Instagram? How different generations are using different apps to [stoke] virality? The role of bots? It’s such a complicated thing.”

Experts were quick to point to what they viewed as seminal ecosystems research, including Yochai Benkler, Robert Faris, and Hal Roberts’ 2018 book Network Propaganda: Manipulation, Disinformation, and Radicalization in American Politics,38 which presents a map of the American political and media landscape during the 2016 presidential election (a 2017 study39 by the same authors plus several additional ones preceded the book), and Zeynep Tufecki’s 2017 book Twitter and Tear Gas: The Ecstatic, Fragile Politics of Networked Protest in the 21st Century,40 which examines the role of the Internet in modern protest movements. They were also quick to note the barriers that prevent researchers from doing more multi-platform ecosystem studies, which range from a lack of technical tools to make data collection and analysis easier across platforms to research infrastructures that inhibit interdisciplinary work to lack of coordination between stakeholders within academia and the tech industry.

An ecosystems approach “might mean that we need to have journalists working with Facebook and have [platforms offer] clear understandings of the technologies for journalists,” said Joan Donovan, director of the Technology and Social Change Research Project at Harvard Kennedy School’s Shorenstein Center on Media, Politics and Public Policy.41 “We need to have civic society organizations be able to report to platform companies quickly that there’s something happening within their communities that is suspect or that there is some kind of manipulation campaign. We also need university researchers to be able to access data and to be able to audit platforms so that their research isn’t so patchwork.” Donovan clarified that the data she is referring to are related to platforms’ revenue, advertising, and manipulation campaigns; not data about individual users.

2. Current data resources, tools, and methodologies can offer valuable insights into specific demographic pockets in ways that were not possible before the digital age, but these online data are not representative samples of the general population. Insights gained come with limitations on how far they can be extrapolated.

When discussing how the Internet has changed the way researchers measure civic engagement, many experts brought up demographic granularity. Existing data resources and tools provide new ways (and more streamlined ways) to identify and study highly specific demographics and subpopulations as well as smaller, community-specific websites, blogs, social media, and other communication networks. Just as digital platforms and “big data” resources have given rise to microtargeted political ads and outreach strategies, they’ve also created new mechanisms for answering detailed, community-specific questions like who speaks up about housing development at planning and zoning board meetings in eastern Massachusetts42 and how do local governments leverage social media in crises.43

Experts noted that digital data also provide new ways for researchers to study political attitudes and ideologies that are commonly considered socially sensitive or outright unacceptable. One example a few experts cited is Seth Stephens-Davidowitz’s study that looked at whether the percentage of “racially charged” Google searches made in specific geographic areas during the 2008 and 2012 elections could predict Barack Obama’s vote share in those places, controlling for vote shares of the previous Democratic presidential candidate.44 (Spoiler: They did.) In the case of that particular study, Google searches provided a way to collect aggregate data from a large number of people, a method of pegging search terms to geographic locations, and a window into racial perceptions and ideologies that are often difficult to study using traditional survey methods.

Many experts issued a word of warning with regard to studies that analyze language, keywords, and political speech: Current quantitative methods often have difficulty factoring in tone or context, even when using sentiment analysis tools. That means that it’s often difficult using quantitative methods alone to figure out the intention behind some messages and whether they’re true expressions of how a user actually feels. This obstacle is especially limiting when tracking words like “climate” or “immigration” that have political and nonpolitical connotations, and when studying communities that have their own language and slang conventions. Getting the lingo wrong muddies research results. This points to another challenge of researching political and civic participation online across a range of platforms that are designed to present text alongside video alongside photographs. The variety of content types presented online makes conducting comprehensive research extremely difficult. Researchers’ efforts have been based primarily on text-based online material; we do not yet have the tools to work meaningfully with audio, video, and image content, which represents a large percentage of the content consumed online.

Jonathan Nagler, co-director of the New York University Social Media and Political Participation (SMaPP) Lab and director of the NYU Politics Data Center, said that machine learning text analysis programs are only as good as the datasets they’re trained on.45 Creating high-quality datasets—ones that account for nuances in language—require careful design, human coders, and quality control measures, all of which require time and financial investment. Even with ample fiscal and human investment, research methodologies will still struggle to keep up with speech ambiguities and with breaking stories and events that introduce new buzzwords and phrases into the political vernacular. “The constraint is going to be can we as a social science community have the resources to build quality training datasets for the task at hand,” Nagler said.46

While online data and data tools have made it easier to pinpoint some specific demographics and how some individual messages travel within platforms, experts reported that they still struggle to find digital data samples that are representative of the general population. That’s partially because despite having large numbers of users, platforms, search engines, and datasets come with biases that make it difficult to know if inferences drawn from those data can be extrapolated to the broader public.

About 70 percent of Americans use social media,47 but the use is wildly uneven. Twitter, for example, skews more heavily toward younger, liberal, and higher income users. Nearly one-quarter of all Americans use the platform, but just 10 percent of Twitter users create 80 percent of the site’s content48 and the platform is flooded with content-producing bots that sound increasingly more like real people.49 That means that tools like Twitter’s Sample Tweets API,50 which offers a free random sample of all public tweets in real time, pulls the majority of content from a small, disproportionately vocal population.51 YouTube and Facebook are, by far, the most widely used platforms—both are heavily used across every age group except those over age sixty-five. In some cases, platforms also skew along racial and ethnic lines. Instagram and WhatsApp, for example, are much more popular among Hispanic users compared to other ethnic demographics.52

Biases inherently mean that some demographics are left out, oftentimes low-income groups and those low in socioeconomic status (more on that in the next section).53 Additionally, researchers are not able to study what’s said privately on these platforms. Even if they have access to all public data, that’s still only a slice of all platform activity and, as one expert said, it’s often “not the most interesting slice.” Perhaps due to privacy concerns, many Americans are now retreating to closed communication forums that offer higher levels of encryption. An unknown number of Americans also use social media platforms that are based outside the United States, several of which (China’s WeChat is one example) seamlessly integrate multiple functions into a single platform.

“I worry that we’re missing large numbers of people,” Henry Brady said. “When we go and get a whole lot of Facebook posts, we just simply don’t know what that’s representative of . . . it’s worrisome that we don’t actually have a good notion of what universe they represent. That’s, I think, the biggest problem with Internet data—you just don’t know what the universe is.”

Experts said that data access problems further complicate this issue by preventing researchers from clearly understanding the biases and limitations of their samples and by inhibiting research replication. Experts also noted that this sampling problem is one that political and social science grappled with in various forms long before the digital age. However, they added, big data digital resources create the illusion of representation and a certain objectivity that’s impervious to human biases—a phenomenon Microsoft principal researcher Kate Crawford calls “data fundamentalism”54 —while data access and other problems prevent researchers from fully understanding exactly who and what exactly they’re studying.

3. Platforms can be crucial tools for galvanizing grassroots social movements and elevating marginalized voices, but these spaces and tools largely reinforce existing power structures and biases along race, gender, and socioeconomic lines. Researchers are limited in the ways in which they can interrogate how those power dynamics are established and their ramifications.

Jen Schradie, a sociologist at the Observatoire Sociologique du Changement (OSC) at Sciences Po in Paris, is part of a pool of researchers whose work shows that the problems of representation inherent with digital civic engagement data and exacerbated by platform design often translate to the powerful becoming more powerful. In the late 2000s, Schradie was intrigued by the Internet’s promise to democratize politics. Online, anyone could have a political voice, but Schradie wondered who was actually producing the content that would drive these voices.

Using survey data from roughly forty-one thousand American adults, Schradie analyzed ten ways of creating digital content, ranging from chat­room participation to blog production, and found that, just like in traditional media, digital content production varied along socioeconomic lines. Across all ten production activities, users who had higher levels of formal education were more likely to be content producers. They also produced significantly more content than those lower on the ladder,55 indicating that instead of disrupting existing power structures, the Internet had strengthened voices that were already louder than others.

Schradie and a handful of other researchers have repeatedly shown that online civic engagement spaces, platforms, and tools have evolved, but the digital divide along class and other lines remains. Schradie’s subsequent work has broadly shown class and education gaps in who produces blogs56 and levels of digital activism within grassroots groups.57 In one study, she conducted in-depth interviews and ethnographic research, and examined social media posts from thirty-four activist groups all organized around the issue of collective bargaining and unionization rights for public employees in North Carolina. Out of sixty thousand total tweets posted within the study’s timeframe, all but one had come from middle- and upper-class groups.58

This skew toward wealthier users is largely because content production and digital messaging take time, labor, and resources, all of which tend to be far more available to those higher on the socioeconomic chain, Schradie said. It’s also worth noting that the classic measures of civic and political participation have been rooted in structures of power and privilege that shape decisions about what to measure in the first place. Other experts noted that marginalization is further enforced by bots, social media manipulation, and computational propaganda campaigns that send armies of automated cybertroops to amplify certain messages thousands of times per day and quash others, as well as by algorithms that incentivize sensationalism. Researchers added that these automated programs can outmatch marginalized voices in terms of the volume of posts they make, but it is not clear how much messages from bots drive (or don’t drive) behavior.

Researchers are still struggling to quantify the extent of influence that bots have over online political discourse and offline behavior, but there’s no doubt that bots are extensive, prolific, and sometimes dominant voices around politically charged conversations, in some cases driving more than half of social media conversations around specific topics.59 Platforms have taken steps over the past few years to crack down on malevolent bots,60 and some research suggests that certain steps may help to curtail the problem.

Earlier this year, for example, WhatsApp put harsher restrictions on the number of times a user can forward a specific message,61 reducing the limit from a maximum of twenty groups down to five. In a preprint paper published on, researchers from Universidade Federal de Minas Gerais in Brazil and the Massachusetts Institute of Technology analyzed public data gathered from WhatsApp in Brazil, India, and Indonesia and found that the new limits didn’t eradicate propaganda, but they did slow the spread of misinformation by about one order of magnitude.62

Despite the crackdowns, studies show that organized disinformation campaigns are still growing. The Computational Propaganda Research Project at the Oxford Internet Institute documented social media manipulation campaigns in seventy countries in 2019—that’s up from forty-eight the year before—with Facebook being the favored platform among bad actors in fifty-six nations.63 Experts interviewed for this report credited Alice Marwick and Rebecca Lewis’s 2017 Data & Society report on media manipulation campaigns as a crucial text for understanding how disinformation propagates.64

“It really comes down to money, power, and resources,” says Schradie.65 “Whether you are an individual who is just kind of creating bots on your own, you have to have time to do that. More often, it’s an organization, an institution, or a state government.”

Schradie’s conclusions that digital participation is dominated by the elite and that online spaces and platforms more often than not bolster those who already have political power is echoed by a body of scientific literature and activist work from across the globe.66 Experts who were involved in this area of research pointed to a broad array of projects that document this phenomenon— from Zeynep Tufecki’s work on the ways in which social media makes protest groups vulnerable67 to Cass Sunstein’s research on echo chambers and political polarization68 to David Karpf’s projects around analytic activism69 to Cathy O’Neil’s work chronicling how big data reinforce discrimination along race, sex, and economic lines while appearing neutral.70

Experts also gave a nod to the breadth of research centered around how the broader tech sector unintentionally reinforces racial, gender, and economic inequality, both in and outside of political spheres. Work like Safiya Umoja Noble’s research that documents how search engine algorithms reinforce racism71 and Ruha Benjamin’s work on discriminatory design72 are directly applicable to questions around how the Internet influences civic engagement.

Experts were also quick to say that data access problems and algorithmic opacity prevent researchers from understanding the full extent of these dynamics and from creating solutions (more on these issues in the next section). Many said that they were disturbed and concerned by what we know about biases and political power structures, but they were more concerned about the hidden issues that can’t be uncovered yet because the tools to do so don’t exist and because the data (if they exist at all) are not available.

4. Search engines and social media platforms play an increasingly powerful role in political speech, voter knowledge, and democratic participation, but there is not enough transparency around curatorial and ranking algorithms nor on how policies within tech companies are crafted and executed.

There’s clear evidence on how heavily social media, search engines, and online communities influence some civic engagement metrics like galvanization of protest and social movements,73 spread of political messages,74 voter turnout,75 and ability to reach younger voters.76 There’s conflicting evidence about how much these platforms influence other metrics like political perceptions77 and whether political messages translate to changes in behavior or belief (a question that has plagued the study of media forms long before the rise of the Internet).

Experts agreed that social media and search engines, in particular, are becoming increasingly significant to how political messages spread from both the politician and citizen sides, but the exact ways that tech companies influence what users see are unclear and users themselves are frequently unaware. Research shows that users often don’t know that search engine results and social media feeds are curated at all.78 Experts reported that this curatorial opacity prevents researchers from better understanding the landscape and effects of political messaging and misinformation.

Aaron Smith, director of Data Labs at the Pew Research Center, said that current research methodologies can allow researchers to follow how many users are sharing or viewing messages in the aggregate within most platforms, but algorithmic opacity makes it hard to understand the backdrop and context within which individuals encounter those messages.79

On most major social media platforms, “you can’t just look at the people they’re following and know [with certainty] that they encountered a particular tweet or engaged with a particular type of content,” Smith said. “Drawing links between an individual person and the actual content that they’re seeing and being exposed to and engaging with on digital platforms that they use and how that bleeds into things like knowledge of elections or support of candidates or support for conspiracy theories, any question of choice, that linkage is very difficult. . . . That’s kind of the Holy Grail for what we’re trying to figure out.”

Experts said that uncovering those levels of exposure is especially relevant when tracing harmful behaviors like hate speech, harassment, and extremism, especially on networks that monetarily incentivize virality. A few people interviewed for this report praised Rebecca Lewis’s 2018 report, “Alternative Influence: Broadcasting the Reactionary Right on YouTube,”80 as a crucial text for understanding how audiences move from mainstream to extremist content and how extremist messages are perpetuated. They added that algorithmic transparency from platforms like YouTube could greatly enhance these types of projects and provide a clearer picture of the choices users have when selecting which content to engage with. Experts said that algorithmic opacity also prevents researchers from uncovering ways that algorithms reinforce biases, especially along gender and racial lines, and it gives rise to accusations of censorship.

“Just having that transparency from the corporate level would help appease a lot of the conspiracy theories that I think are thriving right now,” said Francesca Tripodi, assistant professor of sociology at James Madison University and affiliated researcher with the Data & Society Research Institute, who studies media manipulation.81 This opacity is primarily in place to protect intellectual property and corporate financial interests, Tripodi added, but it’s also because curatorial algorithms often aren’t understood even by those creating them.

A few experts pointed to Safiya Umoja Noble’s book, Algorithms of Oppression: How Search Engines Reinforce Racism,82 which details the myriad ways that search engines reinforce privilege and discriminate against people of color, especially women of color, as an example of work that addresses how search biases influence information consumption. While many tech companies have tweaked existing policies to increase transparency—for example, last year Facebook modified procedures for informing page managers when content was removed83 and published information on how news feeds are personalized84 —experts believed that these efforts did little to meaningfully illuminate automated processes and remove barriers for study.

Some experts mentioned that better protocols for algorithmic auditing could help researchers in this field find and eliminate bias issues. While some data scientists like Cathy O’Neil have created their own algorithmic auditing consultancy groups, either independently or within established organizations, others have pushed for “right to explanation” provisions similar to those embedded in the European Union’s General Data Protection Regulation (GDPR) that would require those deploying decision-making algorithms to provide an explanation to affected users of why the algorithm made the choice it did.85

In addition to calling for greater algorithmic transparency, experts also wanted greater transparency around how tech companies create and enforce content and speech policies. Kate Klonick, assistant professor of law at St. John’s University and affiliate fellow at the Information Society Project at Yale Law School and New America, authored one of the first analyses on how platforms moderate online speech through policy and the procedural systems they use to develop those policies.86 In an interview, she described the lack of knowledge around free speech and comment moderation policies as “kind of like if we didn’t have the story of the Constitution and the Founding Fathers and the American Revolution and were asked to just understand and buy into a system that surrounded us without having any idea how it got put into place.”87

Klonick added that the need for transparency around speech policies is especially important as platforms continue to apply U.S.-centric policies to an ever-widening base of international users.

5. Despite the challenges, many experts believe in the Internet’s potential to promote democracy and strengthen civic engagement, though everyone who spoke on this topic said that online spaces and the tech sector at large have a long way to go to achieve these positive results.

Much of the research around how the Internet has transformed the political playing field examines how online spaces, tools, and overall designs have been leveraged intentionally and unintentionally to hinder democratic practice. Some (but definitely not all) researchers interviewed for this report also highlighted the opposite—the potential ways that online spaces could be optimized to amplify voices of positive change, disseminate accurate information quickly, debunk disinformation, and support healthy civic engagement in ways that reduce inequalities.

These experts pointed to several positive ways that online spaces have influenced civic engagement, including making political issues more accessible to younger audiences, leveraging personal networks to increase voter turnout, launching pro-democracy protest movements, expanding accessibility for those who cannot participate politically in-person, and broadening the definition of civic engagement. They also spoke extensively about what could be—specifically, how online spaces and tools could be optimized for political inclusion, help build civic-minded communities, fact check political figures, support positive collective action, promote accurate news, and increase faith in trustworthy voices and institutions. It is worth noting that no expert who spoke about the Internet’s ability to strengthen democracy said that the current landscape is adequately fulfilling this potential or that they believed that this potential would be fulfilled any time in the near future.

“We don’t really know the pro-democracy uses of social media because we haven’t tried,” said Ethan Zuckerman, who has written extensively on this topic.88 “We don’t see a lot of thoughtful, meaningful deliberation on social media, but we haven’t optimized platforms for that.”

Zuckerman said that there are pockets of the web that either currently serve or have served as examples of spaces that are optimized to promote democracy and reason-based political discourse. He mentioned moderated sites like Parlio, a discussion platform that was dedicated to promoting civil opinion-sharing and debate, as one example of an online space designed to encourage fact-based ideological diversity. Parlio threads are still available online, but the site has not been updated in several years.

Zuckerman added that social media platforms are currently designed to connect users to people with whom they share things in common, but they could be optimized to counteract political polarization and break echo chambers by also connecting users with voices they might not otherwise encounter. There are several tools designed to do this, including the Center for Civic Media’s Gobo tool, which allows users to filter their social media feeds in several ways, one of them being diversification of political perspectives.89

Some experts spoke of the inherent conflict between the goals of advertisement-driven platforms and spaces that serve public, rather than corporate, interests. A few experts said that creating pro-democracy platforms might require public funding, a conscious move away from for-profit business models, and stricter rules on appropriate advertising within those spaces. It would also require thinking critically about how to measure healthy civic engagement, inclusivity, and empowerment within these spaces.