The Data Driving Democracy

Data for Digital Age Democracy

Back to table of contents
Christina Couch
Commission on the Practice of Democratic Citizenship

An article published in July of 2019 by academics from the University of California, San Diego, and the University of Massachusetts, Amherst, referred to the large quantity of new data available to leaders in the civic engagement space as the “Civic Data Deluge.”7 It’s an imperfect term that captures the vast amount of “big data” that is now available to researchers, but leaves out the tsunami of information that isn’t accessible because of technical, legal, and proprietary ownership barriers.

To measure how civic engagement plays out on- and offline, researchers use an enormous array of data resources, each of which come with their own content limitations and, in many cases, accessibility challenges. Diving into each individual data resource would be exhausting to read, not to mention impossible as these resources are constantly changing. This section goes over briefly the types of data that drive research in this arena and how academics obtain these data.

Much of the research in this field, especially studies around political messaging, fake news and disinformation, online campaign reach, and media manipulation, centers around social media data from widely popular channels like Facebook, Twitter, and YouTube, and to a lesser extent community-specific channels like Gab,8 a free speech–focused social media platform known for attracting racists and far-right extremists. Experts interviewed for this report noted that social media is particularly valuable to civic engagement research because it offers immediate ways to observe how online movements catalyze and mobilize,9 measure the reach of media outlets,10 draw inferences based on how users casually talk about politics,11 and follow how political messages move and morph throughout the digital ecosphere.12

Henry E. Brady, dean of the Goldman School of Public Policy at the University of California, Berkeley, gave an example of how social media can streamline research.13 For two of his books, Brady and his coauthors tracked messaging from a broad spectrum of political lobbying groups.

“In the past, if you wanted to study that, what you would have had to have done is somehow figure out what the lobbying groups were, subscribe as a member, and then maybe get the information they sent you via mail and then content analyze that,” he explained, later adding that using this method, it was not possible to single out messages from individual lobbyists working within an organization. “Instead, we just got all the Twitter feeds from these organizations and then we content analyzed them using data science methods and came to some conclusions about what was going on. I mean, that’s an enormously useful thing to know.”

Social media data often include data from the offline world as well. Facebook, for example, has location tracking capabilities that are on even when users aren’t using the app.14 Google came under fire in 2018 for storing location data from Android and iPhone devices even when users opted out of sharing location information.15 Instagram offers shopping checkout that gathers data on purchases made through partner merchants.16 Experts described the breadth of digital trace data gathered by social media and search engine platforms as “enormous” and “mind-boggling,” noting that it includes information on users’ physical activities, consumer behaviors, and information consumption.

The exact data that researchers can access from social media vary dramatically from platform to platform, which is a primary reason why these types of civic data studies heavily skew toward more public and accessible platforms and away from more restrictive ones. Academics typically obtain social media data in one of two ways: 1) through an API (Application Programming Interface), which is a tool companies offer that allows third parties to access a limited set of data curated by the company, and 2) through scraping, a process wherein extraction tools are used to go through sites and pull unstructured data specified by the researcher.

Both methods, alone and in combination with collection and sentiment analysis tools (a partial list of about one hundred of these tools is available in a report by Lily Davies17), are incredibly powerful and both come with challenges. APIs are generally designed for third-party developers rather than academics and they range in terms of the types of data that can be accessed. Since APIs are designed and controlled by tech companies, they can (and do) change,18 which can disrupt ongoing research. APIs also frequently come with hefty fees and “rate limits”19 that restrict the number of data requests parties can make within a certain timeframe. Data scraping provides an alternative route to data access. This method is often technically challenging and time-consuming for researchers, but the primary barrier to this method of access is legality. Many platforms prohibit automated data scraping20 and include language in their terms of service that severely limits how researchers can use public data that have been pulled from the site without automated software. In some cases, researchers can also purchase some consumer spending data through third-party aggregation firms; however, in these instances, researchers do not have access to the raw data and must work with information that has already been manipulated by an outside entity. A further discussion of these issues is in the Barriers and Challenges section of this report.

Studies that revolve around the Internet’s impact on democracy also mine data from myriad other online networks, including search engines, comment boards, wikis, blogs, chat apps, location-based programs, mainstream media and government sites, online petition sites, and large datasets provided by government agencies, nonprofits, nongovernmental organizations, and private companies. Though some organizations like Google21 and Democracy Works22 offer their own civic information-specific APIs, many of the same access challenges and limitations apply to these resources as well. Experts said that data structure also plays a role here. Because platforms vary in terms of what data are available and how they are organized, researchers have an incentive to focus on some platforms over others.

“If you look at Twitter, tweets have a nice structure to them and there are certain kinds of metadata associated with a tweet, like when was it tweeted and who tweeted it. A tweet is a fairly neat object to work with and that scales up when you have lots of tweets,” said David Lazer, co-director of Northeastern University’s NULab for Texts, Maps, and Networks and professor of political science and computer and information science.23 “Reddit has threads and there are lots of reddit [channels] so it’s just a more complicated data structure. Reddit is this sort of very threaded thing that’s much tougher to work with.”

The field isn’t entirely reliant on digital data and quantitative analysis. Journalistic research, like Kate Klonick’s analysis of how platforms develop procedural systems for governing online speech,24 and ethnographic research, such as Francesca Tripodi’s report on media messaging and interpretation within conservative Christian communities,25 offer nuanced answers to questions that quantitative methods can’t fully address. Some work, such as Jen Schradie’s research on class divides within digital political participation,26 uses a mixed methods approach. Researchers noted that there are fewer qualitative than quantitative studies in this field, in part because of the time and cost of conducting good qualitative studies and partially due to the general trend in social science research toward quantitative analysis. Many experts interviewed for this report spoke of a need for more qualitative research in this field. This report focuses primarily on research that relies on quantitative data, and thus the many fine qualitative studies that do exist are not discussed as much as they might be in a longer report.

Both quantitative and qualitative projects often use digital data in tandem with resources relied upon before the digital age, such as voter registration data, phone surveys, polls, public records, judicial opinions, campaign contribution data, census research, and so on. The Internet has also transformed how these traditional data resources are used and what they mean. For example, telephone public opinion polls that had response rates of 37 percent in the late 1990s dropped to response rates of just 6 percent as of 2018.27 Low response rates translate to higher survey costs and reduced sample sizes as well as a bias toward older demographics. For these reasons, major polling organizations such as the Pew Research Center28 now conduct the majority of U.S. polling online, but data from online surveys aren’t entirely comparable to phone polling data.29 In opinion polls that involve speaking with a live interviewer, respondents are significantly more likely to give answers that sidestep awkward interactions and that frame themselves and their communities in positive ways, which skews research centered around socially undesirable attitudes and behaviors.30 Respondents also answer complex questions differently when they hear the question being read to them versus reading it themselves.31 This is to say that the work of this field involves not only understanding what the available digital data can reveal about how citizens engage with politics, but also how new modes of political participation have forced research methodologies to evolve. The reliance on research models that existed before the rise of the Internet presents another challenge: that of understanding what it is that we want to measure in the first place. A focus on getting access to data, whether from phone surveys or from online social media platforms, privileges the questions that these sorts of data can answer. Interrogating the foundations of how we think about civic and political participation is necessary if researchers are to better understand the health of our civic life.