As the predictions for the 2016 presidential election remind us, polling the electorate is an imperfect science. Most polls claimed that Hillary Clinton would be our next president—it seemed a foregone conclusion—and most polls were wrong, although many forecasts for the popular vote were very close, off by less than one percentage point. Election polling has always been inexact. It has also been time-consuming, expensive, and lacking the ability to measure the influence of short-lived events, like a candidate’s speech, or to read the electorate of small geographic areas.
Now, two Boston University professors believe they have found an alternative, one that is not only similarly accurate but has the potential to be faster and less expensive, can target areas as small as towns, and can measure the people’s response to specific issues and events. The methodology, which correlates web browsing patterns with public opinion from polls, was developed by two professors at the College of Arts & Sciences: computer scientist Mark Crovella and political scientist Dino Christenson.
The pair worked with Giovanni Comarela from the University of Vicosa (formerly a PhD student at BU under Crovella), Ramakrishnan Durairajan at the University of Oregon, and Paul Barford at the University of Wisconsin, Madison. Barford, who also works for comScore, Inc., a kind of Nielsen ratings of the internet, negotiated an arrangement by which comScore provided the researchers with the web browsing histories of more than 100,000 US residents over the 56-day period preceding the 2016 election.
All the data the researchers used was specifically authorized and released for this kind of research by the users who generated the data. The researchers’ analysis of that data—two terabytes worth containing 70 million websites—showed exactly when and where voters made decisions that led to the election of Donald Trump.
It also suggested that, contrary to popular and expert opinion, a last-minute dip in support for Hillary Clinton was not precipitated by a letter to Congress in which FBI Director James Comey revealed that the FBI had found a new batch of relevant emails on Hillary Clinton’s server. Crovella and Christenson’s analysis clearly indicated that support for Clinton began to decline on October 25, 2016, three days before the letter was sent. That doesn’t mean, says Christenson, that the letter had no impact on support for the Democratic candidate. “The previous slippage could have just been a coincidence,” he says. “It may have been a small dip that would have rebounded had it not been for the letter…but the findings certainly cast doubt on the Comey letter as the first mover.”
For Crovella and Christenson, the importance of that finding is its proof that their methodology can measure the influence of single, brief events, such as a particular campaign stop, or a Supreme Court decision, or a scandalous news report—a valuable potential for candidates and pollsters.
“Let’s say a candidate flies in to a city, makes a speech, and flies out,” says Crovella. “How much of an effect does that have? A typical political poll is too coarse an instrument to measure that. A poll, even one that’s well done, takes three or four days to get a large enough response to be statistically significant. You can’t measure something that had an effect that lasted two days. That’s washed out of the measurement process.”
Similarly, says Crovella, the large numbers needed to give a traditional poll statistical significance prevent it from drilling down on small populations. “Because there are a lot of people participating in our data, we can look at political leanings of different populations on an early, localized geographical basis,” says Crovella. “We can do this in a fairly fine-grained way in space and time, because we’ve got records of their browsing behavior, their websites, on a minute-by-minute, hour-by-hour, day-by-day basis.”
Crovella and Christenson also say that their method can gauge big-picture support more accurately than current polling methods do. Their research, “Assessing Candidate Preference through Web Browsing History,” by Giovanni Comarela, Ramakrishnan Durairajan, Paul Barford, Dino Christenson, and Mark Crovella, is published in Proceedings of ACM KDD 2018, London, UK.
Ultimately, says Crovella, the polling system needs two things: “It needs the records of web browsing, and it needs some kind of initial poll to calibrate the machine-learning component to learn what it’s looking for.”
Calibration was the hard part, as well as the reason that massive computing power was brought to bear. How exactly does one translate website visits into reliable indicators of political leanings? Some websites are clearly biased toward one candidate or party, but many are not. And a visit to a particular site may not necessarily mean that the visitor shares the site’s opinion.
Step one was finding a credible way to determine “ground truth,” a term that describes criteria based on real-world evidence that is used to train a machine-learning algorithm. Crovella worked backwards, starting, somewhat ironically, with the results of traditional opinion polls.
“Let’s say you have a poll from September 1, and it shows that on this day 60 percent of the people in Michigan are leaning toward the Democratic party. You use that to train a machine-learning algorithm to look at all of the individuals in your data set and decide which of them must make up that 60 percent. Then you have an idea of what a Democratic voter looks like in terms of their website visits. You carry that forward, looking at subsequent visits and asking how the data set is changing. This method was not previously well developed, and we had to find a new way to apply it to data that was as large as what we were studying.”
Crovella and Christenson point out that now that they have developed their approach with data that was donated, they are developing methods to accomplish the same ends that operate on encrypted data. This will improve user privacy, because no computer (other than the user’s own computer) will be able to see a user’s web browsing data.
Unsurprisingly, Crovella and Christenson’s initial analysis taught them a few things about their methodology, as well as the sentiments of voters. They learned, for example, which browsing habits were the best indicators of political leanings. “We found that referrals from social media are very informative,” says Crovella. “We found that if you simply type a search into a browser and click on that link, it’s not as likely to tell us something about your political leanings. But if you follow a link that was referred to you by a friend, it is likely that that’s indicative of your political leanings.”
What’s next? Crovella and Christenson plan to build a web function that will make their technology and methodology available to other social scientists and public opinion researchers. Crovella says they would like to build a system that social scientists can use to answer questions “like if someone goes to Chicago and gives a speech, how much does it move the needle and how long does it stay moved?”
“I would like to have a web API where any academic researcher could go on any day to query public opinion,” says Christenson. “One could type in their outcome of interest as well as the geographic area of the country and period of time, and in return get estimates of the related public opinion dynamics in real time. The applications are potentially quite broad. You could look at the public’s position on candidates, representatives, policy issues, even local events, like campaign stops or school board elections, assuming there is an underlying partisan or ideological dimension, and you wouldn’t have to spend tens of thousands of dollars on a poll or even have a poll in the field for the time period or region of interest.”
Perhaps because he is a longtime observer of political polls and a trained survey researcher, Christenson is sympathetic to the shortcomings of traditional polls.
“There is going to be error whenever you try to generalize,” he says. “And when there’s an electorate that’s as divided as the United States, it’s not surprising that polls would be off, especially by small margins in locales where we don’t have a great deal of data collection.” Still, he suggests, public opinion is too important to be marked by the limitations and costs of polls, at least if there is a way to improve upon them. And now there just might be.