QnA with AP Bestavros: Part II
*Note that this is part 2 of a series of interviews with Associate Provost Bestavros. Stay tuned for the next installment. Part 1 is available here.
Last time we spoke with Azer Bestavros about the power of data and how it can be helpful in evaluating policies or policymaking. Today, we will dive into the differences between data science and data collection/visualization and how the former can be used to uncover evidence about racial disparities (among other things).
Q: A major thrust of CDS’ impact seems to be on racial equity, given the establishment of the Justice Media co-Lab in collaboration with COM, and the Racial Data Lab in collaboration with CARR. Can you give examples of how data science – as opposed to data collection and visualization – can be used to uncover evidence about racial disparities?
Azer: Bestavros: The analysis of data collected through various processes (such as the laws requiring police to report on the demographics of people they issue citations during vehicle or pedestrian stops) could be used to uncover racial disparities (and a lot more), for example. First, notice that the data collected from a single process may not be enough to uncover patterns, but rather it is the linkage of this data with (and the analysis of this data in the context of) other data sets that reveal the patterns. And, this is where data science comes into the picture. To do the linkage or analysis correctly, one needs to deploy an arsenal of fairly elaborate computational techniques. For example, the data may have different modalities (i.e. databases, free text, audio, images, video), the data may be noisy or incomplete, the data may be subject to significant privacy constraints, etc. etc. This is where techniques from a variety of computing and data science fields come into play.
Q: Can you elaborate on why it is important to consult and analyze data from multiple sources? What’s wrong with looking at data that, say, a single hospital has about its patients? Why go beyond the data we have?
To be clear, there is nothing “wrong” about looking at data from a single source. The question is whether you would be getting the full picture, or for that matter an accurate (as opposed to a biased or incomplete) picture.
That said, diversifying the sources of data is important because it is seldom the case that one can come up with insights from looking at data coming from a single source. It is quite the opposite. It is only when you look at data from many sources that you can start to see the patterns that are hidden in the various pieces. For example, if you look at individual hospital data, you may not pay too much attention to a handful of cases that look like the flu. But, when you aggregate data from lots of hospitals, what may appear as a little blip in one hospital may be a pandemic in the making. This is a trivial example, but in general, this is true. It’s only when you combine data from health insurers, hospitals, education, transportation, social media, and more that one can come up with insights of how to approach the health inequities that have become so clear with COVID.
Q: You seem to be implying that diversifying data sources may protect us from reaching incorrect conclusions. Can you please elaborate?
The diversity of data sources is actually one important attribute of the “Big Data” buzzword. People often think that what makes “Big Data” is the amount of data we have. That is only one attribute. In addition to the “volume” of data, the three other important attributes to consider are the “velocity”, “variety”, and “veracity” of the data. Collectively, these attributes are called the 4 Vs of Big Data. Velocity refers to how fast the data is being generated and accumulated. Traditional data analysis involves static data stored in some database, whereas data scientists deal with real-time data – like drinking from the firehose! Variety refers to the different sources of data and the different modalities of the data. Traditional data analysis assumes that the data is well-curated and nicely consistent, whereas today’s data scientists have to deal with heterogeneous data that use different naming standards, granularities, and taxonomies. Veracity is about establishing confidence in data that is often noisy, incomplete, biased, and even polluted by adversaries!
Here is the good news, the only way to boost veracity is through increased variety of data (i.e., diversity of sources). Let me put it this way: Variety to data science is what random sampling is to statistics.
Q: What are examples of projects pursued by CDS faculty and students that exemplify going beyond simple trackers?
BU Spark! has many projects by students that are great examples of such projects. These projects are done either independently by students (with BU Spark! supervision) or done as part of a faculty’s larger project.
This project, done during Spark!’s Resiliency Challenge, tracks racism on Twitter. Specifically, the team of students looked for trends in sinophobic rhetoric in the context of the COVID-19 pandemic. This work stemmed from original research by Gianluca Stringhini (a faculty member in ECE/ENG and a CDS affiliated faculty), whom the students worked closely with. You can learn more in this BU Today story or head to their website.
Another favorite student project of mine used various approaches to build a data set that shed light on the corrupting influence of political campaign contributions on policing practices. This project produced evidence that was cited in Beacon Hill hearings on police reforms.
Q: Going back to the question of mining data for evidence, you indicated that one may need multiple sets of data to uncover patterns, in the sense that it is only when you bring all the data together that the picture emerges. Is it important to put data together in one place for the evidence to be unearthed? Do you mean literally one place – as in on a database stored in a labor data center at BU?
No, not necessarily in one physical place. What I meant is that to solve problems, such as those to be considered by the Racial Data co-Lab or the Justice Media co-Lab, one often needs access to multiple pieces of data in order for the computational and data analysis to make sense of it all. Whether all the data is in one place (physically speaking) is not important/relevant. What is important and relevant is the ability to have access to the data so that it can be combined/linked/analyzed for insights/evidence, etc.
As a matter of fact, the most interesting aspects of data science is its ability to “distribute” the computation so that you don’t have to put all the data in the same place, which may not be feasible (e.g., due to the sheer size of the data – think about census data) or illegal (e.g., due to regulatory constraints – think of HIPAA and FERPA). This is where technologies and platforms such as cloud computing come to the fore!
Q: Looking at the COVID racial data tracker as an example of a game changer in terms of our understanding of the virus and who it was impacting, a lot of people may think this is fairly straightforward: you get the national stats on the race/ethnicity/geographic location of those who are getting infected and dying and you publish them. Why is that so complicated? Why is data science needed for this?
I agree! It is not complicated, and you do not need “data science” to have dashboards that highlight stats from a spreadsheet. To me, as a computing and data scientist it is about how much *more* you can do beyond simply tracking a statistic. The value of the COVID Racial Data Tracker (and any other trackers we may want to develop for other aspects of society) is simply to motivate us to look deeper and to deploy the arsenal of tools and techniques at our disposal (and develop new ones if necessary) to get to the root causes of what the tracker reveals.
Q: Can you elaborate then on what is needed in order to “look deeper” in order to “get to the root causes”? What critical data science skills do social scientists look to CDS for?
Beyond raising awareness about inequities with simple visualization and dashboards, if we have any hope in moving the needle or racial injustices, we must be able to connect these inequities to their root causes in policies and practices by providing convincing evidence anchored in data. Success in this endeavor depends on computational and data-driven capacities from collection, curation, and securely warehousing of individual data sets, to linking, integration, and visualization of multiple data sets, to processing, mining, and analysis of various data pipelines, to predictive modeling, hypothesis testing, and simulation of underlying socio-economic processes, among others.
Unfortunately, it is not enough for social scientists to simply hire a practitioner, use some off-the-shelf software platforms such as R or Tableau, or use your favorite “data science for dummies” tools. What social science experts need is well beyond “skills” and this is where collaboration with CDS constituents is important. CDS experts (faculty and students) have an arsenal of approaches and mechanisms that they can put together in very creative and innovative ways. Beyond the popular dimensions of data science that everybody talks about, like machine learning and predictive modeling, other approaches and mechanisms are equally if not more important, such as software and data engineering, information system security, data privacy, natural language processing, video and image analysis, etc.
Q: What you said about collecting data from various modalities and sources is interesting. Can you give an example of how data collected for one purpose may help answer questions that go well beyond the original reason for collecting the data?
Imagine how one might go about scoring different neighborhoods in terms of how walkable they are, how well-maintained their streets are, how much tree cover they have, what kinds of cars are parked there, what types of businesses are nearby, or the state of repair of buildings in different neighborhoods. To do this, we can use Google Streets data. We have 15 years of Google Street data with multiple versions of the images for the same street per year. We can use approaches from computer vision and machine learning to analyze the images and produce scores for “walkability” or “tree cover” or “quality of pavement” over time. We can then correlate these scores with health outcomes or educational attainment numbers or real-estate values, for example, and then compare all these metrics for different neighborhood (wealthy vs poor) over time to see if things are improving or not. Or maybe we use all that in order to study the impact of gentrification. But, guess what? Google Street data was not collected to answer these questions! They were collected to help people with navigation using Google Maps. Now, through creative use of computational tools and techniques, this same data set can contribute as one of many data modalities that begin to answer questions about gentrification or correlation between public health and quality of streets.
Q: But who comes up with these ideas? How do you decide what questions to ask?
While we (CDS faculty and students) can think creatively about interesting analyses which might lead to intriguing question worth asking – and there are literally dozens if not hundreds of analyses that we can do in for a given data set – we also rely on colleagues and collaborators whose domain expertise zooms in on the questions that are worthy of asking. Answering these questions will allow these experts to gain insights, which may get them to ask even more questions, and the cycle continues. If we (CDS) work by ourselves, we may not know what questions are worth asking and we may even pose the wrong questions altogether. If experts work on their own, they will not be able to go beyond the surface of showcasing some simple trackers or heat maps of various neighborhoods.
Take the example I mentioned earlier about correlating demographics collected as part of traffic stops by police. If doing analysis on racial dimensions of traffic stops is important to antiracist researchers, they will point us to that direction and we will get to work on that. But, they can equally direct us to look at disparities in online capacities for public schools in different communities, or to correlation between noise pollution and education outcomes for different geographic locales. And, there is a lot more…
Q: In announcing the Racial Data co-Lab, Dr. Ibram Kendi referred to “racial data science” as a new and emerging field and says he wants to make BU the premier University center of the field. Do you agree?
I am not sure I would call it a “new field” or a “new discipline” since it is really about two (and more) research areas coming together – it is an interdisciplinary/multidisciplinary field. That said, I agree that the combination of CDS and CAR gives BU a unique chance to be a national center of gravity when it comes to the research and education at the nexus of Data Science and Race.
— Join us for the final installment of this series to learn more about the collaboration with CAR!