Are Computer-Aided Decisions Fair?
Removing bias from the computer algorithms shaping our major life choices
Removing bias from the computer algorithms shaping our major life choices
Algorithms are used to determine credit scores, which can mean the difference between owning a home and renting one. They are used in predictive policing, which suggests a likelihood that a crime will be committed, and in scoring how likely a criminal will reoffend. That data influences the severity of sentencing. As more decisions of greater importance are made by computer programs, the potential for harm grows. And the design of many algorithms is far from transparent, says Adam Smith, a professor of computer science.
“A lot of these systems are designed by private companies and their details are proprietary,” says Smith, who is also a data science faculty fellow at the Rafik B. Hariri Institute for Computing and Computational Science & Engineering. “It’s hard to know what they are doing and who is responsible for the decisions they make. They are increasingly complex, and they are often hard to understand for laypeople and for the people about whom decisions are being made.”
Recently, Smith and a joint team of BU-MIT computer scientists examined this problem, hoping to learn what, if anything, can be done to understand and minimize bias from decision-making systems that depend on computer programs.
The BU researchers—Smith, Ran Canetti, a professor of computer science and director of the Hariri Institute’s Center for Reliable Information Systems & Cyber Security, and Sarah Scheffler (GRS’23,’23), a computer science doctoral candidate—worked with MIT PhD students to design systems whose decisions about all subsets of the population are equally accurate. They presented their work at the 2019 Association for Computing Machinery Conference on Fairness, Accountability, and Transparency.
The researchers believe that a system that discriminates against people who have had a hard time establishing a credit history will perpetuate that difficulty, limiting opportunity for a subset of the population and preserving existing inequalities. What that means, they say, is that automated ranking systems can easily become self-fulfilling prophecies, whether they are ranking the likelihood of default on a mortgage or the quality of a university education.
“Once you’ve got the same computer program making lots of decisions, any biases that exist are reproduced many times over on a larger scale,” Smith says. That problem, the researchers say, will get worse as future algorithms use more outputs from past algorithms as inputs.
“The interaction between the algorithm and human behavior is such that if you create an algorithm and let it run, it can create a different society because humans interact with it,” says Canetti. “So you have to be very careful how you design the algorithm.”
But how exactly can an algorithm, which is basically a mathematical function, be biased?
Scheffler suggests two ways: “One way is with biased data. If your algorithm is based on historical data, it will soon learn that a particular institution prefers to accept men over women. Another way is that there are different accuracies on different parts of the population, so maybe an algorithm is really good at figuring out if white people deserve a loan, but it could have a high error rate for people who are not white. It could have 90 percent accuracy on one set of the population and 50 percent on another set.”
“That’s what we are looking at,” says Smith. “We’re asking, ‘How is the system making mistakes?’ and ‘How are these mistakes spread across different parts of the population?’”
In May 2016, reporters from ProPublica, a nonprofit investigative newsroom, examined the accuracy of Compas, one of several algorithmic tools used by court systems to predict recidivism, or the likelihood that a criminal defendant will commit another crime. Compas gives people a risk score from 1 to 10, and those scores can be translated into probabilities: a score of 8 is the equivalent of a 70 percent chance of recidivism. Those with a risk score of 8 to 10 were considered “high risk.”
When ProPublica researchers compared the tool’s predicted risk of recidivism with actual recidivism rates over the following two years, they found that, in general, Compas got things right 61 percent of the time. They also found that predictions of violent recidivism were correct only 20 percent of the time.
More troubling, they found that black defendants were far more likely than white defendants to be incorrectly deemed more likely to commit crime again, and white defendants were more likely than black defendants to be incorrectly deemed a low risk to recidivate. According to ProPublica’s article, this was a clear demonstration of bias by the algorithm.
In response, Northpointe Inc., the creator of Compas, published another study that argued that the Compas algorithm is in fact fair when measured by statistical average. Northpointe’s software is widely used, and like many algorithmic tools, its calculations are proprietary, but the company did tell ProPublica that its formula for predicting who will recidivate is derived from answers to 137 questions whose answers come either from defendants or from criminal records.
Northpointe’s study found that for each risk score, the fraction of white defendants who received a high risk score and recidivated (out of all white defendants who received a high risk score) roughly equals the fraction of black defendants who were considered high risk and recidivated, out of all black defendants who were named high risk.
“ProPublica and Northpointe came to different conclusions in their analyses of Compas’ fairness. However, both of their methods were mathematically sound—the opposition lay in their different definitions of fairness,” Scheffler says.
The bottom line is that any imperfect prediction mechanism (either algorithmic or human) will be biased according to at least one of the two approaches: the error-balancing approach used by ProPublica, and the calibration method favored by Northpointe.
When it came to solving the problem of algorithmic bias, the BU-MIT research team created a method of identifying the subset of the population that the risk assessment system fails to judge fairly, and sending their score to a different system that is less likely to be biased. The goal is for the method to have the same accuracy on different subgroups of the population—but there is no guarantee that these two systems have to work together fairly.
“There are many different measures of fairness,” says Scheffler, “and there are trade-offs between them. So to what extent are the two systems compatible with the notion of fairness we want to achieve?”
Still, says Canetti, their research points to a possible way out of the statistical bias conundrum, one that could enable the design of algorithms that minimize the bias.
This work was supported by multiple National Science Foundation awards, a Sloan Foundation Research Award, and a Clare Boothe Luce Graduate Fellowship.