There is a lot of controversy in business circles as to whether companies are using artificial intelligence (AI) technology for unethical purposes, or unknowingly doing so. This post isn’t about that - it's about what ethical AI means for model development.
One of the most common misperceptions I hear about bias is, “If I don’t use age, gender or race, or similar factors in my model, it’s not biased.” Unfortunately, that’s not true.
From a data scientist’s point of view, ethical AI is achieved by taking precautions to expose what the underlying machine learning (ML) model has learned, and if it could impute bias. At first glance, the precautions often taken to isolate the input data fields used by models may seem sufficient. However, latent features of the model, which combine the inputs, are difficult to interpret as to whether or not they inject bias. Upon deeper inspection, the model often produces outcomes that are biased toward a particular class. (Here I am referring to data class, not socioeconomic.)
Bias and Confounding Variables
Machine learning learns relationships between data to fit a particular objective function (or goal). It will often form proxies for avoided inputs, and these proxies show bias. Bias is exposed when “confounding variables” cause these proxies to be more activated by one data class versus another, driving the model to produce biased results.
For example, if a model includes the brand and version of an individual’s mobile phone, that data can be related to the ability to afford an expensive cell phone — a characteristic that can impute income. If income is not a desirable factor to use directly in the decision, imputing that information from data such as the type of phone, or the value of purchases the individual makes, introduces bias into the model. This is because, on average, affluent customers can afford more high-end, expensive phones than a non-affluent group.
Research into the effects of smoking provides another example of confounding variables. In decades past, research was produced that essentially made the reassuring correlation, “If you smoke, your probability of dying in the next four years is fairly low. Therefore, smoking is OK.” The confounding variable in this assumption was the distribution of age of smokers; in the past, the smoking population contained many younger smokers whose cancer would develop later in life. Many older smokers were already deceased and therefore their contribution minimized in reaching this finding. Thus, the analytic models representative of the “smoking is OK” conclusion contained overwhelming bias driven by a higher density of younger smokers, thus creating a biased perception about the safety of smoking.
Today, similar bias could be produced by a model concluding that, since far fewer young people smoke cigarettes than 50 years ago, nicotine addiction levels are down, too. However, youth use of e-cigarettes jumped 78% between 2017 and 2018 — to one out of every five high-school students. E-cigarettes are potent nicotine delivery devices, fostering rapid nicotine addiction and simply diverting nicotine use to a new delivery vehicle. Without reflecting this nicotine delivery method, we would have an errant view of nicotine addiction among youth.
Finding Hidden Bias
The challenge of delivering truly ethical AI requires closely examining each data class separately, with respect to the relationships in the data that drive outcomes: the latent features. As data scientists, we must demonstrate to ourselves, and the world, that AI and machine learning technologies are not subjecting specific populations to bias and search for confounding variables. To reach that goal, the relationships learned need to be exposed using explainable latent feature technologies rather than complex webs of interconnected variables. The latter contain relationships that need to be tested but can’t be extracted from the machine learning models.
There are two more blogs in my AI explainer series on the three Es of AI: explainable AI and efficient AI. Follow me on Twitter @ScottZoldi to stay in touch.
Note: A version of this post was published in IOT Agenda.