Data Science for Social Good: AI Fairness, Causality, and Equitable Crowdsourcing
Margarita Boyarskaya is a PhD student at NYU Stern Technology, Operations, and Statistics group (2021-2022) working on causal models for algorithmic fairness. She holds a B.Sc. and a M.Sc. in theoretical mathematics from Moscow State University. Previously, Margarita was an intern with the FATE group at Microsoft Research.
My work centers on building socially equitable AI systems. In particular, I focus on achieving fair (anti-discriminatory) decision-making and on ensuring data quality.
I focus on detection, diagnostics, and mitigation of bias in statistical decision-making, primarily using causal inference methods. One particular problem that I work on is unfairness dues to selection bias. An agent (e.g., a bank issuing consumer credit) may observe an outcome at different rates for two or more demographic subgroups (e.g., genders or races) in the training data, but this disparity may be an artifact of biased sampling, and it may not reflect the true group differences in the population distribution. I am developing methods for diagnosing such cases and making accurate and fair predictions. I supplement my technical work with insights from law and social theory. For example, the legal distinction between ‘business necessity’ and ‘animus’ in using protected categories for prediction suggests a way to group and categorize covariates in prediction models, while differing theoretical views on race in social sciences may either prohibit or allow for other variables to causally affect race.
The unifying theme of my work is the emphasis on the early and late stages of the ML-enabled decision pipelines. In particular, I work on high quality data collection, fair sampling, and equitable decision-making that occurs after the prediction stage. The three parts of my dissertation cover the following topics:
• Addressing the problem of data quality, my research provides analytical expressions for the expected accuracy and cost of collecting data labels on crowdsourcing platforms using a strong version of majority voting.
• Focusing on the issue of sampling, my research has found that, in the presence of biased selection of data into the training set, intentionally correcting Machine Learning models to enforce fairness in the sample may result in the violation of fairness in the population. Our work proposes an instrumental variable approach for disproving the assumption of “no selection bias” and accurately estimating the coefficients of the true data- generating model. We develop a data symmetrization method that allows for counterfactually fair predictions.
• Addressing the issues of data quality and variable inclusion, my work introduced missing clarity on the long-standing problem of unfairness via proxy variables. Together with my coauthors, we put forth a framework for reasoning about proxy variable inclusion, providing common theoretical ground for differing treatments of proxy variables. This helps explain why some proxy variables are commonly accepted as permissible to use, while other ones are unanimously agreed upon as being problematic. Our framework utilizes directed acyclic graphs (DAGs) to explain and reconcile various agents’ seemingly inconsistent modeling choices.