Predictability and other Predicaments in Machine Learning Applications
Abstract:
In the context of building predictive models, predictability is usually considered a blessing. After all - that is the goal: build the model that has the highest predictive performance. The rise of 'big data' has in fact vastly improved our ability to predict human behavior thanks to the introduction of much more informative features. However, in practice things are more differentiated than that. For many applications, the relevant outcome is observed for very different reasons: One customer might churn because of the cost of the service, the other because he is moving out of coverage. In such mixed scenarios, the model will automatically gravitate to the one that is easiest to predict at the expense of the others. This even holds if the predictable scenario is by far less common or relevant. We present a number of applications where this happens: clicks on ads being performed 'intentionally' vs. 'accidentally', consumers visiting store locations vs. their phones pretending to be there, and finally customers filling out online forms vs. bots defrauding the advertising industry. The implications of this are effect are significant: the introduction of highly informative features can have significantly negative impact on the usefulness of predictive modeling and potentially create second order biased in the predictions.
Bio:
Claudia Perlich started her career in Data Science at the IBM T.J. Watson Research Center, concentrating on research in data analytics and machine learning for complex real-world domains and applications. She tends to be domain agnostic having worked on almost anything from Twitter, DNA, server logs, CRM data, web usage, breast cancer, movie ratings and many more. More recently she acted as the Chief Scientist at Dstillery where she designed, developed, analyzed, and optimized machine learning that drives digital advertising to prospective customers of brands. Claudia continues to be an active public speaker and has published over 50 scientific publications as well as a few patents in the area of machine learning. She has won many data mining competitions and awards at Knowledge Discovery and Data Mining (KDD) conferences, and served as the organization's General Chair in 2014. Claudia is the past winner of the Advertising Research Foundation's (ARF) Grand Innovation Award and has been selected for Crain's New York's 40 Under 40 list, Wired Magazine's Smart List, and Fast Company's 100 Most Creative People. She received her PhD in Information Systems from the NYU Stern School of Business where she still teaches as an adjunct professor.