Tutorials
The tutorials at the IEEE DSAA 2018 are related either to the statistical and/or the computer science side of data science.
(T1) NDlib: Modelling and Analyzing Diffusion Processes over Complex Networks
Nowadays the analysis of dynamics of and on networks represents a hot topic in the Social Network Analysis playground. To support students, teachers, developers and researchers we developed a novel framework, named NDlib, an environment designed to describe diffusion simulations. NDlib is designed to be a multi-level ecosystem that can be fruitfully used by different user segments.
(T2) Data Sources and Techniques to Mine Human Mobility Patterns
Understanding human mobility is key for a wide range of applications, such as urban planning, traffic forecasting, activity-based modeling transportation networks or epidemic modelling, to name a few. The huge amount of geo-spatial data is creating new challenges and opportunities to satisfy this thirst of knowledge. In this tutorial, we introduce datasets, concepts, knowledge, and methods used in human mobility mining.
(T3) Reproducible research using lessons learn from software development
In particular, the tutorial will cover the use of markdown and the ‘knitr’ package for combining code, results, and description (literate programming), ‘make’ for organizing and automating complex build processes in a data analysis, git for version control and collaboration, and finally the use of container technology (Docker) to isolate an entire data analysis including the underlying operating system. The tutorial will introduce each of the above mentioned technologies for their particular role in making a data analysis more reproducible and is thus geared towards an audience with either no or little experience in any or all of the required techniques.
(T4) Deep Learning for Computer Vision: A practitioner’s viewpoint
This tutorial will focus on Deep Learning for image classification, adopting a pragmatic perspective and dealing with data scarcity, a scenario where training models from scratch leads to overfitting. We are going to tackle these problems by learning from practical examples. We will show in code examples, using Jupyter notebooks, how to deal with model selection with an example dataset. We will show how the theory of approximation-generalization works in practice, by producing and interpreting the learning curves for different models and estimating the amount of data necessary to obtain a given performance. Finally, we will introduce the transfer learning technique and show how it allows to obtain better performance with less data and limited resources.
(T5) Project Management for Data Science
This short tutorial introduces the fundamental principles and best practices of project management; it specifically targets data science projects, including those that involve supervised machine learning (ML) and natural language processing (NLP) elements. The view presented is overall consistent with traditional project management methodology as taught by the Project Management Institutes (PMI), but adapts and extends these as necessary and beneficial to the data science context, which is more experimental, data driven (and thus open-ended and uncertain) than, say, a bridge construction project. By including practical aspects like interfacing between technologists and business stakeholder in industry projects and integrating ethics and privacy considerations, we hope to provide a holistic, useful and practically applicable account.
(T6) Data Science Workflows Using R and Spark
This tutorial covers the data science process using R as a programming language and Spark as a big-data platform. Powerful workflows are developed for data extraction, data transformation and tidying, data modeling, and data visualization. During the course R-based illustrations show how data is transported using REST APIs, sockets, etc. into persistent data stores such as the Hadoop Distributed File System (HDFS), relational databases and in some cases sent directly to Spark's real-time compute engine. Workflows using dplyr verbs are used for data manipulation within R, within relational databases (PostgreSQL),and within Spark using sparklyr. These data-based workflows extend into machine learning algorithms, model evaluation, and data visualization. The machine learning algorithms taught in this tutorial include supervised techniques such as linear regression, logistic regression, decision trees, gradient-boosted trees, and random forests. Feature selection is done primarily by regularization and models are evaluated using various metrics. Unsupervised techniques include k-means clustering and dimension reduction. Big-data architectures are discussed including the Docker containers used for building the tutorial infrastructure called rspark.
(T7) Kernel Methods for Machine Learning and Data Science
This tutorial is presented at an intermediate level and seeks to explore the variety of ways in which modern kernel methods use the concept of similarity measures in Reproducing Kernel Hilbert Spaces (RKHS) to extend, strengthen and enrich existing learning machines while creating brand new ones. Techniques like kernel PCA which extends the ubiquitous method of principal component analysis are presented, as well as spectral clustering, kernel kMeans, and the whole suite of kernel regression techniques from Radial basis function regression to the Relevance vector Machine Regression, the Support vector Regression machine and the Gaussian Process Regression method, just to name a few. Examples involving small, medium and extremely large (Big data) datasets are used to illustrate the methods. The software environment used is R studio.
(T8) Recurrent Neural Nets with applications to Language Modeling
In this tutorial we will explore how to use Recurrent Neural Networks, to model and forecast series of events, using language as an intuitive example. Their advantages and disadvantages with respect to more traditional approaches will behighlighted and simple implementations using the Keras python library will be demonstrated. You will implement deep neural networks that are capable of guessing what the next letter in a word is or even what the next word in a sentence might be. Code and slides will be made available on GitHub.
(T9) Managing Data Science Product Development
In this short tutorial, we will cover how the basic components of agile software development can be adapted to data science efforts. We will discuss how to define “stories” and “epics” for data science, how to manage and prioritize backlogs, stand-ups, sprint reviews, and how to communicate with stakeholders. The methods covered have, in our application of them, resulted in better-managed stakeholder expectations, kept teams focussed on delivering a business capability, and helped to ensure that we deliver the intended impact to the business.