Fall 2020 PHS 597 Data Mining with Applications to Genomic Data

Basic Course Information

Welcome to our Machine Learning/Deep Learning course in school year 2020-2021. In the Fall semester, we cover classical machine learning, and in the Spring semester, we cover deep learning (primarily based upon neural networks). The current course on data mining was originally developed in 2016, and this is its 4th edition.

Key References

This course covers basics for statistical learning, with an emphasis on its application to genomic data. As a first course in statistical learning, we will largely follow the flow of the first two books (both of which are free for download! The 2nd book ISL is a baby-version of the 1st book ESL):
1. The Elements of Statistical Learning. (ESL) Authors: Hastie T., Tibshirani R. and Friedman J.
2. An Introduction of Statistical Learning with Applications in R. (ISL)  Authors: James, G., Witten, D., Hastie, T., Tibshirani, R

A few topics covered in 1 and 2 appear to be obsolete (e.g. boosting algorithms and support vector machines), so we will supplement with more up-to-date materials from other books or original papers.

A few other references are helpful to have, including
3. Deep Learning by Ian Goodfellow (the first section of the book talks about the math background information as well as classical machine learning)
4. Deep Medicine by Eric Topol (This is a non-technical book, talking about numerous cool applications of AI to medicine).
5. Statistical Learning with Sparsity (This is a technical book, covering the theoretical aspects of LASSO and its further extensions)

Topics

Tentative topics include classification, resampling methods, linear models with regularization (e.g. LASSO), Bayesian variable selection, additive models, classification and regression trees, random forests, support vector machines, basic neural networks, and basics of unsupervised learning, such as k-mean clustering, hierarchical clustering, associative rules, factor analysis, nonnegative matrix factorization.

We will discuss the application of machine learning methods to genomics, which will include (but not limited to) variant annotation, genetic association analysis, fine mapping, risk prediction, variant calling and filtering from next generation sequencing data.

We will illustrate examples with Python (scikit learn). Classical machine learning packages are also wrapped in R (mlr) and students are encouraged to learn R as well.

Lecture Slides

Lecture 1: Introduction to machine learning (ppt)
Lecture 2: Linear Models (ppt) (Notes: 08/31 09/02 09/07, 09/09)
Lecture 3: Introduction to Statistical Genetics, GWAS and Population Structure (ppt) (notes: 09/09 09/14)
Lecture 4: Linear Discriminant Analysis (ppt) (notes: 09/16 09/21)
Lecture 5: Leverage GWAS to Gain Biological Insights (ppt)
Lecture 6: Leverage GWAS to Facilitate Clinical Translation (ppt)
Lecture 7: Support Vector Machine (ppt) (notes: 10/05 10/07 10/19 10/21 )
Lecture 8: Model Validation, Bootstrap (ppt) (notes: 10/21 10/26 11/02 11/06 11/09)
Lecture 9: Basis Expansion (ppt) (notes: 11/09 11/11)
Lecture 10: Applications of SVM to Genomics (ppt)
Lecture 11: Classification and Regression Tree (ppt)

Code Snippet

Introduction to Python (code)
Introduction to NumPy (code)
Introduction to Lasso (code)