Fall 2019 – PHS 597: Data Mining with Application to Genomics

This course covers basics for statistical learning, with an emphasis on its application to genomic data. As a first course in statistical learning, we will largely follow the flow of the two books (both of which are free for download! The 2nd book ISL is a baby-version of the 1st book ESL):
1. The Elements of Statistical Learning. (ESL) Authors: Hastie T., Tibshirani R. and Friedman J.
2. An Introduction of Statistical Learning with Applications in R. (ISL)  Authors: James, G., Witten, D., Hastie, T., Tibshirani, R
A few other references are helpful to have, including
3. Deep Learning by Ian Goodfellow (the first section of the book talks about the math background information as well as classical machine learning)
4. Deep Medicine by Eric Topol (This is a non-technical book, talking about numerous cool applications of AI to medicine).
5. Statistical Learning with Sparsity (This is a technical book, covering the theoretical aspects of LASSO and its further extensions)

Tentative topics include classification, resampling methods, linear models with regularization (e.g. LASSO), additive models, classification and regression trees, random forests, support vector machines and basics of unsupervised learning, such as k-mean clustering. We will also discuss modern topics including neural networks if time permits.

We will discuss the application of machine learning methods to genomics, which will include (but not limited to) variant annotation, genetic association analysis, variant calling and filtering from next generation sequencing data.

Compared to earlier years’ courses, we will add substantially more examples using R (mlr) or Python (scikit learn).

Lecture 1 Introduction (PPT)
Lecture 2 Linear Models (PPT)
Lecture 3 Linear Discriminant Analysis (PPT)
Lecture 4 Support Vector Machine (PPT)
Lecture 5 Support Vector Machine Application to Variant Calling and Functionality Inference (PPT)
Lecture 6 Basis Expansion (PPT)
Lecture 7 Model Assessment (PPT)
Lecture 8 Classification and Regression Tree (PPT)
Lecture 9 Boosting (PPT)
Lecture 10 Random Forest (PPT)
Lecture 11 Neural Networks (PPT)
Lecture 12 Unsupervised Learning (PPT)


1. ISL book Chapter 3: problems 4,5,6,7 and ESL book:  3.1, 3.2, 3.3, 3.4, 3.5