Fall 2018 PHS 597 – Data Mining with Application to Genomics

This course covers basics for statistical learning, with an emphasis on its application to genomic data. As a first course in statistical learning, we will largely follow the flow of the two books (both of which are free for download! The 2nd book ISL is a baby-version of the 1st book ESL):
1. The Elements of Statistical Learning. (ESL) Authors: Hastie T., Tibshirani R. and Friedman J.
2. An Introduction of Statistical Learning with Applications in R. (ISL)  Authors: James, G., Witten, D., Hastie, T., Tibshirani, R

Tentative topics include classification, resampling methods, linear models with regularization (e.g. LASSO), additive models, classification and regression trees, random forests, support vector machines and basics of unsupervised learning. More modern and advanced topics will be offered if time permits.

An emphasis will be on its applications to cutting edge genetics and genomics problems. Areas of genomic applications will include (but not limited to) variant annotation, genetic association analysis, variant calling and filtering from next generation sequencing data.

The class meets Monday and Wednesday from 2:00-3:30pm. The classroom is HCAR 1101. The first class meets on Aug 20.

Lecture Notes:
1. Introduction
2. Linear Models
3. Introduction of Linear Models Used in Statistical Genetics
4. Linear Discriminant Analysis
5. Support Vector Machine
6. Application of SVM to Variant Filtering and Functionality Prediction
7. Basis Expansion
8. Model Assessment
9. Classification and Regression Tree
10. Random Forest
11. Boosting
12. Neural Networks
13. Unsupervised Learning


1. ISL book Chapter 3: problems 4,5,6,7 due Oct 3rd. Optional: ESL book:  3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.12, 3.13
2. ISL book: chapter 4: 1,2,3,4 and chapter 9: 1, 2.
3. ISL book: chapter 9: 3,6,7