Fall 2017 PHS597 Data Mining with Application to Genomic Data

This course covers basics for statistical learning. We will largely follow the flow of the two books (both of which are free for download! The 2nd book is a baby-version of the 1st book):
1. The Elements of Statistical Learning. (ESL) Authors: Hastie T., Tibshirani R. and Friedman J.
2. An Introduction of Statistical Learning with Applications in R. (ISL)  Authors: James, G., Witten, D., Hastie, T., Tibshirani, R

Tentative topics include classification, resampling methods, linear models with regularization (e.g. LASSO), additive models, classification and regression trees, random forests, support vector machines and basics of unsupervised learning.

An emphasis will be on its applications to cutting edge genetics and genomics problems. Areas of genomic applications will include (but not limited to) variant annotation, genetic association analysis, variant calling and filtering from next generation sequencing.

The course was first taught in Fall 2016.The course material will be extensively revised in Fall 2017. To get a sense of what was taught, you may refer to the old course webpage at here .

Lecture Notes
1. Introduction (PPT)
2. Linear Models (PPT)
3. Discriminant Analysis (PPT)
4. Basis Expansion (PPT)
5. Model Assessment (PPT)
6. Boosting (PPT)
7. Trees (PPT)
8. Random forest (PPT)
9. Support vector machine (PPT)

1. Pick 3 problems from 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.12, 3.13 of the “Elements” book, Due Oct 5th.
2. Pick 3 problems from 4.2, 4.3, 4.6, 4.7, 4.10, 4.13 in the ISLR book, due Oct 23rd.

We will explore the use of machine learning methods to model the gene expression levels using genetic data as predictors. The datasets used for this project is described in here .