PHS 597 Data Mining with Application to Genomic Data

This course covers basics for statistical learning, with an emphasis on its application to genomic data. As a first course in statistical learning, we will largely follow the flow of the two books (both of which are free for download! The 2nd book is a baby-version of the 1st book):
1. The Elements of Statistical Learning. (ESL) Authors: Hastie T., Tibshirani R. and Friedman J.
2. An Introduction of Statistical Learning with Applications in R. (ISL)  Authors: James, G., Witten, D., Hastie, T., Tibshirani, R

Tentative topics include classification, resampling methods, linear models with regularization (e.g. LASSO), additive models, classification and regression trees, random forests, support vector machines and basics of unsupervised learning.

An emphasis will be on its applications to cutting edge genetics and genomics problems. Areas of genomic applications will include (but not limited to) variant annotation, genetic association analysis, variant calling and filtering from next generation sequencing.

The class will meet on MW 2-3:30pm @ASB3400M. The class starts on Aug 22. Starting from Oct 17th, the class meets three times a week, MW 2-3:30pm and F 10-11:30am @ASB 3400M.

Homework:
Set 1: Aug 24: Pick 3 problems from 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.12, 3.13 of the “Elements” book. Read sections that we covered, including 3.1, 3.2 (we did not cover Gram Schmidt process, but please read them as well). 3.4.1-3.4.3. Due in two weeks.
Set 2: Sept 20 Choose >2 problems from the following: 4.2,4.3, 4.4, 4.5 4.6; Write up the solution to problem 4.1 4.6, which I pretty much already solved in class; You are also encouraged to complete problem 4.9, and learn to code for a QDA classifier.
Set 3: Pick three problems from 7.1, 7.2, 7.3, 7.4, 7.5, 7.6.
Coding Exercise: Code boosting trees for classification tree with two terminal nodes and a general tree with two terminal nodes;
Set 4: Do 5.1, 5.2 and 5.7. The three problems cover important aspects for cubic splines, B-splines, and smoothing splines.
Set 5 (Optional) – boosting: Exercise: 10.1, 10.2, 10.5, 10.6, 10.7
Set 6 (Optional) – random forest: Exercise 15.1, 15.2, 15.4, 15.5
Course Material
Course Syllabus PDF
A list of reading materials for genomic applications DOC (this is being continously updated)

Lecture Slides
1. Introduction to Machine Learning Course PDF   (class August 22)
2. Linear Models PDF  (class August 22, 24, 29) (Chapter 3 of ESL book)
3. Introduction to GWAS and Ancestry Inference using PCA PDF (class August 31)
4. Linear Discriminant Analysis PDF (class Sept. 7, 14, 19, 26) (Chapter 4 of ESL book)
5. Model Assessment, Inference and Averaging PDF (class Sept 26, 28, Oct 10, 12, 14, 17) (Chapter 7, 8 of ESL book)
6. Tree-based Methods PDF (class Oct 21, 23 and 25) (Chapter 9 and 10 of ESL book)
7. Basis Extension and Splines (Chapter 5 ESL book) PDF (class Oct 31, Nov 1)
8. Boosting (Chapter 10 ESL book) PDF (class Nov 3)
Rscript for boosting a linear classifier, boosting a classification tree
9. Random Forest (chapter 15 ESL book) PDF
10. Neural Networks (chapter 11 ESL book) PPT
11. Support Vector Machine (Chapter 12 ESL book) PPT
12. Unsupervised Learning (Chapter 14 ESL book) PPT