PHS 597 Deep Learning with Applications to Genomics and Health Informatics

Deep learning is called the “new electricity” for modern science and technology. Getting a good understanding of modern deep learning methods would be critical for statisticians who are interested in “big data” research.

This course covers basic theory and applications for modern deep learning methods in genomics and health informatics. We will survey basic theory for deep neural networks, convoluted neural networks, sequence models, generative adversarial networks, etc.

We will not follow any textbooks, but a majority of the fundamental deep learning stuff will be taken from the following sources:
1. Deep Learning book by Ian Goodfellow and Yoshua Bengio and Aaron Courville (link)
2. Deep Learning with Python book by Francois Chollet (link)
3. Deep Learning with R book also by Francois Chollet (link)
Both books can be read for free online. In the meantime, many new topics will be drawn directly from research papers.

The programming language used in this course will be Python or R. I am more experienced with deep learning modeling in Python. Yet, more recently, much progress has been made adapting Torch or TensorFlow to R. In this semester, we will still stick with Python for most of the illustrations.

This is the 3rd time we are teaching this course. Yet, the course material is still very much under development, so you should anticipate major changes in the course material from previous years.

An Efficient R Package to Annotate/Query Sequence Datasets

Along with Xiaowei Zhan, we got our software article accepted recently at Genetic Epidemiology. Our article describes a new R package SEQMINER (in our view) for annotating and querying files of sequence variants, i.e. VCF/BCF files, files of summary association statistics, i.e. RAREMETAL/METAL files, as well as generic files that contains columns of genomic positions. This package provides a route of using R to process large scale datasets arising from statistical genetics studies. The software package can be found in

The software package does two things: annotating and querying sequence variants/summary association statistics. It implements a fully functional variant annotator. It supports both region based annotation (e.g. whether a given variant overlaps regions of biological interest, say a transcription factor binding site) and gene-based annotation (e.g. what kind of amino acid changes the mutation induces).

Another key feature for the package is that it enables random access to large scale datasets that are indexed. This is analogues to looking up the dictionary. If you want to look up the meaning of “zoo”, you would not start from “a”. You would resort to the index and search from “z”, “zo” etc. SEQMINER relies on tabix library to randomly retrieve genomic regions of interest from large scale datasets according to a pre-computed index. It does more than what tabix does: for datasets with complex structures, it parses the retrieved data, stores them in standard R objects and makes them ready for downstream analysis.

The package is very efficient and can handle the annotation of very large datasets on a single desktop machine with standard configurations. If you have genomic datasets that you want to analyze using R, and if it is cumbersome (or even impractical) to load the entire dataset into R, SEQMINER can be a good choice to start from.