Along with Xiaowei Zhan, we got our software article accepted recently at Genetic Epidemiology. Our article describes a new R package SEQMINER (in our view) for annotating and querying files of sequence variants, i.e. VCF/BCF files, files of summary association statistics, i.e. RAREMETAL/METAL files, as well as generic files that contains columns of genomic positions. This package provides a route of using R to process large scale datasets arising from statistical genetics studies. The software package can be found in https://cran.r-project.org/web/packages/seqminer/index.html
The software package does two things: annotating and querying sequence variants/summary association statistics. It implements a fully functional variant annotator. It supports both region based annotation (e.g. whether a given variant overlaps regions of biological interest, say a transcription factor binding site) and gene-based annotation (e.g. what kind of amino acid changes the mutation induces).
Another key feature for the package is that it enables random access to large scale datasets that are indexed. This is analogues to looking up the dictionary. If you want to look up the meaning of “zoo”, you would not start from “a”. You would resort to the index and search from “z”, “zo” etc. SEQMINER relies on tabix library to randomly retrieve genomic regions of interest from large scale datasets according to a pre-computed index. It does more than what tabix does: for datasets with complex structures, it parses the retrieved data, stores them in standard R objects and makes them ready for downstream analysis.
The package is very efficient and can handle the annotation of very large datasets on a single desktop machine with standard configurations. If you have genomic datasets that you want to analyze using R, and if it is cumbersome (or even impractical) to load the entire dataset into R, SEQMINER can be a good choice to start from.