Multiple testing correction in linear mixed models

Our group recently published a new paper on multiple testing applied to genetic studies with population structure.  This project was led by Jong Wha (Joanne) Joo and also involved Farhad Hormozdiari.  The project was joint with Buhm Han’s group.  The approach built upon Buhm Han’s previous work SLIDE (Han et al. 2009; Han and Eskin 2012).
Genome-wide association studies (GWAS) have discovered many variants that are associated with complex traits in the human genome. In GWAS, researchers collect both phenotypic information and genetic information on variants spread through the genome from a population. In order to identify the set of variants associated with a trait of interest, we assess correlations between the phenotype and the genetic information at each variant, which we call the genotype. GWAS are now routinely performed on tens of thousands of individuals—and millions of genetic variants.
GWAS methodology must address specific problems that are tied to this exceptionally large scale of analysis. One major challenge in GWAS is multiple hypothesis testing. In routine analyses, the significance of hypothesis testing is assessed using the p value as a per-marker threshold. However, GWAS involves computing up to millions of statistical tests in a single study. When using traditional association study techniques, multiple hypothesis testing can generate false positives or spurious associations, and p value threshold for significance must be adjusted to control the overall false positive rate.
Several approaches are useful in correcting these potential pitfalls, including Bonferroni correction and permutation test.
Recently, researchers have accepted the linear mixed model (LMM) as standard practice for performing GWAS. The LMM can address two important challenges in GWAS: population structure and insufficient power. Population structure refers to the complex relatedness structure among individuals, which can drive errors in data reporting such as false positives. In many cases, LMM approaches can increase the statistical power and avoid generating false positives by explicitly modeling the population structure’s genetic relationships. Nonetheless, multiple hypothesis testing with LMM approaches may generate some errors of association. Unfortunately, the current approaches for multiple hypothesis testing correction cannot be applied to LMM.  This is because population structure actually affects the correlation structure of the statistics as we show in the paper.
To address this issue, we developed the first gold standard approach for multiple hypothesis testing correction in LMM. This method, called multiple testing in transformed space (MultiTrans), can efficiently correct for multiple testing in LMM approaches. MultiTrans is a parametric bootstrapping resampling approach that is the equivalent of the permutation test. Specifically, our approach samples randomized null phenotypes from the distribution fitted by LMM.
Straightforward parametric bootstrapping where phenotypes are sampled is prohibitively computationally expensive.  MultiTrans instead utilizes   a Multivariate Normal Distribution to directly samples the association statistics.  The figure shows an overview of our methodology.
The full citation to our paper is:

Joo, Jong Wha J; Hormozdiari, Farhad; Han, Buhm; Eskin, Eleazar

Multiple testing correction in linear mixed models. Journal Article

In: Genome Biol, 17 (1), pp. 62, 2016, ISSN: 1474-760X.

Abstract | Links | BibTeX

Multiple hypothesis testing is an essential step in GWAS analysis. The correct per-marker threshold differs as a function of species, marker densities, genetic relatedness, and trait heritability—and no previous multiple testing correction methods can comprehensively account for these factors. The method we developed to address this issue, MultiTrans, is an efficient and accurate multiple testing correction approach for LMM. Our method (a) performs a unique transformation of genotype data to account for actual genetic relatedness and heritability under LMM approaches, and (b) efficiently utilizes the multivariate normal distribution. Using MultiTrans, we accurately estimated per-marker thresholds in mouse, yeast, and human datasets—while reducing computation time from months to hours.

Mixed Models and Confounding Factors Talk @ Simons Institute

mouse-phylogeny-slideI recently gave a talk on mixed models and confounding factors which is a long time interest of our research group at a workshop which is part of the Evolutionary Biology and the Theory of Computing program which was held at the Simons Institute on the UC Berkeley Campus. The talk was held on February 21st. This talk spans many years of work in our group including work by Hyun Min Kang (now at Michigan), Noah Zaitlen (now at UCSF), and Jimmie Ye (now at Harvard) as well as a sneak peak at very recent work by Joanne Joo, Jae-Hoon Sul and Buhm Han.

The video of the talk is available here and is also on our YouTube Channel ZarlabUCLA.

The papers which are covered in the talk include the EMMA, EMMAX and ICE papers published in 2008 as well as a very new paper that should be coming out soon. The key papers from the talk are:

Kang, Hyun Min; Sul, Jae Hoon ; Service, Susan K; Zaitlen, Noah A; Kong, Sit-Yee Y; Freimer, Nelson B; Sabatti, Chiara ; Eskin, Eleazar

Variance component model to account for sample structure in genome-wide association studies. Journal Article

In: Nat Genet, 42 (4), pp. 348-54, 2010, ISSN: 1546-1718.

Abstract | Links | BibTeX

Kang, Hyun Min; Zaitlen, Noah A; Wade, Claire M; Kirby, Andrew ; Heckerman, David ; Daly, Mark J; Eskin, Eleazar

Efficient control of population structure in model organism association mapping. Journal Article

In: Genetics, 178 (3), pp. 1709-23, 2008, ISSN: 0016-6731.

Abstract | Links | BibTeX

Kang, Hyun Min; Ye, Chun ; Eskin, Eleazar

Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Journal Article

In: Genetics, 180 (4), pp. 1909-25, 2008, ISSN: 0016-6731.

Abstract | Links | BibTeX

Emrah Kostem’s talk about his research

Emrah Kostem, who graduated this year and is now at Illumina, gave a talk about the research he completed in the lab this summer at our retreat.  It is available here and gives a good overview of what the goals of our group are and some details of the projects that Emrah completed in the lab.

One of the topics he discusses is his recently published work on estimating heritability, which is quantifying the amount that genetics accounts for the variance of a trait.  He discusses his work on how to partition heritability into the contributions of genomic regions(10.1016/j.ajhg.2013.03.010).

He also talks about his work which takes advantage of the insight that association statistics follow the multivariate normal distribution and applies this to two problems.  The first is the problem of selecting follow up SNPs using the results of an association study(10.1534/genetics.111.128595).  The second problem is the problem of speeding up eQTL studies using a two stage approach where only a fraction of the association tests are performed but virtually all of the significant associations are still discovered(10.1089/cmb.2013.0087).

Details of what he talked about are in his papers:

Kostem, Emrah; Eskin, Eleazar

Improving the accuracy and efficiency of partitioning heritability into the contributions of genomic regions. Journal Article

In: Am J Hum Genet, 92 (4), pp. 558-64, 2013, ISSN: 1537-6605.

Abstract | Links | BibTeX

Kostem, Emrah; Eskin, Eleazar

Efficiently Identifying Significant Associations in Genome-wide Association Studies. Journal Article

In: J Comput Biol, 20 (10), pp. 817-30, 2013, ISSN: 1557-8666.

Abstract | Links | BibTeX

Kostem, Emrah; Lozano, Jose A; Eskin, Eleazar

Increasing Power of Genome-wide Association Studies by Collecting Additional SNPs. Journal Article

In: Genetics, 2011, ISSN: 1943-2631.

Abstract | Links | BibTeX