The Multivariate Normal Distribution Framework for Analyzing Association Studies: Overview

The use of the multivariate normal (MVN) model has been a powerful tool in our groups research and it has been utilized in many of our papers. Jose Lozano (University of the Basque Country, San Sebastian, Spain), along with Eleazar Eskin and three ZarLab alumni—Farhad Hormozdiari (postdoc at Harvard), Jong Wha (Joanne) Joo (faculty at Dongguk University in Seoul), and Buhm Han (faculty at University of Ulsan College of Medicine in Seoul)—recently published a review of the multivariate normal (MVN) distribution framework in genome-wide association studies (GWAS) studies.

Genome-wide association studies (GWAS) have discovered thousands of variants involved in common human diseases. In these studies, frequencies of genetic variants are compared between a population of individuals with a disease (cases) and a population of healthy individual controls). Any variant that has a significantly different frequency between the two populations is considered an associated variant.

A major challenge in the analysis of GWAS studies is the fact that human population history causes nearby genetic variants in the genome to be correlated with each other. In this review, we demonstrate how to utilize the MVN distribution to explicitly take into account the correlation between genetic variants and provide a comprehensive framework for analysis of GWAS.

In this paper, we show how the MVN framework can be applied to perform association testing, correct for multiple hypothesis, testing, estimate statistical power, and perform fine mapping and imputation. In future blog posts, we will highlight different ways the MVN framework can be used in association studies.

An illustration of the multivariate normal model (a) Type I Error (b) Power.

Many of the authors are the alumni of the group who pioneered the use of the MVN in various problems in association studies. Here is a list of papers that our group published using the MVN framework:

Sorry, no publications matched your criteria.

  • Farhad Hormozdiari, Anthony Zhu, Gleb Kichaev, Chelsea J.-T. Ju, Ayellet V. Segre, Jong Wha J. Joo, Hyejung Won, Sriram Sankararaman, Bogdan Pasaniuc, Sagiv Shifman, and Eleazar Eskin. Widespread allelic heterogeneity in complex traits. The American Journal of Human Genetics, 100(5):789{802, may 2017.
  • Yue Wu, Farhad Hormozdiari, Jong Wha J. Joo, and Eleazar Eskin. Improving imputation accuracy by inferring causal variants in genetic studies. In Lecture Notes in Computer Science, pages 303{317. Springer International Publishing, 2017.

The paper was written by Jose A. Lozano, Farhad Hormozdiari, Jong Wha (Joanne) Joo, Buhm Han, and Eleazar Eskin, and it is available at:

The full citation to our paper is:

Jose A. Lozano, Farhad Hormozdiari, Jong Wha (Joanne) Joo, Buhm Han, Eleazar Eskin. 2017. The Multivariate Normal Distribution Framework for Analyzing Association Studies. bioRxiv doi:

Discovering SNPs Regulating Human Gene Expression Using Allele Specific Expression from RNA-Seq data

Analyses of expression quantitative trait loci (eQTL), genomic loci that contribute to variation in genetic expression levels, are essential to understanding the mechanisms of human disease. These studies identify regulators of gene expression as either cis-acting factors that regulate nearby genes, or trans-acting factors that affect unlinked genes through various functions.  Traditional eQTL studies treat expression as a quantitative trait and associate it with genetic variation. This approach has identified many loci involved in the genetic regulation of common, complex diseases.

Standard eQTL methods are limited in power and accuracy by several phenomena common to genomic datasets. First, the correlation structure of genetic variation in the genome, known as linkage disequilibrium (LD), limits the ability of these methods to differentiate between the regulatory variant and neighboring variants that are in LD. Second, like other quantitative traits, the total expression of a gene is influenced by multiple genetic and environmental factors. The effect size for any given variant is therefore small, and standard methods require a large sample size to identify the effect.


ASE example and corresponding mathematical representation of three individuals (1, 2, 3). We assume that the third SNP is the causal SNP site affecting the differential gene expression level (Allele A/ Allele T).

Our forthcoming paper in Genetics presents a new method that improves the accuracy and computational power of eQTL mapping with incorporation of allele specific expression (ASE) analysis. Our novel method uses genome sequencing, alongside measurements of ASE from RNA-seq data, to identify cis-acting regulatory variants.

In standard eQTLs studies, the analysis of ASE is influenced by LD structure and the amount of allelic heterogeneity present in the genome. Individual effects appear weak since the effect of a variant is modest when compared to the variance of total expression. In our approach, the genotypes of each single individual with ASE provides information useful to determining variants causal for the observed ASE. Our approach actually leverages the relationship between LD and variant identification to map the variants affecting expression. Thus, analysis of ASE is advantageous over analysis of total expression levels, the standard approach to eQTL mapping.

We demonstrate the utility of our method by analyzing RNA-seq data from 77 unrelated northern and western European individuals (CEU). To map each gene, we simultaneously compare ASE measurements across a set of sequenced individuals. We then identify genetic variants that are in proximity to those genes and capable of explaining observed patterns of ASE. Here, we characterize the efficacy of this method as the ratio termed “reduction rate” and denoted as the ratio between the number of candidate regulatory SNPs to the total number of SNPs in the proximal region of the gene.

When applied to the CEU dataset, our method reduced the set of candidate SNPs from ten to two (a reduction rate of 80%). Allowing for one error increases the number of candidate SNPs to five and decreases the reduction rate to 50%. We also observe that the relationship between LD and variant identification has a different quality in ASE mapping when compared to eQTL studies, and produces different types of information useful to eQTL mapping studies.

ASE studies are a powerful approach to identifying associations between genetic variation and gene expression. Accurate measurement of ASE can identify cis-acting regulatory variants associated with common diseases. Our novel method for ASE mapping is based on a robust and computationally efficient non-parametric approach, and we hope it advances our understanding of functional risk alleles and facilitates development of new hypotheses for the causes and treatment of common diseases.

This project used software developed by Jennifer Zou, which is available for download at:

This project was led by Eun Yong Kang and involved Serghei Mangul, Buhm Han, and Sagiv Shifman. The article is available at:

The full citation to our paper is:

Sorry, no publications matched your criteria.

A general framework for meta-analyzing dependent studies with overlapping subjects in association mapping

Meta-analyses of genome-wide association studies (GWASs) have become essential to identifying new loci associated with human diseases. We recently developed a novel framework that improves the accuracy and power of meta-analyses, which we describe in our recent Human Molecular Genetics paper. This framework can be applied to the fixed effects (FE) model, which assumes that effect sizes of genetic variants are constant across studies, and the random effects (RE) model, which assumes that effect sizes can be different among studies.

Almost all GWAS publications today employ meta-analysis methodologies, the majority of which assume that component studies are independent and that individuals among studies are unrelated. Yet many studies today use shared controls to reduce genotyping or sequencing cost. These “shared control” individuals can inadvertently overlap between multiple studies and, if not accounted for in the methodology, induce false associations in GWAS results. Most meta-analysis tools, including the RE model, cannot account for these overlapping subjects.

In our paper, we propose a general framework for adjusting association statistics to account for overlapping subjects within a meta-analysis. The key idea of our method is to transform the covariance structure of the data so it can be used in methods that strictly assume independence between studies. Specifically, our method decouples dependent studies into independent studies and adjusts association statistics to account for uncertainties in dependent studies. As a result, our approach enables general meta-analysis methods, including the FE and RE models, to account for overlapping subjects. Existing pipelines implementing these models can be reused for dependent studies if our framework is applied at the front end of the analysis procedure.


A simple example of our decoupling approach. Ω and ΩDecoupled are the covariance matrices of the statistics of three studies A, B and C before and after decoupling, respectively. The thickness of the edges denotes the amount of correlation between the studies. After decoupling, the size of the nodes reflects the information that the studies contain in terms of the inverse variance.

We tested our framework for accuracy and power with five simulated datasets, each containing 1000 to 5000 individuals and 10,000 shared controls. A standard approach produced an inflated number of false positive. Our decoupling method, which systemically accounts for overlapping individuals in meta-analysis, and a standard splitting method, which splits controls into individual studies, both correctly controlled for type 1 errors. The advantage of our framework is apparent when assessing power; in one scenario, we gained 25% power in accounting for overlapping subjects with the decoupling when compared to the splitting method.

Next, we assessed the potential of our framework in identifying casual loci shared by multiple diseases and leveraging information from multiple tissues to increase power for eQTL identification. The decoupling and splitting methods controlled false-positive rates and produced significant p-values at several previously identified candidate shared loci among the three autoimmune conditions present in the Wellcome Trust Case Control Consortium (WTCCC) data. In comparison to the splitting method, our decoupling framework increased the significance of p-values in the shared loci test and increased the number of discovered eQTLs by 19%.

Our approach is flexible and allows many meta-analysis methods, such as the RE model, to account for dependency between studies and overlapping subjects. We developed this approach to complement standard software packages in the meta-analysis of GWAS. This project was led by Buhm Han and involved Dat Duong and Jae Hoon Sul. The article is available at:

The full citation to our paper is:

Sorry, no publications matched your criteria.