Transfering Odds Ratios to another SNP with high LD?

Transfering Odds Ratios to another SNP with high LD?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I have the Odds Ratios (OR) for a particular risk allele in a SNP (I'll call it SNP1). That SNP was unfortunately not genotyped in my data, but I don't want to throw it away. I have looked up the SNP on LDLink and another SNP (I'll call it SNP2) is in high linkage disequilibrium with SNP1 (R2=0.94).

How do I transfer the OR from the SNP1 allele to the corresponding SNP2 allele?

I know it is something similar to log(OR)*sqrt(R2) but I'm not entirely sure. Could somone please clarify if this is the correct formula?

The genomic signature of trait-associated variants

Genome-wide association studies have identified thousands of SNP variants associated with hundreds of phenotypes. For most associations the causal variants and the molecular mechanisms underlying pathogenesis remain unknown. Exploration of the underlying functional annotations of trait-associated loci has thrown some light on their potential roles in pathogenesis. However, there are some shortcomings of the methods used to date, which may undermine efforts to prioritize variants for further analyses. Here, we introduce and apply novel methods to rigorously identify annotation classes showing enrichment or depletion of trait-associated variants taking into account the underlying associations due to co-location of different functional annotations and linkage disequilibrium.


We assessed enrichment and depletion of variants in publicly available annotation classes such as genic regions, regulatory features, measures of conservation, and patterns of histone modifications. We used logistic regression to build a multivariate model that identified the most influential functional annotations for trait-association status of genome-wide significant variants. SNPs associated with all of the enriched annotations were 8 times more likely to be trait-associated variants than SNPs annotated with none of them. Annotations associated with chromatin state together with prior knowledge of the existence of a local expression QTL (eQTL) were the most important factors in the final logistic regression model. Surprisingly, despite the widespread use of evolutionary conservation to prioritize variants for study we find only modest enrichment of trait-associated SNPs in conserved regions.


We established odds ratios of functional annotations that are more likely to contain significantly trait-associated SNPs, for the purpose of prioritizing GWAS hits for further studies. Additionally, we estimated the relative and combined influence of the different genomic annotations, which may facilitate future prioritization methods by adding substantial information.


Breast cancer is a partially heritable disease. Mutations in several high-penetrance genes including BRCA1 [1, 2], BRCA2 [3], and others [4] are associated with high risk of breast cancer among carriers and explain a fraction of the heritability. Genome-wide association studies (GWAS) have identified over 180 common single nucleotide polymorphisms (SNPs) associated with risk of breast cancer [5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]. The majority of these SNPs were identified in European ancestry and East Asian ancestry populations, although some unique SNPs have been identified in African American populations [21] and in Latina populations [22, 23].

Several GWAS studies have identified SNPs at 6q25 that are associated with breast cancer risk [13, 18, 20, 23,24,25,26,27] and mammographic density [23, 27,28,29,30]. The initial report identified a SNP in the intergenic region between ESR1 and CDCC170 in an East Asian population [24]. The locus was then confirmed in other populations and several additional variants were identified [11, 18, 25, 26, 31]. More recently, a fine-mapping and functional approach at this locus identified five distinct common variants associated with risk of different subtypes of breast cancer [27].

Hispanic/Latino populations are the second largest ethnic group in the USA [32] and yet have been understudied in GWAS [33]. Latinos are a population of mixed ancestry with European, indigenous American and African ancestral contributions [34,35,36,37]. Since there are no large studies of breast cancer in indigenous American populations, studies in Latinos may identify novel variants associated with breast cancer that are unique to or substantially more common in this population. We have previously used an admixture mapping approach to search for breast cancer susceptibility loci in Latinas and identified a large region at 6q25 where indigenous American ancestry was associated with decreased risk of breast cancer [22]. Subsequently, we identified a SNP (rs140068132) that was common (minor allele frequency

0.1) only in Latinas with indigenous American ancestry and was associated with substantially lower risk of breast cancer, particularly estrogen receptor (ER)-negative breast cancer, and with lower mammographic density [23]. However, the variant we identified did not completely explain the risk associated with locus-specific ancestry at 6q25 in Latinas, suggesting that other variants may account for this risk. We set out to fine-map and identify additional variants at 6q25 associated with breast cancer risk among Latinas.


Overview of the methods

Let y be the liability of a disease on the logit scale, x be a risk factor in standard deviation (SD) units and z be the genotype of a SNP (coded as 0, 1, or 2). The MR estimate of the causal effect of risk factor on disease 9 is (hat b_ = hat b_/hat b_) , where b zy is the effect of z on y on the logit scale (logarithm of odds ratio, logOR), b zx is the effect of z on x, and b xy is the effect of x on y free of confounding from non-genetic factors (note that b xy can be approximately interpreted as logOR see below). SMR is a flexible and powerful MR approach that is able to estimate and test the significance of b xy using the estimates of b zx and b zy from independent samples 17 . If there are multiple independent (or nearly independent) SNPs associated with x and the effect of x on y is causal, then all the x-associated SNPs will have an effect on y through x (Fig. 1a). In this case, b xy at any of the x-associated SNPs is expected to be identical in the absence of pleiotropy 13,16,22 as all the SNP effects on y are mediated by x (Fig. 1b). Therefore, increased statistical power can be achieved by integrating the estimates of b xy from all the x-associated SNPs using a generalized least squares (GLS) approach (Methods). The GSMR method essentially implements SMR analysis for each SNP instrument individually, and then integrates the b xy estimates of all the SNP instruments by GLS, accounting for the sampling variance in both (hat b_) and (hat b_) for each SNP and the LD among SNPs. It is important to note that in accordance with one of the basic assumptions for MR 9 , only the SNPs that are strongly associated with the risk factor should be used as the instruments for MR analyses including GSMR. We demonstrate using simulations (Supplementary Note 1) that if we use independent SNPs that are associated with the exposure at P < 5×10 −8 , there is no inflation in the GSMR test-statistics under the null hypothesis that b xy = 0 (Supplementary Fig. 1a), that the estimate of b xy by GSMR is unbiased under the alternative hypothesis that b xy≠0 (Supplementary Table 1), and that b xy approximately equals to logOR (where OR is the effect of risk factor on disease in observational study without confounding) (Supplementary Fig. 2). GSMR accounts for LD if the SNP instruments are not fully independent. This is demonstrated by the simulation that in the presence of LD the test-statistic is well calibrated under the null (Supplementary Fig. 1b) and that the estimate of b xy is unbiased under the alternative (Supplementary Table 1). In comparison with the existing methods that use summary data to make causal inference 12,13,16,18 , GSMR is more powerful as demonstrated by simulation (Supplementary Fig. 3) because GSMR accounts for the sampling variance in both (hat b_) and (hat b_) while the other approaches assume that b zx is estimated without error.

Leveraging multiple independent genetic instruments (z) to test for causality. Shown in panel a is a schematic example that if an exposure (x) has an effect on an outcome (y), any instruments (SNPs) causally associated with x will have an effect on y, and the effect of x on y (b xy) at any of the SNPs is expected to be identical. This is further illustrated in a toy example in panel b that under a causal model, for the SNPs associated with x, the estimated effect of z on y ( (hat b_) ) should be linearly proportional to the estimated effect of z on x ( (hat b_) ) and the ratio between the two is an estimate of the mediation effect of x on y, i.e., (hat b_ = hat b_/hat b_)

Pleiotropy is an important potential confounding factor that could bias the estimate and often results in an inflated test-statistic in a MR analysis 9,10,13,19 . We propose a method (called HEIDI-outlier) to detect pleiotropic SNPs at which the estimates of b xy are significantly different from expected under a causal model, and remove them from the GSMR analysis (Methods). The power of detecting a pleiotropic SNP depends on the sample sizes of the GWAS data sets and the deviation of (hat b_) estimated at the pleiotropic SNP from the causal model. We have demonstrated by simulation based on a causal model with pleiotropy that the power of HEIDI-outlier is high especially when the pleiotropic effects are large (Supplementary Fig. 4a). There are certainly pleiotropic outliers (e.g., those with very small effects) not detected by HEIDI-outlier. Nevertheless, these undetected pleiotropic effects do not seem to bias the GSMR estimate (Supplementary Fig. 4b), in contrast to a small bias in the estimate from Egger regression (MR-Egger) which is thought to be free of confounding from pleiotropy 13 . Our simulation results also show that the GSMR estimate of b xy is not significantly different from zero under a pleiotropic model without causal effect in the presence or absence of LD (Supplementary Table 2).

We further develop an approximate method (called mtCOJO URLs) that only requires summary data to conduct a GWAS analysis for a phenotype conditional on multiple covariate phenotypes (Methods). The purpose of developing this method is to estimate the effect of a risk factor on disease adjusting for other risk factors (Methods Supplementary Note 2 Supplementary Fig. 5), which helps to infer whether the marginal effect of the risk factor on disease depends on other risk factors, and to predict the joint effect of multiple risk factors on disease. It is of note that mtCOJO is free of bias due to shared environmental or genetic effect between the phenotype and covariate as described in Aschard et al. 23 (Supplementary Fig. 6).

The effects of seven health risk factors on common diseases

We applied the methods to test for causal associations between seven health risk factors and common diseases using data from multiple large studies. The risk factors are BMI, waist-to-hip ratio adjusted for BMI (WHRadjBMI), HDL cholesterol (HDL-c), LDL-c, triglycerides (TG), systolic blood pressure (SBP), and diastolic blood pressure (DBP). We chose these risk factors because of the availability of summary-level GWAS data from large samples (n = 108,039–322,154) (Supplementary Table 3). We accessed data for BMI, WHRadjBMI, HDL-c, LDL-c and TG from published GWAS 24,25,26 , and data for SBP and DBP from the subgroup of UK Biobank (UKB) 27 with genotyped data released in 2015. We selected SNPs at a genome-wide significance level (PGWAS < 5 × 10 –8 ) using the clumping algorithm (r 2 threshold = 0.05 and window size = 1 Mb) implemented in PLINK 28 (Methods). Note that the GSMR method accounts for the remaining LD not removed by the clumping analysis. There were m = 84, 43, 159, 141, 101, 28, and 29 SNPs for BMI, WHRadjBMI, HDL-c, LDL-c, TG, SBP and DBP, respectively, after clumping. These SNP instruments are nearly independent as demonstrated by the distribution of LD scores computed from the instruments for each trait (Supplementary Fig. 7). We only included in the analysis the near-independent SNPs for the ease of directly comparing the results from GSMR with those from other methods that do not account for LD (e.g., MR-Egger). Our simulation result suggests that the gain of power by including SNPs in LD is limited (Supplementary Fig. 8). Moreover, although the GSMR approach accounts for LD, including many SNPs in moderate to high LD often results in the V matrix being non-invertible (Methods).

The summary-level GWAS data for the diseases were computed from two independent community-based studies with individual-level SNP genotypes, i.e., the Genetic Epidemiology Research on Adult Health and Aging 29 (GERA) (n = 53,991) and the subgroup of UKB 27 (n = 108,039). We included in the analysis 22 common diseases as defined in the GERA data, and added an additional phenotype related to comorbidity by counting the number of diseases affecting each individual (i.e., disease count) as a crude index to measure the general health status of an individual (Supplementary Table 4). We performed genome-wide association analyses of the 23 disease phenotypes in GERA and UKB separately (Methods). We assessed the genetic heterogeneity of a disease between the two cohorts by a genetic correlation (rg) analysis using the bivariate LD score regression (LDSC) approach 30 . The estimates of rg across all diseases varied from 0.75 to 0.99 with a mean of 0.91 (Supplementary Table 4), suggesting strong genetic overlaps for the diseases between the two cohorts. We therefore meta-analyzed the data of the two cohorts to maximize power using the inverse-variance meta-analysis approach 31 . Because OR is free of the ascertainment bias in a case–control study, the effect size (logOR) of a SNP on disease in the general population can be approximated by that from a case–control study assuming that disease in the case–control study is defined similarly as that in the general population. Therefore, GSMR can be applied to data with SNP effects on the risk factor from a population-based study and SNP effects on the disease from an ascertained case–control study, and the estimated causative effect of risk factor on disease should be interpreted as that in the general population. We therefore included in the analysis summary data for 11 diseases from published case–control studies (n = 18,759–184,305) (Supplementary Table 5). The estimated SNP effects and standard errors (SE) for age-related macular degeneration (AMD) were not available in the summary data 32 , which were estimated from z-statistics using an approximate approach (Supplementary Note 3).

We applied the HEIDI-outlier approach to remove SNPs that showed pleiotropic effects on both risk factor and disease, significantly deviated from a causal model (Methods). The LD correlations between pairwise SNPs were estimated from the Atherosclerosis Risk in Communities (ARIC) data 33 (n = 7703 unrelated individuals) imputed to 1000 Genomes (1000G) 34 . Using the large data sets described above, we identified from GSMR analyses 45 significant causative associations between risk factors and diseases (Supplementary Data 1 Fig. 2). We controlled the family-wise error rate (FWER) at 0.05 by Bonferroni correction for 231 tests (PGSMR threshold = 2.2 × 10 −4 ). For method comparison, we have also performed the analyses with MR-Egger 13 and the methods in Pickrell et al. 16 (Supplementary Data 2).

Putative causal associations between seven modifiable risk factors and common diseases. Shown are the results from GSMR analyses with disease data a from a meta-analysis of two community-based studies (GERA and UKB) and b from published independent case–control studies. Colors represent the effect sizes (as measured by odds ratios, ORs) of risk factors on diseases, red for risk effects and blue for protective effects. The significant effects after correcting for 231 tests (PGSMR < 2.2 × 10 −4 ) are labeled with ORs (P-values). The nominally significant effects (PGSMR < 0.05) are labeled with “*”

Obesity and common diseases

Results from analyses of the community-based data showed that BMI had risk effects on T2D (odds ratio, OR = 3.29), hypertensive disease (OR = 1.85), dermatophytosis (i.e., tinea) (OR = 1.67), peripheral vascular diseases (PVD) (OR = 1.59), osteoarthritis (OR = 1.50), dyslipidemia (OR = 1.37), asthma (OR = 1.35), and CVD (OR = 1.30). The risk effects of BMI on T2D, CVD, and hypertensive disease have been confirmed by RCT 35 (Supplementary Data 1), providing proof-of-principle validation. The interpretation of OR(BMI→T2D) = 3.29 is that people whose BMI are 1 SD (SD = 3.98 for BMI in European men corresponding to

12 kg of weight for men of 175 cm stature see Supplementary Table 6 for the SD of the risk factors) above the population mean will have 3.29 times increase in risk to T2D compared with the population prevalence (

8% in the US). It is interesting to note that the estimate of b xy at the TCF7L2 locus strongly deviated from those at the other loci (Fig. 3), suggesting that the TCF7L2 SNP has pleotropic effects on BMI and T2D. The TCF7L2 SNP was detected as an outlier by the HEIDI-outlier method and removed from the GSMR analysis. In addition, the risk effect of BMI on asthma is in line with the result from a recent MR study (using a weighted genetic allele score as the instrument) that higher BMI increases the risk of childhood asthma 36 . Moreover, we identified a protective effect of BMI against osteoporosis (OR = 0.68), consistent with the observed associations in previous studies 37,38 . The estimated risk effect of BMI on T2D in the community data (OR = 3.29) was similar to that in the case–control data (OR = 3.12, Fig. 2b and Supplementary Data 1). We also observed a strong risk effect of BMI on coronary artery disease (CAD) in the case–control data (OR = 1.70), in line with the risk effect of BMI on CVD (OR = 1.30) in the community data.

GSMR analysis to test for the effect of BMI on T2D with and without filtering the pleiotropic outliers. Shown in a and b are the plots of effect sizes and association P-values of all the genetic instruments from GWAS for BMI vs. those for T2D. Shown in c is the plot of b xy vs. GWAS P-value of BMI at each genetic variant. Shown in d, e, and f are the plots for the instruments after the pleiotropic outliers being removed by the HEIDI-outlier approach (see Methods for details of the HEIDI-outlier approach). Error bars in a and d represent the standard errors. The dashed lines in b and e represent the GWAS threshold P-value of 5 × 10 −8 . The coordinates in b, c, e, and f are truncated at 50 for better graphic presentation

Being overweight is a risk factor for general health outcomes as indicated by its risk effect on disease count ( (hat b_ = 0.41) ) in the community data. The question is then how b xy for disease count should be interpreted. We have shown in Supplementary Fig. 9 that the estimate of b xy for disease status (a dichotomous phenotype to indicate whether an individual is affected by any of the 22 diseases) was very similar to that for disease count. Although disease status and disease count are two distinct phenotypes and the analysis of disease count is more powerful, for the ease of interpretation, b xy for disease count can be approximately interpreted as logOR for disease status. Hence, (hat b_ = 0.41) for disease count is approximately equivalent to OR = 1.51 for disease status, meaning an increase of BMI by 1 SD will increase the probability of being affected by any of the 22 diseases by a factor of

1.5. In addition, we found that the effects of WHRadjBMI and BMI on disease were largely concordant (Supplementary Fig. 10a Supplementary Note 4).

Serum cholesterol levels and common diseases

LDL-c is a known causative risk factor for CAD as confirmed by RCTs 6,7 . We found that LDL-c had a significant risk effect on dyslipidemia (OR = 3.36) and CVD (OR = 1.22) in the community data, and CAD (OR = 1.50) in the case–control data (Fig. 2). TG had a significant risk effect on dyslipidemia (OR = 2.09), hypertensive disease (OR = 1.24) and CVD (OR = 1.14) in the community data, and CAD (OR = 1.33) in the case–control data (Fig. 2). The effects of TG on diseases were largely consistent with those for LDL-c (Supplementary Fig. 10b), despite the modest phenotypic correlation between the two traits (r p = 0.19 in the ARIC data). Both LDL and TG had significant risk effects on disease count in the community data (Fig. 2).

There was another example where the HEIDI-outlier approach detected strong effects due to pleiotropy. The effect of LDL-c on Alzheimer’s disease (AD) was highly significant without HEIDI-outlier filtering (OR = 1.35 and PGSMR = 7.8 × 10 −16 ) (Fig. 4). The HEIDI-outlier analysis flagged 16 SNPs, 12 of which are located in the APOE gene region (LD r 2 among these SNPs < 0.05) and all of which had highly significant effects on both LDL-c and AD. Excluding these SNPs makes a more conservative GSMR test because if there is a true causal relationship of increased LDL-c with AD, then the GSMR test should remain significant based on evidence from other LDL-c associated SNPs. In fact, after removing the 16 pleiotropic SNPs, the estimated effect of LDL-c on AD was not significant (OR = 1.03, PGSMR = 0.47). Nevertheless, the multiple pleiotropic signals clustered at the APOE locus are worth further investigation (Supplementary Fig. 11).

GSMR analysis to test for the effect of LDL-c on Alzheimer’s disease (AD) with and without pleiotropic outliers. Shown in a and b are the plots of effect sizes and association P-values of the original set of instruments from GWAS for LDL-c vs. those for AD. Shown in c is the plot of b xy vs. GWAS P-value of LDL-c at each genetic variant. Shown in d, e, and f are the plots for the instruments after the pleiotropic outliers being removed by the HEIDI-outlier approach (see Methods for details of the HEIDI-outlier approach). Error bars in a and d represent the standard errors. The dashed lines in b and e represent the GWAS threshold P-value of 5 × 10 −8 . The coordinates in b, c, e, and f are truncated at 50 for better graphic presentation

We identified a significant protective effect of LDL-c against T2D (OR = 0.84, PGSMR = 1.1 × 10 −4 ) in the case–control data, which might explain the observation from a previous study that lowering LDL-c using statin therapy is associated with a slightly increased risk of T2D 39 . The estimate was not significant in the community data (likely due to the lack of power) but in a consistent direction (OR = 0.95, PGSMR = 0.08). Given the strong genetic correlation between the two T2D data sets (rg = 0.98, SE = 0.062) as estimated by the bivariate LDSC analysis 30 , we meta-analyzed the two data sets using the inverse-variance approach, and performed the GSMR analysis to re-estimate the effect of LDL-c on T2D using the T2D meta-analysis data. The effect size was highly significant (OR = 0.88, PGSMR = 3.0 × 10 −7 ).

The consequences of HDL-c on health outcomes are controversial 40 . Observational studies suggest that HDL-c is associated with a reduced risk to CAD 41 , whereas genetic studies show that the effect of HDL-c on CAD is not significant conditional on LDL-c and TG 20,21 . We found that HDL-c had protective effects against T2D (OR = 0.83), hypertensive disease (OR = 0.88), CVD (OR = 0.88) and disease count (OR = 0.94) in the community data, and T2D (OR = 0.81) and CAD (OR = 0.84) in the case–control data. However, none of these effects remained significant conditioning on the other risk factors, suggesting that the marginal effects of HDL-c on diseases are dependent of the other risk factors (see below for details of the results from conditional analyses). The effect of HDL-c on dyslipidemia is negative ( (hat b_ = - 0.21) and OR = 0.81), which is obvious because one of the diagnostic criteria for dyslipidemia is an abnormally low level of HDL-c. In addition, there was a highly significant risk effect (OR = 1.36) of HDL-c on age-related macular degeneration (AMD) in the case–control data, consistent with the result from a recent MR study 42 . The associations between lipids and AMD are controversial and results from different observational studies are inconsistent 43 . Our results support the observations that increased HDL-c is associated with increased risk of AMD 43,44,45 . It should be noted that LDL-c and TG also appeared to be associated with AMD before HEIDI-outlier filtering but the effects were not significant after HEIDI-outlier filtering (Supplementary Fig. 12), implying that the observed association between LDL-c (or TG) and AMD in epidemiological studies 43 might be due to pleiotropy.

Blood pressure and common diseases

We identified significant risk effects of SBP on hypertensive disease (OR = 4.38), dyslipidemia (OR = 1.50), CVD (OR = 1.40) and disease count (OR = 1.43) in the community data, and CAD (OR = 1.73) in the case–control data. The results for SBP and DBP were highly concordant (Fig. 2 Supplementary Fig. 10c). The risk effect of blood pressure on CAD is known to be causal as confirmed by RCTs 46,47 . Note that the power of the GSMR analysis for blood pressure was likely to be limited given the small number of instruments used (m < 30).

Conditional effects of risk factors on diseases

We have identified (from the analyses above) 45 significant causal associations between health risk factors and diseases (Fig. 2). As the risk factors are not independent, we further sought to estimate the effect of a risk factor on a disease adjusting for other risk factors. To do this, we first investigated the causal associations among the risk factors. We detected 19 significant associations by the GSMR analysis among the 7 risk factors at a FWER of 0.05 (PGSMR < 1.2 × 10 −3 ) (Supplementary Fig. 13). For example, BMI had a significant negative effect on HDL-c ( (hat b_ = - 0.29) ), and positive effects on TG ( (hat b_ = 0.28) ) and DBP ( (hat b_ = 0.15) ).

We developed an approach called mtCOJO (multi-trait-based conditional and joint analysis URLs) to perform a GWAS analysis for a trait conditioning on other traits using GWAS summary data (Methods Supplementary Fig. 5). We then re-ran the GSMR analysis using the adjusted GWAS summary data from the mtCOJO analysis (Methods). The mtCOJO analysis requires the estimates of b xy of the covariate risk factors on the target risk factor and disease, rg among the covariate risk factors, SNP-based heritability ( (h_<>>^2) ) for the covariate risk factors, and sampling covariance between SNP effects estimated from potentially overlapping samples, all of which can be computed from summary data (Methods Supplementary Tables 7–10). Given the similar GSMR results between BMI and WHRadjBMI and between SBP and DBP (Supplementary Fig. 10), we did not include DBP and WHRadjBMI in the conditional analysis to avoid over-correction.

Results from conditional analyses were largely consistent with those from unconditional analyses (Fig. 5 Supplementary Table 11), suggesting that most of the marginal effects are independent of the other risk factors analyzed in this study. Conditioning on the other risk factors, SBP, LDL-c and BMI were the three major risk factors for CAD, BMI was still a large risk factor for T2D and the protective effect of LDL-c on T2D remained largely unchanged (Supplementary Fig. 14). We show above that the GSMR analyses identified significant protective effects of HDL-c against CVD, CAD, T2D and hypertension (Supplementary Fig. 15). However, all the effects became non-significant conditioning on the covariates (i.e., BMI, LDL-c, TG, and SBP), suggesting that the marginal effects of HDL-c on the diseases are not independent of the covariates due to the bidirectional causative associations between HDL-c and the other risk factors as illustrated in Supplementary Fig. 13. It is difficult to distinguish whether the effects of HDL-c on the diseases are mediated or driven by the covariates (Supplementary Fig. 16) because of the complicated association network among risk factors and diseases (Supplementary Fig. 14). Nevertheless, there might be an exception, that is, the association between HDL-c and AMD, because HDL-c is the only risk that showed a significant effect on AMD (OR = 1.36 with PGSMR = 5.9 × 10 −16 ) and the effect size remained largely unchanged and highly significant conditioning on the covariates (conditional OR = 1.36 with PGSMR = 5.1 × 10 −13 ). We conclude that HDL-c is likely to be a direct risk factor for AMD and the effect size is independent of the covariate risk factors analyzed in this study.

GSMR vs. conditional GSMR. Shown are the results from the GSMR analyses compared with those from the conditional GSMR analyses. In the conditional GSMR analysis, the effect size of each risk factor on disease was estimated conditioning on the other risk factors (see Methods for details of the conditional method). “Community”: disease GWAS data from a meta-analysis of the two community-based studies. “Case–control”: disease GWAS data from independent published case–control studies. In gray are the associations that do not pass the P-value threshold 2.2 × 10 −4 in the conditional analysis

Given the estimates from conditional GSMR analyses (Fig. 5 Supplementary Table 11), we could use an approximate approach to calculate the aggregate effect of multiple risk factors on a disease, i.e., (log left( <>> ight) = <[x_ilog left( <>_i> ight)]>) . Here is a hypothetical example. If all the risk factors increase by 1 SD (i.e.,

19 mm Hg for SBP), we would have an increased risk of

2.3-fold to T2D (e 1.01−0.17 ), and 4.5-fold to CAD (e 0.41+0.47+0.14+0.48 ).

Effects of other phenotypes on diseases

Having identified a number of causal associations between seven modifiable risk factors and common diseases, we then sought to test whether there were causative associations between other phenotypes and diseases. We included in the analysis two traits, height 48 and years of schooling 49 (EduYears), for which there were a large number of instruments owing to the large GWAS sample sizes. We selected 811 and 119 near-independent genome-wide significant(GWS) SNPs for height and EduYears, respectively, using the clumping analysis (Methods). The threshold PGSMR after Bonferroni correction was 7.6 × 10 −4 correcting for 66 tests. The large number of instruments for height gave us sufficient power to detect a small effect (Fig. 6 Supplementary Table 12 Supplementary Note 5).

Effects of height and educational attainment on common diseases. Shown are the results from GSMR analyses with disease data a from a meta-analysis of the GERA and UKB studies and b from published independent case–control studies. Colors represent the effect sizes (as measured by odds ratios, ORs) of risk factors on diseases, red for risk effects and blue for protective effects. The significant effects after correcting for multiple testing (PGSMR < 7.6×10 −4 ) are labeled with ORs (P-values). The nominally significant effects (PGSMR < 0.05) are labeled with “*”

Our results also showed that EduYears had protective effects against almost all the diseases (Fig. 6 and Supplementary Table 12). It showed protective effect against PVD (OR = 0.54), hypertensive diseases (OR = 0.62), T2D (OR = 0.64), dyslipidemia (OR = 0.71) and CVD (OR = 0.73) in the community data, and RA (OR = 0.44), AD (OR = 0.61) and CAD (OR = 0.63) in the case–control data. It also showed significant protective effect on disease count (OR = 0.74), suggesting that educational attainment is protective for general health outcomes. The protective effect of EduYears against AD is consistent with the observed association from epidemiological studies 50 . On the other hand, however, EduYears showed a strong risk effect on autism spectrum disorder (OR = 2.30) (Supplementary Note 6), which is not influenced by SNP outliers (Supplementary Fig. 17) and consistent with a positive estimate of genetic correlation (r g = 0.28, SE = 0.038) from a bivariate LD score regression analysis 30 .

Reverse GSMR analysis

It is important to note that the causative associations identified from the GSMR analyses above are unlikely to be explained by reverse causality for two reasons. First, the individuals used in GWAS for risk factors were independent of the individuals used in GWAS for diseases (the only exception was that the blood pressure GWAS data set was part of the community-based disease GWAS data). Secondly, if the associations presented above are driven by reverse causality, we would expect to see strong association signals of the instruments with the diseases, which is not the case as demonstrated in Supplementary Fig. 18, an idea not too dissimilar to the asymmetry analysis that has been used to infer causality in a previous study 16,22 . Nevertheless, it is interesting to investigate the changes in risk factors after development of the diseases. To do this, we selected instruments for diseases from the disease GWAS data (i.e., GWS SNPs for the disease, hence the instruments used in the reverse-GSMR analysis were distinct from those used in the forward-GSMR analysis). The false positive rate of reverse-GSMR is well calibrated as demonstrated by simulation under the null that there is no reverse effect (Supplementary Fig. 19). We performed a reverse-GSMR analysis of the risk factors and diseases for which there was a significant association in the forward-GSMR analysis above (Supplementary Note 7). We identified 10 significant reverse effects (i.e., the effect of disease on risk factor) in the community data and 4 in the case–control data at a FWER of 0.05 (Preverse-GSMR < 1.0 × 10 −3 ) (Supplementary Table 13). The estimates of reverse effects were very small compared with those of the forward effects. To avoid an underpowered test, we limited the reverse-GSMR analysis to diseases with more than 10 instruments. Given the fact that some of the small estimates of reverse effects were highly significant (Supplementary Table 13), it is unlikely that the large difference in the estimated effect size between the forward and reverse analyses is due to the lack of power in the reverse analysis. We further confirmed by simulation that the GSMR estimate of b xy is unbiased irrespective of the sample size for the exposure (Supplementary Fig. 20). Interestingly, there were two cases where the estimated forward and reverse effects were in opposite directions, i.e., (hat b_<>> o >2>)> = 1.19) and (hat b_<>>2> o >)>> -0.07left(

> ight)) (hat b_<>> o >)> = 0.32) and (hat b_<>> o >)> = - 0.03) (left(

> ight)) , meaning that although BMI is risk factor for the two diseases, patients who have developed the diseases may tend to lose weight.

2 Data pre-processing

  • .ped files: The.ped file contains information on each study participant including family ID, participant ID, father ID, mother ID, sex, phenotype, and the full typed genotype. Here, each SNP is bi-allelic (i.e., only two nucleotides are observed at any given SNP across study participants) and coded as a pair of nucleotides (A, C, T, or G). Notably, the ordering in the pair is non-informative in the sense that the first alleles listed for each of the two SNPs are not necessarily on the same chromosome. file contains a row for each SNP with rsNumber (SNP) and corresponding chromosome (chr) and coordinate (BPPos) based on the current genome build.
  • .bim,.bed, and.fam files: The.bim file contains the same information as file as well as the two observed alleles at each SNP (A1 and A2) from the.ped file. It contains a row for each SNP and six columns, containing information for the chromosome number, rsNumber, genetic distance, position identifier, allele 1, and allele 2. The.bed file contains a binary version of the genotype data. This is the largest of the three files because it contains every SNP in the study, as well as the genotype at this SNP for each individual. The.fam file contains the participant identification information, including a row for each individual and six columns, corresponding the same columns described for the.ped file with the exception of the genotype data. Note that not all of these columns contain unique information. That is, in a population-based study of unrelated individuals, ‘family ID number’y and ‘individual ID number’ will be the same.
  • Clinical data file: An additional ascii.txt or.csv file is typically available, which includes clinical data on each study subject. The rows of this file represent each subject, and the columns correspond to available covariates and phenotypes. There may be redundancies in this file and the data contained in the columns labeled ‘sex’ and ‘phenotype’ in the.fam file.

2.1 Reading and formatting data in R (step 1)

In the data example provided, genotype information is available for 861,473 typed SNPs across n = 1401 individuals with available phenotype data.

As illustrated in Figure 1, once we have read in the the genotype and clinical information, we are ready to proceed with the next steps of the GWA data pre-processing. This involves two stages of filtering the data, at SNP and sample levels, respectively. Each of these is described in more detail in the succeeding texts, accompanied by the appropriate R code for implementation. We note again that the order of analysis may vary depending on whether a single GWA analysis is being performed (as described herein) or the analyst is preparing results to be incorporated into a larger meta-analysis that requires data harmonization across multiple studies. In the latter case, the following filtering steps (steps 2, 3, and 4) may be excluded or performed centrally after analysis (steps 7 and 8) as summary level data are combined across studies.

2.2 Single nucleotide polymorphism-level filtering – part 1 (step 2)

  • SNP-level filtering: call rate. The call rate for a given SNP is defined as the proportion of individuals in the study for which the corresponding SNP information is not missing. In the following example, we filter using a call rate of 95%, meaning we retain SNPs for which there is less than 5% missing data. More stringent cut points (e.g., less than 5%) may be employed in smaller sample settings.
  • SNP-level filtering: minor allele frequency (MAF). A large degree of homogeneity at a given SNP across study participants generally results in inadequate power to infer a statistically significant relationship between the SNP and the trait under study. This can occur when we have a very small MAF so that the large majority of individuals have two copies of the major allele. Here, we remove SNPs for which the MAF is less than 1%. In some instances, particularly small sample settings, a cut point of 5% is applied.

In the data example provided, we filter 203,287 SNPs based on call rate <0.95 and/or MAF <0.01.

2.3 Sample-level filtering (step 3)

  • Sample-level filtering: call rate. Similar to SNP-level filtering based on call rate, we exclude individuals who are missing genotype data across more than a pre-defined percentage of the typed SNPs. This proportion of missingness across SNPs is referred to as the sample call rate, and we apply a threshold of 95%. That is, individuals who are missing genotype data for more than 5% of the typed SNPs are removed. A new reduced dimension SnpMatrix genotype object is created, which incorporates this filter.
  • Sample-level filtering: heterozygosity. Heterozygosity refers to the presence of each of the two alleles at a given SNP within an individual. This is expected under HWE to occur with probability 2∗p∗(1 − p), where p is the dominant allele frequency at that SNP (assuming a bi-allelic SNP). Excess heterozygosity across typed SNPs within an individual may be an indication of poor sample quality, while deficient heterozygosity can indicate inbreeding or other substructure in that person 23 . Thus, samples with an inbreeding coefficient |F|=(1 − O/E) > 0.10 are removed, where O and E are respectively the observed and expected counts of heterozygous SNPs within an individual. Note that we calculate the expected counts for each individual based on the observed SNPs for that individual.

Sample-level filtering: cryptic relatedness, duplicates, and gender identity. Population-based cohort studies are often limited to unrelated individuals, and the generalized linear modeling approach described in step 7 (association analysis of typed SNPs) later assumes independence across individuals. Further discussion of alternative data structures and associated analysis tools is provided in Section 6. Importantly, in regional cohort studies (e.g., hospital-based cohort studies) of complex diseases, individuals from the same family can be recruited unintentionally. A common measure of relatedness (or duplication) between pairs of samples is based on identity by descent (IBD). An IBD kinship coefficient of greater than 0.10 may suggest relatedness, duplicates, or sample mixture. Typically, the individual of a related pair with lower genotype call rate is removed. We note that gender identity can also be checked at this stage to confirm that self-reported gender is consistent with the observed X and Y chromosomes however, in the data example provided, sex chromosomes are not available, and thus, an example of filtering on gender identity is not provided.

We begin by applying linkage disequilibrium (LD) pruning using a threshold value of 0.2, which eliminates a large degree of redundancy in the data and reduces the influence of chromosomal artifacts 6 . This dimension reduction step is commonly applied prior to both IBD analysis and PCA, applied in the succeeding texts for ancestry filtering, and results in large computational savings.

This reduces the number of SNPs from 658,186 at the end of step 2 to 72,812. Next, we calculate pairwise IBD distances to search for sample relatedness. A strategy is employed that iteratively removes subjects with the highest number of pairwise kinship coefficients >0.1.

In our example, none of the samples are filtered based on the IBD kinship coefficient >0.10.

Sample-level filtering: ancestry. PCA is one approach to visualizing and classifying individuals into ancestry groups based on their observed genetic makeup. We do this for two reasons: First, self-reported race and ethnicity can differ from clusters of individuals that are based solely on genetic information, and second, the presence of an individual not appearing to fall within a racial/ethnic cluster may be suggestive of a sample-level error. Note that we use the subset of 72,812 SNPs after LD pruning (step 3-c) as the input for the PCA. An alternative strategy to first-stage LD pruning, which also improves computational efficiency, is the ‘HapMap rooted’ analysis, which involves first performing PCA in a reference panel, for example, HapMap or 1000 Genomes, and then projecting the study sample onto the resulting space. This approach is not presented herein but can be implemented with existing functionalities of the Kinship-based INference for Gwas (KING) software 24 .

No additional samples are filtered based on visual inspection of PCA plots. Again, we expect this as the PennCATH data provided are pre-filtered.

2.4 Single nucleotide polymorphism-level filtering – part 2 (step 4)

SNP-level filtering: HWE. Violations of HWE can be an indication of the presence of population substructure or the occurrence of a genotyping error. While they are not always distinguishable, it is a common practice to assume a genotyping error and remove SNPs for which HWE is violated. If case-control status is available, we limit this filtering to analysis of controls as a violation in cases may be an indication of association. Departures from HWE are generally measured at a given SNP using a χ 2 goodness-of-fit test between the observed and expected genotypes. We remove SNPs for which the HWE test statistic has a corresponding p-value of less than 1 × 10 −6 in controls.

We filter out an additional 1,296 SNPs based on HWE p < 1×10 −6 in CAD controls. This results in 656,890 typed SNPs to be considered in the association analysis.


Data processing

To harmonize the set of genetic variants across all four datasets, we imputed the genotypes of all individuals in the four studies using the 1000G Phase 3 v5 as a common reference panel (Michigan Imputation Server [54]). Following imputation, only non-duplicated genetic variants with INFO score larger than 0.9 were retained. We filtered variants with Hardy-Weinberg Equilibrium (HWE) p values below 10 −5 , with missing genotype rate higher than 5%, and with minor allele frequency below 5% using PLINK v1.9 [55]. We used the remaining set of variants in all subsequent analyses unless otherwise noted. To exclude outlier individuals, we calculated genotype principal components (PCs) using smartpca [56]. Five outliers in the DICE dataset were identified and removed from downstream analyses.

To quantify gene expression levels, we used Kallisto [57] and summed the transcript per million (TPM) estimates of all GENCODE 19 [58] isoforms to obtain a gene-level TPM. The gene-level TPM were then scaled and quantile-quantile normalized as described before [17]. Gene expression principal components were calculated using the prcomp function in R . To quantify RNA splicing, RNA-seq reads were aligned to the hg19 reference gnome using STAR 2.6.0 [59] with the GENCODE 19 annotation. To avoid reads mapping with allelic bias, we used WASP [60] as implemented in STAR 2.6.0 by providing the corresponding genotype data. This is an important step as we found a substantial increase in the number of false positive splicing QTL due to allelic bias in read mapping. Indeed, when reads representing different alleles map to different regions of the genome, QTL mapping will be susceptible to identifying spurious associations between the alleles and read coverage at those genomic regions [23]. Exon-exon junctions were extracted using RegTools [61], and clustered and quantified using LeafCutter [23]. As expected, we observed that the number of exon-exon junctions identified in each sample is positively correlated with the sequencing depth in the DICE consortium (Figure S1). To harmonize quantification for splicing junction usage across cell types and datasets in all 18 immune cell types, clusters were merged and the merged union was used to re-calculate intron usage in all samples.

MashR analysis in the DICE dataset

To quantify the sharing of eQTLs and sQTLs in the DICE dataset, we followed the workflow provided by the authors of MashR ( that was previously described in [19]. Briefly, standard errors of QTL effect sizes were calculated from FastQTL nominal output, which were used together with effect sizes as the input for mash. To quantify the correlation structure of the null tests, 30% of all tests were randomly sampled (referred to as the “random” set). To obtain a confident set of QTLs for each feature (gene or intron), the SNP with the smallest P-value across all tested SNPs and all cell types were extracted for each feature. This resulted in a feature-by-sample matrix of effect sizes and their standard errors without missing values referred to as the “strong” set. For eQTLs, we included all protein coding genes. For sQTLs, we included all introns. Data-driven covariance matrices were calculated from the “strong” set. We then built a mash model using the “random” set with the exchange effects (EE) mode to estimate the priors. This model was then applied to the “strong” set to calculate the posterior mean effect sizes (mash effect sizes). Significant QTLs after mash analysis were feature-SNP pairs with local false sign rate (LFSR) below 0.05, as suggested by [19]. The level of QTLs sharing was quantified as both overall sharing and pairwise sharing. Overall, sharing was determined to be the number of cell types in which a given feature has a regulatory QTL (LFSR <0.05). Pairwise sharing was quantified both by magnitude and by sign. Share-by-magnitude between two cell types correspond to the proportion of QTLs that is significant in one of the cell types and posterior mean effect sizes differ by no more than twofold. Share-by-sign between two cell types correspond to the proportion of QTLs that was significant in one of the cell types and had the same sign. The 15 cell types in DICE were grouped into 6 cell groups based on the eQTL sharing-by-magnitude (see Fig. 2b).

Characterization of regulatory QTLs

To calculate the distance between eQTLs and their target genes, we defined the promoter of each gene as the region 2000 bp upstream and 500 bp downstream of TSS. We tested the enrichment of eQTLs in regulatory elements from Ensembl Regulatory Build and consensus ATAC-seq peak set from Calderon et al. [41]. We categorized all ATAC-seq peaks to be either an enhancer or a promoter based on whether they overlap with any promoter region (2000 bp upstream and 500 bp downstream of TSS). The observed and expected number of QTLs overlapping with each feature was estimated using the fenrich command from QTLtools [62], and the odds ratios of enrichment were calculated by supplying those number to Fisher’s exact test in R . We validated eQTLs from DICE in other datasets using π1 statistics [63], stratifying eQTLs by their levels of sharing across six cell groups estimated by mash (specific: in one cell group intermediate: 2–5 cell groups shared: 6 cell groups). The 95% confidence intervals of π1 was estimated using 1000 bootstraps (i.e., re-sampling DICE eQTLs with replacement).


COLOC Colocalization analyses were performed between eQTLs/sQTLs and 72 publicly available GWAS summary statistics for 11 autoimmune diseases (14 studies), namely, rheumatoid arthritis (RA) [64], Crohn’s disease (CD) [27, 30], ulcerative colitis (UC) [27, 30], inflammatory bowel disease (IBD) [27, 30], allergy and eczema (AE) [65], asthma, hay fever and eczema (allergy for short) [66], apoptotic dermatitis (ApD) [67], asthma [68, 69], systemic lupus erythematosus (SLE) [70] and multiple sclerosis [71]. We also collected 36 GWAS for blood-related traits [72], 11 GWAS related to heart functions and circulation system [73], and several other traits including type 2 diabetes (T2D) [74], Alzheimer’s disease (AD) [75], Parkinson’s disease (PD) [76], estimated glomerular filtration rate (eGFR) [77], height [78], and breast cancer survival [79] and other cancers/neoplasms [73]. We considered the 14 autoimmune and the 36 blood-related GWAS as immune GWAS, and the rest 22 GWAS as non-immune GWAS.

To assess colocalization between GWAS loci and QTLs, we first identified the lead GWAS variants and their flanking region in which colocalization was to be tested. Specifically, all variants available in the GWAS summary statistics were sorted by p-values in increasing order. Starting from the variant with the smallest p-value (lead variant), variants within the 500 Kb window on either side of the leading variant were removed. This resulted in a 1Mbp GWAS locus for colocalization analysis. The same procedure was then applied to the next most significant variant among the remaining variants, until no variant with p value below 10 -7 was left. The HLA region (Chr6: 25–35 Mb) was excluded from colocalization. Only GWAS with more than 10 identified loci were included in our analysis. For each GWAS locus identified above, colocalization was tested only if it harbored a regulatory QTL with beta-distribution permuted p value below 0.01 (bpval <0.01) as reported by FastQTL in the 1 Mb window flanking that leading GWAS SNP. Default priors were used for COLOC . We set PP4 >0.75 as the threshold for colocalization. The colocalization proportion was calculated as the proportion of colocalized loci among all identified loci in a GWAS.

Colocalization results were visualized using a function adapted from LocusCompare [80]. For a given locus, SNP with the largest posterior probability from COLOC was defined as the colocalized SNP. r 2 relative to the colocalized SNP were calculated from the genotypes in the QTL study. To visualize the sQTL in the form of a Sashimi plot [81], we first grouped individuals by their genotypes, and then extracted RNA-seq reads that mapped to the cluster that contains the intron to be visualized. To make the coverage comparable between different genotypes, we scaled the read coverage by the number of indivuduals that carry each genotype using the scaleFactor argument in bamCoverage from Deeptools [82] when generating bigWig files. The coverage was then visualized using pyGenomeTracks [83].

Cis-eQTL data of eQTLGen [43] was directly obtained from the website ( We also downloaded allele frequencies from 26,609 eQTLGen samples (excluding Framingham Heart Study), which were used in our colocalization analysis. Of note, the DGN dataset is also included in eQTLGen meta-analysis, but does not alter the interpretation of any of our analyses.

HyPrColoc The GWAS-gene pairs tested in HyPrColoc were selected in the same way as COLOC . We set PP >0.25 as the threshold for colocalization as recommended by the authors [52].

Validation of immune-cell-specific colocalization for non-immune traits We validated colocalization of 14 non-immune traits (11 heart-related, AD, PD and breast cancer survival) in DICE immune cells using the GTEx V7 eQTLs. We first chose several tissues in GTEx that are most relevant to each GWAS trait. For heart-related traits, we chose tissues in heart and circulation system (Artery - Aorta, Artery - Coronary, Artery - Tibial, Heart - Atrial Appendage, Heart - Left Ventricle). For AD and PD, we included the 13 brain tissues (Brain - Amygdala, Brain - Anterior cingulate cortex (BA24), Brain - Caudate (basal ganglia), Brain - Cerebellar Hemisphere, Brain - Cerebellum, Brain - Cortex, Brain - Frontal Cortex (BA9), Brain - Hippocampus, Brain - Hypothalamus, Brain - Nucleus accumbens (basal ganglia), Brain - Putamen (basal ganglia), Brain - Spinal cord (cervical c-1), Brain - Substantia nigra). For breast cancer survival, we used adipose tissues and breast tissue (Adipose - Subcutaneous, Adipose - Visceral (Omentum), Breast - Mammary Tissue). We then identified all the colocalized gene-SNP pairs for these 14 GWAS in DICE, and extracted their P values from GTEx eQTLs in the relevant tissues, as well as from DICE eQTLs in all immune cell types. Given that a large proportion of eQTLs are shared in DICE, we grouped the 15 immune cell types into 6 groups, assigning the smallest P value from all cell types within a given group to that group for each gene. We used Bonferroni correction to adjust P values for multiple testing. Finally, we calculated the proportion gene-SNP pair that has adjusted P value below 0.05 in DICE but not GTEx tissues.

Characterizations of uncolocalized GWAS loci We restricted this analysis to the loci from the 14 autoimmune GWAS that did not colocalize with a in BLUEPRINT QTL. All genes were classified into four categories: genes with an eQTL that colocalized at a GWAS locus, genes that are the closest to a GWAS locus, genes that are closest to a uncolocalized GWAS locus, and all remaining genes. We compared gene expression level in the three BLUEPRINT cell types separately. The gene expression level values for the three cell types were combined and plotted in Fig. 5a. We also obtained Enhancer-domain score (EDS) [39] and “loss-of-function observed/expected upper bound fraction” (LOEUF) [40] for all available genes and compared the distribution of EDS and LOEUF across the four categories above.

To test the enrichment of uncolocalized loci in ATAC-seq peaks in stimulated immune cells, we constructed a contingency table by counting the number of colocalized and uncolocalized loci overlapping stimulated and unstimulated ATAC-seq peaks, respectively. We then tested the hypothesis that uncolocalized loci were more highly enriched in stimulated open chromatin regions compared to colocalized loci using Fisher’s exact test. We estimated 95% confidential interval of estimates by bootstrapping uncolocalized GWAS loci 1000 times with replacement.

We reasoned that regulatory effects of many uncolocalized GWAS loci might be too small to be detected due to small sample sizes. To test this possibility, we ascertained eQTLs only at uncolocalized GWAS loci. Briefly, we extracted QTL tests at lead SNP of uncolocalized loci. GWAS locus-gene pairs that have already been tested in COLOC but did not colocalize were filtered. Since it is common for one lead SNP to be associated with many genes, we adjusted the P values by number of tested genes at each loci using Bonferroni correction and picked the gene with the smallest P value. We then calculated the proportion of genes with P value below 0.05. This analysis was applied to each autoimmune GWAS in each cell type in BLUEPRINT dataset.

RA samples collection and analysis

Sample collection and CUT&Tag experiment All of the clinical samples were obtained from Xijing Hospital. Peripheral blood and synovial fluid samples were collected from 6 RA patients at the Department of Clinical Immunology, Xijing Hospital. All of the RA patients fulfilled the 1987 revised American College of Rheumatology criteria and the ACR 2010 Rheumatoid Arthritis classification criteria [84], and their clinical characteristics are shown in Additional File 1: Table S7. In addition, peripheral blood samples were gathered from 4 healthy individuals. All blood and synovial fluid samples were subjected to gradient centrifugation using lymphocyte separation medium (MP Biomedicals, 0850494) to isolate mononuclear cells, which were cryopreserved for later experiments.

The cryopreserved mononuclear cells were thawed into RPMI/10%FBS, washed once in sterile phosphate-buffered saline (PBS Beyotime, ST476), and stained with the following antibodies in PBS for 30 min: anti-CD3-APC/Cy7 (Biolegend, 300426), anti-CD4-PE/Cy7 (Biolegend, 357410), anti-CD8-Percp/Cy5.5 (Biolegend, 301032), anti-CD25-PE/CF594(BD Horizon,562525), anti-CD19-FITC (Biolegend,302206), and anti-CD14-APC (Biolegend, 301808). CD4 + T cells (CD3 + , CD4 + , CD8 - ), CD8 + T cells(CD3 + , CD4 - , CD8 + ), T reg cells (CD3 + , CD4 + , CD8 - , CD25 + ), B cells (CD3 - , CD19 + ), and monocytes (CD3 - , CD14 + ) were sorted by FACSAria III (BD Pharmingen, San Diego, USA) directly into wash buffer for CUT&Tag, with a maximum of 1×10 5 cells for each cell type. We profiled H3K27ac (abcam ab4729) for each cell type following standard CUT&Tag protocol ( [21]. Samples were processed in different batches, and we ensured to include at least one healthy individual and one RA patient in each batch to minimize batch effects that align with biological differences that we are interested in.

CUT&Tag data analysis The DNA libraries were subjected to 150bp paired-end (PE) sequencing. Sequencing reads were aligned to human reference genome hg19 using Bowtie 2 [45] with parameters –local –very-sensitive-local –no-unal –no-mixed –no-discordant –phred33 –minins 10 –maxins 700 . Aligned reads were filtered using Samtools with -F 1804 -f 2 -q 30 [85]. Samples with fewer than 2M reads were excluded from subsequent analyses. Filtered BAM files for samples that have the same disease status (healthy/RA), tissue-type (PBMC/SF) and cell type were merged. Read coverage was calculated using bamCoverage in 10bp window normalized by RPKM [82]. H3K27ac peaks were called from the merged BAM files using MACS2 with parameters –format BAMPE –broad –broad-cutoff 0.1 –qvalue 0.1 –extsize 146 [46]. We reasoned that calling peaks from merged BAM files increases the signal-to-noise ratio. To generate a consensus peak set, we merged all the peaks using bedtools merge [86], resulting in 90,412 peaks. We then counted the number of fragments overlapping with the consensus peak set in each sample using featureCounts [87].

Differential peak analysis was performed using limma [88]. We calculated average log2CPM across samples with the same disease status, tissue-type, and cell type. This average log2CPM was only used to filter our peaks with low fragments counts. Peaks with average log2CPM below 2 in all groups were excluded from differential analysis. Then, normalization factors were calculated from the remaining peaks using the TMM method, and counts in each sample converted to log2CPM. Since samples were processed in different batches, we used ComBat to adjust for batches while including disease status, tissue-type, and cell type as our variable-of-interest. We constructed a contrast matrix comparing RA SF vs. RA PBMC, RA SF vs. Healthy PBMC, and RA PBMC vs. Healthy PBMC in each, and applied the trend method. Differential peaks were defined as log2-fold-change (log2(FC)) larger than 1 or smaller than -1, and FDR below 0.1.

We overlapped H3K27ac peaks up-regulated in RA samples with uncolocalized RA GWAS loci. We first fine-mapped RA GWAS summary statistics using SuSiE [51]. Fine-mapping was performed at each locus we used in our colocalization analysis. We supplied GWAS Z-scores, genotype correlation matrix from CEU and GBR from the 1000 Genome Project as the reference panel and the sample size of reference panel to the susie_rss function.

We estimated the enrichment of RA SNP heritability in our H3K27ac peaks using Stratified LD Score Regression (S-LDSC) [5]. We used MACS2 peaks from merged BAM files, which were extended by 500 bp on both sides. To reproduce the heritability analysis from Calderon et al. [41], we used the MACS2 peaks shared by the authors.

Study designs for enriching or prioritising rare variants

Study designs exploiting unique characteristics of different populations have been used to boost power in association studies of rare and low-frequency alleles. One notable example is population isolates, which provide powerful study designs for medical genetics due to a number of advantageous characteristics. For example, variants of medical importance that are rare in outbred populations might be found at higher frequencies in isolated populations due to past bottleneck events, genetic drift or adaptation and selection [43, 112], increasing power to detect associations with medically important phenotypes [113, 114].

A particularly interesting case of rare variation is variants that lead to inactivation of the corresponding protein. Such so-called loss-of-function (LoF) variants include variants predicted to lead to premature termination of the protein (stop-gain variants or protein-truncating variants) and insertion or deletion polymorphisms that affect the overall codon sequence of the protein (frameshift INDELS) or alter pre-mRNA splicing of essential exons (essential splice-site variants). LoF variants provide powerful tools to understand the impact of “knocking out” human genes, akin to gene knockout experiments commonly conducted in model organisms [115]. Understanding the phenotypic and clinical consequences of carrying LoF alleles, particularly when they are carried in the homozygous (i.e. complete knockout) state, has been shown to provide crucial insights into the identification of new disease genes and druggable pathways [116,117,118]. Further, studies of LoF variants in established drug targets, when carried by an otherwise healthy individual, provide evidence for safety of modulating that particular target to reduce disease risk. The data set of 60,706 individuals collated by the Exome Aggregation Consortium (ExAC) can assist in filtering of candidate disease-causing variants and in the discovery of human “knockout” variants in protein-coding genes [119].

Efforts to discover these mutations are boosted in populations with high rates of homozygosity, for example in populations with a tradition of consanguineous marriage, and where such variants occur more often in a homozygous state. Analysing samples from the PROMIS study, it was found that 961 genes were completely inactivated in at least one participant. Combined with rich phenotype information, this enabled the discovery of genotype–phenotype associations of clinical importance, such as the association of APOC3 with absent plasma apolipoprotein C-III levels [120]. Another study predicted LoF in 781 genes after analysing 3222 British Pakistani heritage adults with high parental relatedness [121]. The whole genomes of 2636 Icelanders together with imputing additional 101,584 chip-genotyped and phased Icelanders has begun to enable studies of rare complete human gene knockouts in the Icelandic population. The authors are also planning to characterise most homozygous LoF variants in the Icelandic population and to carry out bespoke phenotyping of the carriers [122]. A caveat of this approach is that the functional consequences of sequence variants are typically bioinformatically annotated as based on generic transcript annotations (for instance based on the most deleterious consequence among all annotated transcripts). LoF variants may therefore not lead to protein inactivation in a biologically relevant context, which could be due to gene redundancy, or to heterozygosity, or to genuine variants that do not actually disrupt gene function, or to variants that are only active in certain tissue-specific (or rare) isoforms [112, 115]. Thus, extensive and painstaking follow-up efforts are required to validate the predicted consequences of these variants.

Results: LD impact on power

The simulation data consisted of 1200 synthetic datasets, corresponding to 4 LD blocks × 3 effect sizes × 100 retrospective case-control datasets with 1000 subjects in each cohort. The estimated heritabilities h 2 are given in Table 3 and are all below h 2 =0.10. These were computed according to the subsequent formula (4), in which Gi represents the nine two-locus genotype combinations underlying g1×g2, and results immediately from the penetrance tables previously computed for each effect size (as Table 2 was an instance for β3=0.90 effect size and results in h 2 =0.083):

Furthermore, Table 4 shows that only 1 SNP is in moderate to strong LD with the causal locus DSL 1 (r 2 threshold of 0.75), while 60 SNPs are in very low LD with DSL 1 (r 2 threshold of 0.20). Moderate to strong LD with DSL 2 A, B, C and D is observed for 98, 107, 78 and 24 SNPs (at r 2 of 0.75), respectively. The number of tag SNPs (and thus the signal capture probability) increase with decreasing r 2 threshold. For instance, for a threshold of 0.45, respectively 2, 114, 110, 80 and 48 tag-SNPs for DSL 1, DSL 2 A, B, C and D are obtained.

The estimated signal sensitivities of MB-MDR to detect the simulated purely epistatic interaction (DSL 1, DSL 2), for different scenarios of DLS 2 position (DSL 2 A, DSL 2 B, DSL 2 C, DSL 2 D), three epistasis effect sizes and five LD pruning schemes before MB-MDR analysis are presented in Fig. 5, for signal sensitivity defined via r 2 ≥0.45-tagging and in Fig. 6 for tagging determined by r 2 ≥0.20. The estimated exact sensitivities are displayed on the lower panels of the aforementioned Figures. Note that estimates of exact sensitivity do not depend on block definitions. All estimates are tabulated in Table 5. The following observations are made:

For all scenarios of epistasis effect size and location of DSL 2, as well as tag-SNP block definition and pruning at different r 2 values ranging from 0.20 to 0.75, the signal sensitivity is always higher than the exact sensitivity.

Also when no pruning is performed (thus all SNP pairs are screened for epistasis, regardless of between-SNPs correlations), the exact sensitivity is smaller than the signal sensitivity.

Exact sensitivities dramatically decrease when pruning is applied. The worst results are obtained for scenarios A and C, for which the corresponding DSL 2 can be considered to reside at the boundary of an LD (sub-)block. The results are only slightly better for scenario D. In case both DSLs are located on different chromosomes, exact sensitivity estimates range from 0.10-0.18 (setting D, see Fig. 1). In contrast, exact sensitivity estimates in case DSL 2 is located in the middle of an LD block range from 0.16-0.64, again depending on the epistatic effect size and LD pruning threshold (setting B, see Fig. 2).

Signal sensitivity can be further improved by SNP-set reduction via pruning. In general, the more LD-pruning is involved, the higher the signal sensitivity. Whatever the SNP-tag block definition used, too heavy pruning at r 2 of 0.20 gives by far the lowest signal sensitivity. For all considered DSL 2 locations, little power (signal sensitivity) is lost by pruning further down from 0.75 to 0.60, retaining more SNPs. For setting C, power balances around 0.50 when more extensive pruning is done at r 2 of 0.50 instead of 0.60, which is similar to flipping a coin and highly unacceptable (see Fig. 5).

There are no clear patterns regarding increasing epistasis effect size leading to increased exact or signal sensitivity.

Sensitivities of MB-MDR to detect two-loci pure epistatic interaction in 4 settings at three effect sizes and with different LD pruning levels: Signal sensitivities (upper panel) and exact sensitivities (lower panel) are displayed at different LD pruning thresholds (unpruned data or LD pruning at 0.75, 0.60, 0.50 and 0.20). Signal sensitivities determined with tag-SNP subsets at LD r 2 ≥0.45 with causal SNPs

Sensitivities of MB-MDR to detect two-loci pure epistatic interaction in 4 settings at three effect sizes and with different LD pruning levels: Signal sensitivities (upper panel) and exact sensitivities (lower panel) are displayed at different LD pruning thresholds (unpruned data or LD pruning at 0.75, 0.60, 0.50 and 0.20). Signal sensitivities determined with tag-SNP subsets at LD r 2 ≥0.20 with causal SNPs


Institute of Sport, Exercise and Active Living (ISEAL), Victoria University, Melbourne, Australia

Sarah Voisin, David J Bishop & Nir Eynon

Department of Tourism and Recreation, Academy of Physical Education and Sport, Gdansk, Poland

Pawel Cieszczyk & Zbigniew Jastrzebski

Ural State University of Physical Culture, Chelyabinsk, Russia

Vladimir P Pushkarev, Dmitry A Dyatlov, Boris F Vashlyayev & Vladimir A Shumaylov

Faculty of Physical Culture and Health Promotion, University of Szczecin, Szczecin, Poland

Pawel Cieszczyk, Agnieszka Maciejewska-Karlowska & Marek Sawczuk

Cell Biology Department, Faculty of Biology, University of Szczecin, Szczecin, Poland

Murdoch Childrens Research Institute, The Royal Children’s Hospital, Melbourne, Australia

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

Corresponding author

Author Contributions

S.S.V., T.O.B., L.G., R.E.P., T.E.K., A.R.S., and M.D.R wrote the manuscript. S.S.V., T.O.B., L.G., R.E.P., T.E.K., A.R.S., M.D.R, J.-L.R., J.D.B., J.P.L., Y.B., B.D.M., Di.A., Da.A., R.A., K.B., G.C., K.C., J.H.C., J.-P.D., N.P.D., I.F.-C., P.F., M.G., T.G., G.F.G., B.G., P.A.G., W.H., L.H., E.-Y.K., H.-S.K., M.K., M.T.M.L., R.M., J.M., D.M.R., E.S., M.S., J.G.S., J.M.S.-M., J.M.tB., D.T., M.V., J.W., M.-S.W., R.W., and S.W. designed the research. S.V., T.B., L.G., J.-L.R., J.L., Y.B., T.K., A.S., and M.R. performed the research. S.S.V., T.O.B., L.G., R.E.P., T.E.K., A.R.S., M.D.R., J.-L.R., J.D.B., J.P.L., Y.B., and B.D.M. analyzed the data.

Filename Description
cpt1911-sup-0001-FigS1.tifTIFF image, 17.3 MB
cpt1911-sup-0002-FigS2.tifTIFF image, 16.6 MB
cpt1911-sup-0003-FigS3.tifTIFF image, 16.6 MB
cpt1911-sup-0004-FigS4.tifTIFF image, 16.6 MB
cpt1911-sup-0005-TableS1.xlsxapplication/excel, 9.2 KB
cpt1911-sup-0006-TableS2.xlsxapplication/excel, 9.1 KB
cpt1911-sup-0007-TableS3.xlsxapplication/excel, 11.5 KB
cpt1911-sup-0008-TableS4.xlsxapplication/excel, 11.8 KB
cpt1911-sup-0009-TableS5.xlsxapplication/excel, 11.5 KB
cpt1911-sup-0010-TableS6.xlsxapplication/excel, 11.8 KB
cpt1911-sup-0011-TableS7.xlsxapplication/excel, 11.7 KB
cpt1911-sup-0012-TableS8.xlsxapplication/excel, 14.7 KB
cpt1911-sup-0013-TextS1.docxWord document, 12.1 KB

Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.

Watch the video: Τα γονίδια και η κληρονομικότητα (October 2022).