SNPs mapping into protein

SNPs mapping into protein

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Starting a new project on protein-protein interactions and SNP analysis tool development. I would like to ask how does SNPs is mapped into protein? What does mapping means?

Well SNPs are single nucleotide polymorphisms. Some SNPs are in the coding regions of genes and they can result in changes in the resulting protein. For example, the SNP rs1801131 is a human variation where some individuals have a G instead of an A at position 1515 of the gene MTHFR. When the gene is transcribed and then translated into protein, this variation (the presence of the G nucleotide) causes a Glutamate residue at position 429 of the protein to be a Glycine instead.

So, SNP rs1801131 maps to position 429 of the protein MTHFR.

Molecular Biology

Because restriction markers are not restricted to those genome changes that affect the phenotype, they provide the basis for an extremely powerful technique for identifying genetic loci at the molecular level. A typical problem concerns a mutation with known effects on the phenotype, where the relevant genetic locus can be placed on a genetic map, but for which we have no knowledge about the corresponding gene or protein. Many damaging or fatal human diseases fall into this category. For example cystic fibrosis shows Mendelian inheritance, but the molecular nature of the mutant function was unknown until it could be identified as a result of characterizing the gene. A hypothetical example is shown in Figure 3.4. This situation corresponds to finding 100% linkage between the restriction marker and the phenotype. It would imply that the restriction marker lies so close to the mutant gene that it is never separated from it by recombination.


Genetic variants that are associated with common human diseases do not lead directly to disease, but instead act on intermediate, molecular phenotypes that in turn induce changes in higher-order disease traits. Therefore, identifying the molecular phenotypes that vary in response to changes in DNA and that also associate with changes in disease traits has the potential to provide the functional information required to not only identify and validate the susceptibility genes that are directly affected by changes in DNA, but also to understand the molecular networks in which such genes operate and how changes in these networks lead to changes in disease traits. Toward that end, we profiled more than 39,000 transcripts and we genotyped 782,476 unique single nucleotide polymorphisms (SNPs) in more than 400 human liver samples to characterize the genetic architecture of gene expression in the human liver, a metabolically active tissue that is important in a number of common human diseases, including obesity, diabetes, and atherosclerosis. This genome-wide association study of gene expression resulted in the detection of more than 6,000 associations between SNP genotypes and liver gene expression traits, where many of the corresponding genes identified have already been implicated in a number of human diseases. The utility of these data for elucidating the causes of common human diseases is demonstrated by integrating them with genotypic and expression data from other human and mouse populations. This provides much-needed functional support for the candidate susceptibility genes being identified at a growing number of genetic loci that have been identified as key drivers of disease from genome-wide association studies of disease. By using an integrative genomics approach, we highlight how the gene RPS26 and not ERBB3 is supported by our data as the most likely susceptibility gene for a novel type 1 diabetes locus recently identified in a large-scale, genome-wide association study. We also identify SORT1 and CELSR2 as candidate susceptibility genes for a locus recently associated with coronary artery disease and plasma low-density lipoprotein cholesterol levels in the process.


Forward genetic screens are used to identify genes important for the expression of a particular phenotype of interest. After mutagenesis and genetic screens are performed, molecular biology techniques can be used to determine which gene has been mutated to result in a particular phenotype. There are several ways to map a mutation to a gene in the model organism Caenorhabditis elegans, but most classical approaches begin by determining which of the six chromosomes (five pairs of autosomes and one pair of sex chromosomes) contains the mutated gene (as in ref. [ 1, 2 ).

Single nucleotide polymorphisms (SNPs) are single nucleotide differences between the genetic material of one organism and another, and can be used as genetic markers to map a mutation to a chromosome. Different isogenic lab strains of C. elegans carry different complements of SNPs in their DNA. Two specific worm strains N2 (Bristol) and HI (Hawaiian) are commonly used for SNP mapping. In this lab exercise, students use several N2-HI SNPs to map a specific mutation that causes abnormal locomotion.

In this experiment, the mutation of interest is carried in the N2 background, while the HI background does not contain the mutation of interest. An N2 worm expressing the mutant phenotype (in this case, a worm that is uncoordinated and is homozygous for a recessive mutation) is mated to a HI male. The offspring of the mutant N2 × HI cross have a mix of DNA from each strain (see Fig. 1). All the F1s will inherit a mutated chromosome from their N2 parent, but because the mutation used in this lab is recessive, two copies of the mutant chromosome are needed for the expression of the mutant phenotype. Therefore, the F1 worms will not express the mutant phenotype. F1s are allowed to self-fertilize (these worms are hermaphrodites) to homozygose the mutation. The mutant phenotype is then observed in some portion (approximately 25%) of the F2 generation. Those F2s exhibiting the mutant movement phenotype must have inherited two copies of the N2 chromosome that contains the mutated gene. The offspring that do not show the mutant phenotype (the coordinated offspring) must have inherited at least one copy of the HI chromosome that contains the normal allele of the gene mutated in the N2 worms.

The next step is to determine, for those F2 worms that exhibit uncoordinated movement, which chromosomes are N2 and which are HI. SNPs are used to identify which F2 chromosomes are N2 and which are HI. To do this, students first amplify portions of the uncoordinated F2 chromosomes using polymerase chain reaction (PCR). The regions of DNA amplified contain known SNPs- so there will be differences in the sequence of the N2 and HI DNA. Students detect those SNPs using restriction fragment length analysis. In this case, SNPs that alter restriction enzyme recognition sites were chosen. Restriction enzymes cut amplified N2 DNA differently than they cut HI DNA because of the single nucleotide differences between their DNA sequences. Because the enzymes cut N2 and HI differently, the resulting DNA fragments for N2 and HI DNA will be of different sizes. Students determine the sizes of the DNA fragments (and thus the type of DNA—N2 vs. HI) by gel electrophoresis. The expectation is that the chromosome containing the mutated gene can only be made of N2 DNA, whereas the chromosomes that do not contain the mutation can contain either N2 or HI DNA.

This laboratory exercise introduces students to several molecular techniques. The exercise also introduces students to the strategy behind genetic mapping studies. The exercise can be adapted to map any recessive mutation that elicits an easily observed phenotype (and does not severely affect fecundity or cause lethality) and is located on chromosomes II or IV.


We are extremely grateful to all the families who took part in the ALSPAC study, the midwives for their help in recruiting them and the whole ALSPAC team, including interviewers, computer and laboratory technicians, clerical workers, research scientists, volunteers, managers, receptionists and nurses. We acknowledge J. Bowden for statistical support and advice relating to MR-Egger regression. This publication is the work of the authors, and J. Zheng will serve as guarantor for the contents of this paper. J.Z. is funded by a Vice-Chancellor’s Fellowship from the University of Bristol. This research was also funded by the UK Medical Research Council Integrative Epidemiology Unit (MC_UU_00011/1 and MC_UU_00011/4), GlaxoSmithKline, Biogen and the Cancer Research Integrative Cancer Epidemiology Programme (C18281/A19169). The UK Medical Research Council and Wellcome (grant no. 102215/2/13/2) and the University of Bristol provided core support for ALSPAC. A comprehensive list of grant funding is available on the ALSPAC website ( T.R.G. holds a Turing Fellowship with the Alan Turing Institute. G.H. is funded by the Wellcome Trust and the Royal Society (208806/Z/17/Z). M.V.H. is supported by a British Heart Foundation Intermediate Clinical Research Fellowship (FS/18/23/33512) and the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre. This work has been supported by the NIHR Biomedical Research Centre at University Hospitals Bristol and Weston NHS Foundation Trust and the University of Bristol (G.D.S. and T.R.G.). The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care. This work was supported by the Elizabeth Blackwell Institute for Health Research University of Bristol and the Medical Research Council Proximity to Discovery award. P.E. is supported by Cancer Research UK (CRUK C18281/A19169). S.L. is funded by the Bau Tsu Zung Bau Kwan Yeun Hing Research and Clinical Fellowship (200008682.920006.20006.400.01) from the University of Hong Kong. J.D. is funded by a NIHR Senior Investigator award. J.D. sits on the International Cardiovascular and Metabolic advisory board for Novartis (since 2010), the UK Biobank Steering Committee (since 2011), and is a member of the MRC International Advisory Group (ING) London (since 2013), the MRC High Throughput Science ‘Omics Panel’, London (since 2013), the Scientific Advisory Committee for Sanofi (since 2013), the International Cardiovascular and Metabolism Research and Development Portfolio Committee for Novartis and the AstraZeneca Genomics advisory board (since 2018). P.C.H. is supported by CRUK Population Research Postdoctoral Fellowship C52724/A20138.

Participants in the INTERVAL randomized controlled trial were recruited with the active collaboration of NHS Blood and Transplant England (, which has supported fieldwork and other elements of the trial. DNA extraction and genotyping was co-funded by the NIHR, the NIHR BioResource ( and the NIHR Cambridge Biomedical Research Centre at the Cambridge University Hospitals NHS Foundation Trust. The academic coordinating centre for INTERVAL was supported by core funding from the NIHR Blood and Transplant Research Unit in Donor Health and Genomics (NIHR BTRU-2014–10024), UK Medical Research Council (MR/L003120/1), British Heart Foundation (SP/09/002 RG/13/13/30194 RG/18/13/33946) and the NIHR Cambridge Biomedical Research Centre at the Cambridge University Hospitals NHS Foundation Trust. A complete list of the investigators and contributors to the INTERVAL trial is provided in Di Angelantonio et al. 59 . The academic coordinating centre thank blood donor center staff and blood donors for participating in the INTERVAL trial.

We gratefully acknowledge all studies and databases that have made their GWAS summary data available for this study: arcOGEN (Arthritis Research UK Osteoarthritis Genetics), BCAC (the Breast Cancer Association Consortium), C4D (Coronary Artery Disease Genetics Consortium), CARDIoGRAM (Coronary ARtery DIsease Genome-wide Replication and Meta-analysis), CKDGen (Chronic Kidney Disease Genetics consortium), DIAGRAM (DIAbetes Genetics Replication And Meta-analysis), EAGLE (EArly Genetics and Lifecourse Epidemiology Consortium), EAGLE Eczema (EArly Genetics and Lifecourse Epidemiology Eczema Consortium), EGG (Early Growth Genetics Consortium), ENIGMA (Enhancing Neuro Imaging Genetics through Meta-Analysis), GCAN (Genetic Consortium for Anorexia Nervosa), GEFOS (GEnetic Factors for OSteoporosis Consortium), GIANT (Genetic Investigation of ANthropometric Traits), GIS (Genetics of Iron Status consortium), GLGC (Global Lipids Genetics Consortium), GliomaScan (cohort-based GWAS of glioma), GPC (Genetics of Personality Consortium), GUGC (Global Urate and Gout consortium), HaemGen (hematological and platelet traits genetics consortium), IGAP (International Genomics of Alzheimer’s Project), IIBDGC (International Inflammatory Bowel Disease Genetics Consortium), ILCCO (International Lung Cancer Consortium), IMSGC (International Multiple Sclerosis Genetic Consortium), ISGC (International Stroke Genetics Consortium), MAGIC (Meta-Analyses of Glucose and Insulin-related traits Consortium), MDACC (MD Anderson Cancer Center), MESA (Multi-Ethnic Study of Atherosclerosis), Neale’s lab (a team of researchers from Benjamin Neale’s group, who made the UK Biobank GWAS summary statistics publically available), OCAC (Ovarian Cancer Association Consortium), IPSCSG (the International PSC study group), NHGRI-EBI GWAS catalog (National Human Genome Research Institute and European Bioinformatics Institute Catalog of published GWAS), PanScan (Pancreatic Cancer Cohort Consortium), PGC (Psychiatric Genomics Consortium), Project MinE consortium, ReproGen (Reproductive ageing Genetics consortium), SSGAC (Social Science Genetics Association Consortium), TAG (Tobacco and Genetics Consortium) and the UK Biobank.

J.Z. acknowledges his grandmother ChenZhu for all her support, may she rest in peace.

SNPs in microRNA target sites and their potential role in human disease

In the post-genomic era, the goal of personalized medicine is to determine the correlation between genotype and phenotype. Developing high-throughput genotyping technologies such as genome-wide association studies (GWAS) and the 1000 Genomes Project ( has dramatically enhanced our ability to map where changes in the genome occur on a population level by identifying millions of single nucleotide polymorphisms (SNPs). Polymorphisms, particularly those within the coding regions of proteins and at splice junctions, have received the most attention, but it is also now clear that polymorphisms in the non-coding regions are important. In these non-coding regions, the enhancer and promoter regions have received the most attention, whereas the 3′-UTR regions have until recently been overlooked. In this review, we examine how SNPs affect microRNA-binding sites in these regions, and how mRNA stability changes can lead to disease pathogenesis.

1. Introduction

Single nucleotide polymorphisms (SNPs) occur in 1% or more within the population [1]. Although these populations are identical in 99.5% at the DNA level [2], there are approximately 10 million SNPs in the human genome, indicating that they occur once in every 300 bp in both coding and non-coding regions of genes [3]. SNPs in the coding region can result in synonymous and non-synonymous changes, with the latter resulting in an amino acid change or the introduction of a stop codon [4]. These changes can lead to human diseases [5], and in fact at least 25% of the reported non-synonymous SNPs are predicted to negatively affect protein function [6,7].

Synonymous SNPs have been referred to as silent mutations because they do not change the amino acid [8]. However, there is a growing body of evidence indicating that synonymous SNPs do affect proper protein function [9]. For example, two synonymous SNPs in the sequence encoding the multidrug resistance protein 1 (MDR1) affect protein folding and function [10]. Moreover, the most common disease-causing mutation in the cystic fibrosis transmembrane conductance regulator (CFTR) gene is an out-of-frame deletion of phenylalanine-508 (ΔF508) that introduces a SNP at isoleucine-507 (I507) and this SNP contributes to the severity of the ΔF508 CFTR channel dysfunction [11,12].

Recently, more attention has been paid to the SNPs identified in non-coding regions. Interestingly, about 93% of functional SNPs in the GWAS catalogue are in non-coding regions [13]. They have been called regulatory SNPs (rSNPs) because they affect transcriptional regulation or post-transcriptional gene expression [14]. rSNPs can cause changes in cell function at different levels of gene regulation. For example, they can affect gene splicing [15] and transcription factor binding [16]. These rSNPs reside in the sequence of non-coding RNA in the promoter and enhancer regions [16]. They can also affect the half-life of messenger RNA (mRNA) and result in decreased protein levels through mRNA–microRNA (miRNA) interactions. SNPs in miRNA target sites in the 3′-UTR of mRNAs are referred to as poly-miRTSs [17]. The SNP dataset from the UCSC Genome browser (NCBI dbSNP, Build 130 [18]) consists of 18 833 531 human SNPs, while the genomic coordinates for a subset of 175 351 (approx. 11%) locates them in the 3'-UTRs of 16 810 genes [19]. Given that there are an estimated 19 000–20 000 genes in the human genome, this suggests that the majority of the genes could be regulated by miRNAs [20], indicating that the potential biological consequence of these mutations should be carefully considered. Furthermore, a substantial number of SNPs and rare mutations within pri-, pre- and mature miRNA sequences have been reported [21,22]. Although some of these miRNA SNPs are related to human diseases [23–27] (reviewed in [17]), their biological role is difficult to elucidate given that changes in any miRNA can have profound genome-wide effects since miRNAs can bind to hundreds of different mRNAs. Since 2008, when Sethupathy & Collins [17] critically reviewed reports of miRNA SNPs involved in human diseases and provided clear criteria for validation of such associations, a large number of novel human disease-related poly-miRTSs have been proposed. Furthermore, recently developed approaches dedicated to miRNA function, targeted genome editing with in silico methods provide novel tools for complex verification of miRNA SNP consequences. In this review, we focus on poly-miRTSs and their potential impact in human diseases.

2. SNPs in miRNA target sites

2.1. mRNA : miRNA association

miRNAs are short (approx. 22 nt) endogenous non-coding single-stranded RNAs which act as post-transcriptional regulators of gene expression [28]. In the cytosol, mature miRNAs that are a part of the Argonaute-containing silencing complexes called miRNA ribonucleoprotein complexes (miRNP) can downregulate a specific target mRNA by Argonaute-catalysed degradation of the mRNA target strand in P bodies or by translational repression [29,30]. Hence, the major consequence of miRNA : mRNA pairing is loss of protein expression, resulting from either decreased transcript levels or by translational repression [29].

Although the mechanism underlying the recognition of mRNA targets by miRNAs has been extensively studied, the minimal requirements for a functional mRNA : miRNA association are not fully understood. Furthermore, despite the fact that many mRNAs have conserved miRNA target sites, a variety of interactions through non-conserved sites has been reported [31]. Finally, the average size of the human 3′-UTR is about 950 nt (for highly expressed neuronal genes it is 1300 nt, whereas for genes specific to non-neuronal tissue it is only 700 nt) [32], while the efficient miRNA-binding site consists of 6–8 nt. Hence, the 3′-UTR of a specific mRNA can include tandem target sequences for a specific miRNA as well as target sequences for many other miRNAs. The composition of specific miRNAs associated with the 3′-UTR of a mRNA along with the efficiency of miRNA pairing to their target sequences impacts the mRNA's half-life and influences protein levels [33,34]. Hence, determining the consequences of SNPs in miRNA target sites is not a trivial endeavour.

That being said, it is well established that the complementary pairing of a 3′-UTR of a mRNA to a conserved heptametrical seed sequence is usually found at positions 2–7 from the miRNA 5′-end and is critical for mRNA target selection [35]. Initially, it was thought that perfect complementarity of the 3′-UTR of a mRNA to the miRNA seed sequence led to transcript degradation, and a partial match caused translational inhibition [35]. However, recent studies have shown that non-canonical sites also exist and can regulate mRNA degradation [36]. Furthermore, base pairing between mRNA and miRNA seed sequences do not always lead to decreased expression of target transcript [37]. The above findings suggest that additional features of mRNA target sequences play a crucial role in effective miRNA binding. The detailed analysis of seed sequences established 8-nt pairing (8-mer) with mRNA as the most effective, whereas 7- and 6-nt binding sites (7-mer and 6-mer) were less effective (figure 1). Although 6-mers often provide efficient pairing, even in an offset position (figure 1a,b), a 4-mer is a non-functional site in vivo [38]. Interestingly, 7-mer pairing efficiency relies strictly on sequence complementarity. Consequently, although the 7-mer-m8 site (an exact match to positions 2–8 of the mature miRNA—the seed and position 8 (figure 1c)) has increased seed pairing compared with the 6-mer, the 7-mer-A1 (an exact match to positions 2–7 of the mature miRNA—the seed followed by an ‘A1’) has similar seed pairing to 6-mer (figure 1d). The seed pairing including both the match at position m8 and the A1 is characteristic for a 8-mer site [37] (figure 1e). The effect of G : U base pairs and bulges in the seed were also considered showing that a single G : U wobble or target sites with a 1 nt bulge can still be functional [38] (figure 1f). However, the Watson–Crick base pairing is absolutely critical between nucleotides at positions 9–12 in the target site, since the hydrolysis of the phosphodiester backbone in mRNA cleaved by miRNA occurs only when the 10th and 11th nucleotides of mRNA are complementary to nucleotides at positions 2–15 in miRNA [39].

Figure 1. Types of mRNA : miRNA interactions. (a) 6-mer, (b) 6-mer offset, (c) 7-mer-m8, (d) 7-mer-A1, (e) 8-mer, (f) GU wobble pairing, (g) productive 3'-pairing, (h) compensatory site and (i) centred site.

Furthermore, additional mRNA pairing to the 3′ region of miRNA, termed as productive seed pairing, can increase the target recognition or it can compensate for the mismatch to the seed (3′ supplementary sites and 3′ compensatory sites, respectively) [36]. The substantial pairing of 3′ compensatory sites to mRNA increases the weak 5′ pairing, resulting in functional miRNA binding (figure 1g,h).

Interestingly, Shin et al. [30] indicated that centred mRNA sequences consisting of 11 nt create Watson–Crick pairs with miRNA nucleotides at positions 4–14 or 5–15 and serve as a type of miRNA target site. This unique class of miRNA target sites is devoid of both perfect seed pairing and 3′ compensatory pairing but can be supplemented by pairing to the other miRNA areas (figure 1i).

Based on the studies discussed above, mRNA target sites can be divided into two major groups. The first group consists of canonical sites with (i) strong seed pairing to the 5′ end of miRNA (low pairing energy) that is amplified through either strong base pairing to the 3′ end of the miRNA (an extension of the seed type) or (ii) strong seed pairing to the 5′ end of miRNA seed sites requiring little or no 3′-UTR pairing support. These canonical sites have pairing energy and are often functional in one copy. In contrast with these sites, the second groups are non-canonical seed sites with higher pairing energy that should exist in the 3′-UTR in more than one copy to be effective [38]. It has to be stressed that the seed region contributes the majority of binding energy and strong binding relies mainly on base pairing within this region, whereas an additional 3′ pairing only slightly reduces binding energy [40]. Interestingly, pairing beyond position 16 and at positions 10–11 increases binding energy that results in weakened binding [40].

Another factor to consider in miRNA : mRNA interactions is the location of the target mRNA sites. In general, the 3′-UTR mRNA sites are most efficient [37,41]. Furthermore, target mRNA sites positioned within at least 15 nt from the stop codon, sites located away from the centres of long 3′-UTR, as well as those miRNA target sites located in AU-rich regions are the most effective [37,41]. Additionally, a location of target mRNA sites in close proximity to protein-binding sites and to other miRNA-binding sites may also affect miRNA : mRNA associations [33,37]. The 3′-UTR mRNA quartiles bordering the mRNA poly(A) tail and the ORF exhibit more effective targeting than remaining two centred quartiles. However, this effect was apparent only for longer 3′-UTRs (more than 1300 nt) [37].

Taking into account the complexity of miRNA : mRNA pairing, the introduction of a SNP into a 3′-UTR can have numerous functional consequences by either introducing or removing miRNA target sequences or changing the binding efficiency. The poly-miRTSs within the canonical seed sequence can either create a novel mRNA target site from a preexisting 5-mer sequence (into 6-mer offset or 6-mer) or impair the existing target site 6-mer or 6-mer offset sequence (into 5-mer). Furthermore, since the introduction of poly-miRTSs into seed regions can also affect miRNA : mRNA binding efficiency, it can lead to either increased or decreased post-transcriptional mRNA regulation. Finally, poly-miRTSs may also affect miRNA-binding efficiency by changing supplemental seed pairing that applies to both canonical and non-canonical binding sites. Additionally, in the case of non-canonical binding sites, poly-miRTSs may introduce or remove tandem target sites, and thus change the miRNA effects. Finally, the introduction or removal of miRNA target sites may affect binding to other miRNA target sequences in the SNP's close proximity, which could have unforeseen effects on the mRNA half-life. Given the number of SNPs in the human population, it is not surprising that poly-miRTSs have been shown to affect the levels of numerous proteins that have been associated with various disorders (table 1) [39]. Below, we discuss examples of several studies identifying poly-miRTSs and their potential association with human disorders.

Table 1. Reports of poly-miRTS associations with human disease. Bold indicates the studies fulfilled the criteria for assigning SNPs as poly-miRTSs involved in human diseases and included: (i) functional experimental validation of SNPs related to differential mRNA targeting (ii) genetic testing of the association with the disease that takes into account the effects of population stratification and (iii) mechanistic testing underlying the mechanism by which poly-miRTSs contribute to the disease [17].

2.2. Creation of new miRNA target sites by SNPs

2.2.1. MDM4 | miR-191 or miR-877-3p

Mdm2-like p53-binding protein (MDM4) is an oncoprotein that negatively regulates the p53 tumour suppressor protein [70]. It is well documented that overexpression of this protein leads to cancer development [70]. Recent studies suggested that the variation in the 3'-UTR of MDM4 can lead to a decreased risk of various malignancies [42–47]. The occurrence of the C minor allele (SNP rs4245739 A>C) in the 3'-UTR of MDM4 has been shown to decrease the risk of cancer, and delay the progression of metastasis and cancer-related death [42–47]. Numerous studies have shown that introduction of this C minor SNP creates a new binding site for miR-191 [42–46] and/or miR-887-3p [42,43,47], and this leads to a decreased level of MDM4 protein. Moreover, a recently conducted meta-analysis of 69 477 subjects (19 796 cases of nine various type of cancer and 49 681 controls) showed that the above-mentioned SNP correlates with a reduced overall risk of cancer [71].

2.2.2. ΔNp63 | miR-140-5p

p63 is another tumour suppressor protein belonging to the p53 family. Because of different promoters and alternative splicing, there are two major isoforms of TP63: TAp63 (acidic transactivation domain present) and ΔNp63 (no transactivation domain) [72]. Interestingly, in vivo experiments indicate that TAp63 acts like a tumour suppressor gene, whereas ΔNp63 is an oncogene [73–75]. Wang et al. [48] found that the SNP rs35592567 (C>T) in the 3′-UTR of ΔNp63 has an impact on bladder cancer risk. Analysis showed that the T allele is correlated with a decreased risk of bladder cancer because miR-140-5p is able to bind to the 3'-UTR of ΔNp63. Overexpression of miR-140-5p in 5637 cells (urinary bladder grade II carcinoma cells) attenuated cell migration and invasion and inhibited cell proliferation [48].

2.2.3. HNF1B | miR-214-5p and miR-550a-5p

Another example of a positive effect of a SNP on a disease risk is rs2229295 (C>A), which is located in the 3′-UTR of hepatocyte nuclear factor 1-beta (HNF1B) mRNA. This gene encodes a transcription factor known to be a regulator of growth and development in the pancreas [76]. Since HNF1B has a role in controlling hepatic insulin activity and glucose metabolism in vivo [77], Goda et al. [49] suggested that the rs2229295 SNP may correlate with susceptibility for type 2 diabetes mellitus (T2DM). Using luciferase reporter vectors, they demonstrated that the A allele constructs were regulated by two miRNAs: miR-214-5p and miR-550a-5p, whereas C allele constructs were not. Hence, the presence of A allele decreases HNF1B protein levels and has a protective effect against T2DM [49].

2.2.4. APOC3 and APOA5 | miR-4271 and miR-485-5p

APOC3 and APOA5 are genes that encode apolipoprotein C3 and A5, respectively. Both of these proteins along with lipoprotein lipase (LPL) and apolipoprotein C2 (APOC2) are involved in triglyceride metabolism [50,51]. Hu et al. [50] demonstrated that decreased levels of APOC3 lead to lower triglyceride levels and reduce the risk of coronary heart disease (CHD). This is due to SNP (rs4225 G>T) found in the 3′-UTR of APOC3. When the T minor allele is present in the cell, miR-4271 is able to bind to the 3′-UTR of APOC3, and this leads to a decreased translation of APOC3. miR-4271, however, cannot bind to the variant containing the G major allele [50]. Similarly, APOA5 c.*158C>T (rs2266788) is also associated with alterations in triglyceride metabolism and results in hypertriglyceridaemia [51]. In this case, the rare c.*158C APOA5 allele creates a new functional binding site for miR-485-5p. Importantly, both miRNAs regulating APOC3 and APOA5 are endogenously expressed in the human liver, so if the SNP occurs, they may be involved in the regulation of triglyceride metabolism in vivo. However, both examples of SNPs and their impact on the risk of disease need further clarification, since different results have been obtained for different ethnic groups [50].

2.2.5. PLIN4 | miR-522

PLIN4 (perilipin 4) is a member of the perilipin family and these proteins coat the intracellular lipid storage droplets (LSD). PLIN4 has been proposed to promote uptake of free fatty acids from the blood to the LSD and is dependent upon the cell's nutritional status [78]. Meta-analysis of two populations of this gene, rs8887 (G>A), analysed with antropometric measurements, indicated that the two populations were different. Individuals with the A minor allele had an increased volume of visceral and subcutaneous adipose tissue, and higher BMI and weight compared to individuals with the G major allele [52]. This study reported that PLIN4 is regulated by miR-522 only in the rs8887A variant. It is not yet clear, however, if the lower expression of PLIN4 contributes to obesity because the results are conflicting [79,80].

2.2.6. FXN | miR-124-3p

Reduced expression of the mitochondrial frataxin (FXN) protein has been postulated to play a role in Friedreich's ataxia (FRDA), an inherited neurodegenerative disease [81]. Lower levels of frataxin are due to GAA repeat expansion in the FXN gene [81]. Additionally, Bandiera et al. [53] have suggested that miR-124-3p regulates FXN expression in vivo only in FRDA patients. They identified seven SNPs in the 3′-UTR of FXN in children and adults diagnosed with FRDA. One of them, rs11145043 (G>T), permits miR-124-3p binding only when the T allele is present. Although miR-124-3p is highly expressed in the nervous system [82], it is overexpressed in FRDA patients [83], suggesting its role in FRDA. However, its influence on FXN needs further clarification.

2.3. Loss of miRNA target sites by SNPs

2.3.1. SCNA | miR-34b

The α-synuclein SCNA gene polymorphism is considered a main risk for the common sporadic form of Parkinson's disease (PD approx. 90% of all PD cases) [84]. α-Synuclein is a crucial protein that creates immunoreactive aggregates in Lewy-bodies, which are typical for Parkinson's disease patients' brains [85]. Studies have indicated that miR-34b targets the α-synuclein mRNA3′-UTR in two distinct sites and represses translation of this protein [86]. Importantly, in PD patients' brains, the level of miR-34b in the substantia nigra is decreased. Kabaria et al. [54] have identified a SNP, rs10024743 (T>G), in the 3′-UTR of α-synuclein, which is localized in the target site 1 of miR-34b. This SNP diminishes the miR-34b-mediated repression of α-synuclein levels due to disruption of the miRNA : mRNA association. However, this study was performed only on SH-SY5Y cells and its association with PD remains unclear [54].

2.3.2. PALLD | miR-96 and miR-182

The PALLD gene encodes the actin-associated protein Palladin, whose expression correlates closely with the pathological cell motility characteristics of aggressive cancer cells. The expression level of Palladin in breast cancer patients is higher in invasive and malignant cancer cell types than in non-invasive and normal cell lines. The results suggest that Palladin promotes podosome formation, regulates the actin cytoskeleton via multiple pathways, participates in matrix degradation, and thus facilitates metastasis in breast cancer [87,88]. Gilam et al. [55] have reported that miR-96 and miR-182 reduce breast cancer cell migration and invasion by downregulating Palladin protein levels and that this process is disrupted by a SNP, rs1071738 (G < C), located in the 3′-UTR of the PALLD gene. This SNP is characterized by highest minor allele frequency (greater than 43%) and the alternate G allele is much more common than the ancestral minor C allele. If the C allele occurs in the binding site, the mRNA target sequence at the 3′-UTR of PALLD is fully complementary to the miR-96 and miR-182 seed regions, whereas the presence of the alternate G allele results in one mismatch. A significant decrease in Palladin levels is diminished by miR-96 and miR-182 expression (approx. 30% and approx. 70% reduction, respectively) in the presence of the C allele, but not in the presence of the G allele due to the disrupted miRNA:mRNA association. These findings suggest that although miR-96 and miR-182 may prevent breast cancer metastasis, the functional rs1071738 G variant abolishes this effect [55].

2.3.3. EFNB2 | miR-137

The EFNB2 (ephrin-B2) gene encodes an ephrin, a protein tyrosine kinase that is involved in remodelling and the development of synaptic connections that are regulated by activated NMDA receptor. Ephrin-B2 is essential for the Reelin pathway controlling neuronal migration. Additionally, the activation of EFNB2 is crucial for rescuing the Reelin defect and disruption of this pathway is associated with schizophrenia [56,89]. Recently, a negative correlation between miR-137 and EFNB2 expression was shown [56]. Importantly, the SNP rs550067317 (A>C) is located at the predicted target site of miR-137 in the 3′-UTR of EFNB2. The minor C allele of rs550067317 disrupts the formation of the typical stem-loop structure during base pairing of miR-137 with the predicted target sequence at the 3′-UTR, consequently reversing inhibition of EFNB2 expression.

2.3.4. HIF1A | miR-199a

The HIF1A gene encodes the HIF-1α protein (hypoxia-inducible factor 1), an oxygen dependent subunit and master transcriptional regulator of the mammalian cell response to oxygen deprivation, and is therefore important in both the cardiovascular and cancer fields. To date, numerous studies have demonstrated miRNA's role in regulation of HIF-1α levels [90–93]. Recently, a SNP (rs2057482 T>C) in the 3′-UTR of HIF1A located near the miR-199a binding site was identified [57,94]. The C allele of this variant has an increased frequency in pancreatic ductal adenocarcinoma patients and this CC genotype was characterized by a larger tumour size, shorter overall survival and a higher risk of this disease compared to CT and TT genotypes [57]. Additionally, the occurrence of the C allele was significantly associated with higher HIF1A mRNA and consequently upregulation of HIF1 levels, suggesting that this SNP impairs miR-199a : HIF1A binding [57].

2.3.5. DROSHA | miR-27b

A very interesting example of a synonymous mutation that leads to the loss of an miRNA binding site is SNP rs10719 (T>C) located in the 3′-UTR of the DROSHA gene. The Drosha enzyme, a member of the RNAase III family, plays a critical role in miRNA biogenesis. It liberates the pre-miRNA stem-loop by cleavage of the longer pri-miRNAs in the nucleus [95]. In addition to this function, Drosha also influences cell proliferation and apoptosis [96]. Since overexpression of Drosha is observed in bladder cancer, this SNP is associated with an increased risk of bladder cancer [58]. Yuan et al. [58] reported that DROSHA's 3′-UTR contains a target site for miR-27b, while rs10719 (T>C) is located in close proximity to this site (46 bp downstream of the miR-27b binding site). They have postulated that rs10719T to C transition leads to weaker mRNA : miRNA association at the miR-27b target site and consequently to increased Drosha expression.

2.4. SNPs affecting the miRNA : mRNA interaction

2.4.1. FGF20 | miR-433

An example of another poly-miRTS related to PD was provided by Wang et al. [59], who reported a correlation between SNP (rs127202208 C/T) in the 3′-UTR of fibroblast growth factor 20 (FGF20) and the development of PD. FGF20 is mainly expressed in substantia nigra and increases proliferation and promotes survival of dopaminergic neurons during the early stages of life. However, increased levels of FGF20 in the later stages of life enhance α-synuclein expression and can lead to the death of dopaminergic neurons [59]. miR-433, which is abundant in brain, was shown to downregulate the translation of FGF20, mainly because this reported SNP resides within the predicted binding site for miR-433. The allele C of this polymorphism represents a valid miRNA base pairing, whereas the T allele introduces a G : U wobble base pairing and consequently a mismatch, which affects the miRNA : mRNA interaction. However, this SNP does not eliminate the miRNA : mRNA binding, but attenuates it. This leads to increased FGF20 levels and indirectly to overexpression of α-synuclein. Importantly, the effect of this SNP on FGF20 expression and its relationship to miR-433 levels were confirmed in vivo [59].

3. Conclusion

The discussed examples of poly-miRTSs strongly suggest that these SNPs can be crucial factors in developing human pathologies and could contribute to genetic diversity. As mentioned, roughly 180 000 SNPs in the human genome that are located in the 3'-UTR region were identified along with about 2600 mature miRNA sequences which are deposited in the mirBase (v. 21), suggesting that a large number of these SNPs may introduce miRNA-binding changes. Furthermore, the recent development of deep sequencing techniques and advanced database/software tools like miRSNP and PolymiRTS Database 3.0 (see table 2 for complete list) allows researchers to initially access potential poly-miRTSs. Hence, in the near future, we can expect growing numbers of studies linking poly-miRTSs to human diseases.

Table 2. Current software and databases dedicated for poly-miRTS studies.

In 2008, Sethupathy & Collins [17] provided criteria for assigning SNPs as poly-miRTSs involved in human diseases that include: (i) functional (preferably in vivo) experimental validation of SNPs related to differential mRNA targeting (ii) genetic testing of the association with the disease that takes into account the effects of population stratification and finally (iii) mechanistic testing underlying the mechanism by which poly-miRTSs contribute to the disease [17]. Few current studies satisfy all these criteria (table 1), while the majority of them rely on population correlation effects and in silico modelling only, ignoring the necessity of the mechanistic approach. Importantly, commonly used methods to confirm differential miRNA : mRNA binding, in vitro luciferase reporter constructs and miRNA overexpression often do not consider the physiological miRNA levels in vivo. However, miRNA physiological levels are often undergoing dynamic changes due to epigenetic factors [101], and thus they can affect the verification of the poly-miRTS disease-related mechanisms. The luciferase-based reporter assays are usually performed in artificial cancer cell lines that permit easy AgoMiR (mimic) delivery, and are often characterized by low endogenous miRNA levels. The latter inhibits endogenous miRNAs from degrading the reporters prior to the miRNA overexpression. Importantly, the miRNA overexpression in these systems is often a hundred fold higher than in vivo conditions. Hence, in the case of validation of new target sites created by poly-miRTS, this experimental model may lead to false positive results, since it cannot differentiate between weak and strong binding to the targets. The vector-based miRNA expression system that provides inducible and scalable control over miRNA levels may provide more solid verification of potential miRNA : mRNA binding [102].

Recently, the development of morpholino-based target protector technology provides an elegant tool to test the functionality of novel potential miRNA : mRNA interactions that mimics physiological conditions [103,104]. Target protectors bind to specific target mRNA sequences and block miRNA access, however without triggering an RNAi response [105]. Hence, target protectors allow blocking the miRNA-mediated suppression of a specific target mRNA [105]. Importantly, these modified oligonucleotides can be used to evaluate the significance of miRNA : mRNA interactions in the context of physiological miRNA levels.

Furthermore, often changes in a gene's mRNA level are not reflected in its protein levels [106]. Hence, the studies of miRNA SNP-affected targets should be always accompanied by monitoring protein levels in cell lines related to the disease. Finally, although in research models usually one miRNA and one target are considered, the single miRNA usually is predicted to bind hundreds of target mRNAs, and have multiple effects on cellular metabolism. Hence, studying the mechanism of poly-miRTS involvement in human diseases requires verification that the miRNA effects are a result of indirect disease-related targets. Although this possibility cannot be totally eliminated, following genome-wide effects of specific miRNA modulation (with next generation sequencing) can support direct miRNA : mRNA interactions.

The most convincing and final criterion for linking poly-miRTSs to disease is establishing the disease-related mechanisms of differential miRNA binding. Taking into account complexity of a potential SNP effect on miRNA : mRNA pairing, this can be challenging. Nevertheless, the recent development of targeted genome editing tools (like CRISPR/Cas9 systems) allows one to make efficient, precise and targeted changes to the genome of the living cells, and opens novel possibilities to overcome this limitation [107]. Sadly, to date, no study has been reported in which targeted genome editing was applied in order to validate poly-miRTSs.

Analysing the specific effects of homozygotic and heterozygotic SNPs in both in vitro and in vivo disease models could provide the critical proof for the role of and frequency that poly-miRTSs occur in human diseases.

A practical view of fine-mapping and gene prioritization in the post-genome-wide association era

Over the past 15 years, genome-wide association studies (GWASs) have enabled the systematic identification of genetic loci associated with traits and diseases. However, due to resolution issues and methodological limitations, the true causal variants and genes associated with traits remain difficult to identify. In this post-GWAS era, many biological and computational fine-mapping approaches now aim to solve these issues. Here, we review fine-mapping and gene prioritization approaches that, when combined, will improve the understanding of the underlying mechanisms of complex traits and diseases. Fine-mapping of genetic variants has become increasingly sophisticated: initially, variants were simply overlapped with functional elements, but now the impact of variants on regulatory activity and direct variant-gene 3D interactions can be identified. Moreover, gene manipulation by CRISPR/Cas9, the identification of expression quantitative trait loci and the use of co-expression networks have all increased our understanding of the genes and pathways affected by GWAS loci. However, despite this progress, limitations including the lack of cell-type- and disease-specific data and the ever-increasing complexity of polygenic models of traits pose serious challenges. Indeed, the combination of fine-mapping and gene prioritization by statistical, functional and population-based strategies will be necessary to truly understand how GWAS loci contribute to complex traits and diseases.

1. Introduction

Most, if not all, phenotypic traits and diseases have a genetic component that influences their development, susceptibility or characteristics. Which genetic regions (loci) are linked to phenotypic traits has largely been determined by genome-wide association studies (GWASs) (figure 1a). GWASs compare and associate millions of relatively common genetic variants, usually single-nucleotide polymorphisms (SNPs), between a baseline (healthy) population and one with a trait of interest such as type 1 diabetes [1], coeliac disease [2] or height [3]. The trait-associated genetic loci obtained by GWASs are marked by specific variants referred to as marker or top variants. Each marker-variant signifies a haplotype containing many nearby variants that are in high linkage disequilibrium (LD), indicating that they are most likely to be inherited together [4] (figure 1b). Over 4000 GWASs have been published since 2002 [5], yielding almost 150 000 marker variant associations to hundreds of traits [6]. However, despite the method's great initial promise, GWASs have not provided immediate insights into the underlying biological mechanisms of each trait due to two major complicating factors.

Figure 1. Outline of the current post-GWAS workflow. (a) First, the correct context needs to be identified for the trait under study. (b) Subsequently, causal variants can be fine-mapped to better understand the fundamental mechanisms of transcription. Here, the causal variant (star) is not the strongest GWAS signal, but rather a variant in strong LD with the top effect located in an active enhancer region. (c) To gain insights into the biological processes leading to the phenotype, genes can be prioritized and causal networks constructed. GWAS variants are generally common in the population and have smaller effect sizes (blue). Thus, the genes that they impact are more likely to have a small effect on the phenotype as well (peripheral genes). The genes on which many peripheral genes converge (core genes) generally have stronger effects (red) on the phenotype. As such, the variants that affect core genes are more likely to be Mendelian disease variants.

Firstly, GWASs cannot distinguish the marker-variant signal from that of the other varaints that are in high LD. Over 95% of the variants in high LD (R 2 > 0.8) are located outside of genes in the non-coding DNA [7] and can be located up to 500 kb apart [8]. Consequently, any of them could be the actual causal variant (figure 1b).

Second, the effects of non-coding causal variants can be highly cell-type-, context- and disease-specific [9]. Non-coding DNA contains regulatory regions—enhancers and promoters—that can bind transcription factor (TF) proteins and regulate gene expression [10]. Which enhancers and promoters are used depends on the cell-type-specific abundance of approximately 1600 human TFs and their epigenetically regulated accessibility to a given regulatory region [11]. Variants can disrupt the binding of any of these TFs, resulting in changed enhancer or promoter activity. This, in turn, affects gene expression [12] and cellular pathways [13]. Thus, the cell-type and tissue- or disease-specific micro-environment greatly affect which variants, TFs, genes and pathways are involved (figure 1). These complexities make it difficult to understand how GWAS loci contribute to their associated traits and have significantly hampered the interpretation and application of GWAS results. To address this, many different fine-mapping approaches have been developed in the post-GWAS era with the aim of identifying the important variants and genes and interpreting their biological impact on diseases and traits [14–17].

Important to note is that to reduce fine-mapping complexity, most approaches assume that only a single variant per locus contributes to a trait. This is, however, not a proper reflection of reality as multiple variants within a single GWAS locus can have an effect on a single gene's expression. This can occur in one of two ways: either the effect of the variants adds up in a linear way (additive effect) or an interaction between two or more variants is required to affect gene expression (epistatic effect) [18,19]. Thus, multiple variants may play a role in a single locus, either within a single cell-type or in a context- and cell-type-specific manner [18]. This further complicates performing and interpreting fine-mapping and gene prioritization approaches. For simplicity, throughout this review, we continue to address variants that affect gene regulation and pathways in association with a GWAS trait in any way as causal, even though a collective of smaller contributing effects acting in unison per locus may be necessary to elicit a functional effect on a GWAS trait.

Here, we assess fine-mapping and gene prioritization approaches that have been used to translate GWAS loci to a functional understanding of the associated trait, while taking cell-type- and disease-specific context into account. Specifically, we review the genetics of lower effect size common variants identified through GWASs rather than high effect-size Mendelian disease variants (figure 1c). Moreover, we discuss the impact of the recent paradigm shift towards polygenic models and how these can be used to aid in the identification of gene networks that highlight core disease genes (figure 1c).

2. Fine-mapping from the variant perspective

Fine-mapping variants in GWAS loci require an understanding of the underlying mechanism by which a variant can contribute to a trait. Overcoming LD and identifying the context-specific variants that are causal to a trait is imperative for understanding disease mechanisms and confidently identifying which downstream genes and pathways are affected. Many functional and computational (high-throughput) fine-mapping methods have been developed and applied for this purpose. Below we review several fine-mapping methods according to their increasing ability to describe the complex role of variants in GWAS traits and diseases.

2.1. Identifying overlap with functional elements

The most straightforward fine-mapping approach is to overlap GWAS variants in high LD with functional elements such as promoters and enhancers (figure 2a). Currently, the best resource for functional elements has been compiled by the NIH Roadmap Epigenomics Mapping Consortium [20] (electronic supplementary material, table S1), which used ChIP-seq (electronic supplementary material, table S2) to measure histone marks to determine the location of functional elements in 127 different cell and tissue types [20,21]. Fine-mapping of GWAS variants from 21 autoimmune diseases using the NIH Roadmap and similar data estimated that approximately 60% of candidate causal variants map to immune cell enhancers, and another approximately 8% to promoters [12]. This was also reflected in the tissue-specific enrichment of type 1 diabetes susceptibility variants in lymphoid gene enhancers [22]. Moreover, candidate causal variants were enriched in enhancers defined by the histone mark H3K27ac in specific subsets of CD4+ T cells, CD8+ T cells and B cells [12]. This was also the case in another study in monocytes, neutrophils and CD4+ T cells [23]. Other studies have also identified tissue-specific enrichments of disease-associated variants via overlap with functional elements, showing that this approach can help specify which variants play a role in certain cell types [23,24].

Figure 2. An illustrative depiction of a GWAS locus showing example mechanisms by which variant effects on enhancer activity and gene expression can be detected. (a) Many trait-associated variants are shown with varying LD strength (scatterplot) when compared with the GWAS-identified marker variant (in black). In this example, the causal variant is located in an allele-dependent active enhancer (C-allele, caQTL) as shown by the open chromatin regions of the same locus (peak-density plot below the variant). The variant affects the TF binding site of the green TF with a strong binding preference for the C-allele, as shown by the enhancer activity in the ‘transcription factor binding affinity’ box. In addition, using 3D interactions (grey arches connecting the gene, promoter and enhancer), physical contact with the nearby ‘Gene X’ indicates the enhancer affects the gene's expression. (b) To highlight cell-type-specific effects, the influence of the causal variant is depicted in three cell types with varying TF availability. The mRNA expression of ‘gene X’ is stronger for the CC-genotype compared with the GG-genotype because of the increased TF binding affinity to the green TF (as shown in a). This mRNA expression remains low but stable for the GG-genotype in all three cell types regardless of the TF availability but decreases for the CC-genotype in cell types with reduced TF availability, which reduces cooperative TF binding.

Other ways of detecting regulatory regions that can be used to fine-map GWAS variants are either based on DNA accessibility, such as ATAC-seq [25] and DNase-seq [26] (electronic supplementary material, table S2), or identify the inherent transcriptional activity of enhancers and promoters [27,28], such as GRO-seq [29], PRO-seq [30] and CAGE [31] (electronic supplementary material, table S2). Collective public databases using these techniques—like the NIH Roadmap consortium [20], ENCODE [32], FANTOM5 [33] and the IHEC consortium [34]—are indispensable context-specific resources (electronic supplementary material, table S1). However, it appears to be more difficult than originally anticipated to specify the exact location of regulatory regions since all these methods show different sensitivities and accuracies in the mapping of active regulatory regions [35]. Moreover, overlap of a variant with an active regulatory region may not result in functional disruption of these elements, and thus does not definitively point to causality. This uncertainty limits the accuracy of fine-mapping through overlap with functional elements and still leaves us with a multitude of candidate causal variants.

2.2. Inferring allele-specific variant effects

In high-throughput methods such as ATAC-seq, the sequencing reads containing a variant can be separated based on its allele. The allele-specific abundance of sequencing reads can then directly inform us about the functionality of this variant on the open chromatin region. Variants that cause allelic imbalance in regulatory regions are called chromatin accessibility quantitative trait loci (caQTLs figure 2a) [25,36]. Many caQTLs were identified in primary CD4+ T-cell ATAC-seq peaks, and these showed a strong enrichment in candidate causal autoimmune variants [36]. Similarly, the existence of variants or histone-QTLs that affect regulatory regions by altering enhancer-associated H3K27ac or H3K4me1 histone peaks also implies that these variants have an effect on cell-type-specific enhancer activity [23]. Due to their functional effect on DNA accessibility and epigenetic marks, these variants are more likely to be causal variants for GWAS traits.

Another mechanism by which non-coding GWAS variants can have an allelic effect on gene expression is alternative splicing of genes. GWAS-associated variants have the potential to induce cell-type-specific alternative splicing (sQTL) or could affect trans-acting splicing regulation genes [37,38]. This was shown in a genome-wide approach where 622 exons with intronic sQTLs were identified. One hundred and ten of these exons harboured variants in LD with GWAS marker variants [37]. In a more specific example, the multiple sclerosis-associated PRKCA gene is seemingly affected by an intronic sQTL that increases the expression of a gene isoform more prone to nonsense-mediated decay, thereby reducing the likely protective PRKCA mRNA levels post-transcriptionally [39]. However, sQTLs appear to also act through more complex mechanisms such as indirectly through caQTLs [40], or by inducing alternative upstream transcription start sites [41]. These and many other examples [38] suggest that sQTLs may be an important but complex mechanism by which GWAS-associated variants affect a trait.

2.3. Identifying variants that disrupt underlying TF binding sites

Further prioritization of variants in regulatory regions that show allelic imbalances can be done by computational or functional analysis of the underlying TF binding sites (TFBS) or motifs. Regulatory regions consist of both very strict and more degenerate DNA motifs [42] to which TFs can bind in order to initiate local transcription (e.g. enhancer RNAs) and regulate nearby or distant genes [10,27]. Variants can change the TFBS, altering the binding affinity of the TF and changing the activity of a regulatory region (figure 2a) [18,43,44]. The specificity and location of potential TFBSs have been collected for many cell types in large databases such as JASPAR [45], FANTOM5 [33] and ENCODE [32] (electronic supplementary material, table S1), mostly using ChIP-seq and HT-SELEX [46] (electronic supplementary material, table S2).

An enrichment of TFBS disruption by putatively causal variants has been identified for 44 families of TFs [18]. For TFs like AP-1 and the ETS TF-family, regulatory regions containing these disrupted TFBSs also show effects on chromatin accessibility, indicating that the effect of variants on TF binding affinity leads to caQTLs [18]. Similarly, upon identification of nearly 9000 DNase-seq locations affected by allelic imbalances, it was found that the alleles associated with more accessible chromatin were also highly associated with increased TF binding [43]. In a more specific case, TFBS disruption analyses and in vitro confirmation by ChIP-seq led to the identification of rs17293632 as a likely causal SNP that increases Crohn's disease risk by disrupting an AP-1 TFBS [12]. Interestingly, this effect on AP-1 TFBSs was stimulation-specific: H3K27ac peaks with affected AP-1 TFBSs were enriched in stimulated CD4+ T cells compared with non-stimulated cells [12]. This highlights the importance of context-specificity and the need for tissue- and disease-relevant stimulations in experimental set-ups (figure 2b) [12,47]. Finally, in a study of leukaemia patients, a small DNA insertion resulting in a TFBS for MYB created an enhancer near TAL1, which led to activation of this oncogene and the onset of leukaemia [48]. Thus, decreased or increased affinity of TFs due to genetic variants or small DNA changes can have far-reaching effects.

Currently, only 10–20% of the potentially causal non-coding GWAS variants defined by allelic imbalances within a regulatory region can be shown to disrupt a known TFBS [12]. Therefore, the actual causal variants may potentially act through a different mechanism, or our understanding of TF binding may still be insufficient [49]. One complicating factor here is the potential cooperative binding of more than one TF at an overlapping TFBS. Detection of these cooperative binding motifs is currently being improved by both biological methods (such as SELEX-seq [50]) and computational methods, such as No Read Left Behind (NRLB) [44]) (electronic supplementary material, table S3). A striking example of context-specific cooperative binding of TFs is illustrated by an increased TFBS enrichment of p300, RBPJ and NF-kB in risk loci of GWAS traits as a consequence of the presence of Epstein–Barr virus (EBV) EBNA2 protein [51]. In this study, ChIP-seq data from EBV-transformed B-cell lines were used, together with the RELI algorithm (electronic supplementary material, table S3), to systematically estimate the enrichment of variants in TFBS [51]. In six out of the seven autoimmune disorders tested, RELI identified that 130 out of 1953 candidate causal variants [12] overlapped with EBNA2 binding sites in B-cell lines identified by ChIP-seq [51]. Interestingly, many autoimmune diseases, including coeliac disease and multiple sclerosis [52,53], are thought to be partially triggered by viral infections, suggesting that variants may only be causal when viral factors are also present. Moreover, TF motifs can be highly degenerate, and a small change in TF binding affinity can induce a subtle dosage effect on the activity of a regulatory region [44]. While this effect may be subtle, downstream genes could be affected sufficiently [44] to induce or affect a trait. Thus, a better understanding of how TF binding affinity to DNA motifs is mediated is necessary to comprehend how variants affect the functionality of a regulatory region.

2.4. Fine-mapping by detection of regulatory region activity

A more immediate fine-mapping approach is to directly measure the effect a variant can have on the strength of a regulatory region. Active promoters and enhancers have transcription start sites (TSSs), and the activity of an enhancer or promoter is directly correlated with the active transcription from these TSSs [27]. However, some promoter RNAs, and most enhancer RNAs, are very short-lived, making them difficult to detect with most RNA sequencing methods [10,27]. CAGE (electronic supplementary material, table S2) does allow for the identification of exact TSS locations, as well as expression levels of genes, by sequencing 5′-capped transcripts regardless of their stability [30]. CAGE has identified promoter and enhancer effects, and showed that 52% of the effects observed in promoter regions were in secondary CAGE peaks, highlighting that genes can have multiple active promoters depending on the genotype [54]. CAGE QTLs have been observed for loci associated with systemic lupus erythematous (SLE) and inflammatory bowel disorder [54], supporting their relevance in immune disease.

Reporter-plasmid assays can also be applied to directly measure the effects of variants on enhancer or promoter TSS activity by moving variant-containing DNA fragments from their natural environment to a plasmid and transfecting these into a cell type of interest. The most traditional reporter-plasmid assay, the luciferase assay (electronic supplementary material, table S2), was used to confirm a functional effect of rs1421085, which is associated with obesity risk, by showing that the risk-allele induces an increase in enhancer activity [55]. However, high-throughput reporter assay methods with high resolution are required to fine-map all potentially causal variants within entire GWAS loci based on regulatory region activity.

One such method, the massively parallel reporter assay (MPRA electronic supplementary material, table S2), can test over 30 000 candidate variants by synthetically creating 180 bp DNA fragments containing both alleles of a variant with a unique barcode and integrating these into GFP-reporter plasmids that are subsequently transfected into different cell lines [56]. An MPRA was used to identify the expression of 12% (3432) of the 30 000 candidate DNA fragments in three cell lines, with 842 showing allelic imbalances caused by SNPs. Indeed, 53 of these SNPs had previously been associated with GWAS traits [56]. Similar high-throughput fine-mapping methods that use patient-derived DNA instead of synthetically generated DNA sequences are STARR-seq [57] and SuRE [58] (electronic supplementary material, table S2). Using a whole-genome approach, the SuRE method managed to screen 5.9 million SNPs in the K562 red blood cell line, identifying over 30 000 SNPs that affect regulatory regions and allowing for in-depth fine-mapping of SNPs for 36 blood-cell-related GWAS traits [59]. Follow-up research on these reporter assays has identified a causal SNP (rs9283753) in ankylosing spondylitis [56] and another (rs4572196) in potentially up to 11 red blood cell traits [59]. Despite the obvious advantages of high-throughput fine-mapping screens, a major drawback is that these methods are usually applied in cancer or EBV-transformed cell lines. These cell lines can be significantly different from trait-specific tissue-derived cell types [60] and have often accumulated many somatic mutations as a consequence of years of culturing [61]. Thus, the wrong variants may be identified as causal because the relevant cell-type and context-specific effects have not been considered [62].

2.5. From causal variant to gene using the 3D interactome

When a causal variant has been identified, the gene expression effects of that variant can be directly assessed by mapping the necessary physical interaction of the regulatory region it affects with its target genes (figure 2a) [63,64]. For example, H3K27ac regions containing autoimmune-disease-prioritized variants were linked to the TSS of genes using HiChIP (electronic supplementary material, table S2) and shown to contain cell-type-specific interactions between the TSS of the IL2 gene and rs7664452 in Th17 cells and between rs2300604 and target gene BATF in memory T cells [63]. Interestingly, for 684 autoimmune-disease-associated variants assessed with HiChIP, 2597 gene–variant interactions were identified, indicating that autoimmune disease variants can regulate a multitude of genes. Moreover, only 14% (367) of these gene–variant interactions were with the gene closest to the variant [63]. Another example of a long-range interaction of a causal variant is that of the previously mentioned rs1421085, which is associated with obesity risk and located in an intron of FTO. TFBS disruption analyses have shown that rs1421085 disrupts the ARID5B TF binding motif and affects the activity of an enhancer that regulates IRX3 and IRX5, genes located 1.2 Mb upstream, instead of the initially expected co-localized FTO gene itself [55,65]. Thus, fine-mapping and interaction analysis has identified additional causal genes in this obesity-associated risk locus.

Hi-C (electronic supplementary material, table S2) is another high-throughput method for identifying specific promoter and enhancer gene interactions [19,66–68]. For example, Hi-C was used to prioritize four rheumatoid arthritis genes by overlapping promoter–gene interactions of various primary immune cells with rheumatoid arthritis GWAS variants [19]. Another study analysed Hi-C datasets of 14 primary human tissues and showed that frequently interacting regions (FIREs) are enriched for disease-associated GWAS variants [68]. However, the resolution limitations of Hi-C and other interaction data make it difficult to precisely pin-point the causal variant within a regulatory region [63,64,68]. In addition, cell-type and environmental effects influence regulatory region interactions with genes, as shown by the fact that 38.8% of FIREs were identified in only one tissue or cell type [68]. Thus, multiple strategies as described here and collected in databases such as the EnhancerAtlas2.0 [69] (electronic supplementary material, table S1) should be combined to confidently fine-map causal variants and link them to genes that play a role in GWAS traits.

3. Gene prioritization using GWAS traits

Traditional fine-mapping approaches focus on identifying the causal variants that affect a trait of interest. While very important, knowing which variants are causal does not identify the downstream effects of the variant on the trait. One way to gain such insights is by identifying the genes that are affected by each GWAS locus. Moreover, if the causal genes affected by a locus are known, this can reduce the credible set of potentially causal variants. Recent efforts in systems biology have focused on identifying such causal genes and their downstream effects.

3.1. Gene prioritization using expression quantitative trait loci

A more comprehensive approach to identifying the genes affected by a GWAS locus is through the use of quantitative trait loci (QTL figure 3a). While caQTLs are often indicative of a causal variant or regulatory region, a specific subset of QTLs called expression QTLs (eQTL) can be used to identify the genes affected by a GWAS locus [70–72]. The simplest way to perform gene prioritization using eQTL analysis is simply to overlap the marker variant of a GWAS locus with the top eQTL variant. An example of this is an SLE risk variant that is also a cis-eQTL for the TF IKF1. The eQTL on IKF1 affected the transcription of 10 genes in trans that are all regulated by IKF1 [70], highlighting this gene as a likely candidate causal gene for SLE. Additionally, these types of effects can be context-specific, as was shown for a cis-eQTL on TLR1 after stimulation of peripheral blood mononuclear cells (PBMCs) with Escherichia coli [73]. This cis-eQTL was also a strong trans regulator of the E. coli-induced response network, regulating another 105 genes [73], showing that an eQTL can strongly influence the immune response to pathogens.

Figure 3. Aspects of fine-mapping genes from GWAS loci. (a) Using eQTLs (dark blue) and CRISPRi/a-based assays, GWAS loci can be linked to genes when using the correct context. (b) Not every relationship between genetics and expression can be described additively. Epistatic effects (dark red) describe a relationship where two (or more) mutations are needed to arrive at the phenotype. (c) Using co-expression, regulatory relationships between genes can be quantified, but the specific role of genetics in these relationships is unknown. (d) Using PGSs, the joint effects of GWAS loci can be assessed, sacrificing resolution to obtain higher-level insights into the pathways affected by the genetics associated with a phenotype. (e) When assessed at single-cell resolution, the total network can be deconstructed into the cell-type relevant components. Affected cells can subsequently display an altered interaction with other cells within a tissue or individual, leading to a changed tissue- or individual-wide outcome for a phenotype.

However, the top eQTL variant might not always be the same as, or in LD with, the top GWAS marker variant due to noise in the eQTL data [74] or to multiple causal effects on a gene or disease in a locus [75]. As a result, many statistical frameworks have been created to give more accurate estimates of overlap or causality between a GWAS locus and a QTL locus, including FUMA [76], COLOC [77] and Mendelian randomization (MR electronic supplementary material, table S3). The latter is commonly used to estimate causality between GWAS and QTL profiles [78–84] and has been successfully applied to identify genes causally linked with complex traits [3,79–81]. For example, MR studies were able to identify a causal role for SORT1 on cholesterol levels [79,81], a role which has been experimentally validated [85]. Still, MR can be challenging as multiple variants in LD can affect the same gene (linkage), and several genes can be affected by the same causal variants (pleiotropy) [70,73,86]. More recent work on MR has focused on more accurately controlling for pleiotropy and linkage [79,81,82,84]. Independent variant selection for MR is currently done by either LD-based clumping or some form of stepwise regression using tools like GCTA's COJO [75] (electronic supplementary material, table S3), which only select for independence and not causality. Accurate fine-mapping can potentially help these efforts by improving the independent variant selection for MR since fine-mapping can reveal the true causal variants independent of linkage.

Recently, it has been suggested that approximately 70% of the heritability in mRNA expression is due to trans-eQTLs [87,88], which highlights the importance of trans-eQTL relationships. While trans-eQTLs have the potential to further our understanding of complex traits, the multiple testing burden is very large due to the large number of comparisons that have to be made when doing genome-wide trans-eQTL mapping (in the worst case, millions of variants times approx. 60 000 genes) [70,72]. Therefore, many eQTL studies opt to only map cis-eQTL effects genome-wide, as this dramatically reduces the number of comparisons that have to be made [70–72,74]. Another approach is to limit the number of comparisons by only mapping trans effects for a predefined subset of variants or genes [70,72,73,86]. However, since a full trans-eQTL mapping dataset is rarely available, overlap between trans-acting genes and GWAS loci will be missed.

An additional challenge with QTL-based gene prioritization approaches lies in the context-specificity of the QTL data used, as different tissues, cell types, time points and stimulation conditions can induce many different expression patterns and different interactions with the variants in a GWAS locus [23,73,89–92]. Consequently, the QTL information that is available might not be informative for the trait under study. This is especially challenging when studying traits that are present in a tissue other than blood, as is the case for neurological disorders [93,94], because sufficiently powerful cell-type- or context-specific QTL studies are usually not available. However, with the advent of single-cell RNA sequencing (scRNAseq) and the increasing availability of large-scale datasets for tissues other than blood, some of these challenges are being overcome [70,72,90,91]. scRNAseq (electronic supplementary material, table S2) allows for high-throughput eQTL analysis in individual cell types instead of a bulk population, as shown for PBMCs [90]. This allows for an increase in resolution and can help to assess only the trait-relevant cell types [91], as shown for eQTLs on TSPAN13 and ZNF414, which were only present in CD4+ T cells and not in bulk or other specifically assessed cell types [90]. Consortia that are amassing single-cell data at a large scale in many different tissues—like the Human Cell Atlas [95], Single-cell eQTLgen [96] and the LifeTime consortium [97] (electronic supplementary material, table S1)—will facilitate the use of single-cell sequencing data for traits where bulk RNA-seq obtained from blood is not informative.

3.2. Identifying downstream effects of GWAS loci using other QTLs

Beyond gene-expression-based eQTL, a plethora of other QTL types exist that affect the abundance of proteins (pQTL) [98,99], metabolites (mQTL) [100], DNA methylation (meQTL) [101], microbiota (miQTL) [102] and cells (cell-count or ccQTL) [103,104]. Naturally, these can all be overlapped with GWAS loci to obtain insights into their pathology. For example, the ex vivo cytokine response to stimulation has been shown to have strong genetic regulators [99]. Interestingly, all the associated effects found were trans (i.e. not in proximity to the cytokine genes), suggesting that the release of cytokines is controlled by genes in the receptor's pathways rather than being directly controlled by the mRNA levels of the cytokine. Moreover, context-specificity is important, as QTLs affecting cytokines from T cells were found to be enriched in autoimmune GWAS loci, whereas QTLs affecting cytokines from monocytes were more enriched in infectious-disease-associated loci [99]. Thus, the effects of genetics on traits should not only be studied at the level of gene expression, but also at levels more directly related to a phenotype.

3.3. Functional approaches to mapping genetic effects on expression

While eQTL analysis provides invaluable insights into the genes that affect a trait or disease, context- and cell-type-specific biases in the expression data and LD structure in GWAS loci cause potential errors in gene prioritization. With the recent introduction of CRISPR/Cas9-based screens [105] (electronic supplementary material, table S2), it is now possible to functionally validate eQTL effects in a high-throughput manner independent of LD structure and in a cell-type relevant to the trait of interest.

CRISPR-based assays use guide RNAs to bind specific regions of the genome and either activate (CRISPRa) or interfere (CRISPRi) with the transcription of genes or enhancers [106]. Recent advances in both scRNAseq and CRISPRi/a have facilitated methodologies that evaluate enhancer effects on genes in single cells [107]. For example, a recent effort evaluated the effects of 5920 candidate enhancers on gene expression using CRISPRi [107]. Strikingly, 664 showed a significant effect on gene expression in K562 cells. Thus, CRISPRi-based assays are capable of identifying enhancer–gene pairs in a high-throughput manner. However, as only approximately 10% of candidate enhancers were actually found to affect gene expression, identifying which enhancers are active based on already available data might not always be straightforward, even for a very well-characterized cell line such as K562 [20,32,34,58,59].

In addition to mapping active enhancer gene pairs, CRISPRi/a-based assays can be used to identify epistatic interactions between genes and to generate gene networks based on changes in co-expression in perturbed versus non-perturbed cells (figure 3b). Genes that are strongly co-expressed are likely to be regulated by a shared mechanism [86]. Therefore, identifying such genes can help reveal the gene network that leads to a disease-associated trait [94,108,109]. Indeed, a CRISPRi screen that targeted 12 TFs, chromatin modifying factors and non-coding RNAs was able to identify epistatic effects in cells perturbed by two guide RNAs [110]. In these cells, chromatin accessibility remained relatively stable in loci associated with autoimmune disease in cells with one perturbed TF. However, significant changes were observed when evaluating the chromatin accessibility for the same loci in cells also perturbed for NFKB1. This again highlights the importance of taking the entire context of a trait into account when fine-mapping or interpreting the role of a GWAS locus.

A major drawback of the majority of CRISPRi/a screens is that they are very laborious and therefore usually performed in easily manipulated, but also highly modified, cancer cell lines [61]. Fortunately, recent studies have shown that CRISPRi screens can be applied to primary T cells [111,112]. This, while challenging, needs to be extended to other tissues and model systems. These studies will greatly assist variant, regulatory region and gene fine-mapping efforts because they directly identify the active enhancer–gene pairs and the downstream gene network affected in specific cell types. In addition, future work could focus on performing CRISPRi/a screens in patient-derived cells that contain relevant risk genotypes to fully reach variant-level resolution.

3.4. Mapping gene–gene regulatory interactions using population data

Co-expression can also be modelled based on inter-individual variation in expression, which can be used to prioritize disease genes and make inferences about the downstream consequences of diseases (figure 3c) [94,108,109,113]. For example, DEPICT (electronic supplementary material, table S3) integrates gene co-regulation with GWAS data to provide likely causal genes and pathways relevant for the trait [113]. Moreover, the GADO tool (electronic supplementary material, table S3) correctly identified causal genes in 41% of a cohort of 83 patients with varying Mendelian disorders, and prioritized several novel causal candidate genes by combining trait-specific gene sets with a co-expression network [109]. Finally, eMAGMA (electronic supplementary material, table S3) used co-expression together with tissue-specific eQTLs in brain regions to prioritize 99 candidate causal genes for major depressive disorder [94]. These co-expression modules were enriched in brain regions but not in whole-blood, highlighting the tissue-specific nature of the co-expression networks [94].

Population-based co-expression networks describe the relationships between genes through both genetics and environment. Consequently, based on the co-expression alone, it is not possible to separate which part of the co-expression is due to genetics. Therefore, these networks have limited use for fine-mapping causal variants and are mainly used to identify genes and pathways affected by GWAS loci after gene prioritizations have been made. In addition, co-expression networks are not directed [108]. Genetic information of the individuals used to generate the co-expression network would solve this issue, as the genetic and environmental components could be separated and directionality could be added into the network [108], although this is not a trivial task. Fine-mapping would be of great value in modelling the genetic component of the network by facilitating the selection of true causal variants.

3.5. Fine-mapping under the omnigenic model

As discussed throughout this review, it is becoming increasingly clear that complex traits are highly polygenic and that many variants can deregulate cis- and trans-acting factors in a variety of ways (figure 2a). In the light of this, Boyle et al. [87] proposed an omnigenic model for complex traits in which each gene that is expressed in the cell will have an effect on the trait or disease in some way (figure 1c) [87,88]. For example, height is so polygenic that most 100 kb genomic windows seem to contribute to explaining its variance. Given that the effect sizes of the individual variant are getting so small, it raises the question: what does the causality of the individual variant mean in a complex trait [87,88,114]? If the omnigenic model is true, it presents a major challenge for fine-mapping GWAS loci, particularly for the interpretation of the downstream consequences as the complexity of genetic effects on traits will only increase. In addition, current functional assays may not be suited to model the small and subtle variant effects and gene–gene or gene–environment interactions observed in population studies using millions of individuals.

Instead, the complete GWAS signal from all loci associated with a trait can be used to estimate a polygenic score (PGS) that describes an individual's genetic predisposition for the given trait. In its most basic form, a PGS constitutes the linear combination of all independent risk genotypes weighted by the GWAS effect size, but many more sophisticated methods exist (figure 3d) [115–117]. The PGS for a trait can be associated with the expression level of genes (and proteins) in a population [72,118]. If there are strong correlations, GWAS loci together, as represented by the PGS, are jointly influencing these genes. These genes probably represent core genes in a disease-associated co-expression network. Although PGSs have issues when it comes to broad applicability across populations [119], they can be a useful abstraction layer to make sense of a polygenic trait.

Given we are becoming aware of the likely polygenic and even omnigenic nature of traits, fine-mapping the individual GWAS locus seems like an impossible task. However, with current approaches the stronger, and arguably more important, genetic effects associated with traits and diseases can be elucidated [70,72,73]. Moreover, by using abstraction layers such as PGS, inferences can be made about the joint consequences of these effects [72]. Indeed, the genes and pathways associated with stronger or joint genetic effects are more likely candidates for drug interventions [120] (electronic supplementary material, table S1). Although we might never fully comprehend all the tiny effects and interactions underlying a trait, we will probably see an increase in clever ways to arrive at the interpretable biological mechanisms behind traits.

4. Future perspectives

We have reviewed recent high-throughput GWAS fine-mapping approaches that can identify variants and genes causal for a trait or disease. The complexity and uncertainty present in aspects of these approaches illustrates that a single approach does not suffice to grasp the full cause and effect of candidate variants and genes. In addition, while large datasets, mostly in blood, have identified many potentially causal variants and genes associated with traits, these candidates need to be refined and validated using tissue- and cell-type-specific resources in combination with trait-specific environmental factors to recapitulate the true biological state of each trait as closely as possible. An additional challenge lies in translating these disease genes into clinical practice, as prioritized genes might not be existing, nor practical, drug targets.

Despite these challenges, we believe that combining the use of patient-derived material, with methods that find regulatory regions and their downstream genes will aid drug target identification for complex diseases. In addition, this knowledge could be used to generate prediction models that aid in the fast and non-invasive identification of trait-specific variants and genes in the general population. This will form the foundation of our understanding of complex traits, aid drug development and will allow tailored precision medicine in the near future.

Tissue-Specific Expression and Target Gene Identification in Endometriosis

Taken together, the studies reviewed above demonstrate complex interactions among genotype (DNA sequence variation), epigenetics, and gene transcription. In addition, tissue-specific and developmental differences in regulation of gene expression are further complicating factors [ 92]. Although many cis-eQTLs are observed across tissues, approximately 30% of cis-eQTLs are tissue-specific [ 92]. This has important consequences because we need to study gene expression in tissues relevant to the disease. To date, most studies with sufficient scale have been conducted using samples from blood or lymphoblastoid cell lines, although this is changing.

We do not know which tissue(s) and cell type(s) are targets for regulatory effects of SNP variation that increase endometriosis risk. Tissues contributing to the development and growth of endometriosis lesions could include deposition of viable endometrial tissue or endometrial stem cells via retrograde menstruation [ 93- 95], epithelial cells from the fallopian tubes [ 96], embryonic cell rests [ 97], mesothelium [ 98], and the immune system [ 99]. It is also possible that the tissue of origin may vary for different presentations of endometriosis.

Data for relevant cell lines for reproductive tissues in the ENCODE project are limited, and comprehensive data that map critical regulatory sequences in reproductive tissues are not currently available. This limits our ability to rapidly link the SNP variation to target genes. We have initiated the Endometrial Gene Expression Project (EGEP) for eQTL studies in endometrium to help prioritize genes and pathways for follow-up studies. We chose endometrium for these studies as one important tissue in which we can examine regulatory effects of SNP variation affecting endometriosis.

Large sample sizes will be essential to have sufficient power to evaluate important tissue-specific regulation of gene expression. The effect sizes of eQTL are large compared with results for GWAS. However, our ability to significantly detect eQTL is also limited by the increased multiple testing burden that is a feature of eQTL analyses. Power calculations show that a sample size of 100 tissue samples is required to detect an eQTL that explains ∼10% of the variance in gene expression with 80% power (at a study-wide corrected P value of 10 −9 ). Increasing the sample size to 400 individuals at the same type I error rate (10 −9 ) and power (80%), we could detect eQTLs that explain ∼6.3% of the variance in gene expression. Because the distribution of effect sizes is not uniform, and based on our data for whole blood, increasing sample size from 100 to 400 would translate to an increase in the numbers of eQTL detected from ∼1200 with >10% variance to ∼3100 with >6.3% variance. Given the fact that gene expression of endometrium varies throughout the menstrual cycle, even larger sample sizes with well-defined stage of menstrual cycle will be required to understand gene regulation in the endometrium and effects of the stage of the menstrual cycle.

Sequence Characterized Amplified Region (SCAR)

DNA fragments amplified by the Polymerase Chain Reaction (PCR) using specific 15-30 bp primers, designed from nucleotide sequences established in cloned RAPD (Random Amplified Polymorphic DNA) fragments linked to a trait of interest. By using longer PCR primers, SCARs do not face the problem of low reproducibility generally encountered with RAPDs. Obtaining a co-dominant marker may be an additional advantage of converting RAPDs into SCARs.


Interleukin-1 (IL-1) is a large family of cytokines (small cell signaling proteins) that mediate innate immune responses to defend the host against pathogens. The IL-1 family has 11 member proteins (IL-1F1 to IL-1F11) and they are encoded by 11 distinct human and mouse genes [1]–[3]. The first discovered family members, IL-1α (newly named IL-1F1) and IL-1β (IL-1F2), are secreted by macrophages and epithelial cells in response to pathogens and have strong proinflammatory properties leading to fever (affecting hypothalamus) and activation of T cells and macrophages.

IL-1 family members have been intensely studied (especially IL-1α and IL-1β) unraveling their roles in a number of autoinflammatory diseases [4]–[6]. Signaling initiated by the IL-1 cytokines increases the expression of adhesion factors on endothelial cells resulting in immune cells (such as phagocytes and lymphocytes) migration to the site of infection. The autoinflammatory disease is a class of chronic inflammation with increased secretion of active IL-1β, thus blocking IL-1β is therapeutically beneficial [7].

IL-1α and IL-1β can induce mRNA expression of hundreds of genes, including themselves (a positive-feedback loop), and their gene regulatory actions are conducted via a conserved signaling pathway [8]. Signal propagation mainly depends on mitogen-activated protein kinases (MAPKs), MAPK kinases (MKK/MAP2Ks), MKK kinases (MKKK/MAP3K/MEKKs) and the downstream proteins of MAPKs, finally leading to activation of transcription factors that regulate the expression of host defense proteins (Figure 1). The signal initiates by binding of IL-1α or IL-1β ligands to type I receptor (IL-1R1) and propagates with the help of the co-receptor IL-1 receptor accessory protein (IL-1RAP), forming a trimeric complex [9]. In this trimeric complex, the Toll- and IL-1R–like (TIR) domains on the cytoplasmic regions of IL-1R1 and IL-1RAP receptors get close to each other resulting in the recruitment of myeloid differentiation primary response gene 88 (MYD88), Toll-interacting protein (TOLLIP) [7] and IL-1 receptor-associated kinase 4 (IRAK4) [10], [11]. A stable complex is formed between IL-1, IL-1R1, IL-1RAP, MYD88 and IRAK4 [10]. MYD88 binding triggers phosphorylation of IL-1 receptor-associated kinases IRAK4, IRAK2 and IRAK1, leading to the recruitment and oligomerization of tumor necrosis factor-associated factor 6 (TRAF6) [12]–[14]. TRAF6 and phosphorylated IRAK1 and IRAK2 dissociate and migrate to the membrane to associate with TGF-β-activated kinase 1 (TAK1) and TAK1-binding proteins TAB1 and TAB2 [7]. The TAK1-TAB1-TAB2-TRAF6 complex migrates back to the cytosol, where TRAF6 is ubiquitinated and TAK1 is phosphorylated [7]. From this point, the signal can propagate via two main paths: IKK – IκB – NF-κB and/or MKK – MAPK/JNK/ERK (Figure 2). In the first path, phosphorylated TAK1 activates the inhibitor of nuclear factor kappa-B kinase subunit beta (IKKβ) and activated IKKβ phosphorylates the nuclear factor kappa-B inhibitor (IκB) which gets degraded so that nuclear factor kappa-B kinase (NF-κB) is released and migrates to the nucleus [7]. TAK1 can also activate mitogen-activated kinases (MAPK) p38, c-Jun N-terminal kinases (JNK) and extracellular signal-regulated kinases (ERK) by interacting with MAP kinase kinase (MKK) proteins. Downstream in this path, are transcription factors such as c-Jun, c-Fos, c-Myc and ATF2.

In this simplified diagram of the IL-1 signaling pathway, the signal initiates by the recognition of cytokines by IL-1 receptors and propagates via multiple sub-pathways involving family homologs or alternate pathways to activate transcription factors downstream.

This detailed map of IL-1 signaling presents the protein-protein interactions and the resulting cellular events. The colored nodes represent proteins having experimentally identified 3D structures and the white nodes are the proteins without 3D structures. The edges represent protein-protein interactions (straight/dashed arrows relate to available/unavailable 3D structures of proteins) or associations leading to cellular events such as cell cycle or gene expression (dashed arrows beginning with circular heads).

MAP kinase signaling pathways, which are conserved among eukaryotes, mediate cellular events triggered by extracellular signals such as cytokine binding [15] and they are essential for IL-1 signaling (Figures 1 and 2). This pathway builds upon a triple kinase cascade consisting of a MAP kinase kinase kinase (MKKK/MEKK), a MAP kinase kinase (MKK/MEK) and a MAP kinase (MAPK) and these kinases sequentially phosphorylate and activate each other [15]. The JNK and p38 MAP kinases, called stress activated MAP kinases, have roles in tumor suppression and can be both directly phosphorylated and activated by MKK4, which is also a tumor suppressor [16]–[18]. The successive activation mechanism takes place as follows: MEKK interacts with inactive MKK and phosphorylates it the complex dissociates, releasing the free and active MKK, which is ready to interact with inactive JNK to activate it [15]. Activation of JNK leads to disruption of the MKK-JNK interaction, freeing the active JNK to phosphorylate its downstream targets. There are several mechanisms through which stress activated MAP kinases regulate tumor suppression, including promoting apoptosis (p53, Bax, Bim/Bmf), inhibiting proteins that inhibit apoptosis (Bcl2, Bcl-XL, 14-3-3, Mcl-1), inhibiting tumor development (TGF-β1) and tumor growth (CDC25, CyclinD1/CDK4) [16]. Somatic mutations were identified in the JNK pathway via large-scale sequencing analyses of human tumor cells [19], [20], and JNK3 encoding gene (MAPK10) has been speculated to be a putative tumor suppressor gene as almost half of the brain tumors that were examined included mutations [21]. ERK1 and ERK2, the other members of MAPK family, are also upregulated in tumors [22].

Recently, inflammation has been related to cancer [23]–[25]. Cancers are mostly due to somatic mutations and environmental factors, and chronic inflammation is implicated by most of these risk factors [26]. Chronic inflammation, due to autoimmune diseases or infections, causes tumor development via several mechanisms, including oncogenic mutation induction [26]. Oncogenic mutations and single nucleotide polymorphisms (SNPs) are key players in inflammation-related cancers and it is crucial to map the mutations/SNPs on the corresponding 3D structures of the proteins to gain insight into how they affect protein function [27], [28]. SNPs that cause diseases, if not in the core of the protein, are frequently located in protein-protein interface regions rather than elsewhere on the surface [26], [28]. Structural knowledge can clarify the conformational and functional impact of the mutation/SNP on the protein [27]–[29]. The effect of a functional mutation can be expressed by a change in the specificity of the interactions between a mutated protein and its partners [30]. Quantitatively, a mutation changes the binding free energies of the mutant's interactions with its partners with respect to the free energies of its interactions in the native form [30]. The functional impact of the mutation on the specificity differs. The mutation can destabilize the protein and/or its interaction, leading to ‘loss-of-function’ or can lead to a change in the specificity of protein-partner interactions, resulting in a ‘gain-of-function’, or can gain new binding partners and hence a new biological function, i.e. result in a ‘switch-of-function’ [30].

Two recent studies used structural pathways for mapping mutations on protein-protein interfaces, one on smaller scale pathways [27] and the other on large scale [28]. Mosca et al. [27] mapped mutations onto proteins as an application of their useful computational modeling technique with the data limited to the Kyoto Encyclopedia of Genes and Genomes (KEGG) complement cascade pathway and the interactions of complement component 3 (C3) and complement factor H (CFH) which it includes. In a pioneering work, Wang et al. [28] explored genotype-phenotype relationships on a large scale. They systematically examined thousands of mutations and mapped them onto interaction interfaces and experimentally validating their predictions for the MLH1-PMS2, WASP-CDC42 and TP63–TP73 interactions. These and other studies emphasized the need for computational methods for large-scale interactome studies [31], [32]. We propose a method similar to ones used in the works of Mosca et al. [27] and Wang et al. [28], while introducing the advantage of in silico mutagenesis to observe the effects of mutations on protein-protein interactions on a large scale.