The International HapMap Project aimed at determining the common patterns of DNA sequence variants, their fre- quencies, and correlations between them, through geno- typing samples from four large populations, Centre d'Etude du Polymorphisme Humain reference individuals from Utah, USA (CEU), Han Chinese in Beijing, China (CHB), Japanese in Tokyo, Japan (JPT), and Yoruba in Ibadan, Nigeria (YRI), at a density of 1 SNP every 5 kb. The populations genotyped in the HapMap can serve as reference populations for the selection of tagging SNPs (tSNPs) that capture most of the variations in the genome. It provides an important shortcut to carry out candidate- gene and genome-wide association studies in a certain population by minimizing the numbers of SNPs need to be genotyped [1-3].
susceptibility. Ten candidate tagging SNPs (tSNPs) were selected from seven genes whose polymorphisms have been proven by classical literatures and reliable databases to be tended to relate with gliomas, and with the minor allele frequency (MAF) > 5% in the HapMap Asian population. The selected tSNPs were genotyped in 629 glioma patients and 645 controls from a Han Chinese population using the multiplexed SNP MassEXTEND assay calibrated. Two significant tSNPs in RTEL1 gene were observed to be associated with glioma risk (rs6010620, P = 0.0016, OR: 1.32, 95% CI: 1.11-1.56; rs2297440, P = 0.001, OR: 1.33, 95% CI: 1.12-1.58) by χ 2 test. It was identified the genotype “ GG ” of rs6010620 acted as the protective genotype for glioma (OR, 0.46; 95% CI, 0.31-0.7; P = 0.0002), while the genotype “ CC ” of rs2297440 as the protective genotype in glioma (OR, 0.47; 95% CI, 0.31-0.71; P = 0.0003). Furthermore, haplotype “ GCT ” in RTEL1 gene was found to be associated with risk of glioma (OR, 0.7; 95% CI, 0.57-0.86; Fisher ’ s P = 0.0005; Pearson ’ s P = 0.0005), and haplotype “ ATT ” was detected to be associated with risk of glioma (OR, 1.32; 95% CI, 1.12-1.57; Fisher ’ s P = 0.0013; Pearson ’ s P = 0.0013). Two single variants, the genotypes of “ GG ” of rs6010620 and “ CC ” of rs2297440 (rs6010620 and rs2297440) in the RTEL1 gene, together with two haplotypes of GCT and ATT, were identified to be associated with glioma development. And it might be used to evaluate the glioma development risks to screen the above RTEL1 tagging SNPs and haplotypes.
For the phase II cohort, sixteen tagging SNPs for the same metabolism enzyme genes were selected, which covered 5,000 base pairs (bp) upstream and 1,000 bp downstream of these genes. Because two tagging SNPs in the CYP2E1 gene had a low design score on Illu- mina measurements, we selected four other SNPs located in the locus or coding region. One of them, rs3813867, tracked CYP2E1-RsaI (rs2031920) that was previously reported associated with NPC. For phase II samples, sixteen SNPs including rs7927381, rs6591256 and rs947895 for GSTP1; rs2071409 and rs2243828 for MPO; rs10517, rs1800566, rs4986998, rs689452, rs2917667, rs2917666 and rs1469908 for NQO1; and rs3813867, rs2070673, rs41299426 and rs41299434 for CYP2E1 were genotyped by Illumina GoldenGate assay. Fourteen SNPs (all except rs41299426 and rs41299434 of CYP2E1 which were excluded due to low minimum allele frequency) were analyzed as described below.
We used the Tagger program within Haploview  to select 18 ABCB1 tagging SNPs (Hapmap Phase II release #24) within a 60 kb region encompassing rs2032582, and genotyped these and our three previously reported SNPs (rs1128503, rs2032582, and rs1045642) [10, 11] in optimally debulked AOCS patients that met our inclusion criteria (n=433). SNPs were analysed both independently and in forward and backward log-additive stepwise Cox proportional hazards (PH) models adjusted for tumor stage and level of residual disease, using a conservative p-value of 0.05 to enter or exit the model . We also investigated haplotype frequencies for all patients genotyped for these 21 SNPs (n=615) using the Beagle Genetic Analysis Software package v.3.3.2 for inferring haplotype phase or sporadic missing genotype data in unrelated individuals . The likelihood ratio test was used to compare regression models of the three most common ABCB1 haplotypes (alternate models) versus the rs2032582 (null model) to test the likelihood that any observed
18 Read more
after adjustment for multiple comparisons, only rs8068600 met the Bonferroni-adjusted threshold for statistical significance (P = 0.01). No significant or nom- inally significant (P < 0.05) associations with glucose or insulin traits including fasting or 2 hour glucose and in- sulin from the OGTT, Insulin AUC, Glucose AUC, Matsuda Index or HOMA-IR were found for any of the SNPs in the Amish after adjustment for age and sex. Similar results were found when BMI was included in the model. Data from the MAGIC GWAS identified a nominal association of rs4300700 with ln fasting insulin (P = 0.01). No additional associations were found in the MAGIC data for SOCS7 tagging SNPs for ln fasting in- sulin, fasting glucose, 2 hour glucose or HOMA-IR [22,23]. Review of the NHGRI GWAS database did not reveal any SOCS7 SNP associations with T2DM or other glucose homeostasis traits; however, SNPs in- cluded in the database are limited to those with P-values <1.0 × 10 -5 .
The individual htSNPs and the haplotypes they define in the present study were not associated with breast cancer, although it is possible that unidentified functional SNPs not in linkage disequilibrium with the selected htSNPs exist and could be associated with breast cancer risk. The efficiency of the haplotype tagging approach depends on the density of the markers used to choose the tagging SNPs. In this case, we used the markers from Bonnen and colleagues, which had an average density of about one SNP per 10 kb. This may not be sufficient to tag all common variants in ATM. For example, Letrero and colleagues demonstrated that carriers of the S49C SNP, a nonconservative SNP in the ATM coding region, were just as likely to be carriers of one of the common Bonnen and colleagues' haplotypes as noncarriers of the SNP, suggesting that it is possible for association studies to miss functional SNPs .
In summary, the findings of our case–control study evidence that CTLA-4 rs16840252 C.T and rs231775 G.A SNPs are correlated with genetic susceptibility for development of CRC in an Eastern Chinese Han population. Additionally, this study first highlights that CTLA-4 rs16840252 C.T polymorphism increases the susceptibility of CRC. Further- more, findings are consistent with the biological functions of tagging SNPs in the CTLA-4 gene and validate the hypothesis that CTLA-4 tagging polymorphisms, which alter CTLA-4 mRNA and/or protein expression, may influence normal immune functions and lead to an increased risk of CRC.
11 Read more
As a negative control data set we chose a GWAS study with a relatively small sample size so that the power to identify a real effect is very low. Data are from the Age-Related Eye Disease Study (AREDS), that was initially designed as a long-term, multicenter, pro- spective study to assess the clinical course of age-related macular degeneration (AMD) and age-related cataract . In addition to collecting natural history data, AREDS included a randomized clinical trial of high-dose vitamin and mineral supplements for AMD and a clinical trial of high-dose vitamin supplements for cataract [21 – 23]. Prior to study initiation, the protocol was approved by an independent data and safety monitoring committee and by the institutional review board for each clinical center. Written informed consent was obtained from all participants in accordance with the Declaration of Helsinki. AREDS participants were 55 to 80 years of age at enrollment and had to be free of any illness or condition that would make long-term follow-up or compliance with study medications unlikely or difficult. For the current analysis, a subset of the control group from the original AREDS study was included: 2000 Caucasian participants aged 60 and older who did not have AMD and were further screened to also exclude indi- viduals with cataracts, retinitis pigmentosa, color blindness, other congenital eye problems, LASIK, artificial lenses, and other eye surgery. Mean spherical equivalent (MSE) of both eyes was calculated on study participants without either AMD or cataracts at the first study visit. A binary phenotype, hyperopia, defined 858 cases as those with MSE ≥ + 1D and 602 controls with MSE < 0D. Quality-controlled SNPs were imputed using MACH  based on HapMap phase 2 reference panel. To reduce the number of SNPs for analysis, LD pruning was performed using PLINK with pairwise r 2 of 0.99 as threshold. 908,293 common SNPs with complete genotypes remained for analysis. Further details about the genotype data have been published previously [25, 26].
15 Read more
6.1.3 Tagging for Low-Resource Languages Learning part-of-speech taggers for severely low- resource languages (e.g., Malagasy) is very chal- lenging. In addition to scarce (token-supervised) labeled resources, the tag dictionaries avail- able for training taggers are tiny compared to other languages such as English. Garrette and Baldridge (2013) combine various supervised and semi-supervised learning algorithms into a common POS tagger training pipeline to address some of these challenges. They also report tagging accuracy improvements on low-resource languages when us- ing the combined system over any single algorithm. Their system has four main parts, in order: (1) Tag dictionary expansion using label propagation algo- rithm, (2) Weighted model minimization, (3) Ex- pectation maximization (EM) training of HMMs us- ing auto-supervision, (4) MaxEnt Markov Model (MEMM) training. The entire procedure results in a trained tagger model that can then be applied to tag any raw data. 3 Step 2 in this procedure involves
14 Read more
We set out to develop a POS inventory for Tagger that would be intuitive and informative while at the same time simple to learn and apply so as to maximize tagging consistency within and across narrations. Thus, we sought to design a corpus tag set that would capture standard parts of speech (noun, verb, etc.) as well as categories for token varieties seen mainly in secondary data. The 36 keywords have been extracted and then they should be tagged by secondary data to formed primary data as to count the different narrations in text.
One of the most significant finding in our study was the multiple SNP–SNP interactions composed of ERCC1 rs2298881 and XPC rs1870134 polymorphisms, which were consistently identified by two different statistical approaches: multivariate logistic regression and MDR analyses. We found the P value for “ERCC1 rs2298881-XPC rs1870134-ERCC2 rs238417-ERCC5 rs873601” combination was more significant than the two-way interactions of “ERCC1 rs2298881 and XPC rs1870134”, but the four-way interaction combination was not verified by the multivariate logistic regression method, which might due to the more subgroups causing the rare genotypes. Several studies showed that the combined effect of multiple SNPs in several genes in one or more relevant DNA repair pathways could have a greater impact on pathological phenotypes than SNPs in single genes . And we found the OR of “ERCC1 rs2298881 and XPC rs1870134” polymorphisms interaction was higher than the OR of single-locus (OR interaction : 2.11 vs. OR XPC: 1.67), which suggest that this two-way interaction was a superior combination model for the prediction of HCC risk. As the mechanism of these two SNPs was not very clear now, it required further functional study to verify this finding in future studies.
11 Read more
Pharmacogenomics is the study of how genetic makeup determines the response to a therapeutic intervention. It has the potential to revolutionize the practice of medicine by individualisation of treatment through the use of novel diagnostic tools . This new science should reduce the trial-and-error approach to the choice of treatment and thereby limit the exposure of patients to drugs that are not effective or are toxic for them. Single Nucleotide Polymorphisms (SNPs) holds the key in defining the risk of an individual’s susceptibility to various illnesses and response to drugs. There is an ongoing process of identifying the common, biologically relevant SNPs, in particular those that are associated with the risk of disease. The identification and characterization of large numbers of these SNPs are necessary before we can begin to use them extensively as genetic tools. As SNP allele frequencies vary considerably across human ethnic groups and populations, the SNP consortium has opted to use an ethnically diverse panel to maximize the chances of SNP discovery. Currently most studies are biased deliberately towards coding regions and the data generated from them therefore are unlikely to reflect the overall distribution of SNPs throughout the genome. The SNP consortium protocol was designed to identify SNPs without any bias towards these coding regions. Most pharmacogenomic studies were carried out in heterogeneous clinical trial populations, using case-control or cohort association study designs employing either candidate gene or Linkage disequilibrium (LD) mapping approaches. Concerns about the required patient sample sizes, the extent of LD, the number of SNPs needed in a map, the cost of genotyping SNPs, and the interpretation of results are some of the challenges that surround this field. While LD mapping is appealing in that it is an unbiased approach and allows a comprehensive genome-wide survey, the challenges and limitations are significant. An alternative such as the candidate gene approach does offer several advantages over LD mapping. Ultimately, as all human genes are discovered, the need for random SNP markers diminishes and gene-based SNP approaches will predominate. The challenges will then be to demonstrate convincing links between genetic variation and drug responses and to translate that information into useful pharmacogenomic tests.
The present study has some limitations. Importantly, some lipid-related outcomes, such as LDL-C and TAG concentrations, were not measured in the PRECISE study. The PRECISE study was also conducted in two popula- tions, a UK cohort and a Danish cohort, which used differ- ent food frequency questionnaires and this might have introduced measurement bias, even though the current results were adjusted for country in the regression analysis to avoid confounding. Another possible limitation is the use of a cross-sectional design (in both studies) to investi- gate genetic effects at a single point in time, whereas a longitudinal analysis design would have captured the gen- etic effects on lipid outcomes over a specific time period. The effect-size of the minor allele of some of the studied SNPs was relatively small, and hence a large sample size is required to detect reliably detect any interaction between SNPs and dietary factors. Despite the fact that this study was not adequately powered to detect such an interaction, it was sufficiently powered to detect the main effects (i.e., associations). Significant gene-diet interactions were iden- tified, however these did not reach the Bonferroni- corrected P value ( P = 0.001) and hence need to be con- firmed in larger cohorts. This study is strengthened by the
14 Read more
For sparse polygenic modeling approaches with Elastic Net and Lasso, genotypes were coded in 0, 1, or 2, after missing genotypes in VCF format were converted to reference alleles. Then, for each SNP site, coded genotypes of individuals were normalized to mean zero and variance one. We assumed a simple linear additive model for gene expression as in (1), and the R package glmnet (Friedman et al. 2010) was used to apply Lasso and Elastic Net for variable selection and joint estimation of effect sizes. The tuning parameter lambda was estimated by 10-fold cross validation for each gene, as imple- mented in glmnet. As a result, per-normalized-genotype effect sizes for variants were estimated and used in our anal- ysis. The assumptions underlying the degree of polygenicity for gene expression is another parameter that may affect pre- diction performance with sparse polygenic modeling ap- proaches. In Elastic Net, the mixing parameter a controls polygenicity, ranging from a small number of variants when a is close to one (the algorithm performs like Lasso), to all the variants when a is close to zero (the algorithm performs like Ridge), and can be set somewhere in between (0 , a , 1) (Zou and Hastie 2005). For Elastic Net, we use a ¼ 0:5 in our data analyses, assuming that, for most genes, the number of cis-regulatory variants affecting gene expression is sparse, as previously suggested (Wheeler et al. 2016). For both Lasso and Elastic Net, we ranked SNPs by the absolute values of their effect sizes.
12 Read more
When the reads dataset contains variations (e.g. two allele of the same individual, or two or more distinct indi- viduals, or different isoforms of the same gene in RNA- Seq data, or different reads covering the same genome fragment in a sequencing process, etc.), the eBWT posi- tional clustering described in "eBWT positional cluster- ing" subsection can be used to detect, directly from the raw reads (hence, without assembly and without the need of a reference genome), positions G[i] exhibiting possibly different values, but followed by the same context: they will be in a cluster delimited by LCP minima and con- taining possibly different letters (corresponding to the read copies of the variants of G[i] in the reads set). We now describe how to use this theoretical framework to discover SNPs just scanning eBWT, LCP and GSA of the sets of reads, without aligning them nor mapping them onto a reference genome.
13 Read more
We then performed fine mapping with SNPs within an area of ~400 kb surrounding the IL9 and D5S2017 micro- satellites. In order to improve our chances of identifying significant association, we chose to genotype only prom- ising SNPs from another pooling study in SZ on Illumina HumanHap550 arrays . That study used an extended sample of 574 Bulgarian trios and was finished just at the time when we were selecting SNPs to follow-up (all 300 trios used in the microsatellite stage of the study were part of that larger sample). We selected 8 SNPs (4 for each region), that had shown nominal significance (p = 0.05) in pools hybridised on Illumina arrays. This considerably reduced the cost of our project, as the two regions con- tained nearly 200 SNPs on the Illumina arrays (and even more in the HapMap database). Individual genotyping of the SNPs in the full 615 parent-proband trios confirmed the pooling results for two SNPs: rs7715300 (p = 0.001) and rs6897690 (p = 0.032). SNPs rs17169180 and rs7443175 showed only trends toward association (p = 0.06). To show the validity of the pooling approach, we also report data on the 300 trios in our original pools in the Table 2. However our aim was to identify susceptibil- ity loci for SZ which is clearly best achieved by genotyping as large a sample as possible, therefore the full data on the 615 trios constitutes our primary analysis.
Figure 5 shows the laser-induced PL emission spec- trum of Au NWs and multilayer NWs. The NWs were drop-cast on a silicon substrate for PL studies. Initial measurement shows a weak emission at 542 nm, which is corresponding to the Si substrate (Figure 5i). Au NWs without tagging show a very weak and broad emis- sion at 560 nm which is closer to gold emission  (Figure 5ii). Au NWs exhibited a shift of maximum emission at 570 nm, indicating the efficient tagging on the Au NWs (Figure 5iii). Figure 5iv illustrates the PL spectrum of multilayer NWs with an emission at 570 nm. The underlying concept of low emission intensity is the reduction of Au surface area in Au/Ni/Au multi- layers that causes lesser amount of tagging on the NW. Consequently, a larger amount of tagged DNA is adsorbed on the Au layer, but not by the Ni layer; thus, the fluorescence signal was quenched by the Ni segment (Figure 5iv). Shown in Figure 5v,vi are the bright and dark luminescence images of the Au NW and multilayer
Disease association studies often identify non-coding regions of the genome exhibiting signiﬁcant association with disease. The exploration of those non-coding regions will beneﬁt from the survey of gene expression variation and how it relates to genetic variation (eQTL mapping). For any disease-associated non-coding region (eg from a case-control study), it is possible to test whether the disease-associated SNPs and haplotypes are also associated with gene expression variation of nearby genes (as identiﬁed from eQTL studies; Figure 2). This enables conclusions to be drawn about the nature of the function of the causal variant. For instance, if the same haplotype that appears to increase the disease risk also appears to be associated with high expression of a nearby gene, it is possible to start making some connections between the biology of the affected gene and the disease itself. Moreover, one can hypothesise (and hopefully test) how levels of expression of a gene might affect disease risk. This simple connection between the two types of study could provide not only the identity of the gene that is linked to the disease, but also the consequence of genome variation that linked the gene with the disease. It may also provide some clues about other candidates (upstream transcriptional regulators, interacting proteins etc).
in the group of patients. When we analyzed these SNPs (-73C>T and 3’UTR 188C>T), our findings showed that most of the IVS4-73C allele carriers also have OLR1 3’UTR 188T allele (89.5%). This finding suggests the possibil- ity of an interaction between these 2 SNPs (-73C>T and 3’UTR 188C>T) and hypertension in the presence of CAD. Although the definite mechanism requires further re- search, we think that the intron 4 variations of the OLR1 gene may result in an increased risk of CAD by increasing the SBP levels.
Total genomic DNA of the samples was isolated from peripheral blood leukocytes according to a standard phenol-chloroform method. The extracted DNA was placed at -80°C. Genotypes of the four SNPs were determined using modi- fied polymerase chain reaction and restriction fragment length polymorphism (PCR-RFLP) . The SNPs were selected using two criteria: bioinformatics functional assessment and link- age disequilibrium (LD) structure. Computatio- nal analysis of GALNT2 SNPs (http://www.ncbi. nlm.nih.gov/SNP/buildhistory.cgi) ascribed po- tential functional characteristics to each vari- ant allele. In addition, the four SNPs selected for genotyping also based on the frequency of Beijing Han population from the Human Genome Project Database. The heterozygosity values were higher than 10% for the minor allele frequency. Transform bases were used for the genotyping. The sequences of the for- ward and backward primers used for GALNT2 rs1997947, rs2760537, rs4846913 and rs- 11122316 were 5’-TTGCTTGTTGGAGGTTGG-3’ and 5’-AGGAAGGGACTGTGCTGA-3’, 5’-CTGGCT- GGAACCCCTCTTTA-3’ and 5’-ACACGCCCATCTC- TCTTTCA-3’, 5’-CGCCACCTCCCATCACAGA-3’ and 5’-AAGCCTCACATCAACAGCAAAG-3’, 5’-CACAG- TGGTCCCGTAAGA-3’ and 5’-GGCATAAGCTCCA- GAGGC-3’ (Sangon, Shanghai, People’s Repu- blic of China); respectively. Each reaction sys- tem of a total volume of 25 µl, comprised 100 ng (2 µl) of genomic DNA; 1.0 µl of each primer
18 Read more