• No results found

gene expression level, codon usage, chromosomal location and intrinsic gene characteristics

4.3 Materials and Methods 1 Genome and microarray data

4.4.3 Chromosomal location

Although it is tempting to arbitrarily select subsets of genes, based on expression level, CAI range, or GC content, we segregated the data according to genome location, within strand, and relative to the origin and terminus of replication (Figure 2). The correlation between gene expression level and all other parameters were then investigated again, using methodologies presented above. ANOVA analysis investigated whether there were differences in the distributions of the parameters between the four

chromosomal locations. Results are summarized in Table 3. Analyses revealed genes on the leading strand were more highly expressed that those on the lagging strand. Similarly, genes located from the origin to the terminus were more highly expressed than those located from the terminus to the origin. Genes were segregated by strand (Leading or Lagging) and relative to the terminus (from the Origin to the Terminus OT, or from the Terminus to the Origin TO) into four groups, namely LeTO, LeOT, LaTO, LaOT. A comparison of gene expression across these four groups revealed that LeOT genes were the most highly expressed, followed by LaTO, and by both LeTO and LaOT, which were both most lowly expressed (Figure 4).

In contrast, GC3 showed opposite distributions between the four locations, with LeOT and LaTO showing a lower GC3 content than LeTO and LaOT (Table 3, Figure 4). CAI10 and CAI50 both showed differences between strands, with higher values on the leading strand. These results correlate well with distributions observed on Figure 1, showing differences between CAI10-CAI50 and CAIall. Also, the strand differences explain the dual distribution observed for CAIall on Figure1.

Additionally, correlation analyses were carried out, after data were split into the four chromosomal locations (Table 4). Differences between locations were observed, consistent with ANOVA results (Table 3). Since genes LeOT and LaTO showed the highest expression levels (Table 3, Figure 4), particular attention was given to their correlations with other parameters. For LeOT, CAI showed the highest correlation, followed by GCall, GC1, GC2, Size and RBS. Interestingly, CAI10/CAI50 showed the highest correlation coefficients, namely 0.52 and 0.51. For LaTO, GCall, GC1, CAIall,

GC2, CAI10, size, CAI50 and size showed highest correlations. Both GC3 and start did not show any significant correlation, which is similar to global results (Table 2).

Visualization of the correlation analyses for CAI10, GCall and size can be seen on Figure 5. For select parameters, gene distribution for each location can be seen on Figure 6. CAI10 showed the strongest correlation with gene expression level, for LeOT (Table 4). Most of those high CAI10 value correspond to genes with high expression levels (Figure 5). Most of the highly expressed genes are located on LeOT, and some on LaTO (Figures 5 and 6), while only a few genes located on LeTO and LaOT show expression above LSM=1.0. In contrast, for the lagging strand, only a few genes show CAI10 above 0.55 (Figures 5 and 6), none of which have high gene expression.

When comparing the relationship between CAI10 and LSM globally (Figure 3) vs. by location (Figure 5), there is a location-specific difference. In contrast, the relationship between LSM and gene size, or LSM and GCall does not change when data is segregated by location (Figures 3 and 5). This is consistent with a strand discrepancy in codon adaptation (CAI10), as seen on Figure 6.

Both GCall and GC3 contents are consistent regardless of location (Figure 6), with a value close to that of the genome in the case of GCall. For GC3, the value has to be lower than that of the genome, due to the restriction in the first two positions, resulting in a higher GC content for this position. As a result, since L. acidophilus is a low GC organism, the GC content at the third codon position has to be lower than that of the genome. Gene size distribution is seemingly equal throughout the chromosome, although many more genes are present on the LeOT, and LaTO. There is a strong difference between the two strands as to codon adaptation (Table 3, Figures 4 and 6), which was

observed irregardless of the training set used. The codon adaptation index is always higher for the leading strand, regardless of direction relative to the origin or terminus of replication (Figure 6).

4.5 Discussion

The analysis of the relationships between gene expression levels, codon usage, chromosomal location and intrinsic gene properties in L. acidophilus revealed strong correlations between GC content, codon usage, chromosomal location and gene expression levels. However, there was no correlation between GC3 and gene expression level. Globally, chromosomal architecture seemed to influence gene expression strongly, with both a strand bias, and a gene location and orientation effect, relative to the origin and terminus or replication.

Globally, a relatively small number of genes showed high expression levels. Predicted highly expressed genes usually encompass ribosomal proteins (RP), transcription and translation processing factors (TF), chaperone proteins (CH), recombination and repair proteins, outer membrane proteins and energy metabolism enzymes (Karlin and Mrazek, 2000; Karlin et al., 2004). Throughout a variety of prokaryotes, those genes display a high codon bias (Karlin and Mrazek, 2000). Our results indicate that the 20 most highly expressed included genes were involved in glycolysis, transcription, ATP synthesis, membrane construction, ribosomal proteins, regulators, and a peptidase. Genes encoding glycolytic enzymes and translation factors have also been shown to be highly expressed in S. pneumoniae (Martin-Galiano et al.,

2004). Although this is consistent with RP and TF families of genes, genes most highly expressed in L. acidophilus did not include CH genes.

Although most studies analyzing codon bias have relied on multivariate statistical analyses such as correspondence analysis (Perriere and Thioulouse, 2002), the major trends identified in codon usage account for a low proportion of the variation (Grocok and Sharp, 2002). In a thorough study of Pseudomonas aeruginosa, the first axis accounted for 17% of the variation, and the first four axes combined accounted for a total of 30% of the variation (Grocock and Sharp, 2002). In another study, the combination of the first three axes account for less than 23% of the variation in codon usage (McInerney, 1994).

Since our objective was to investigate correlation between gene features and expression levels, rather than describe the variation within CAI distributions, we used correlation analysis rather than correspondence analysis. Although no assumption can be made as to the linearity of the relationships between parameters being tested, a linear regression was attempted nonetheless. Several correlation analyses were carried out, including both parametric and non-parametric analyses, namely Pearson, Spearman and Kendall, since no assumptions were made a priori regarding data distribution and linearity of the relationships. Spearman correlation analysis has previously been used in codon analysis studies and Spearman ranking was considered a more appropriate statistic than the Pearson correlation coefficient (Coghlan and Wolfe, 2000). Similarly, Spearman correlation has also been used previously to investigate correlation between effective number of codons in a gene (Nc) and CAI (Fuglsang, 2003). Additionally, Kendall correlation has also been used to analyze the correlation between gene expression level

and codon usage (dosReis et al., 2003). A combination of both Pearson and Spearman correlation analyses has also been used to investigate correlations between CAI and other parameters (Jansen et al., 2003). Pearson correlation coefficients have also been used to analyze the correlation between codon bias and microarray expression data (Fraser et al., 2004). Our strategy allows comparison of results obtained from both parametric (Pearson) and non-parametric (Kendall, Spearman) correlation tests. It was previously suggested that non-parametric tests are more appropriate for such analyses, since they are robust against non-linearity and non-normality (dosReis et al., 2003).

Prior studies carried out using correspondence analysis to investigate CAI statistic distribution (Lloyd and Sharp, 1992) have identified a major and a secondary trend, with the first axis appearing to differentiate genes according to their expression level. (Lloyd and Sharp, 1992, Kliman et al., 2003). Although our results indicate a correlation between CAI and gene expression level, globally, our strongest correlation was established between gene expression level and GC content. Additionally, our investigation of the correlation between CAI and other statistics indicated it is not correlated with GC3.

CAI has previously been shown to be the best codon usage bias indicator (Coghlan and Wolfe, 2000). CAI was also shown to be highly correlated with mRNA expression levels in S. cerevisiae (Coghlan and Wolfe, 2000). CAI and mRNA levels have been shown to be correlated previously (dosReis et al., 2003). In our study, we found a strong correlation between gene expression level and CAI10/CAI50. Although it was not as strong as that between gene expression level and GCall on a global scale, it was the strongest correlation for gene positioned LeOT. Our results show gene GC

content is the parameter most highly correlated with gene expression, which is different from results shown previously (dosReis et al., 2003), but similar to findings from rodents (Konu and Li, 2002).

Our results, indicating a positive correlation between gene expression level and gene size, differ from previous studies reporting a negative correlation between mRNA concentration and protein length (Coghlan and Wolfe, 2000) and gene length and codon usage (Kliman et al., 2003). Perhaps this discrepancy reflects the differences between the organism used in the study, eukaryotic S. cerevisiae and prokaryotic L. acidophilus. The relationship between CAI and mRNA levels has been shown previously in S. cerevisiae

(Coghlan and Wolfe, 2000), and in E. coli (dosReis et al., 2003). Also, a non-parametric regression on mRNA expression levels in E. coli has shown that gene size followed by GC and then CAI are the best predictors of mRNA concentration (dosReis et al., 2003).

Although several studies have used the CAI as an indicator of gene expression, a variable positive correlation is found between codon bias and level of gene expression. Historically, initial CAI studies claimed the strong correlation between CAI and levels of gene expression allow utilization of CAI as a predictor of gene expression (Sharp and Li, 1987). In contrast, we believe the correlation between CAI and gene expression level is indicative, rather than predictive of the level at which a gene is expressed.

The a priori assumption that genes with genes bias close to that of highly expressed genes should be highly expressed is not consistent with the fact that some genes with very high CAI values are not highly expressed (Figure 5). Analysis of CAI and microarray gene expression levels in Streptococcus pneumoniae showed that CAI is not always predictive of gene expression (Martin-Galiano et al., 2004). Specifically,

genes with high CAI are not always highly expressed, and genes with low CAI can be highly expressed (Marting-Galiano et al., 2004), which is also shown in our current findings (Figure 5). Interestingly, S. pneumoniae and L. acidophilus are both low GC Gram-positive lactic acid bacteria.

A small correlation (r2 0.09) has been shown between CAI and microarray fluorescence in S. pneumoniae (Martin-Galiano et al., 2004). A similar correlation level (r2 0.07) was shown in L. acidophilus. In contrast, a higher correlation (r2 0.18) was found between GC and gene expression level in L. acidophilus.

The genomic DNA GC-content varies widely between species, as a result of mutation pressure (Muto and Osawa, 1987). GC variation has been shown to be the most important parameter differentiating codon usage bias between organisms, in archae and eubacteria (Chen et al., 2004). The relationship between codon usage bias and GC composition has been characterized across unicellular genomes (Wan et al., 2004). Specifically, GC3 was shown to be the primary factor within GC content to correlate highly with codon usage bias (Wan et al., 2004). Further, GC3 was hypothesized as the key factor driving synonymous codon usage, independently of species (Zhang and Chou, 1994; Wan et al., 2003). Although those results were inferred across 70 bacterial species and 16 archaeal genomes, our results show this is not the case for L. acidophilus. We found no correlation between GC3 and CAI. The non-linearity of the relationship between codon usage bias measures and GC3 has been shown previously in a variety of bacteria and archaea (Wan et al., 2004).

The L. acidophilus NCFM genome is 34.7% GC, so it is not surprising that codon usage is related to base composition bias. The observed differences in GC content at the

three codon positions illustrate the overall GC content. Codon degeneracy is located primarily at the third position of the codon, since there are strict constraints on the first and second position of each codon (Zhang and Chou, 1994). As a result, the third codon position is representative of the GC content of an organism, and reflects differences between species (Muto and Osawa, 1987; Carbone et al., 2003). GC3 has previously been shown to vary between species (Zhang and Chou, 1994), explaining the species impact on the correlation between GC3 and CAI (Lloyd et al., 1992). Also, it was previously reported that CAI can most highly correlate with GC skew (Carbone et al., 2003), and that gene expression levels are correlated with GC3 (Kliman et al., 2003). The position- specific GC content within codons has been investigated previously (Muto and Osawa, 1987; Chen and Zhang, 2003), across species with varying GC content, indicating that low GC content bacteria have higher GC content at the first codon position and lower GC content at the third codon position, than that of their overall genome content, while that of the second codon position is close to their genomic content (Chen and Zhang, 2003). This is consistent with our findings in L. acidophilus (Figure 1). Early work showed that there is a codon position bias in GC content, which is correlated with genome GC content (Muto and Osawa, 1987). Specifically, the correlation between GC3 and genome GC content explains the discrepancies observed at the third codon position between species with varying GC content (Muto and Osawa, 1987).

A previous study investigating codon bias in P. aeruginosa (Grocock et al., 2002) reported that for species with highly biased GC base composition, the CAI methodology may not be appropriate. While the study in P. aeruginosa (67% GC) illustrated this point for high GC organisms, our analyses in L. acidophilus (35% GC) might validate this

theory for low GC organisms. It was recently suggested lactic acid bacteria are a desirable group of organisms for analysis of codon usage (Fuglsang, 2003), but our result suggest that caution should be applied when using the CAI methodology.

Perhaps the high correlation between GC content and gene expression level is due to the genomic composition of L. acidophilus. The genomic GC content in prokaryotes ranges between approximately 25% and 75% (Muto and Osawa, 1987), which allows great codon usage flexibility and variability. Since L. acidophilus is a low GC organism (Altermann et al., 2004), perhaps the strong correlation between GC content and gene expression level is due to the importance of high GC content genes. Indeed, for a low GC organism such as L. acidophilus, genes with a high GC content differ widely from its genomic “fingerprint”, since GC content is a main component of genomic signature (Sandberg et al., 2003). Therefore, retaining genes that vary from its overall genomic signature may indicate that they are biologically important, and consequently highly expressed.

A correlation between RBS and gene expression level was found, albeit it was minor compared to that of GCall. Nonetheless, a positive correlation between a strong RBS and gene expression level is intuitive, and consistent with previous findings (Ma et al., 2002).

We observed a discrepancy between the genome signature (low GC) and highly expressed genes (high GC), perhaps indicating the codon usage for highly expressed genes is different from that of the genome. Specifically, the genome-wide codon usage is characterized by a high AT content at the third codon position, which is consistent with a low GC organism. In contrast, genes with high codon bias showed a specific preference

for high GC content at the third codon position for select amino acids (Table 1). However, GC3 was not a good indicator of gene expression (Tables 2 and 4). Perhaps this is an indicator that for low GC organisms, overall gene GC content is more representative of bias than codon usage.

Differences in the base composition between strands have been shown previously (Grocock and Sharp, 2002; Lobry and Sueoka, 2002). The leading-lagging strand bias in codon usage has been shown in Borriella burgdorferi (McInerney, 1998; Carbone et al., 2003). Additionally, replication selection is seemingly responsible for the presence of the majority of the genes on the leading strand, whereas transcription selection results in higher expression of genes present on the leading strand (McInerney, 1998).

Interestingly, location per se did not correlate with gene expression level globally (Table 2). This means that the position of the start of any gene on the chromosome does not correlate with gene expression level. However, it was shown previously that location is indeed an important factor in gene expression. We therefore further investigated the effect of both strand location, and orientation relative to the terminus on gene expression level.

The importance of chromosomal location has been illustrated before in P. aeruginosa (Grocock and Sharp, 2002), Borrelia burgdorferi (McInerney, 1998), and

Treponoma pallidum (Lafay et al., 1999). Specifically, differences between strands have been illustrated for codon usage (McInerney, 1998). Strand location was shown to be a major cause of variation in codon usage (McInerney, 1998). Albeit the correlation between gene location and expression level has been estimated weak in P. aeruginosa,

whereby gene location was only the tertiary trend in correspondence analysis, accounting

for only 4.4% of variation (Grocock and Sharp, 2002). Strand location accounted for 8.6% of the variation in codon usage, as the secondary source of variation (Lafay et al., 1999). In contrast, in B. burgdorferi, strand location is the primary parameter involved in codon usage, accounting for 13.7% of the variation (McInerney, 1998). Within species, inter-strand differences appear on the primary axis of correspondence analysis (Lafay et al., 1999). Nevertheless, they showed that the position of a gene relative to the strand has an influence on codon usage (Grocock and Sharp, 2002). In addition to strand location, the orientation of a gene relative to the direction of DNA replication is also important in codon usage pattern (McInerney, 1998). Nevertheless, the impact of both strand and orientation on gene expression had not yet been illustrated simultaneously, prior to our study.

Chromosomal architecture has a major effect on gene expression, both relative to strand bias and gene position and orientation relative to the terminus of replication. The impact of chromosomal architecture is important for many of the parameters measured in our study, showing a significant bias for genes converging towards the terminus. Although it was previously shown the leading strand in low GC Gram-positives pervasively exceeds 75% of the genes (Karlin et al., 2004), it is not the case in L. acidophilus, where only 55% of the genes are on the leading strand. Nevertheless, very significant differences in codon usage, GC content and other parameters were observed between the two strands.

Interestingly, while codon usage, GC content and gene size all showed a global correlation with gene expression levels, CAI was the parameter which showed the most variability between chromosomal locations, relative to the strand bias, and the position

and orientation relative to the terminus. Specifically, the correlation between CAI10 and gene expression level is higher for LeOT genes (Table 4). In contrast, the correlation between gene expression level and GCall or GC3 was consistent regardless of location. For CAI particularly, genes on the leading strand located between the origin and the terminus of replication show the most codon usage bias. Specifically, genes that show the most codon bias are located in this region, and are likely to be highly expressed (Figure 5).

Globally, it seems chromosomal architecture is a primary factor controlling gene expression in L. acidophilus. Perhaps the combination between replication efficiency and transcription efficiency underlie the impact of chromosomal location on gene expressivity. Indeed, replication is thought to be more efficient while co-directional with

Related documents