Codon Usage Bias Covaries With Expression Breadth and the Rate of Synonymous Evolution in Humans, but This Is Not Evidence for Selection

(1)



Codon Usage Bias Covaries With Expression Breadth and the Rate of

Synonymous Evolution in Humans, but This Is Not Evidence for Selection

Araxi O. Urrutia and Laurence D. Hurst

Department of Biology and Biochemistry, University of Bath, Bath BA2 7AY, United Kingdom Manuscript received May 22, 2001

Accepted for publication August 15, 2001

ABSTRACT

In numerous species, from bacteria to Drosophila, evidence suggests that selection acts even on synony-mous codon usage: codon bias is greater in more abundantly expressed genes, the rate of synonysynony-mous evolution is lower in genes with greater codon bias, and there is consistency between genes in the same species in which codons are preferred. In contrast, in mammals, while nonequal use of alternative codons is observed, the bias is attributed to the background variance in nucleotide concentrations, reflected in the similar nucleotide composition of flanking noncoding and exonic third sites. However, a systematic examination of the covariants of codon usage controlling for background nucleotide content has yet to be performed. Here we present a new method to measure codon bias that corrects for background nucleotide content and apply this to 2396 human genes. Nearly all (99%) exhibit a higher amount of codon bias than expected by chance. The patterns associated with selectively driven codon bias are weakly recovered: Broadly expressed genes have a higher level of bias than do tissue-specific genes, the bias is higher for genes with lower rates of synonymous substitutions, and certain codons are repeatedly preferred. However, while these patterns are suggestive, the first two patterns appear to be methodological artifacts. The last pattern reflects in part biases in usage of nucleotide pairs. We conclude that we find no evidence for selection on codon usage in humans.

D

OES selection act on mutations within exons that et al.2001). There are also consistent preferences toward certain codons within any given genome that may be do not alter the amino acid sequence of the coded

protein? Originally it was asserted that these synony- interpreted as a result of selection as they appear to be in the opposite direction to mutation bias.

mous mutations must be neutral (KingandJukes1969).

However, it is well known that unequal use of alternative In mammalian genomes, codon usage bias is also ob-served (Eyre-walker1991;IkemuraandWada1991). codons is a common phenomenon in many unicellular

species, as well as in Drosophila and Caenorhabditis, However, as mammalian genomes show a great variation in nucleotide concentrations across the genome (i.e., and that this may reflect the activity of selection (Marais

et al. 2001). In several unicellular species (Gouy and isochores;Bernardi1995;Bernardiet al.1997), codon usage bias has been attributed to this background nucle-Gautier1982;Sharpet al. 1986;Stenicoet al.1994)

otide bias. In Figure 1, for example, we plot GC content and some invertebrates (DuretandMouchiroud1999)

of third sites in exons for 369 genes on human chromo-it has been observed that codon usage bias is related to

somes 21 and 22, against the GC content of the 50 kb expression patterns. In these species it has been found

of DNA flanking the gene in question. As can be seen that the extent of codon usage bias correlates with levels

the two strongly covary. A similar covariance is also seen of gene expression, where highly expressed genes tend

between GC content at third sites and intronic GC (see to have a greater bias (GouyandGautier1982;Sharp

Duret and Hurst 2001 and references therein). As

et al.1986;Stenicoet al.1994;DuretandMouchiroud

codon usage bias strongly reflects the GC content at the 1999). Evidence has been found to relate this to tRNA

silent sites (Figure 2a), it has been difficult to assess availabilities (Sharpet al.1995;MoriyamaandPowell

the input of other variables to bias in codon usage in 1997;Kanaya et al.1999). These observations suggest

mammalian genes related to protein synthesis such as that in these species codon usage bias is explained partly

expression patterns and rates of substitutions. by translation efficiency-related pressures. Additionally,

Nonetheless in one case there has been a claim that the rate of synonymous evolution covaries with the level

a highly expressed set of genes (histones) does show of codon bias (see,e.g.,PowellandMoriyama1997),

codon usage that deviates from background nucleotide although this might be an artifact of the method (Dunn

content in flanking regions (Debry and Marzluff 1994). This suggests that selection could operate on codon usage in humans as well. If this were generally Corresponding author:Laurence D. Hurst, Department of Biology

true we might expect that codon usage bias in mammals

and Biochemistry, University of Bath, Claverton Down, Bath BA2 7AY,

United Kingdom. E-mail: [email protected] should influence expression patterns. It is unknown

(2)

Figure1.—Correlation of GC content at fourfold degener-ate sites with the GC content at 50-kb flanking region. (GC4⫽

⫺0.32⫹1.96 GC50 kb;r2_⫽_50.4%;_P_⬍_0.001).

whether, more generally, codon usage bias, after correc-tion for background nucleotide content, covaries with any expression parameters. By contrast, there is now some evidence supporting the notion that codon usage affects expression. When nonmammalian genes are to be expressed in mammalian cells, the replacing of rare codons in the mammalian genome for common ones appears to have dramatic effects on the level of gene expression. This method, known as

“mammalianiza-Figure2.—Correlation of codon usage bias using the MCB tion” or “humanization,” has been used for increasing

method and G⫹C content at third sites (GC3s), (A) assuming the expression of several genes (e.g., Levy et al.1996;

equiprobability and (B) correcting by nucleotide bias in a Zolotukhin et al.1996;Wells et al.1999;Zhou et al. _{sample of 2396 human genes. (MCB}_⫽ _0.57_⫹_{0.067 GC3s,}

1999). _r2 _⫽ _0.5%, _P _⫽ _{0.001; MCBequiprobability} _⫽ _2.61 _⫺ _8.94 In this study we present the results from the analysis GC3s⫹9.56⫻GC3s2_,_r2_⫽_85.3%,_P_⬍_0.001).

of codon usage bias in a sample of over 2000 human genes designed to ask whether codon usage bias in

mam-mals can be explained by background nucleotide con- _{ated for each gene conserving the base content at first,} tent alone or whether such parameters as expression _{second, and third sites and for gene length. Start and} breadth might also be important. To achieve this we _{stop codons were removed from the randomizations.} developed a tool to measure codon usage bias, correct- _{During randomizations all sequences that contained an} ing nucleotide biases. _{internal stop codon were discarded. The procedure was} repeated until a total of 1000 random sequences were obtained.

METHODS _{Effective number of codons tests:}_{Effective number}

of codons (ENC) values were obtained for all sequences Sequences from 2396 genes were included in the

sam-and values of original sequences were compared with the ple. Accession numbers were obtained from theDuret

distribution of random sequences. As the ENC index and Mourichoud (2000) database and sequences

re-has a cutoff at 61 and all sequences with greater values trieved by ACNUC (Gouyet al. 1984). All incomplete

are adjusted to 61, the variance of the distribution was sequences (i.e., with internal gaps, nondefined

nucleo-estimated on the basis of the median instead of the mean tides, or no start or stop codon) were discarded. Data of

and by using only the lower half of the distribution. expression patterns were also obtained from theDuret

Defining amino acids: In all tests, nondegenerative andMouchiroud(2000) database. Breadth of

expres-amino acids (methionine and tryptophan) were not taken sion was calculated by counting the number of tissues

into account. For the majority of the amino acids all where the gene is expressed. Columns referring to the

their alternative codons have the same bases at the first same tissue but in different developmental stages were

and second site. The exceptions are serine, arginine, treated as a single tissue.

(3)

Each of these amino acids was treated in all tests as two from analysis (leavingn⫽1629). To estimate the proba-bility of the bias observed for each gene under the independent amino acids, one of twofold degeneracy

and one of fourfold degeneracy. null hypothesis, the deviation from expectation for each variable was represented in terms of the numbers of Background nucleotide bias model expectations:To

obtain expected proportions for each alternative codon standard deviations away from the mean (z). The stan-dard deviation for a binomial variable can be defined correcting for background nucleotide content, all

co-dons were split into three groups according to the num- as ber of different nucleotides (two, three, and four) that

could appear at the third site without changing the _{␴ ⫽}

冪

P(p) · P(q)

N .

amino acid encoded. The group of degeneracy two was further divided into two groups, those where the choice

The squaredz values for each of the 38 variables were is between T and C and those ending in A or G. The

calculated and then summed to obtain the overall score expected proportions of each alternative codon for a

(x), given amino acid were derived from all the other sites

with the same degree of degeneracy or greater (i.e.,

z⫽ O⫺E

␴

excluding the amino acid being analyzed). For example, for amino acids with two degrees of degeneracy that

x⫽

兺

z2_. could use the nucleotides thiamine and cytosine at the

third site, expectations were calculated on the basis of _{Assuming that the binomial variables are normally} dis-all the other amino acids of two degrees of degeneracy _{tributed, the probability of occurrence of the observed} that had a choice of the same nucleotides for the third _{bias can then be calculated with a}_␹2_{distribution of 38} site and also all the amino acids of four degrees of _{d.f. that has the following probability density function:} degeneracy were included by calculating the relative

frequencies of thiamine and cytosine. For isoleucine,

fx(x; 38)⫽

1

238/2_·⌫_(38/2)x

(38/2)_·_e⫺x/2_. expectations were calculated by calculating the relative

frequencies of adenine, cytosine, and thiamine in

four-Analysis of overall bias:All sequences were concate-fold degenerate amino acids. Finally, for all the fourconcate-fold

nated into a single large sequence. Observed and ex-degenerate amino acids, only the distributions of

nucle-pected codon distributions were obtained as previously otides at the third sites of other fourfold degenerate

described for individual genes. The probability of the sites were used for calculating expectations.

observed bias or greater from background nucleotide To minimize the uncertainty in the expected values,

bias model expectations for each amino acid with n

all cases with⬍30 sites to base the expectations on were

alternative codons was estimated using two standard eliminated. It should be noted that by using this model

methods of goodness of fit that approximate the ␹2 as null expectation, we are not taking into account the

distribution withn⫺ 1 d.f.: codon bias caused by dinucleotide biases.

Probability of observed bias:Proportions of observed _A._␹2 _test: and expected codon usage for each amino acid were

represented in terms of the minimal number of bino- _␹2_⫽

兺

(O⫺E)2

E .

mial variables. For amino acids with two alternative co-dons, codon usage is represented in terms of one

vari-B. G-test: able, the frequency of one codon over the number of

times the amino acid is present. For three alternative _G_⫽_{2 ·}

兺

₍_O_{· ln(}_O_/_E_)). codons A, B, and C, codon usage can be represented

Probabilities were estimated for the values obtained by with two variables: (a) the proportion of codon A over

comparing with cumulative␹2_{distributions.} the total number of times the amino acid is present

Dinucleotide analysis:Dinucleotide proportions were and (b) the proportion of codon B from the sum of

obtained for sites first to second, second to third, and frequencies of codons B and C. For amino acids with

third to first for each gene. Expected proportions for four alternative codons A, B, C, and D, the proportions

a given dinucleotide (Ed) at sites s2 and s3 given the of codon usage are represented by three variables: (a)

nucleotide content in the sequence at each site were the proportion of codons A⫹B over the frequency of

calculated as the amino acid, (b) the proportion of codon A over

the sum of frequencies of codons A⫹C, and (c) the _E

d,s2,s3⫽p(n2) ·p(n3), proportion of B over the sum of frequencies of codons

B⫹ D. Under this method, the distribution of codon wherep(n2) andp(n3) are the proportions of the second and third nucleotides of the dinucleotide at sites s2 and usage of a gene, and the expected one, can be

repre-sented by 38 binomial variables. All sequences in which s3. Dinucleotide bias (DnB) was estimated for each gene by

(4)

DnB⫽

兺

(Ed,s2,s3⫺ Od,s2,s3)2, known to affect the bias in codon usage (Hanai and Wada 1988; Karlin and Mrazek 1996) so it was ex-whereOd,s2,s3is the observed proportion of the dinucleo- _{pected that some proportion of the genes would be} tide. _{more biased than expected by the background}

nucleo-tide content.

Prior methods for assessing codon usage bias have RESULTS

limitations:There are several methods to measure co-don usage bias; however, many of them require a known Background nucleotide content alone does not

ex-plain the codon bias observed in mammalian genes:The set of preferred codons estimated from highly expressed genes. ENC (Wright 1990) is a popular method that extent of codon usage bias in human genes is

domi-nantly dictated by the nucleotide content of the chromo- does not assume preferred codons but is not especially suitable for statistical analysis, as it does not allow testing somal region within which the gene finds itself

(Ber-nardi 1995). Does this alone explain the degree of null hypotheses for codon usage distribution other than equiprobability.KarlinandMrazek(1996) proposed codon usage bias? We studied codon usage bias in a

sample of 2396 human genes. As a first approach to an alternative method that permits the introduction of values of an expected distribution. However, we found investigate whether the observed codon bias can be

ex-plained by nucleotide biases at synonymous sites, 1000 that this method is sensitive to biases on the use of amino acids of different degrees of degeneracy;i.e., the random sequences were generated for each gene,

con-serving gene length and the base content per site. ENC proportion of fourfold degenerate amino acids of a sequence correlates with the index of codon usage bias values (Wright 1990) were obtained as a measure of

codon usage bias for original and random sequences. (r2⫽ _{14.1%; Figure 3a).}

Maximum-likelihood codon bias is a new method for We then compared the ENC for the real sequences to

the distribution of random sequences. Over one-half of determining codon usage bias correcting for back-ground nucleotide content:Given the limitations of the the genes were more deviated than any of the random

sequences generated and 81% were significantly devi- available methods, we chose to develop an alternative method that is easy to obtain and not sensitive to amino ated with␣ ⫽ 0.05 (seemethods).

However, because the ENC has a cutoff at 61 it has acid biases. We wanted a method that could measure the degree of nonrandomness in the use of alternative limited use for sequences with low codon usage bias. In

addition, randomizing the first and second positions codons that is minimally affected by the presence of rare amino acids. In addition the method should allow could potentially influence the distribution of ENC

val-ues in the random sequences. We therefore performed testing of a variety of null hypotheses for codon distribu-tion (i.e., not just equiprobability of occurrence); in this a second test to estimate the probability of the bias

observed, by comparing the proportions of alternative article we use this method to correct for background nucleotide content, but it can be used to correct for codons of each sequence from the same sample of genes

to expected proportions based on the nucleotide con- dinucleotide biases as well.

The use of alternative codons can be thought of as tent at the third sites of each sequence. If the bias from

equiprobability of a gene can be explained by the nucle- an ensemble of several random variables, one per amino acid, each with two to six possible different outcomes otide content of that gene, then the null expectation

would be that frequencies for each codon should match or codons (amino acids encoded by only one codon cannot have codon usage bias), and each outcome with proportions of bases at the third site of all the amino

acids with the same degree of degeneracy in that gene. an associated probability of appearance. Each specific distribution of outcomes is a vector and the codon bias To estimate the probability of obtaining the observed

bias or greater under the null hypothesis for each gene, for one amino acid is the distance of the observed vector from the expected one. However, to obtain an index the observed and expected frequencies of codon usage

for each amino acid were represented in terms of the of codon usage bias for a complete gene, the biases of individual amino acids have to be added in a sensible minimal set of binomial variables, which is a method

to approximate a multinomial distribution (seemeth- way. Different amino acids within a gene vary in two aspects: frequency within a sequence and their degree of ods). The probability of obtaining the observed bias or

greater under null expectation was estimated by sum- degeneracy. If an amino acid is rare, then the observed distribution is more likely to be far from the expected ming the squaredzvalues of distances of observed from

expected for each binomial variable and comparing this just by chance; therefore the bias of a rare amino acid should be downscaled to have less impact on the overall with a standard␹2_{distribution (see}_methods).

Signifi-cant deviations from expected (defined as P ⬍ 0.01) index of codon bias. The different amino acids also vary in the number of alternative codons by which they are were found for 99% of the genes in the sample (data

not shown). We conclude that “background” nucleotide encoded and this should also be taken into account when biases from different amino acids are to be added. content explains some, but by no means all, of the

(5)

(MCB), where the contribution to the index of the bias of each amino acid is weighted by an estimation of the likelihood of occurrence of bias on each amino acid, given its frequency and degree of degeneracy. Neverthe-less, MCB is not a maximum-likelihood method in a strict sense. We believe this method would be useful for interspecies comparisons by allowing correction for differences in nucleotide composition. Importantly, MCB is minimally affected by biases in amino acid con-tent of different degrees of degeneracy (r2 ⫽ _0.4%; Figure 3b) and appears to effectively remove the influ-ence of background GC content (compare Figure 2A and 2B).

It should be noted that with any procedure that esti-mates the distance from randomness, the size of the sample of events affects the variance that is expected; since the length of genes varies it is expected that this would influence the MCB values that are obtained. Therefore it is important to carefully study the relation of gene length with the variables that are being tested against codon usage values. A script for calculating MCB is available from the authors.

MCB covaries with breadth of expression and rates of synonymous substitution:Expected distributions for each codon family were derived from the base composi-tion of all third sites with the same or greater level of degeneracy within a given sequence (according to methods) and MCB values were obtained for all genes in the data sample. If the residual biases in codon usage, once correcting for nucleotide content, are due to selec-Figure3.—Correlation of codon usage bias and the

propor-tion then we could expect (a) higher bias in more tion of fourfold degenerate sites in the sequence, using (A)

the Karlin and Mrazek (K&M) method (K&M⫽0.136⫹0.588 broadly expressed genes, (b) consistently preferred co-fourfold degeneracy,r2⫽_14.1%;_P⬍_{0.001) and (B) the MCB}

dons, or (c) an inverse correlation with levels of synony-method (MCB⫽0.535⫹0.145 fourfold degeneracy,r2⫽_0.4%;

mous substitutions (Ks). P⫽0.002).

We assessed the effect of breadth of expression on co-don usage bias in our sample (seemethods). Breadth of expression is not a direct measure of expression rate we developed a new method that is easy to calculate

and allows us to test different models to explain codon and therefore we may not necessarily be analyzing the usage bias. The bias of an individual amino acid BA key parameter. Nonetheless, the breadth of expression with frequencyNAof level of degeneracyT, having the is known to covary with the intensity of purifying selec-observedOcand expectedEcproportions for each alter- tion acting on the nonsynonymous sites (Duret and native codon, is obtained by Mouchiroud2000), so may reasonably be taken as a covariate to the strength of purifying selection. To assess

BAT⫽

兺

c

(Oc⫺Ec)2

Ec

. the interaction between breadth of expression and co-don usage bias the sample was divided into three groups according to the number of tissues in which they are The bias for a geneBgcan then be obtained by summing

expressed: (1) genes expressed in up to 5 tissues (n⫽

over all amino acids,

1242), (2) genes expressed in more than 5 but not more than 10 tissues (n⫽ 494), and (3) genes expressed in

Bg⫽

兺

BAT· logNA

A , _{between 11 and 15 tissues (}_n_⫽_{272). The levels of codon}

bias in the three groups were significantly different from whereAis the number of amino acids contributing to

each other (5vs. 10, P ⫽ 0.001; 10vs. 15, P ⬍ 0.001; the index.

5vs.15,P⬍0.001; Kruskal-Wallis test). In all compari-All genes where more than five amino acids were

sons genes expressed in fewer tissues tend to have a missing or no index could be estimated were removed

lower MCB value. The correlation line between MCB from all comparisons (leavingn⫽ 2387). We

(6)

Figure4.—The relationship between MCB and expression

Figure5.—The relationship between the rate of silent site patterns. Average values of MCB and standard error bars are

evolution (Ks) and codon usage bias. Average value and stan-shown for the genes divided into five groups according to the

dard error bars ofKs(human-rodent comparison using the number of tissues where they are expressed: 1–3, 4–6, 7–9,

Li method; Duret and Mouchiroud 2000) are shown for 10–12, and 13–15 (n⫽884, 484, 293, 203, and 144,

respec-genes grouped by MCB values: 0.21–0.39, 0.40–0.57, 0.58–0.75, tively).

0.76–0.93, and⬍0.94 (n⫽153, 938, 874, 264, and 67, respec-tively).

(P ⬍ 0.001,r2⫽ _{3.1%; see Figure 4). This result} sug-gests that genes with broader expression show a higher

The covariance with expression breadth and synony-degree of codon usage bias. These results cannot be

mous substitution rates is also found when correlating explained by compositional biases caused by

transcrip-the Karlin and Mrazek method as a measure of codon tional coupled mutational biases (e.g., higher rate of

bias (breadth of expression,P⬍0.001,r2⫽_0.9%; synon-C→T mutations in “breathing DNA”) since the MCB

ymous substitution rate using Li93, P ⫽ 0.002, r2 _⫽ method already takes into account gene-specific

back-0.4%). Although both the MCB and Karlin and Mrazek ground nucleotide concentrations.

methods significantly correlate with synonymous rates An inverse correlation between codon usage bias and

of substitutions and breadth of expression, these weak rates of silent site substitutions has been observed in

correlations should be interpreted cautiously. bacteria (SharpandLi1987), Drosophila (Powelland

Codon preferences:The above results are suggestive Moriyama1997), and yeast (L. D. Hurst,unpublished

of a role for selection. If selection is to explain the above data). If codon usage bias is due to selective pressures

effects then we should also expect to see certain codons then it is expected that genes with higher codon usage

repeatedly being favored among genes. To investigate bias would have lower rates of synonymous substitutions,

if the observed biases were favoring specific codons over although the effect may be weak. When rates of

synony-others we performed an overall analysis of the whole mous substitutions (compared to mouse and rat

or-sample by concatenating all genes into one large se-thologs; Duret and Mouchiroud 2000), using Li’s

quence. If the biases are due to factors specific for indi-(1993) method (Li93) and removing tandem

substitu-vidual genes these should cancel each other out in the tions (data as inDuretandMouchiroud1999), were

whole sample. The proportions for each alternative co-plotted against MCB values, we observed an inverse

don were obtained and compared with expectations correlation between rates of silent site substitution and

from the nucleotide biases. Significant differences from MCB values (r2 ⫽ _1.2%, _P ⬍ _{0.001; see Figure 5). A}

expectations were observed for all of the amino acids similar result is obtained (r2⫽_1.4%,_P⬍_{0.001) when}

that have two or more synonymous codons using the comparing MCB values with rates of substitution at the

two tests of goodness of fit (P⬍ 0.001, seemethods). fourfold degenerate sites, using the Tamura and Nei

A more conservative test is to investigate the consis-protocol after removing tandem substitutions. While

tency of the direction of the biases for individual genes. this result is consistent with selection, it must be treated

If there is no significant tendency favoring a particular with caution owing to the fact that estimators ofKsmay

set of codons then it is expected that a codon would be be biased when nucleotide content is biased (Dunn et

overrepresented one-half of the times that it deviates

al.2001). Indeed, the correlation is not present (or at

from expectation. The majority of the codons have sig-most only weakly suggested) if instead we apply the

nificantly less heterogeneity than expected by chance maximum-likelihood method of Goldman and Yang (P⫽

(7)

TABLE 1 gene, we still find a weak positive correlation of the order of magnitude reported for the real genes (P ⬍

Heterogeneity of direction of bias for each codon

0.001,r2⫽ _{4.0%). Likewise we find in the mean MCB}

vs. Ksregression a weak negative correlation of about First Second codon position Third

codon codon the order reported for the real genes [Li93,P⫽0.001, position U C A G position _r2⫽_{0.6%; Tamura and Nei method (TN93),}_P⫽_0.002,

r2 ⫽ _{0.5%]. Moreover, when we subtract the average}

U (⫹) ⫹⫹ (⫹) (⫹) U

bias of the random sequences from the bias of the real

C ⫺⫺ ⫹⫹ (⫺) ⫺⫺

sequences, the correlation with breadth of expression A ⫹ (⫹) (⫹) (⫺)

G ⫺⫺ ⫹⫹ ⫹ ⫺⫺ disappears and with rates of substitution weakens consid-U (⫺) ⫹ (⫺) (⫺) C _{erably (expression,}_P_⫽_0.348, _r2_⫽_0.01%;_K

sLi93,P⫽ C ⫺⫺⫺ ⫽ (⫹) (⫺) _0.014,_r2 ⫽ _{0.03%). Therefore, the most conservative} A ⫹ ⫹ (⫹) (⫺) (⫹)

interpretation of our data is that MCB does covary with

G ⫺⫺⫺ ⫹ ⫹ ⫺ (⫹)

expression breadth andKs, but this is likely to be because

U ⫺⫺ (⫹) STOP STOP A

of a tendency of larger genes having lower expression

C ⫺⫺⫺ ⫹⫹ ⫺⫺⫺ ⫽

breadth and higher rates of silent site substitution. The

A ⫺⫺⫺ ⫹⫹ (⫺) ⫹

G ⫺⫺⫺ ⫹ ⫽ ⫹ data appear not to support the hypothesis that covari-U ⫹⫹ ⫺⫺⫺ STOP TRP G _{ance is due to selection on codon usage}_{per se.}_{It should} C ⫹⫹⫹ ⫺⫺⫺ ⫹⫹⫹ ⫹ _{be noted that for 96% of the sequences the MCB value}

A MET ⫺⫺⫺ (⫹) ⫺

of the real data was higher than the mean value for the

G ⫹⫹⫹ ⫺⫺⫺ ⫽ ⫺

random sequences.

The degree of heterogeneity in the direction of bias toward _{Dinucleotide effects and preferred codons:} _{We are} under- or overrepresentation for each codon is shown. To _{left trying to understand why there is such a large} resid-facilitate interpretation, values were substituted by symbols.⫽,

ual variance in codon usage after background nucleo-no significant deviation from expectation;⫹and⫺, significant

tide content is taken into account. One possibility is over/under representation compared to expectation (P ⬍

that the biases are caused by mutation biases or selection 0.01); (⫹) and (⫺), significant deviation (P⬍0.01) up to 10

standard deviations away from expected distribution;⫹/⫺, _{associated with particular dinucleotides. We performed}

⫹⫹/⫺⫺,⫹⫹⫹/⫺⫺⫺, between 10–20, 20–30, and⬍30 stan- _{a dinucleotide analysis on the whole sample (see} meth-dard deviations.

ods) and also found that the sequences of the sample show significant biases in the appearance of dinucleo-tides from the expectations based on nucleotide content genes (Table 1). The above results are consistent with _{variations, consistent with previous observations} (Kar-selective pressures favoring specific codons. Were this _lin_and_Mrazek_{1996). Dinucleotide bias explains part} the result of selection we can predict that tRNA levels _{of the codon usage bias that we find in sequences in} should be more highly skewed for the amino acids show- _{our sample when correcting for background nucleotide} ing bias than for those showing little bias as has been _{content. Most notably both TA and CG are avoided. It} shown for other species (cf.Sharpet al.1995;Moriyama _{has been suggested that the dearth of CpG is probably} and Powell 1997; Kanaya et al. 1999); however, _{related to the mutation of methylated CpG sites to TpG} Landeret al.(2001) did not find support for this predic- _{dinucleotides. By contrast, the dearth of TA may be} tion in human genes. _{owing to selection related to the susceptibility of UA in} Expression breadth and synonymous substitution pat- _{mRNA to RNase activity (Beutler}_{et al.} _{1989; but see} terns are most probably due to gene length effects:The _Duret_and_Galtier_{2000). A similar bias is also found} above results are suggestive of selection possibly playing _{in noncoding regions (Karlin}_and_Mrazek_1996), sug-a role in codon ussug-age bisug-as in humsug-ans. However, sug-as _{gesting some RNA-independent mechanism; however,} stated earlier, genes of different length are likely to have _{the bias is significantly more profound in coding} re-different MCB values owing to the nature of the method. _gions.

(8)

dons among the codons that specify glutamic acid (GAA, of synonymous substitution. This could suggest selective GAG). pressures related to translation efficiency, as has been For each gene, the relative frequencies for each co- conjectured (Debry and Marzluff 1994). However, don were calculated with respect to the other codons the fact that these two correlations disappear when the that encode the same amino acid. Those amino acids effect of gene length is included suggests that gene whose codons have the same nucleotide at the second length could be a more relevant variable and that the site and that have the same type of degeneracy were suggestive results are just artifacts. It is nonetheless inter-grouped. Three such groups can be formed: (1) tyro- esting to find a weak tendency for short genes to have sine, histidine, asparagines, and aspartic acid; (2) gluta- broader expression and lower synonymous substitution mine, lysine, and glutamic acid of twofold degeneracy; rates. These effects are, however, so weak that it may and (3) proline, threonine, and alanine of fourfold _{be improper to suppose that they have any biological} degeneracy. Within each group of amino acids, sub- _meaning.

groups of those codons that have the same nucleotides _{We also observe that there are codons that are} consis-at the first and the second sites were formed. Within _{tently over- or underrepresented. This pattern can be} each subgroup the relative frequencies of codons were _{explained in part by dinucleotide biases that also} influ-compared against each other with Mann-Whitney tests. _{ence codon usage. However, we have also shown that not} A total of 21 comparisons were made (for amino acids _{all the bias can be explained by such a simple mutational} with twofold degree of degeneracy, only one subgroup _{bias. While the cause of the remaining bias is uncertain,} of codons was formed since the second subgroup is _{we fail to provide support for the hypothesis that codon} complementary). In this test, the CG content variations _{usage is owing to selection.}

do not affect the comparisons because the relative fre- _{Can we be sure that selection does not affect codon} quencies for each amino acid are calculated with respect _{usage in mammals? While the above results would tend} to the other codons that encode the same amino acid. _{to suggest an absence of selection, as might be assumed} Assuming that there are no diamino acid biases or other _{to be the dominant position, several caveats must be} factors of bias than dinucleotide effects, then we can _{noted. First, the dearth of TA dinucleotides may be a} expect that the relative proportions of codons of differ- _{result of selection, as we discussed. However,} _Duret ent amino acids are nearly identical (i.e., not signifi- _and

Galtier(2000) argue that this is a methodological cantly different) since they are expected to interact with

artifact. Second, our expression analysis looked at similar proportions of nucleotides at the first position

breadth of expression, not rate of expression. Nonethe-of the next codon. The major difference was found in

less, the breadth of expression is known to covary with the comparison of the codons CAA and GAA (encoding

the intensity of purifying selection acting on the nonsyn-for glutamine and glutamic acid, respectively) with

onymous sites (DuretandMouchiroud2000), so may mean frequencies of 0.24 and 0.38, respectively. From

reasonably be taken as a covariate to the strength of all 21 possible comparisons within subgroups, only the

purifying selection. comparison of the codons TAT and AAT (that encode

Third, we need to understand how to resolve the for tyrosine and asparagine, respectively) was not

sig-present findings with the result that there are dramatic nificant with an␣value of 0.05. All but 4 were

signifi-increases in the amount of gene expression observed cantly different with an␣value of 0.01.

when foreign sequences, to be expressed in mammalian Some of the differences observed might be due to the

cells, are modified to avoid having rare codons. One existence of trinucleotide biases, diamino acid biases, or

possibility is that negative results are not reported and any more elaborated mutation patterns. These results

therefore we are left only with the cases in which the show, however, that dinucleotide effects cannot alone

increase in expression could be due to the change in account for all of the observed distribution of codons

some synonymous sites rather than the effect of codon in human genes.

usageper se.On the other hand, this observation could indeed be indicative of selective pressures related to DISCUSSION _{translation efficiency acting on codon usage}

distribu-tions. However, because we are using a method that We found that codon usage bias in mammalian genes

measures distance from random use, rather than the is not completely explained by background nucleotide

degree in which optimal codons are used, we might not content variation. We therefore developed a method to

have adequate resolution to detect the patterns. Using study the influence of other variables on codon usage.

a method to measure codon bias based on the degree Unlike other methods ours appears to be insensitive to

of use of optimal codons, but correcting for the back-influence from rare amino acids. When we apply this

ground nucleotide bias, could allow recovery of evidence method to a sample of human sequences, correcting

of weak selective pressures acting on coding sequences in expected distributions for background nucleotide

con-mammals. In the meantime, we may conclude that codon tent, we find that codon usage bias covaries with breadth

(9)

tide-sequence data and cytogenetic data. Nucleic Acids Res.19: of synonymous evolution in humans but that this is not

4333–4339.

evidence for selection. _{Kanaya, S., Y. Yamada, Y. Kudo}_and_{T. Ikemura,}₁₉₉₉ _{Studies of}

codon usage and tRNA genes of 18 unicellular organisms and We thank Laurent Duret, Brian Charlesworth, Jody Hey, and two

quantification of Bacillus subtilis tRNAs: gene expression level anonymous referees for comments on an earlier version of the

manu-and species-specific diversity of codon usage based on multivariate

script. This work was funded by a grant from Conacyt to A.O.U. _{analysis. Gene}

238:143–155.

and by the Biotechnology and Biological Sciences Research Council _{Karlin, S.,}_and_{J. Mrazek,}₁₉₉₆ _{What drives codon choices in}

hu-(BBSRC) and the Royal Society to L.D.H. _{man genes? J. Mol. Biol.}_262:_459–472.

King, J. L.,andT. H. Jukes,1969 Non-Darwinian evolution. Science

164:788–798.

Lander, E. S., L. M. Linton, B. Birren, C. Nusbaum, M. C. Zodyet

LITERATURE CITED al., 2001 Initial sequencing and analysis of the human genome. Nature409:860–921.

Bernardi, G.,1995 The human genome: organization and

evolu-Levy, J. P., R. R. Muldoon, S. ZolotukhinandC. J. Link, 1996 tionary history. Annu. Rev. Genet.29:445–476.

Retroviral transfer and expression of a humanized, red-shifted

Bernardi, G., D. MouchiroudandC. Gautier,1997 Isochores

green fluorescent protein gene into human tumor cells. Nat.

and synonymous substitutions in mammalian genes, pp. 197–208 _Biotechnol._14:_610–614.

inDNA and Protein Sequence Analysis, edited byM. J. Bishopand

Li, W.-H.,1993 Unbiased estimation of the rates of synonymous and

C. J. Rawlings.IRL Press, Oxford. _{nonsynonymous substitution. J. Mol. Evol.}_36:_96–99.

Beutler, E., T. Gelbart, J. H. Han, J. A. KoziolandB. Beutler,

Marais, G., D. MouchiroudandL. Duret,2001 Does

recombina-1989 Evolution of the genome and the genetic-code—selection _{tion improve selection on codon usage? Lessons from nematode}

at the dinucleotide level by methylation and polyribonucleotide

and fly complete genomes. Proc. Natl. Acad. Sci. USA98:5688– cleavage. Proc. Natl. Acad. Sci. USA86:192–196.

5692.

Debry, R. W.,andW. F. Marzluff,1994 Selection on silent sites

Moriyama, E. N.,andJ. R. Powell,1997 Codon usage bias and in the rodent H3 histone gene family. Genetics138:191–202.

tRNA abundance in Drosophila. J. Mol. Evol.45:514–523.

Dunn, K. A., J. P. BielawskiandZ. H. Yang, 2001 Substitution

Powell, J. R.,andE. N. Moriyama,1997 Evolution of codon usage rates in Drosophila nuclear genes: implications for translational

bias in Drosophila. Proc. Natl. Acad. Sci. USA94:7784–7790. selection. Genetics157:295–305.

Sharp, P. M.,andW. H. Li,1987 The rate of synonymous

substitu-Duret, L.,andN. Galtier,2000 The covariation between TpA

tion in enterobacterial genes is inversely related to codon usage

deficiency, CpG deficiency, and G⫹C content of human iso- _{bias. Mol. Biol. Evol.}_4:_222–230.

chores is due to a mathematical artifact. Mol. Biol. Evol. 17:

Sharp, P. M., T. M. F. TuohyandK. R. Mosurski,1986 Codon 1620–1625.

usage in yeast—cluster-analysis clearly differentiates highly and

Duret, L.,and L. D. Hurst,2001 The elevated GC content at _{lowly expressed genes. Nucleic Acids Res.}_14:_5125–5143. exonic third sites is not evidence against neutralist models of

Sharp, P. M., M. Averof, A. T. Lloyd, G. MatassiandJ. F. Peden,

isochore evolution. Mol. Biol. Evol.18:757–762. ₁₉₉₅ _{DNA-sequence evolution—the sounds of silence. Philos.}

Duret, L.,andD. Mouchiroud,1999 Expression pattern and, sur- _{Trans. R. Soc. Lond. Ser. B}_349:_241–247.

prisingly, gene length shape codon usage in Caenorhabditis, Dro- _{Stenico, M., A. T. Lloyd}_and_{P. M. Sharp,}₁₉₉₄ _{Codon usage in} sophila, Arabidopsis. Proc. Natl. Acad. Sci. USA96:4482–4487. _{Caenorhabditis elegans: delineation of translational selection and}

Duret, L.,andD. Mouchiroud,2000 Determinants of substitution _{mutational biases. Nucleic Acids Res.}_22:_2437–2446.

rates in mammalian genes: expression pattern affects selection _{Wells, K. D., J. A. Foster, K. Moore, V. G. Pursel}_and_{R. J. Wall,}

intensity but not mutation rate. Mol. Biol. Evol.17:68–74. ₁₉₉₉ _{Codon optimization, genetic insulation, and an rtTA}

re-Eyre-walker, A. C.,1991 An analysis of codon usage in mammals— _{porter improve performance of the tetracycline switch.}

Trans-selection or mutation bias. J. Mol. Evol.33:442–449. _{genic Res.}_8:_371–381.

Gouy, M., and C. Gautier, 1982 Codon usage in bacteria— _{Wright, F.,}₁₉₉₀ _{The effective number of codons used in a gene.}

correlation with gene expressivity. Nucleic Acids Res.10:7055– _Gene_87:_23–29.

7074. _{Zhou, J., W. J. Liu, S. W. Peng, X. Y. Sun}_and_{I. Frazer,}₁₉₉₉

Papillo-Gouy, M., F. Milleret, C. Mugnier, M. JacobzoneandC. Gautier, _{mavirus capsid protein expression level depends on the match}

1984 Acnuc—a nucleic-acid sequence data-base and analysis sys- _{between codon usage and tRNA availability. J. Virol.}_73:_4972–

tem. Nucleic Acids Res.12:121–127. _4982.

Hanai, R.,andA. Wada,1988 The effects of guanine and cytosine _{Zolotukhin, S., M. Potter, W. W. Hauswirth, J. Guy}_and _N.

variation on dinucleotide frequency and amino-acid composition _Muzyczka, ₁₉₉₆ _{A “humanized” green fluorescent protein}

in the human genome. J. Mol. Evol.27:321–325. _{cDNA adapted for high-level expression in mammalian cells. J.}

Ikemura, T.,andK. Wada,1991 Evident diversity of codon usage _Virol._70:_4646–4654. patterns of human genes with respect to chromosome-banding