The use of simulated annealing in chromosome reconstruction experiments based on binary scoring.

(1)

The Use

of Simulated Annealing in Chromosome Reconstruction

Experiments Based on Binary Scoring

A. Jamie Cuticchia,’ Jonathan

Arnold and William

E. Timberlake

Genetics Department, The University of Georgia, Athens, Georgia 30602 Manuscript received September 16, 199 1

Accepted for publication June 17, 1992

ABSTRACT

We present a method of combinatorial optimization, simulated annealing, to order clones in a

library with respect to their position along a chromosome. This ordering method relies on scoring each clone for the presence or absence of specific target sequences, thereby assigning a digital signature to each clone. Specifically, we consider the hybridization of oligonucleotide probes to a

clone to constitute the signature. In that the degree of clonal overlap is reflected in the similarity of their signatures, it is possible to construct maps based on the minimization of the differences in signatures across a reconstructed chromosome. Our simulations show that with as few as 30 probes and a clonal density of 4.5 genome equivalents, it is possible to assemble a small eukaryotic chromosome into 33 contiguous blocks of clones (contigs). With higher clonal densities and more probes, this

number can be reduced to less than 5 contigs per chromosome.

A

variety of methods have previously been devel- oped to produce nearly complete maps of the genomes of complex organisms (SMITH and KOLOD-

NER 1988; SULSTON et al. 1988; ALBERTSON 1985;

KOHARA, AKIYAMA, and ISONO 1987) as well as im- portant regions within the human genome (GREEN

and OLSON 1990; BATES et al. 199 1). One approach, chromosome walking, involves the production of probes from the ends of a previously cloned insert. These probes are then used to “walk” to the next insert by hybridization of the cloned end of a probe to the other probes, and this procedure is repeated until all clones are ordered. A modification of this approach, chromosome jumping, attempts to facilitate the speed of this procedure by using clones containing widely spaced nucleotide sequences (COLLINS et al. 1987; POUSTKA and LEHRACH 1986; POUSTKA et al.

1987). However both of these processes are tedious, subject to interference by repeated DNA sequences, and confounded when regions of the genome are absent from the clonal library.

Another method, DNA fingerprinting, has been successfully used to map the genomes of both Saccha- romyces cerevisiae and Caenorhabditis elegans (OLSON et al. 1986; COULSON et al. 1986). A related method was used to build a highly detailed restriction map of Escherichia coli (KOHARA, AKIYAMA and ISONO 1987). T h e S. cerevisiae map was produced by assembling similar restriction profiles using a pair of six base cutters, whereas a C. elegans map was made from profiles produced via a double digest with both a 4-

School, 1830 E. Monument Street, Baltimore, Maryland 21205. Genetics 1 3 2 591-601 (October, 1992)

’

Present address: Department of Medicine, Johns Hopkins Medical

base and a 6-base cutter. These experiments to assemble contiguous blocks of overlapping clones or contig maps are not only time consuming, but data entry requires digitizing complex and sometimes ambiguous banding patterns for each clone. Presently, both groups have exploited yeast artificial chromosomes

(2)

592 A. J. Cuticchia, J. Arnold and W. E. Timberlake

TABLE 1

List of oligomers used in the chromosome reconstruction simulations

CTTGCCGGC CTTGCGGCC CTCTCGGCC CTCCTCCCG CTCCTCGCC CTCCCAGGC CTCCACCGC CTCCAGGCG CTGCTGCGC CTGCTGGGG CTGCCACCC CTGCCACCG CTGCCGACG CTMCGGTG CTGCAGCGG CTGGTGCGG CTGGTGGGC CTGGCGACG CTGGCGAGC CTGGCGAGG

CTGGCGGTC CTGGGAGCC CTGGGAGGC CTGGGAGGG CCTTGGCCG CCTCTCCGC CCTCTGGGC CCTGCCTGC CCTGCGGTG CCTGGCCTG CCTGGGAGC CCCTCTCCC CCCTGCAGC CCCGCCTTC CCACCCTGG CCACCAGGC CCAAGGCGG CCAGAGGCC CCAGGCTCC CACCTGGCG CACCAGCCC CACCGCCTC CAGGAGGCC CAGGAGGGC CAGGGCTGC CGCCTCTCC CGCCCTTCC CGCCGCTTC CGACGCAGC CGGCCTCTC

PROBE CHOICE

T o ensure that a reasonable subset of the library will hybridize to a given probe, we employ a 3rd order Markov chain predictor to estimate the frequency of occurrence of any given oligonucleotide. Both

E. coli and S. cerevisiae have been successfully modeled via a Markov chain (PHILLIPS, ARNOLD and IVARIE

1987a,b; ARNOLD et al. 1988). Moreover, it has been shown that probes hybridizing to 40% of the clones within a library lead to efficient mapping experiments

(Fu, TIMBERLAKE and ARNOLD 1992).

Probe choice must also take into account the expected

Tm

of the oligonucleotide probes. A FOR- TRAN program, PCAP, which selects clones with the appropriate G

+

C composition and with a hybridization frequency of 40% has been written by A. J.

1992a). Using 40 kilobases (kb) of Aspergillus nidulans DNA sequence, we have selected the 50 9-mers (Table 1) with the above properties to assign digital call numbers to cosmid clones with 40-kb inserts. If YAC vectors were used instead, with an average insert size of 150 kb in A. nidulans, then a battery of 50 1 1-mers could be selected as probes because the program PCAP can identify in A. nidulans highly abundant 1 1- mers with the appropriate G

+

C composition and a

40% hybridization frequency. NIZETIC, DRMANAC and

LEHRACH (1 99 1) have reported success with 1 0-mers, and HOHEISEL et al. (1 99 1) report success with octa- mers. An alternate approach is to use single copy landmarks to assign digital fingerprints to clones. For example, cosmid clones (landmarks) could be hybridized with clones in a YAC library (HOHEISEL et al.

CUTICCHIA (CUTICCHIA, ARNOLD AND TIMBERLAKE

199 1).

ORDERING PROCEDURES

T o simplify the reconstruction problem, the clonal library is first separated into chromosome-specific sub- collections (BRODY et al. 199 1). T h e problem of contig mapping is then reduced to ordering the clones in a specific subcollection along a chromosome. T h e ordering principle is as follows:

(1) Assume a piece of DNA has been randomly cloned into 4 fragments as shown below, where the numbers 1-8 represent the presence of a specific oligonucleotide pattern that is the target site for probe hybridization.

7

1 3 2 8 6 3 5 6 8 4 1 5

clone A

Contig

B

C

D

(2)

T h e scoring of the clones would be: Probe:

Clone 1 2 3 4 5 6

7

8

A

1 0 0 0 1 0 0 0

D

1 0 0 1 1 1 0 1

c

0 1 1 0 1 1 0 1

B

1 1 1 0 0 0 1 1

where 1 represents hybridization and 0 represents the absence of hybridization.

(3) T h e statistic d(ci,cj) is defined as the distance between clone ci and clone c,.

m

d(ci,cj) =

C

I x x= 1

where:

m = number of probes

I, = 1 if probe x hybridizes to exactly one of the

I, = 0 if probe x hybridizes to both or neither

(4) T h e Manhattan Matrix denoting the distances clones

clone.

between each clone is as follows: B C D

: I

6 4 5 5

C 3

(5) T h e statistic, D, is defined as the total linking distance across the chromosome:

n-1

D

=

C.

d(ci,ci+l),

i= 1

where:

n = number of clones

x = ordered index along a chromosome.

(3)

A. Number

of Contigs

40 50

B. Maximum Contig Size

30 40 50

Number of Probes

C. Error within Contigs

30 40 50

225 clones 338 danes 450 clones

-E+

- - -

--A-

-.

- -

...

0

...

FIGURE 1 .-Results of simulated annealing on the reconstruction of a simulated 2000-kb chromosome. Part A shows the expected number of contigs for 200 trials with varying probe number and clonal densities. Part B shows the expected size of the largest contig produced by simulated annealing. Part C shows the error of ordering along a chromosome. Error is measured by the increase in the number of steps required to cover the chromosome. Under perfect ordering there were be n - 1 steps of length 1 to cover the chromosome, where n is the number of clones (1-2 is 1 step, 2-3 is 1 step, but 5-9 is 4 steps with error value of zero). All values were standardized to a chromosome consisting of 225 clones for purposes of comparison. Perpendicular lines represent 95% confidence intervals.

respect to clonal ordering. While as of yet there is no proof that

lim pr(J Dmin

-

D O

I)

> E ) = 0,

i e . , that the minimum D converges in probability (pr) to the true library distance to DO as the library size ( n )

grows, our solutions to this minimization problem were extremely close to the true DO (Figure 1). T h e degree of closeness of solutions to this problem is quantified in panel C of Figures 1, 2 and 3. Thus, the ordering of a clonal library simplifies to the minimization of

D.

T o order libraries with as few as 225 clones using a n exhaustive approach would require considering

orderings. For n clones, the total number of possible orders is n!. Thus, for n

>

10, the deduction of the minimal D cannot be performed exhaustively. We have chosen to use the process of simulated an-

nealing (LUNDY 1985; GOLDSTEIN and WATERMAN

1987) to find the best contig map. Simulated annealing allows for a trial D

’

from a random reordering of clones to be computed, and the new order kept for

D’

<

D. However, in order to avoid obtaining a local minimum for D , a trial D’

>

D

will be tolerated with decreasing probability throughout the trials. There- fore, the early stages of the annealing process will provide an exploratory phase which will become cir-

n+ca

cumscribed as

D

approaches its minimum. By employing a method of combinatorial optimization, simulated annealing, the number of necessary trial orderings is decreased from 10415 to less than 10’ while avoiding local optima with probability one under fairly general conditions (FAIGLE and KERN 1991).

T h e term simulated annealing comes from the algorithm’s foundation in statistical mechanics. In order for a solid to form its minimal energy state it must first be heated to a sufficient temperature so as to break all solid structure (To), and then be cooled slowly enough so that all molecules have time to find the appropriate conformation ( A T ) . T h e process of simulated annealing derives the solution of minimization problems of this type while avoiding local minima with probability one (METROPOLIS et al. 1953; KIRKPATRIC, GELETT and VECCHI 1983; LUNDY 1985; PRESS et al.

1986; GOLDSTEIN and WATERMAN 1987; FAIGLE and

KERN 199 1). T h e algorithm is as follows:

1. Choose a random order of clones whose linking

2. Compute the distance D.

3. Choose a random segment within the ordering.

4. Either perform a segment reversal (reverse all clones within the chosen segment) or a segment transport (relocate segment elsewhere in the order) each with equal probability.

(4)

594

A. Number of Contigs

"I

I

90

8

25

e

M 5 20

v)

(L

P W

15

5 d

$ 5 c

0 10

A. J. Cuticchia, J. Arnold and W. E. Timberlane

B. Maximum Contig Size C. Error within Contigs

225 clones 338 clones 450 clones

FIGURE 2.-Results of simulated annealing when unclonable repetitive DNA is introduced into the simulation. As in Figure 1, the number of contigs, maximum contig size, and error are plotted for the varying clonal densities and probe number for 200 simulations introducing

5% of unclonable DNA broken into I O randomly interspersed pieces.

A. Number

of

Contigs

40

B. Maximum Contig Size

a0 40 50

Number of Probes

C. Error within Contlgs

I

190

125

1m

115

110

105

loo'

sb

40 50

225 clones 338 clones 450 clones

-"

- -. .

-A-

-. - -

- 0

FIGURE J.-Results of simulated annealing when non-homologous hybridization is introduced into the simulation. As in the preceding figure, number of contigs, maximum contig size, and error is plotted for the varying clonal densities and probe number for 200 simulations introducing a 1 % chance that a probe lacking homology to the clone will hybridize for each probe which homologously hybridizes.

5 . Compute the new distance D'. order (where AD = D'

-

0). Thus, the probability

(5)

TABLE 2

Levels examined in the calibration of the annealing machine

Parameter Values

To 50, 100

F 0.25, 0.50, 0.75

M 100,000, 300,000, 500,000

S 0.05M, 0.10M

7.

If the number of successful reversals or transports is greater than 0 for a chosen step on the staircase, decrease T by a factor F and continue on to the new step. T h e length of the step on the staircase function is determined by the parameters M (maximum number of trials at a given level T ) and S (maximum number of successful rearrangements at a given level T ) . If the number of successful reversals or transports equals 0 for a given step, the process is complete.

For the process of simulated annealing to be effective, an annealing machine must first be fine-tuned. In simulated annealing the values of four parameters must be chosen. T h e first parameter, TO, the initial temperature, must be set high enough so that the exploratory phase of the process is initially imple- mented (in reference to physics, the solid must be heated high enough to cause melting). T h e parameter

F , the fraction by which T is decreased per run, must be set high enough to allow the annealing to be completed efficiently, while not so low as to increase the probability of choosing an order that is only a local minimum with respect to D.

T h e geometric staircase, Ti = f i T 0 , where i denotes the ith step is one example of an annealing schedule.

Others have been proposed, such as TO/ln(i) (GEMAN

and GEMAN 1984). We will return to why we have selected the simple geometric staircase in the next section. Finally, the parameters M and S, determining the length of each step in the annealing schedule must be set in order to allow for sufficient computation to minimize D. T h e constants S, the maximum number of successful rearrangements before decreasing T by the fraction F , and M , the maximum number of trials at a given value of T , must be carefully chosen. Each step is completed when either M or

S

is exceeded.

CALIBRATION OF THE ANNEALING MACHINE

T o set the four parameters of the annealing algorithm, a full factorial design (WINER 1971) was performed for several values for each parameter as shown in Table 2. T h e effect of each combination was measured by the ability to minimize D across 225 clones probed with 50 probes in a simulation study. A 2000- kb fragment was simulated based on a 3rd order Markov chain and cut randomly into 225 cosmid (40

kb) clones to each of which 50 probes was hybridized. T h e probes are shown in Table 1. Ten random per- mutations of the 225 clones were ordered by each of the annealing schemes and the resulting D , recorded. In order to test for the possibility of a genome effect on the results, the entire simulation was repeated on another simulated genome. Both the modeled genome and the associated probes in Table 1 were generated from estimated tetranucleotide frequencies of the genome of A . nidulans.

There were significant ( P C 0.01) main effects for both F and M , their interaction, as well as the T-F

interaction. T h e only other significant effect was a

T-F-M three-way interaction. T h e steepness of the staircase ( F ) interacts with the length of the step ( M )

and with the total height of the staircase (TO) to determine the quality of the annealing schedule. There was no significant main effect based on a rep- licate of the simulated genome.

Analysis of the plot of the residuals as a function of

expected values from the ANOVA model showed the classical fan-shape plot indicative of a deviation from homogeneity of variance, although the regression was insignificant ( r

<

0.001). Plots of residuals us. each of the factors involved in the calibration of the annealing machine indicate a decrease in the variance of D with increasing F and increasing M . A Box-Cox transformation was performed showing L,,, peaking at X = -1.25 (DRAPER and SMITH 1961). The resulting ANOVA after transformation showed no deviation in the results for the significance of the annealing parameters as well as the T-F-M interaction, while eliminat- ing much of the fan-shape in the residuals us. expected value plot. Furthermore, the normal plots of both transformed and non-transformed data showed no significant deviation from normality.

In a multiple comparison analysis (Tukey-b) based on the 18 classes with significant T-F-M interaction, we found 6 classes that best minimized D without significant difference from each other. From these annealing schemes, T = 50, F = 0.5, M = 500,000 and S = 0.05M is the recommended annealing scheme since it completed in the least amount of time. With the parameters of the annealing machine set, the algorithm was modified to perform only transports of segments or to perform only reversals of segments instead of an equal ratio of the two operations. T h e results showed no significant difference in the resulting D ' s for reversals only us. both reversals and transports. Because employing only reversals in the annealing machine decreased the time of annealing by half, the annealing machine was modified to perform only reversals.

Lastly in a series of 10 additional replicates, we also tried the more gradual annealing schedule (To/ln(i),

(6)

596 A. J. Cuticchia, J. Arnold and W. E. Timberlake

GEMAN and GEMAN ( 1 984). T h e results did not differ significantly from the recommended annealing schedule and took two orders of ma3nitude longer to complete. We recommend the use of the geometric annealing schedule.

TEST OF THE ALGORITHM

The ability of this ordering procedure to produce accurate physical maps was tested at three clonal densities and at three numbers of probes. One hundred simulated fragments of 2000 kb (based on the genome of A . nidulans) were generated and segmented into overlapping 40-kb fragments to simulate cloning of the random fragments into a cosmid library. T h e entire procedure was repeated for another 100 genomes in order to ensure that the sample size was sufficient. The statistics collected were the number of contigs present at the completion of the annealing, the maximum contig length (in terms of numbers of constituent clones), and the error within contigs (see figure legends for definitions).

A true contig is defined as a sequence of one or more clones, in which each clone overlaps with its neighbor(s). An observed contig is a sequence of one or more clones, in which each clone is inferred to overlap with its neighbor(s). Some authors prefer to define a contig as consisting of two or more clones to distinguish them from isolated clones. This distinction is unnecessary for the purposes of this paper. All counts of contigs in this paper are observed in the reconstructed chromosome.

Once again a full-factorial design was employed with respect to probe number and clonal density. T h e number of clones per 2000-kb fragment were 225, 338 and 450. These values represent genome equivalents of 4.5, 6.75 and 9.0, respectively. T h e number of probes studied were 30, 40 and 50. T h e effects of probe number and clonal density on number of contigs, maximum contig length, and error of ordering along a chromosome are presented in Figure 1 . With a clonal density of 9 genome equivalents, an experiment using 50 probes would reduce the ordering problem from one of 450! (assuming each clone as a unique contig) to one of 5!

*

25. These contigs can then be ordered with respect to one another by hybridization to larger fragments of DNA, such as those contained in YAC libraries (COULSON et al. 1988).

Even with as few as 30 probes and 4.5 genome equivalents, it is possible to reduce the number of contigs to less than 40. Another measure of the quality of a map is the size of the largest contig. Frequently, the long range continuity of a physical map is measured b y the maximum contig’s size (MERRIAM et al. 1991).

Low densities and probe numbers can yield contigs of

as many as 25 clones, while higher densities and probes can increase this to over 70.

The error in ordering in reconstructing a chromosome is now defined. Assume that clones in a library are indexed by their inferred ordering: 1 , 2,

. .

.

, n .

Further suppose a principle, such as the minimization of the linking distance

D,

is applied to reconstruct a chromosome. The true ordering (ranking) of clones is actually: R1, R2,

.

. .

, R,. One measure of error,

denoted e , might be:

n- 1

e =

x

IRi+l

-

R,l

-

( n

-

1).

If the true ordering were recovered, then the

summation would reduce to ( n

-

1) because ( R 1 ,

R z ,

. . . ,

R,) = ( 1 , 2,

. .

.

, n ) . In this case, we would have

e = ( n

-

1 )

-

( n

-

1 ) = 0. The error e then measures the increase in number of steps to cover a chromosome. It is one measure of how far the true ranks (R1, Re,

. . .

, R,) are from the inferred ordering, ( 1 , 2,

.

. . ,

n ) .

This measure of error is particularly appropriate in our setting because the cosmid library (BRODY et al.

1991) has been sorted into chromosome specific sub- collections. (We have not encountered problems in cloning centromeric regions as in

D.

melanogaster.)

There is only one true contig, and we are interested in how accurately minimizing

D

reconstructs the whole chromosome in vitro. When more than one true contig exists, the measures of error may need modification. For example, it might be useful to have the error measure both misorderings within contigs as well as the incorrect assignment of clones to contigs.

i= 1

T h e maximum value of this error statistic is

emax = i

+

INTEGER

- -

1

-

( n

-

l ) ,

n-1

i= 1

t

)

where:

n = number of clones.

For 225 clones, in the worst possible order clone 1 is between clones 224 and 225 in the following arrange- ment:

[ill.

-

* 3-224-1-225-2-223-4- -1 121

T h e maximum value is 25087. As an example, suppose n = 5. Then the order (3, 2, 5, 1, 4) would maximize the error e , and

emax = (4.5/2)

+

INTEGER

(;

-

1)

-

(5-1) = 1 1 - 4 = 7 .

(7)

0.09 -1

0.08

0.07

-1

-

p

0.06 -

v

A

2

0.05 C

9)

3 0.04

0-

$

0.03

0.02

0.01

I I

0 ’ I I I , I I I I I

0 olo0 Zoo0 3Ooo 4Ooo

Ranked Hexanucleotide Abundance

- Observed

+

Egpected

FIGURE 4.-Hexanucleotide ordered abundance plot for A.

nidulans. The 4096 hexanucleotide frequencies determined from 40 kb of GenBank sequences were ranked from highest to lowest abundance (observed). Superimposed on this are those frequencies predicted by a third-order Markov chain (expected).

in the reconstructed chromosome exists, it is caused by either the fact that no detectable overlap between ends of contigs exists or the fact that two non-overlapping clones have a higher probability of overlap than two overlapping clones. It is for these reasons that high clonal density and high probe number aid in the process of accurate chromosome reconstruction.

T h e purpose of this investigation is to order random clonal libraries by simulated annealing to aid in the mapping of the genome of A . nidulans. Our simulations have yielded estimates as to the expected numbers of contigs and maximum contig length based on a specified clonal density and number of probes. How- ever, these estimates are based on three assumptions inherent in our simulations.

T h e first is that the genome of A. nidulans can be modelled by a Markov Chain. Our investigation based on the amount of available sequence shows that the mean observed (obs) to expected (exp) ratios of hexanucleotides (a measure of error in genome mod- eling) within A . nidulans does not deviate significantly from those of previously modelled organisms (Figure

4). T h e mean of the ratio obs/exp is 1.35 with a standard error of 1.77 for a random predictor, while the Markov chain predictor yields a mean ratio of 1.17 with a standard error of 0.15. Actual hybridization results of chosen oligomers will provide the first direct empirical test as to the error of a Markov chain in the prediction of oligonucleotide occurrence across an entire genome.

T h e second assumption is that clones within a random cosmid library are truly random. In previous mapping endeavors there has been a tendency for libraries to deviate from random representation (OL-

SON et al. 1986). To test the effect that unclonable or

missing regions of the genome might have on the ordering results, a simulation was run as in the full- factorial design, with probe number and clone densities varied as before and with the injection of unclonable regions into the model genome. Two replicates of the 100 genome factorial design were run with 5%

of the 2000-kb fragment designated as missing and dispersed into 10 pieces of random length. We chose the value of 5% because it closely approximates the amount of repetitive DNA in A. nidulans which is mostly ribosomal DNA (rDNA) (TIMBERLAKE 1978). T h e exact distribution of rDNA sequences in the genome is unknown at this time, but each clone in the library is being scored for hybridization (or no hybridization) with an rDNA probe. T h e potentially prob- lematic clones are subtracted from the library in the chromosome reconstruction. As a consequence, clones containing repetitive sequences can be viewed as “missing” regions during the reconstruction. If clones with repetitive sequences were not set aside and if a probe were by chance selected to hybridize to a repetitive sequence, then all clones with this repetitive sequence would hybridize with the probe independent of whether or not these clones overlap. T h e effect of missing regions on the ordering procedure decreased the expected maximum contig size at all densities and number of probes, while actually decreasing error at low probe numbers and high clone numbers (Figure 2). From this experiment we conclude that if 5% of the A. nidulans genome is unclonable or that if we subtract all the rDNA genes ( 5 % of the genome) from the library, the physical map will not be substantially degraded.

T h e final assumption is that all probes used in the procedure will hybridize independently of one another. This assumption can be tested from the resulting mapping data by an analysis of the association between the hybridization and lack thereof of each possible pair of probes. In the event that an association between combinations of probes exists, this will result in a decrease of information from the profiles based on probe number. Thus, 50 probes with a moderate degree of association may only yield the information of 30 independently hybridizing clones.

(8)

598 A. J. Cuticchia, J. Arnold and W. E. Timberlake

r

Distance (D) _{Expected mapping results based}TABLE _{on present clonal densities of}3

A. nidulans

1500

h

1000 -

500 -

01 ,

50 25 12.5 6.3 1.6 3.1 .78 .39 .19 . 1 .05 .02

Annealing Temperature (T)

FIGURE 5.-A plot of mapsize D against the steps of the annealing schedule. The true D is 41 7, while the minimum D achieved in the ordering procedure is 420. T h e simulated chromosome is 2000 kb,

the insert size for the cloning vector is 40 kb, and the library size is

225 clones.

a 1 % probability that the cosmid would hybridize with a nonsimilar probe (CRAIG et al. 1990). It should be noted that the error probability of 1% in the simulations in Figure

4

is distinct from the 1.7% error rate reported in CRAIG et al. (1990), which is the total number of errors in the data divided by the product, nm. When there is a 1 % chance that a nonhomologous probe falsely hybridizes to a clone for each probe that truly hybridizes to a clone, there is a 6-10% chance of an error in the sense of CRAIG et al. (1990). Our 1 % error rate is then much higher than what has been observed so far. False positives at this rate increased the number of contigs by a factor of two at low probe numbers and by a factor of three at high probe numbers. T h e expected maximum contig size decreased by a factor of six at high probe numbers, while error within contigs was increased by less than 25% even at high probe numbers (Figure 3). These results emphasize the importance of careful hybridization and scoring of clones in this method of chromosomal reconstruction.

While simulated annealing has been proven to be a very effective algorithm at avoiding local minima in

D , for example, with probability one (FAIGLE and KERN 1991), this does not in and of itself provide a justification for the criterion D. What is needed is

some consistency result, in which the minimum of D is shown to converge to the true Do as the clonal library grows large (or the number of probes). In Figure 1 we provide simulation evidence through the plotted decrease in error that a limiting result as n approaches infinity (or p!) may hold. In fact, in most runs the minimum found is very close to the D of the actual order (Figure 5). Figure 5 also serves to illustrate that while

D,,;,

may be close to the true Do, they need not be equal, with Dm,, falling on both sides of

D,, in the 1800 runs summarized in Figure 1.

Number of Probes 30 40 50

Chromos0 e 1 79.ga (6.0) 56.4 (6.1) 40.1 (5.3) 350%

46 1

’

180.5‘ (9.9) 143.4 (12.5) 113.0 (10.6) 21.7e (4.9) 30.9 (7.9) 41.8 (10.7)

Chromosome2 95.3 (7.1) 71.9 (5.9) 56.9 (5.3) 4000

410 167.5 (10.2) 137.5 (9.7) 113.5 (10.1)

16.2 (4.3) 22.1 (5.1) 29.5 (6.9)

Chromosome3 64.0 (6.4) 43.4 (5.6) 29.5 (5.1) 0.36 (0.02) 0.42 (0.02) 0.48 (0.02)

3000 165.6 (11.2) 130.2 (11.4) 103.0 (11.5) 432 24.7 (6.1) 35.4 (9.7) 51.1 (15.3)

Chromosome 4 38.4 (5.4) 24.0 (4.7) 14.5 (3.7) 0.33 (0.02) 0.40 (0.02) 0.47 (0.03)

2100 343

129.2 (11.4) 99.5 (12.0) 79.1 (7.7)

30.7 (8.8) 49.3 (16.8) 74.5 (24.8)

Chromosome 5 80.2 (5.4) 62.7 (4.9) 51.0 (4.9) 0.34 (0.02) 0.41 (0.03) 0.48 (0.04)

3500 132.7 _(7.6) 108.2 _(7.6) 89.9 _(8.4) 326 14.8 (3.8) 20.2 (4.3) 24.5 (5.5) 0.38 (0.02) 0.44 (0.02) 0.50 (0.03)

3000 107.6 89.2 (8.5) 71.2 (7.7)

51.4 (4.8) 40.7 (4.7)

276 15.0 (3.0) 19.0 (4.7) 25.0 (6.4) 0.40 (0.02) 0.46 (0.03) 0.52 (0.03) Chromosome 7 12 1.3 93.9 (5.9) 75.7 (5.8)

4800 197.9

i::;!

160.8 (9.5) 134.2 (10.1) 468 15.2 (3.0) 21.1 (4.8) 25.4 (4.9)

Chromosome8 153.6 (47.3) 120.7 (51.0) 95.6 (52.4)

0.35 (0.01) 0.41 (0.02) 0.47 (0.02) 5500 253.1 (19.0) 208.2 (18.7) 169.2 (19.2)

0.31 (0.07) 0.37 (0.10) 0.42 (0.13) 0.33f (0.0lf 0.40 (0.02) 0.46 (0.03)

Chromosome 6 64.4

59 1 16.4 (3.7) 20.9 (4.5) 28.6 (7.7)

a Expected number of contigs and standard error in parentheses.

Chromosome size (kb).

Expected error of ordering and standard error. Number of clones uniquely identified to chromosome.

e Expected maximum contig size and standard error.

f g = 1

-

6, where 6 = minimal detectable percent overlap.

Standard error of u, calculated by the &method.

PRACTICAL APPLICATIONS

As we stated at the start of this paper, we are attempting to utilize these procedures for the chromosome reconstruction of the genome of A. nidulans.

A simulation was conducted with 100 replicates of the eight A. nidulans chromosomes. This simulation involved the modelling of the eight chromosomes and the cloning into cosmid libraries of size equal to the number G f cosmid clones known to hybridize uniquely to A. nidulans chromosomes (BRODY et al. 199 1).

In Table 3 we show that we can expect to assemble contigs of over 70 clones in the case of chromosome 4 using 50 probes, and contigs up to 20 clones using

50 probes in any chromosome. Presently, one-third of the clones in our A. nidulans collections are not uniquely assigned to any one chromosome. It is our hope that further assignments will increase clonal density, thus increasing the resolution of the resulting map.

From these simulation results in Table 3 we can also estimate a critical parameter 8, the “expected minimal detectable overlap” for this fingerprinting procedure. The quantity c = 1

-

8 reported in Table

(9)

contigs, average contig size, and other relevant char-

acteristics of a library (LANDER and WATERMAN 1988). We estimated u by the method of moments. We set

the average number of contigs in Table 3 equal to , where the quantity M denotes the insert size of the cloning vector, and N denotes the size of the chromosome. T h e standard error on the estimate of

c was computed by the delta method from the standard error in mean contig number in Table 3 (WEIR

As an example, for the yeast physical map (OLSON

et al. 1986) the expected minimal detectable overlap

0 is estimated to be 0.63, while for the more detailed physical map of E. coli (KOHARA, AKIYAMA and ISONO

1987) 0 is estimated to be 0.20. If we use 30 synthetic oligonucleotide probes, we have a mapping experiment with 0 between 0.60 and 0.69 (from Table 3)

for the different A. nidulans chromosomes, the resolution of the yeast physical map; however, if we use 50 probes, we have a mapping experiment with 0

between 0.52 and 0.59, which is intermediate in resolution between the S. cerevisiae and E . coli physical maps. With the estimates of 6, further operating char- acteristics of this digital fingerprinting procedure can be determined (LANDER and WATERMAN 1988). We have also applied our recommended annealing schedule to reported binary fingerprint data on 20 clones each with 22-digit binary fingerprints compos- ing a portion of the HSV-I genome (CRAIG et al.

1990). Our results showed three contigs with a single contig comprising two-thirds of the map. A plot of the inferred order of HSV-I clones against their actual order from the complete DNA sequence of HSV-I is given in Figure 6 . In this figure we illustrate the efficacy of simulated annealing in the production of physical maps using experimental data. Although the number of clones in this experiment was small, it would still be impossible to solve the problem exhaustively. Furthermore, the two breaks within our resulting map represent strong differences in hybridization profiles between supposedly overlapping clones within the correct order.

There are a number of different binary scoring schemes to which this algorithm for in vitro reconstruction of a chromosome could be applied. In this paper we have focused on the application of binary scoring of cosmid clones or smaller clones (CRAIG et

al. 1990) using synthetic probes. HOHEISEL et al.

( 1 99 1) have applied this approach to a cosmid library using 8-mers. There are limitations to this approach because of background hybridization to the E. coli

host, for example, and non-repeatable hybridization patterns with short oligonucleotides. T h e former problem can be circumvented either by reducing the background hybridization or increasing the signal (i.e.,

hybridization to the cloned DNA fragment). For ex-

ne-onM/N

1990).

20

15 -

0 ' I I 1

0 5 10 15 20

Actual Order

FIGURE 6.-A plot of the inferred order of clones covering the genome of HSV-I against their actual order based on the complete genomic sequence.

ample, if cosmid DNA preparations were done, this would eliminate the background hybridization. Alter- natively, if there were hybridization to the host genome by a particular probe, then all colonies should light up, and a different probe could be selected.

One might argue that long range continuity of the physical map might be better achieved by using a YAC library. One potential problem with YAC libraries is that insert size may vary. However, with the YAC library for

D.

melanogaster (MERRIAM et al. 199 l ) , the mean size of a YAC clone is 197 f 8 kb. There are a few clones that may be much smaller or larger. It might be desirable to size fractionate them for use in chromosome reconstruction, discarding those clones far from the mean in, for example, J. MERRIAM'S clone database (FLYBASE). With a YAC library longer oligonucleotides (i.e., 1 l-mers) could be used; however, there could still be a problem with background hybridization of probes to the yeast genome. This could be overcome currently by pulsed field electro- phoretic purification only.

A third scheme to which our reconstruction methods could be applied is in the binary scoring of YACs by single-copy landmarks (BARILLOT, DAUSSET and COHEN 1991), such as sequence-tagged sites (GREEN

and OLSON 1990). While the small example in GREEN

(10)

600 A. J. Cuticchia, J. Arnold and W. E. Timberlake

database (GDB) (PEARSON et al. 1991; CUTICCHIA,

ARNOLD and TIMBERLAKE 1992b).

A copy of the annealing machine in FORTRAN is available via EMAIL request to ARNOLD

@BSCF.UGA.EDU. (CUTICCHIA, ARNOLD and TIM-

BERLAKE, 1992c).

CONCLUSIONS

We have shown the ability of simulated annealing to reconstruct simulated chromosomes and one real genome accurately. Clonal densities as low as 4.5- genome equivalents are sufficient to allow contigs of considerable size to be constructed. By using the method of binary scoring, it may become possible for

a single researcher to map significant portions of the genome of a simple eukaryote in a relatively short amount of time.

Data entry is simpler than other methods in that it involves the recording of either the presence or absence of a single hybridization site. As it is not necessary to digitize restriction profiles from single and double digests, data entry can be performed from either the keyboard or by clicking on spots corre- sponding to the site of hybridization on a graphically generated microtiter plate. It is even possible to allow for automated digitization from the autoradiograms themselves.

For this method, computational requirements are moderate. A VAXstation 2000 can perform the annealing procedure on 591 clones with 50 probes in a few hours. Computational time increases on average linearly with number of clones per chromosome, as more clones correspond to higher clonal densities and more of a degree of overlap. It may be possible to use this procedure in higher organisms whose repetitive DNA constitutes a larger portion of the genome, if the process is used to assemble regions that are known to come from areas of the chromosome lacking repetitive DNA and in other mapping problems (COX et al.

1990).

We wish to thank ROBERT IVARIE, HOWARD BRODY, MICHAEL

CELLINO, ROBIN DEAN and two anonymous reviewers for the critical reading of the manuscript. We would like to also thank Y.-X. FU for much discussion on this topic. Many thanks to JOHN C. AVISE and MARY E. CASE for input into the project. Finally, we would like

to thank MARJORIE ASMUSSEN, MICHAEL J. WEISE and the Biological Sequence/Structure Computational Facility at the University of

Georgia for allocating the computer resources necessary to carry out this project. This work was supported by National Institutes of

Health grant GM 42924.

LITERATURE CITED

ALBERTSON, D. G., 1985 Mapping muscle protein genes by using

in situ hybridization using biotin labelled probes. EMBO J. 4: 2493-2498.

ARNOLD, J., A. J. CUTICCHIA, D. A. NEWSOME, W. W. JENNINGS

and R. IVARIE, 1988 Mono- through hexanucleotide analysis of the sense strand of yeast DNA: a Markov chain analysis. Nucleic Acids Res. 1 6 7145-7158.

BARILLOT, E., J. DAUSSET and D. COHEN, 199 1 Theoretical analysis of a physical mapping strategy using random single-copy landmarks. Proc. Natl. Acad. Sci. USA 88: 39 17-392 1.

BATES, G . P., M. E. MACDONALD, S. BAXENDALE, S. YOUNGMAN, C. LIN, W. L. WHALEY, J. J. WASMUTH, J. F. GUSELLA and H. LEHRACH, 1991 Defined physical limits of the Huntington disease gene candidate region. Am. J. Hum. Genet. 49: 7-16. BRODY, H., J. GRIFFITH, A. J. CUTICCHIA, J. ARNOLD and W. E.

TIMBERLAKE, 1991 Chromosome-specific recombinant libraries from the fungus Aspergillus nidulans. Nucleic Acids Res. 1 9 3105-3109.

BURKE, D. T., G . F. CARLE and M. V. OLSON, 1987 Cloning of large segments of exogenous DNA into yeast by means of artificial chromosome vectors. Science 2 3 6 806-8 12. COLLINS, F. S., M. L. DRUMM, J. L. COLE, W. K. LOCKWOOD, G . F.

VANDE WOUDE and M. C. IANUZZI, 1987 Construction of a general human chromosome jumping library with application to cystic fibrosis. Science 2 3 5 1046-1049.

COUUON, A., J. SULSTON, S. BRENNER and J. KARN, 1986 Toward a physical map of the genome of the nematode Caenorbabditis elegans. Proc. Natl. Acad. Sci. USA 83: 7821-7825.

COULSON, A., R. WATERSTON, J. KLIFF, J. SULSTON and Y. KOHARA,

1988 Genome binding with yeast artificial chromosomes. Na- ture 335: 184-1 86.

COX, D. R., M. BURMEISTER, R. PRICE, S. KIM and R. M. MYERS,

1990 Radiation hybrid mapping: a somatic cell genetic method for reconstructing high-resolution maps of mammalian chromosomes. Science 250: 245-250.

CRAIG, A. G . , D. NIZETIC, J. D. HOHEISEL, G. ZEHETNER and H. LEHRACH, 1990 Ordering of cosmid clones covering the Her- pes simplex virus type I (HSV-I) genome: a test case for fingerprinting by hybridization. Nucleic Acids Res. 18: 2653- 2659.

CUTICCHIA, A. J., J. ARNOLD and W. E. TIMBERLAKE, 1992a PCAP: probe choice and analysis package, a set of programs to aid in choosing synthetic oligomers for contig mapping. CABIOS (in press).

CUTICCHIA, A. J., J. ARNOLD and W. E. TIMBERLAKE, 1992b CMAP: contig mapping and analysis package, a relational database for chromosome reconstruction. CABIOS (in press).

CUTICCHIA, A. J., J. ARNOLD and W. E. TIMBERLAKE, 1992c ODS: ordering DNA sequences, a physical mapping algorithm based on simulated annealing. CABIOS (in press).

DRAPER, N., and H. SMITH, 1961 Applied Regression Analysis, Ed. 2. John Wiley & Sons, New York.

FAIGLE, U., and W. KERN, 1991 Note on the convergence of simulated annealing algorithms. SIAM J. Control Optim. 2 9

Fu, Y.-X., E. W. TIMBERLAKE andJ. ARNOLD, 1992 On the design of genome mapping experiments using short synthetic oligonucleotides. Biometrics 48: 337-359.

GEMAN, S., and D. GEMAN, 1984 Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6: 721-741.

GOLDSTEIN, L., and M. S. WATERMAN, 1987 Mapping DNA by stochastic relaxation. Advan. Appl. Math. 8: 194-207. GREEN, F. D., and M. V. OLSON, 1990 Chromosomal region of

the cystic fibrosis gene in yeast artificial chromosomes: a model for human genome mapping. Science 2 5 0 94-98.

HOHEISEL, J. D., G. G. LENNON, G . ZEHETNER and H. LEBRACH, 1991 Use of high coverage reference libraries of Drosophila melanogaster for relational analysis. J. Mol. Biol. 2 2 0 903-914. KIRKPATRICK, S., C. D. GELATT and M. P. VECCHI, 1983 Optimi-

(11)

KOHARA, Y., K. AKIYAMA and K. ISONO, 1987 The physical map of the whole E. coli chromosome: application of a new strategy for rapid analysis and sorting of a large genomic library. Cell

LANDER, E. S., and M. S. WATERMAN, 1988 Genomic mapping by fingerprinting random clones: a mathematical analysis. Ge- nomics 2: 231-239.

LEHRACH, H. 1990 Genetic and Physical Mapping, edited by K. E. DAVIES and S. M. TILCHMAN. CSH Press, Plainview, N.Y. LUNDY, M., 1985 Applications of the annealing algorithm to

combinatorial problems in statistics. Biometrika 72: 191-199.

MERRIAM, J., M. ASHBURNER, D. L. HARTL and F. C. KAFATOS,

199 1 Toward cloning and mapping the genome of Drosoph- ila. Science 2 5 4 221-225.

METROPOLIS, N., A. W. ROSENBLUTH, M. N. ROSENBLUTH, A.

TELLER and E. TELLER, 1953 Equation of state calculations by fast computing machines. J. Chem. Phys. 21: 1087.

MICHIELS, F., A. G. CRAIG, G. ZEHETNER, G. P. SMITH and H. LEHRACH, 1987 Molecular approaches to genome analysis: a strategy for the construction of ordered overlapping clone libraries. CABIOS 3: 203-210.

NIZETIC, D., R. DRMANAC and H. LEHRACH, 1991 An improved bacterial colony lysis procedure enables direct DNA hybridization using short (10, 1 1 bases) oligonucleotides to cosmids. Nucleic Acids Res. 1 9 182.

OLSON, M. V., J. E. DUTCHIK, M. Y. GRAHAM, G. M. BRODEUR, C. HELMS, M. MACCOLLIN, R. SCHEINMAN and M. FRANK,

1986 Random-clone strategy for genomic restriction mapping in yeast. Proc. Natl. Acad. Sci. USA 83: 7826-7830.

PEARSON, D. L., B. MAIDAK, M. CHIPPERFIELD and R. ROBBINS,

1991 The human genome initiative-do datamaps reflect current progress? Science 2 5 4 2 14-2 15.

PHILLIPS, G. J., J. ARNOLD and R. IVARIE, 1987a Mono-through hexanucleotide composition of the E. coli genome: a Markov

5 0 495-508.

chain analysis. Nucleic Acids Res. 15: 261 1-2626.

PHILLIPS, G . J., J. ARNOLD and R. IVARIE, 1987b The effect of codon usage on the oligonucleotide composition for E. coli

genome and identification of over- and underrepresented sequences by Markov chain analysis. Nucleic Acids Res. 15:

POUSTKA, A., and H. LEHRACH, 1986 Jumping libraries and linking libraries: the next generation of molecular tools in mammalian genetics. Trends Genet. 2: 174-179.

POUSTKA, A,, T. POHL, D. P. BARLOW, A. M. FRISCHAUF and H. LEHRACH, 1987 Chromosome construction by use of human chromosome jumping libraries from NotI-digested DNA. Na- ture 3 2 5 353-355.

PRESS, W. H., B. P. FLANNERY, S. A. TEVKOLSKY and W. T. KETTERLING, 1986 Numerical Recipes, the Art of Scientzfic Com- puting. Cambridge University Press, Cambridge.

SMITH, C. L., and R. D. KOLODNER, 1988 Mapping of Escherichia coli chromosomal T n 5 and F insertions by pulsed field gel electrophoresis. Genetics 1 1 9 227-236.

SMITH, C. L., J. G. ECONOME, S. SCHUTT, S. KLCO, and C. R. CANTOR, 1987 A physical map of the Escherichia coli K12

genome. Science 2 3 6 1448-1453.

SULSTON, J., F. MALLETT, R. STADEN, R. DURBIN, T. HORNSNELL and A. COULSON, 1988 Software for genome mapping by fingerprinting techniques. CABIOS 4: 125-1 32.

TAVARE, S., and B. W. GIDDINCS, 1989 Mathematical Methods for

D N A Sequence Analysis, edited by M. S. WATERMAN. CRC Press, Boca Raton, Fla.

TIMBERLAKE, W. E., 1978 Low repetitive DNA Content in Asper- gillus nidulans. Science 2 0 2 773-775

WEIR, B. S., 1990 Genetic Data Analysis. Sinauer, Sunderland, Mass.

WINER, B. J., 1971 Statistical Principles in Experimental Design, Ed.

2. McGraw-Hill, New York.

2627-2638.