The choice o f program to use for finding genes in genom ic sequence depends on a num ber o f considerations. The reliability o f the program ’s output is obviously an im portant consideration. Ideally, a program w ould predict all exons in a gene (have high sensitivity), and w ould not predict any exons that w ere not part o f a gene (have high specificity). However, no perfect system has yet been designed! Increasing sensitivity o f a program usually decreases specificity (i.e. m ore true positives means m ore false positives) and each program has attem pted to com e to a com prom ise betw een these tw o m easures. The two measures can be com bined to give one m easure of accuracy (the ‘correlation coefficient’ or CC of B urset & G uigo 1996), either at the nucleotide level or at the exon level. A lab-based investigator may prefer to have high specificity at the expense o f low er sensitivity, as no com puter based prediction can be believed w ithout supplem entary lab-work, and the investigation o f false positive predictions is tim e-consum ing and costly, whereas exons m issed by prediction program s w ould be easily discovered while confirm ing the true positive predictions. Som e program s (e.g. G RA IL and SORFIND) give a classification o f the predicted exons according to how confident the prediction is (G RAIL predicts ‘excellent’, ‘g o o d ’ and ‘m arginal’ exons), useful when deciding w hich predictions to test experim entally.
C om parison of the accuracy of the different program s is difficult, as each has its own strengths and w eaknesses (e.g. GRAIL often fails to predict small exons b ut G enLang claim s to succeed for small exons, whereas G RA IL is better for larger exons than G enLang). B urset & G uigo (1996) used a large test set o f new vertebrate sequences (elim inating those w ith sim ilarity with sequences previously in the database to m inim ise overlap w ith the training sets) to assess perform ance of various gene finding program s by a variety of measures (sensitivity, specificity and CC at both nucleotide and exon levels). They found that o f an older generation of program s (FG EN EH , G enelD , G eneParser2, G enLang, G RAIL2, SORFIND and X pound), G RA IL2 w as am ongst the
m ost accurate (despite the fact that all predicted exons w ere tested and not ju s t the ‘excellent’ predictions), but that two m ore recently introduced program s (G eneID + and G eneParserS) perform ed better than GRAIL2. The version o f G RA IL2 w hich incorporates GAPIII, the gene m odel assembly program , was not assessed. The authors o f G RA IL2 claim that the G AP algorithm im proves the accuracy o f G R A IL 2’s predictions so that 98% o f exons in a gene model at least partially m atch real exons, but it is hard to assess this in com parison to other program s w ithout testing on the sam e test set o f sequences. Som e of the program s are updated regularly, either in the algorithm or in the training set used, so assessm ents o f accuracy quickly becom e out o f date.
T he species the query sequence is from is im portant - som e gene finding program s perform better on sequences from particular species, often because the training set used was species-specific (e.g. G R A IL ’s published training set consisted o f only hum an sequences but GenLang used a m ixture of sequences from different vertebrates). G RA IL2 is now trained with datasets from five different organism s - hum an, m ouse.
D rosophila, Arabidopsis and E. colt, which will m ake it m ore useful for gene prediction
in non-hum an sequences. The authors of GenLang (Dong & Searls 1994) noted that perform ance was significantly im paired when query sequence was from an organism very different from that used in the training set, presum ably because o f species specific differences between features o f genes - e.g. on average. D rosophila introns are m uch sm aller than m am m alian introns and use subtly different splicing signals (M ount et al.
1992). The nature o f the training set may also have som e effect w ithin a species w hose genom e is not hom ogeneous - genes found in GC-rich isochores m ay have different statistical characteristics (codon usage, etc) from genes in G C -poor regions, genes transcribed at different levels or in different cell types may utilise different signals and there m ay be subclasses of genes with different codon usage. Som e algorithm s (e.g. G RA IL2, Xu et al. 1994) take the GC content of the isochore and o f the potential coding sequence into account in their coding potential predictions.
F or this project, G RAIL2 (im plem ented using xGRAIL) was used for every cosm id analysed so far. A lthough som e other programs claim better accuracy o f prediction than G RA IL, the slight reduction in accuracy of predictions is counterbalanced by its speed, availability at the H G M P Resource Centre, its ease o f use and its m any useful features (gene m odel assembly, ease of database searches with predicted features and prediction
of CpG islands, promoters, polyadenylation sites, sim ple repeats and repetitive elem ents). The results file used to store the G RA IL output and to produce the X- w indow s based graphical output is also in an easily read form at w hich I have found useful for sum m arising and annotating results.