Residue frequency difference - Use of neural networks to model molecular structure and function

F i g u r e 3 .6 Com parison of neural netw ork w eights and the resid u e freq u en cy difference: The resid u e frequency difference is plotted against the neural netw ork

weights. Each of the weights corresponds to a

particular amino acid at a certain position. The weight is plotted against the num ber of times that amino acid appears at that position in the A T P/G T P-binding proteins minus the number of times that am ino acid appears at that position in the non-binding proteins.

weights define the non-binding sequences. The weight matrix of the neural network, given in Table 3.4, was generated by running the netw ork 100 times, training on all 349 sequences, so there were no random ly high weights. No clear pattern was deducible from the weights.

In Figure 3.5, the frequency of each am ino acid at each p o s i t i o n in th e A T P / G T P - b i n d i n g p r o t e i n s ( th e r e s id u e frequency) is plotted against the c o rresp o n d in g neural netw ork w eight. There are 20 d ifferen t am ino acids and 17 d ifferent positions, so there are 340 points. Figure 3.6 is a similar plot, but the resid u e freq u e n cy of the n o n -b in d in g p ro te in s has been subtracted from the residue frequency of the binding proteins, to give the residue frequency difference. The correlation coefficient in Figure 3.5 is 0.22 and Figure 3.6 this rises to 0.36. The higher correlation coefficient for Figure 3.6 indicates that the weights of the neural netw ork are more correlated to the residue frequency difference than to the residue frequency. This suggests that the neural n e tw o rk has in c o rp o ra te d in fo rm a tio n ab o u t the n o n binding sequences, as well as the binding sequences, although, as neith er c o efficien t is large, it suggests that the neural netw ork approach and the statistical method are not identical.

3 .5 C o n c l u s i o n

The results show that the neural netw ork is slightly worse than the alignm ent by homology method, although comparable. It w ould appear that, if sufficient trouble is taken over developing

a statistical method, that it will perform as well, or better than a sim ple tw o -lay e r feed -fo rw ard neural netw ork. P e rce p tro n type n e u r a l n e tw o r k s can o n ly d i s t in g u i s h l in e a r l y s e p a r a b le functions. Thus, if the data are represented in two dim ensions, a perceptron can classify the data only if it is possible to separate the two classes by a straight line. M ore involved architectures may perform better, but the developm ent of these is problem atic in the case o f m o tif recognition, because the small num ber of know n protein structures limits the num ber of weights that may be used in the neural network. Two-layer feed-forw ard netw orks hav e b een u sed in o th er a p p lic a tio n s and th e ir ‘s u c c e s s ’ proclaim ed as vindication of the neural netw ork approach, when, in fact, alternative statistical m ethods have not be tried on the same data.

S tu d ies th at rig o ro u sly c o m p a re the p e rfo rm a n c e of a neural netw ork with the perform ance of another state-of-the-art statistical method on the same data set are required to assess the utility of the neural netw ork approach. The assessm ent is further co m p lica te d when one rea lises that there are m any em pirical p a ra m e te rs in th e n e u ra l n e tw o rk ap p ro a ch (asid e from the w e ig h ts ) th at can be o p tim ise d . Som e o f th e s e p a ra m e te rs in clu d e: the w in d o w size, the n u m b er o f h id d en u n its, the learning rate, the sam pling of the data, the ratio of true to false exam ples, the definition of convergence. W h ile the optim isation of these p aram eters m ay be p erm issib le, the final ch o ices are sometimes just those that give the best result. It is also unclear, in some cases, w h eth er the optim isation has been perform ed on the test or training set. The test set and training set m ust be

clearly distinct and there should be no hom ology betw een the two. The existence of hom ology betw een the test and training sets can in c re a se the p e rfo rm a n c e of the n e tw o rk , by the netw ork learning homology rules, as well as detecting the general features that are of interest.

In the application of neural netw orks to protein structure pro b lem s, the attraction m ust be the p o ssib ility that a neural netw ork with hidden units will be able to extract higher than first order inform ation. Neural netw orks should be able to learn ru les in clu d in g co m p lex c o n d itio n al sta te m e n ts, such as 'the secondary structure is predicted to be helical if either leucine or valine are neighbours to the residue, but random coil if they are neighbours'. Since rules similar to this one are very relevant for secondary structure prediction schem es, it has been hoped that the hidden unit layer would be im p o rtan t for such problem s. H o w ev er, several w orkers have rep o rte d that the d a ta b ase is sim ply too small for second order features to be ex h ib ited as

general features (Qian and Sejnowski, 1988; Rooman and W odak,

1988). Thus, at present, neural networks are only able to use the first o rd er in fo rm atio n that o ther p red ic tiv e m ethods use, and there is no evidence to suggest that they can use this information b etter than these other m ethods. F o r ex am p le, th ere are now several m ethods fo r predicting protein secondary stru ctu re that perform as well as the neural netw ork approach (G ibrat et a/.,

1987; King and Sternberg, 1990; Ptitsyn and Finkelstein, 1989).

Nucleic acid sequence analysis by neural netw orks appears to h av e been m ore su c essfu l than p ro tein a p p lic a tio n s . T he

d e te ctio n and p red ic tio n by neural n e tw o rk s o f tran sla tio n a l initiation sites (Stormo, et a l , 1982) and prom oter regions in E . c o l i (Nakata et al., 1988, Lushakin et al., 1989; O 'N eill, 1991; D em eler et al., 1991), and splice junctions (Nakata et al., 1985; B r u n a k et al., 1991) and c o d in g re g io n s in h u m a n D N A (U berbacher and M ural, 1991) are signifcantly b etter than other sta tistica l a p p ro ach es. This m ay in p a rt be due the g reater amount of nucleic acid sequence data and the fact that the bases o f DNA can be encoded into a netw ork with only fo u r bits, w hereas the encoding of amino acids requires tw enty bits. The ap p ro ach of U b e rb ac h er and M ural (1 9 9 1 ) su g g ests that the in d irect analysis of protein sequences by neural netw orks may be more successful than the direct analyses so far, which have been lim ited by the size of the database being in su fficien t to e x h ib it in a g e n eralisa b le way in fo rm a tio n h ig h er than first o r d e r .

T h e u se o f a lig n e d s e q u e n c e s in n e u r a l n e tw o rk applications, as outlined in this chapter, has been developed in m ore rec en t work by F rish m an and A rg o s (1 9 9 2 ) on m o tif recognition and R ost and Sander (1993a, 1993b) on secondary structure prediction. It is difficult to tell how much the success of these studies is due the use of aligned sequences and how much is due to the use of neural networks. In applications to problem s in protein sequence analysis, it is still unclear if neural netw orks h a v e y ie ld e d s i g n if ic a n t i m p r o v e m e n ts o v e r o th e r c u r r e n t m eth o d o lo g ie s. H o w ev er, the su ccess o f the ap p ro a ch in the analysis of nucleic acid sequences and in other fields suggests that as the protein structure d atabase grow s and p erh ap s with

the in c o rp o ra tio n of o th er in fo rm a tio n , the p o w e r o f neural networks m ay be more fully exploited in the analysis of protein s e q u e n c e s .

C h a p t e r

4

Q u a n tita tiv e s t r u c tu r e -a c tiv ity r e la tio n s h ip s : neural netw orks and in d u ctive lo g ic program m ing

com pared to statistical m ethods.

T he inhibition o f dihydrofolate reductase by p yrim idines

4.1 S y n o p s i s

R e c e n t i n n o v a t i o n s in m e t h o d o l o g y a n d in d a ta r e p r e s e n t a t i o n a p p l i e d to q u a n t i t a t i v e s t r u c t u r e - a c t i v i t y r e la ti o n s h i p (Q S A R ) a n a ly sis h av e b een e v a lu a te d . N e u ra l n e tw o rk s and in d u c tiv e logic p ro g ra m m in g (IL P ) have been c o m p a re d to the tra d itio n a l s ta tistic a l te c h n iq u e s o f lin e a r reg ressio n , nearest neig h b o u r alg o rith m s and d ecisio n trees. A new r e p r e s e n ta tio n o f d ru g s by p h y s ic o c h e m ic a l a ttr ib u te s ( P C A s ) (K in g et a l , 1992) has been extended to cover more s u b s titu e n ts. The PC A r e p re s e n ta tio n has been b e n c h m a rk e d against the widely used Hansch param eters for the QSAR of the inhibition of E. coli d ih y d ro fo la te red u c tase (D H F R ) by 2,4- d i a m i n o - 5 - ( s u b s t i t u t e d - b e n z y l ) p y r i m i d i n e s , a n d , in th e su b seq u e n t c h ap ter, the in h ib itio n of ro d e n t D H FR by 2,4- d i a m i n o -6,6- d i m e t h y l - 5 - p h e n y l - d i h y d r o t r i a z i n e s . N e u r a l netw orks and ILP perform better than the traditional statistical techniques on the PCA representation, but the difference is not s t a tis tic a lly s ig n if ic a n t. T he PC A r e p r e s e n ta t io n d o e s not consistently give more accurate QSARs than Hansch param eters, but does allow the form ulation of rules relating the activity of the inhibitors to their chemical structure.

4 . 2 I n t r o d u c t i o n

In this chapter, the recent innovations in m ethod and data rep resen tatio n in QSAR analysis have been assessed, using the inhibition of E. coli d ih y d ro fo la te re d u c ta s e (D H F R ) by 2,4- d ia m in o -5 -(s u b stitu te d -b e n z y l) p y rim id in e s (F ig u re 4 .1). N eural n e tw o rk s and in d u c tiv e logic p r o g ra m m in g (IL P ) have been co m pared against the traditional statistical tech n iq u es of linear reg re ssio n , nearest n eighbour alg o rith m s and d ecisio n trees. A new representation of drugs by PCAs (King et a l , 1992) has been ex ten d ed to cover m any more su b stitu en ts, for use in QSAR a n aly sis. This c h ap ter and the su b se q u e n t c h a p te r p ro v id e a th o ro u g h c o m p a ris o n o f n e u ra l n e tw o rk s and IL P a g a in st traditional statistical methods in QSAR, and with this fram ew ork, the new PC A re p re se n ta tio n has been e v a lu a te d . T h e work presented here on ILP and decision trees was done by Dr. Ross King.

4 .3 M e t h o d s

4.3.1 D ata

T h e d a ta used in the stu d y w ere 74 2 ,4 - d ia m in o - 5 (su b stitu ted -b en zy l) pyrim idines (1), w hose substituents are

I

F i g u r e 4.1 Model of trimethoprim bound to DHFR,

displayed using the graphics program PREPI by Dr. Suhail Islam.

listed in Table 4.1. These data come from several sources. The first 44 drugs (numbers 1 - 44 in Table 4.1) have been analysed by linear regression (Hansch et a i , 1982); these 44 and 11 more from Roth and coworkers (Roth et al., 1981; Roth et a l , 1987) (numbers 1 - 55) were used in an ILP study (King et a l , 1992); 43 of the 44, and another 25 were used in a more recent linear re g re ss io n study (S e la ssie et a l , 1991) (numbers 2 - 44, and numbers 56 - 74) and in a subsequent neural network analysis (So and Richards, 1992). Six of the 25 have not been included in this comparison, because the com plete set of Hansch param eters for the substituents were not available at the time of this study. B io lo g ic al activ itie s had been m easu red by the asso c iatio n constant to DHFR from MB 1428 E. coli (Li et a l , 1982). The su b stitu en ts o f the phenyl ring vary in the 3-, 4- and 5- p o sitio n s. N ot only has this rea so n a b ly large d ata set been e x te n siv e ly stu d ied by Q SA R m eth o d s, bu t th ere are also c r y s ta l lo g r a p h i c stu d ie s o f th e c o m p le x fo rm e d b e tw e e n t r i m e t h o p r i m ( 2 , 4 - d i a m i n o - 5 - ( 3 , 4 , 5 - t r i m e t h o x y b e n z y l ) p yrim idine) and DHFR from E. coli (C h a m p n e ss et a l , 1986; M a t t h e w s et a l , 1985). It is possible therefore, in this test case.

I n d e x No. A c tiv ity (logKi) S u b s t i t u e n t 3 - 4 - 5 - 01 3 .0 4 Œ H CH 0 2 5 . 6 0 H 0(CH2)6CH3 H 0 3 6 .0 7 H 0(CH2)5CH3 H 0 4 6 .1 8 H H H 0 5 6 .2 0 H NO2 H 0 6 6 .2 3 F H H 0 7 6 .2 5 0(CH2)7CH3 H H 0 8 6 .2 8 CH2OH H H 0 9 6 . 3 0 H NH2 H 1 0 6 .31 CH2OH H CH2OH 1 1 6 .3 5 H F H 1 2 6 .3 9 0(CH2)6CH3 H H 1 3 6 .4 0 H OCH2CH2OCH3 H 1 4 6 .4 5 H

a

H 15 6 .4 6

Œ

Œ H 1 6 6 .4 7 OH H H 1 7 6 .4 8 H CH3 H 1 8 6 .5 3 OCH2CH2OCH3 H H 1 9 6 .5 5 CH20(CH2)3CH3 H H 2 0 6 .5 7 O Œ2CONH2 H H

In document Use of neural networks to model molecular structure and function (Page 124-136)