• No results found

The use of bioinformatics to assign function to unknown proteins

CHAPTER 4: C53 CDNA CLONING, BIOINFORMATICS AND

4.1.2 The use of bioinformatics to assign function to unknown proteins

The recent flood of data from genome sequences and functional genomics has given rise to the field of bioinformatics, which combines elements of biology and computer science. The field aims to take the new information (in terms of DNA and protein sequence) and applies ‘informatics’ techniques (which are derived from disciplines such as applied maths, computer science and statistics) to try to organise and understand the information on a large scale. The field primarily concentrates on the use of large datasets that are now available in molecular biology, such as genome sequences and the results of functional genomics experiments (e.g. expression data). By employing a wide range of computational techniques, including sequence and structural alignment, database design, data mining, prediction of protein structure and gene finding, the field tries to integrate the large variety of computational methods and heterogeneous data sources to produce searchable databases that unknown sequences can be compared against (Luscombe et al, 2001). The study described here is based on the use of these databases to screen sequences derived from a cDNA library for novel clones (see chapter 3) using the NCBI based BlastN and BlastX programs. Because there is no direct functional data on C53, this chapter aims to take the raw sequence and, using various web-based bioinformatics programs, begin to

Chapter 4 cDNA, bioinformatics, expression

describe potential structural elements of the protein that may give valuable clues about its function.

One of the most informative tools on the web is SWISS-PROT, which can be found on the ExPASy Molecular Biology Server (http://www.expasv.ch/V SWISS-PROT is a protein sequence database that aims to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants etc), a minimal level of redundancy and high level of integration with other databases. TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all of the translations of EMBL nucleotide sequence entries not yet integrated into SWISS-PROT. Its links with other database sources, such as Pfam (a protein family database from the Sanger Center

http://W W W .sanger.ac.uk/Software/Pfam/ (Bateman A et ai, 2000), means any

potential domains and protein family members can be easily identified. The most informative database entry for C53 (the C53 human ortholog, protein Accession No. NP_071437) indicated that it may be an integral membrane protein. This hint about potential protein structure can be investigated further using the programs SignalP, TMpred and TMHMM

SignalP (http://www.cbs.dtu.dk/servics/SignalP) is a world-wide-web based signal peptide prediction server that can predict the presence or absence of signal peptides in the first 50-70 amino acids of the protein in question. Signal peptides control the entry of virtually all proteins to the secretory pathway, both in eukaryotes and prokaryotes (Gierasch, 1989), and comprise the N terminal part of the amino acid

Chapter 4 cDNA, bioinformatics, expression

chain, which is cleaved off when the protein is translocated through the membrane. A strong interest in the automated identification of signal peptides (mainly due to the huge amount of unprocessed data available and the industrial need to find more effective vehicles for the production of proteins in recombinant systems) has led to a neural network approach to the identification of signal peptides and their cleavage sites. In essence, the SignalP server will return three scores between 0 and 1 for each position in the first 50-70 amino acids in the sequence. The C-score (raw cleavage site score) is the output from networks trained to recognise cleavage sites vs. other sequence positions, and are trained to be high at position +1 (immediately after the cleavage site) and low at all other positions. The S-score (signal peptide score) is the output from networks trained to recognise signal peptide vs. non signal peptide positions, and are trained to be high at all positions before the cleavage site and low

at thirty positions after the cleavage site and in the N-terminus of non-secretory proteins. The Y-score (combined cleavage site score) optimises the output from the C- and S- cleavage site predictions by observing where the C-score is high and the S- score changes from a high to low value. The Y score formalises this by comparing where the C- score is high with the slope of the S- score. All three scores are averages of five networks trained on different partitions of the data. For each sequence, SignalP will report the maximal C-, S-, and Y- scores and the mean S- score between the N terminal and the predicted cleavage site. These scores are used to predict the presence of a signal peptide and if one is present in the sequence the cleavage site is predicted to be immediately before the position with the maximal Y- score (Nielsen a/., 1999; Nielsen e/a/., 1997).

Chapter 4 cDNA, bioinformatics, expression

SignalP prediction cannot make the distinction between secreted molecules and membrane bound molecules, so it is crucial to scan the sequence for trans-membrane domains to confirm that C53 has a similar structure to its human orthologue, and is indeed integrated into the cell membrane. The TMpred program

rhttp://www.ch.embnet.org/software/TMpred form.html (Hofmann and Stoffel, 1993) makes a prediction of membrane spanning regions and their orientation. The algorithm is based on the statistical analysis of TMbase, a database of naturally occurring trans-membrane proteins. TMbase is mainly based on SWISS-PROT and was initially intended to be a tool for analysing the properties of transmembrane domain proteins. It now has all the data present in a form that can be easily queried, allowing trans-membrane domains to be identified in novel sequences. Aswell as TMpred, a new method for the prediction of trans-membrane domains is used. TMHMM ('http://www.cbs.dtu.dk/services/TMHMM) is based on a hidden Markov model (HMM, described in Sonnhammer et al., 1998). Early methods for the prediction of trans-membrane (TM) domains used hydrophobicity analysis alone (Argos et al, 1982). However, another signal shown to be important for the

identification of TM proteins is the abundance of positively charged residues on the cytoplasmic side of the membrane, the “positive inside” rule (von Heijne, 1994). By combining this charge bias with hydrophobicity analysis better TM predictions can be obtained (von Heijne, 1992). Helical membrane proteins also follow a grammar in which the cytoplasmic and none cytoplasmic loops have to alternate. TMHMM is a new method that takes into account hydrophobicity, charge bias, grammatical constraints and also helix length (which has only been crudely modelled by other prediction programs). A detailed analysis of TMHMM performance has been carried

Chapter 4 cDNA, bioinformatics, expression

out and shows that it correctly predicts 97-98% of TM helices and can discriminate between soluble and membrane proteins with both specificity and sensitivity better than 99% (Krogh A et al, 2001).