Development of Integrated Protein Analysis
Tool
Poorna Satyanarayana Boyidi*, Srinivas Kudipudi**, Sridhar GR***, Allam Appa Rao****, Mohan Rao CPVNJ**
* Chaitanya Engineering College, Visakhapatnam
** Avanthi Institute of Engineering & Technology, Makavarapalem, Visakhapatnam, ***Endocrine and Diabetes Centre, Visakhapatnam,
****JNTU, Kakinada, India Abstract
We present an “Integrated Protein Analysis Tool(IPAT)” that is able to perform the following tasks in segregating and annotating genomic data: Protein Editor enables the entry of nucleotide/ aminoacid sequences Utilities :IPAT enables to conversion of given nucleotide sequence to equivalent amino acid sequence: Secondary Structure Prediction is possible using three algorithms (GOR-I Gibrat Method and DPM (Double Prediction Method) with graphical display. Profiles and properties: allow calculating eightphysico-chemical profiles and properties, vizHydrophobicity,Hydrophilicity,Antigenicity, Transmembranous regions , Solvent Accessibility,Molecular Weight, Absorption factor and Amino Acid Content. IPAT has a provision for viewing Helical-Wheel Projection of a selected region of a given protein sequence and 2D representation of alpha-carbon IPAT was developed using the UML (Unified Modeling Language) for modeling the project elements, coded in Java, and subjected to unit testing, path testing, and integration testing.
This project mainly concentrates on Butyrylcholinesterase to predict secondary structure and its physico-chemical profiles, properties.
Key words: nucleotides, amino acids, secondary structure, prediction, in silico, bioinformatics
Introduction
With rapid identification of nucleotide sequences and proteins of unknown characteristics, methods are required to annotate their structure and possible functions (1). Using the method of homology, comparison among nucleotide or amino acid sequences can be used to infer relationships and possible functional roles. We have developed a tool that can be employed to compare primary sequences (nucleotides or amino acids), predict the secondary structure of amino acid sequences as well as display characteristics of the secondary structure, in an integrated portal.
Method
We employed Java v 1.5.0 to develop the tool. 3.3 SPECIFIC REQUIREMENTS
3.3.1 External Interfaces
The section gives the detailed description of all inputs into and outputs from the software system. Name of Input Item : FASTA Protein file
Source of Input : from NCBI website
Destination of Output : Local System Hard Disk Name of Output Item : .Structure Files
3.3.2 Function Requirements
3.3.2.1 Exact Sequence of Operations
The sequence of operations is: • Validating the given FASTA file
• Selecting any algorithm to perform secondary structure prediction • Load all initial structure values into buffers
• For each amino acid perform prediction on specific window size. 3.3.2.2 Relationship of outputs to inputs
Input/Output Sequences
Input : FASTA File
Process : Secondary Structure Prediction Output : predicted structure of the sequence Input : FASTA file
Process : profile calculation
Output : all profiles like hydrophobicity & hydrophilicity and Transmembranous Region and Solvent Accessibility.
Input : FASTA file
Process : Propeties calculation Output : yields molecular
Representative coding for determining the properties of the protein is given below in the Method section The remaining code is given in Appendix 1
UML diagrams are given in Appendix 2. Testing details are given in Appendix 3 Screen shots are shown in Appendix 4 properties.java
import java.io.*; import java.lang.*;
class properties extends pssp {
int []counts;
public properties() {} public properties(String s) { super(s);
counts=new int[20];
for(int i=0;i<20;i++)
counts[i]=0; }
int i=x;
int []array=new int[20];
for(i=0;i<seqlen;i++)
counts[ get_eq_num(ipseq[i]) ]++ ;
for(i=0;i<20;i++) {
array[i]=counts[i];
System.out.println(""+counts[i]); }
return array;}
Results/ Observations:
Protein Editor
The graphical editor allows entry of data entry. It allows only valid 20 amino acids into file. It also allows all fundamental operation on file like opening, saving, printing, closing, editing, pasting, searching, and replacing. Secondary Structure Prediction
This module allows the prediction of secondary structure of a given protein using 2 algorithms: GOR-I (2) and Gibrat Method (3).
This tool also enables to view graphical representation of the results.
Calculator allows calculating of eight physico-chemical profiles and properties: Hydrophobicity, Hydrophilicity, Antigencity, Transmembranous regions, Solvent Accessibility, Molecular Weight, Absorption factor and Amino Acid Content
Helical-Wheel Projection
It has a provision for viewing Helical Wheel Projection of a selected region of a given protein sequence and 2D representation of alpha-carbon which is useful for calculating the hydrophobic moment.
Sequence Alignment
Pairwise alignment compares two protein files and shows alignment and similarity (4,5). Multiple alignment compares a list of files from a specified directory and compares one to one with the source file and returns the similarity and global alignment.
The system was developed using the UML (Unified Modeling Language) for modeling the project elements. It was coded in Java, which is a pure object oriented language. The system was tested using the unit testing, path testing, integration testing.
SCOPE
The current version is limited to FASTA format files, and to an upper limit of 1000 amino acids. In the secondary structure table maximum of 1000 amino acids can reside. Enhancement can be made to enhance the capabilities.
User Interface
The graphical user interface provided with different 8 menu operations like all operations on file will be handled by FILE menu operations. The typical editing operations on sequence will be taken care by EDIT menu. The METHODS menu allows selecting one algorithm to predict protein secondary structure. PROPERTIES menu item allows one to calculate all physico-chemical properties of given sequence. PROFILES menu allows to finding out chemical properties of sequence finally The logical characteristics of the user interface between the software product and the user are:
to calculate. In case of no input user can generate random amino acid sequence to give as input. User can also helical wheel for sequence.
Window format: The screen consists of protein editor and structure viewer and results table at right hand side. Software interfaces
This was developed used Java (Version: 1.5.0 ; Sun Microsystems); this free software has strong file operations support , can be ported easily because Java is platform independent.
System Interfaces
The integrated System to be developed is a stand-alone one that can be used in any environment irrespective of particular operating system.
PRODUCT FUNCTIONS
This project provides flexibility for the selection of particular prediction algorithm among five to predict secondary structure of protein. The product has following functionalities.
1. It converts nucleotide sequence to amino acid sequence
2. It generates random amino acid sequence with specified name, length of sequence 3. It predicts protein secondary structure from any of the five different algorithms 4. It calculates five profiles of the amino acid sequence
5. It calculates profiles of the amino acid sequence 6. It plots helical wheel projection
ASSUMPTIONS AND DEPENDENCIES
It is assumed that amino acid sequence provided by the user of this system is valid and exactly in the FASTA specified format. FASTA format is a text-based format for representing either nucleic acid sequences or peptide bonds, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. A simple example of one sequence in FASTA format:
>Fasta Seq
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNL V
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
weight and absorption factor of the sequence
CONCLUSION
The present system is developed to integrate all individual operation those can be applied on a protein sequence in a single window. The same process can be extended for other sections so that an entire structure i.e. tertiary, quaternary structure can be predicted. This system assumes that the section contains the only FASTA format files. It may be extended to various protein file formats with little bit of modification.
• The design can be extended for tertiary structure prediction • The design can be extended for quaternary structure prediction • It can be extended to various file formats of protein files
• It can be extended up to three dimensional representation of helical wheel. • Converting current system to a web based system
REFERENCES
[1] Sridhar GR, Divakar Ch, Hanuman T, Appa Rao A Bioinformatics approach to extract information from genes. Intl J Diab Dev
Countries 2006; 26:149-51
[2] Garnier J, Osguthorpe DJ, Robson B. Analysis of the accuracy and implications of simple methods for predicting the secondary
structure of globular proteins. J Mol Biol 1978; 120:97-120
[3] Gibrat JF, Robso FB, Garnier J. Further development of protein secondary structure prediction using information theory. New
parameters and consideration of residue pairs. J Mol Biol 1987;198:425-43
[4] Smith TF, Waterman MS. Identification of Common Molecular Subsequences.J Mol Biol 1981;147: 195–197
[5] Needleman SB, Wunsch CD.A general method applicable to the search for similarities in the amino acid sequence of two proteins. J
Mol Biol 1970;48: 443–53
Web-links
http://www.ncbi.nlm.nih.gov/ http://www.bioinformatics.org
http://www.dnaftb.org http://www.sun.java.com
http://www.predictioncenter.com http://www.bioinformaticscenter.com
http://javaboutique.com http://www.lecture.molgen.edu