Bioinformatics Grid
-Enabled Tools For
What is Grid-Enabled Tools (GET)?
● As number of data from the genomics and proteomics experiment increases.● Problems arise for the current sequence analysis technology : mainly SLOWER speed.
● Using GET the sequence is cut into batches and distributed to different computers in the cluster for processing.
● After computation, the results are sent back to the head node for recombination and thus is ready for collection by the user.
● Utilizing this method of sequencing and analyzing data reduces the total amount of time need to be spent in doing so.
GET Login Submit sequence in FASTA Format GetEMBOSS GetMSA GetANNO Choose your blast parameter
Choose your parameter Choose to perform either
DNA or Protein analysis
Emboss Clustalw &
Hmmer Blast
Results Result in zip is sent via e-mail download the zip file
GET
Flowchart
GET
Registration
Type in your name, e-mail and password. Then go to your e-mail to activate your account.
Login Page
Type in your e-mail address and password to login
GetANNO
● GetANNO is to add on additional information associated with a particular point in a piece of information.
● Many proteins are modular in nature, generally many having small conserved regions called motifs.
● Motifs are surrounded by divergent regions exhibiting a high degree of mutational change among family members of the same protein which tend to correspond to core structural and functional elements of the proteins.
GetANNO
● Protein annotation compares the user input with databases to determine the family of the protein.
● Computation will take a long time due to large database caused by many classes and long size of proteins.
● GetANNO splits up the user input into parts and sends it to
different computers holding databases to compute, speeding up the time taken to analyze the proteins.
GetANNO
GetANNO enables users to:
- Perform sequence similarity searches against databases such as RefSeq, Swissprot, Pfam and Gene ontology.
- Obtain the results description from an excel spreadsheet output .
GetANNO
Click here to start GetANNO Type in your title
Choose which type DNA or Protein
Paste in Sequence
Load Sequence from file
Choose the parameter
Start the Annotation Choose E-Value
GetANNO Parameter
There is 4 types of databases
available to
BLAST against.
There also parameter to choose the E-value and Scoring matrix.
In addition a check box is added to only show the top 10 hit in the result
Database
There is 4 type of database to check against with. ● RefSeq
● Gene Ontology ● Pfam
● SwissProt
All of them are well accurate and reliable since the information is frequently updated.
RefSeq
Provides a comprehensive, integrated & non-redundant set of sequence. Including genomic DNA, transcript (RNA) and protein products.
Database
Gene Ontology
Provide structured, controlled vocabularies and classification which cover molecular and cellular biology. Often use in
Database
Pfam
A large collection of multiple sequence alignments and hidden Markov model in many common protein domains.
SwissProt
Provide a high level of annotation, a minimal level of redundancy and high level of integration with other databases.
GetEMBOSS
EMBOSS collectively contains the processes of: * Sequence alignment
* Rapid database searching with sequence patterns
* Protein motif identification, including domain analysis * Nucleotide sequence pattern analysis
* Codon usage analysis for small genomes
* Rapid identification of sequence patterns in large scale sequence sets
● GetEMBOSS helps to save time by splitting up jobs and sent to different computers in the clusters thus the
computational power is increased.
● GetEMBOSS allows users to perform several sequence analysis options on a batch of sequences submitted.
GetEMBOSS
Click here to start GetEMBOSS
Type in your title
Paste your FASTA sequence
Load sequence from file
Choose the type of analysis and parameter
Click here to start analysis
GetEMBOSS Parameter
Find and extract open reading frames.Picks PCR primers and hybridization oligos.
Finds restriction enzyme cleavage site.
Translates nucleic acid sequence Predicts protein
secondary structure
Protein statistics Calculates the isoelectric point of a protein
Predict coiled coil regions
GetMSA
● Multiple Sequence Alignment
● Compares multiple DNA or amino acid sequences and aligns them to highlight their similarities.
● GetMSA helps to shorten the computation time needed.
● Allow users to align multiple sequences for comparison and select further analysis options of predicting secondary structure and finding domains for
GetMSA
Click here to start GetMSA Type in your title
Choose DNA or Protein sequence
Load sequence from file Type in sequence Pairwise Alignment options Mutiple Alignment options
Click here to start analysis
Search History
●
The Search History is a page where
past analysis data done are stored.
●
Results of submitted jobs are found
Search History
Click here to view the result and search history
Click here to view the sequence you enter and the result of the analysis
Our Project Plans
Original PlanNGO BII
LSF SGE TP Database
There is a limited
capacity in this system. Often there would be collision between the information travel since it is a single line transmission
Linux Virtual Server (LVS)
● The Linux Virtual Server, or LVS, is a piece of software that is used to balance loads on clusters.
● The architecture of the whole cluster is transparent to the end user, thus the LVS cluster acts as a single high performance virtual
server.
● LVS is commonly used to build highly scalable services on the internet such as HTTP, FTP, VoIP and so on.
How LVS Works
User Load Balancer Internet Real Server LAN/WAN Real Server Real Server Real ServerHow LVS Works
● LVS works by having a load balancer connected to a cluster.
● The real servers and the load balancer may be interconnected by either high-speed LAN or by geographically dispersed WAN.
● The load balancer will dispatch requests to the different servers and make parallel services of the cluster to appear as a virtual
service on a single IP address, and request dispatching can use IP load balancing technologies or application-level load balancing technologies.
How LVS Works
● Scalability of the system is achieved by transparently adding or removing nodes in the cluster.
● High availability is provided by detecting node or daemon failures and reconfiguring the system appropriately.
●
Thus, the service will continue to function even if one
real server is taken down for maintenance.
● A backup load balancer can be connected to the network to provide for backup support if the primarily load balancer has gone down due to either maintenance or service failures.
How LVS Works
● can handle >1million concurrent simultaneous connection ● 128 bytes memory per connection
● a computer with 1 gigabyte memory can handle more than 8 million simultaneous connections.
●
LVS is also able to produce statistics of each real server, the number of connections, packets, bytes and so on, on which graphs can be created using other software.
Our Project Plans
LVS
NGO BII TP
Users
Database synchronized
This is method which make use of a software known as LVS to act as a router to link up all the cluster together.
This method is more efficient.
Convention Methods
VS
Start
Select Blast parameters
Can only submit 1 query sequence at a time. Do not allow upload of file.
Repeat the same process
for the other 393 sequences. Obtain Results
Conventional
Blast
Analysis of 394 SequencesStart
Select Blast parametersCan submit more than 1 query sequence at a time. Allows
upload of file. Obtain Results
GetAnno
394 sequence iscombined into a single FASTA format text file
Conventional Blast
Vs GetAnno
20 17.5 15 12.5 10 7.5 5 2.5 0 Time (hr) GET Conventional For a 394 sequence, the normal protein blast takes about 18hrs, while GetANNO only takes 2 hours.Start
Can only submit 1 query sequence at a time. Can only select 1 Emboss Program
Obtain Results
[Results are not compiled]
Repeat the same process for the other 9 sequences
and also for the other program
Conventional Emboss
Start
Select Emboss Programs
[How many depends on user perference] Restrict
Results
Can submit more than 1 query sequence at a time. E.g all 10 query seqs
Results Eprimer 3 Running In Parallel
Compile into 1 result text file
GetEmboss
10 sequence is combined into a
single FASTA format file
For 10 sequence DNA analysis with 2 program, Institute Pasteur Web takes
30mins but Get Emboss takes 2 mins.
Conventional Blast
Vs GetEmboss
30
25
20
15
10
5
0
Time (mins) GET ConventionalStart
Upload file that contains more than 1 sequences
Choose parameters E.g window size, k-tuple
Obtain result [Jalview, alignment, phylogenetic tree] in individual files
Start
Upload file that contains more than 1 sequence
Choose parameters E.g window size, k-tuple
Allow users the option to build a hmm profile.
Obtain result [Jalview, alignment,
phylogenetic tree, hmmbuild] in 1 text profile.
Conventional MSA
Vs GetMSA
The GetMSA offers more option of building the hmm profile for their sequence. Thus saving it an extra step
Why use our program??
● The time taken for GET to complete a process is faster than the conventional method.
● The GET provide multiple option for analysis.
● It is more user-friendly than conventional method.
Target Audiences
● Biologists ● Students ● Teachers
● Anyone who need information on DNA or Protein sequencing.
Summary
●
Grid Enabled Tools Suite is developed for
Biologists to access computing resources via a
user friendly web interface for highthroughput
bioinformatics analysis.
●
Provide a convenient resource for annotation
extraction and sequence analysis
●
Capitalize on the availability of cluster and grid
THANK
YOU for
listening!