Bioinformatics Grid

(1)

-Enabled Tools For

(2)

What is Grid-Enabled Tools (GET)?

● As number of data from the genomics and proteomics experiment increases.

● Problems arise for the current sequence analysis technology : mainly SLOWER speed.

● Using GET the sequence is cut into batches and distributed to different computers in the cluster for processing.

● After computation, the results are sent back to the head node for recombination and thus is ready for collection by the user.

● Utilizing this method of sequencing and analyzing data reduces the total amount of time need to be spent in doing so.

(3)

GET _Login Submit sequence in FASTA Format GetEMBOSS GetMSA GetANNO Choose your blast parameter

Choose your parameter Choose to perform either

DNA or Protein analysis

Emboss Clustalw &

Hmmer Blast

Results Result in zip is sent via e-mail download the zip file

GET

Flowchart

(4)

GET

(5)

Registration

Type in your name, e-mail and password. Then go to your e-mail to activate your account.

(6)

Login Page

Type in your e-mail address and password to login

(7)

GetANNO

● GetANNO is to add on additional information associated with a particular point in a piece of information.

● Many proteins are modular in nature, generally many having small conserved regions called motifs.

● Motifs are surrounded by divergent regions exhibiting a high degree of mutational change among family members of the same protein which tend to correspond to core structural and functional elements of the proteins.

(8)

GetANNO

● Protein annotation compares the user input with databases to determine the family of the protein.

● Computation will take a long time due to large database caused by many classes and long size of proteins.

● GetANNO splits up the user input into parts and sends it to

different computers holding databases to compute, speeding up the time taken to analyze the proteins.

(9)

GetANNO

GetANNO enables users to:

- Perform sequence similarity searches against databases such as RefSeq, Swissprot, Pfam and Gene ontology.

- Obtain the results description from an excel spreadsheet output .

(10)

GetANNO

Click here to start GetANNO Type in your title

Choose which type DNA or Protein

Paste in Sequence

Load Sequence from file

Choose the parameter

Start the Annotation Choose E-Value

(11)

GetANNO Parameter

There is 4 types of databases

available to

BLAST against.

There also parameter to choose the E-value and Scoring matrix.

In addition a check box is added to only show the top 10 hit in the result

(12)

Database

There is 4 type of database to check against with. ● RefSeq

● Gene Ontology ● Pfam

● SwissProt

All of them are well accurate and reliable since the information is frequently updated.

(13)

RefSeq

Provides a comprehensive, integrated & non-redundant set of sequence. Including genomic DNA, transcript (RNA) and protein products.

Database

Gene Ontology

Provide structured, controlled vocabularies and classification which cover molecular and cellular biology. Often use in

(14)

Database

Pfam

A large collection of multiple sequence alignments and hidden Markov model in many common protein domains.

SwissProt

Provide a high level of annotation, a minimal level of redundancy and high level of integration with other databases.

(15)

GetEMBOSS

EMBOSS collectively contains the processes of: * Sequence alignment

* Rapid database searching with sequence patterns

* Protein motif identification, including domain analysis * Nucleotide sequence pattern analysis

* Codon usage analysis for small genomes

* Rapid identification of sequence patterns in large scale sequence sets

(16)

● GetEMBOSS helps to save time by splitting up jobs and sent to different computers in the clusters thus the

computational power is increased.

● GetEMBOSS allows users to perform several sequence analysis options on a batch of sequences submitted.

(17)

GetEMBOSS

Click here to start GetEMBOSS

Type in your title

Paste your FASTA sequence

Load sequence from file

Choose the type of analysis and parameter

Click here to start analysis

(18)

GetEMBOSS Parameter

Find and extract open reading frames.

Picks PCR primers and hybridization oligos.

Finds restriction enzyme cleavage site.

Translates nucleic acid sequence Predicts protein

secondary structure

Protein statistics Calculates the isoelectric point of a protein

Predict coiled coil regions

(19)

GetMSA

● Multiple Sequence Alignment

● Compares multiple DNA or amino acid sequences and aligns them to highlight their similarities.

● GetMSA helps to shorten the computation time needed.

● Allow users to align multiple sequences for comparison and select further analysis options of predicting secondary structure and finding domains for

(20)

GetMSA

Click here to start GetMSA Type in your title

Choose DNA or Protein sequence

Load sequence from file Type in sequence Pairwise Alignment options Mutiple Alignment options

Click here to start analysis

(21)

Search History

●

The Search History is a page where

past analysis data done are stored.

●

Results of submitted jobs are found

(22)

Search History

Click here to view the result and search history

Click here to view the sequence you enter and the result of the analysis

(23)

Our Project Plans

Original Plan

NGO BII

LSF SGE _{TP Database}

There is a limited

capacity in this system. Often there would be collision between the information travel since it is a single line transmission

(24)

Linux Virtual Server (LVS)

● The Linux Virtual Server, or LVS, is a piece of software that is used to balance loads on clusters.

● The architecture of the whole cluster is transparent to the end user, thus the LVS cluster acts as a single high performance virtual

server.

● LVS is commonly used to build highly scalable services on the internet such as HTTP, FTP, VoIP and so on.

(25)

(26)

How LVS Works

User Load Balancer Internet Real Server LAN/WAN Real Server Real Server Real Server

(27)

How LVS Works

● LVS works by having a load balancer connected to a cluster.

● The real servers and the load balancer may be interconnected by either high-speed LAN or by geographically dispersed WAN.

● The load balancer will dispatch requests to the different servers and make parallel services of the cluster to appear as a virtual

service on a single IP address, and request dispatching can use IP load balancing technologies or application-level load balancing technologies.

(28)

How LVS Works

● Scalability of the system is achieved by transparently adding or removing nodes in the cluster.

● High availability is provided by detecting node or daemon failures and reconfiguring the system appropriately.

●

Thus, the service will continue to function even if one

real server is taken down for maintenance.

● A backup load balancer can be connected to the network to provide for backup support if the primarily load balancer has gone down due to either maintenance or service failures.

(29)

(30)

How LVS Works

● can handle >1million concurrent simultaneous connection ● 128 bytes memory per connection

● a computer with 1 gigabyte memory can handle more than 8 million simultaneous connections.

●

LVS is also able to produce statistics of each real server, the number of connections, packets, bytes and so on, on which graphs can be created using other software.

(31)

Our Project Plans

LVS

NGO BII TP

Users

Database synchronized

This is method which make use of a software known as LVS to act as a router to link up all the cluster together.

This method is more efficient.

(32)

Convention Methods

VS

(33)

Start

Select Blast parameters

Can only submit 1 query sequence at a time. Do not allow upload of file.

Repeat the same process

for the other 393 sequences. Obtain Results

Conventional

Blast

Analysis of 394 Sequences

(34)

Start

Select Blast parameters

Can submit more than 1 query sequence at a time. Allows

upload of file. Obtain Results

GetAnno

394 sequence is

combined into a single FASTA format text file

(35)

Conventional Blast

Vs GetAnno

20 17.5 15 12.5 10 7.5 5 2.5 0 Time (hr) GET Conventional For a 394 sequence, the normal protein blast takes about 18hrs, while GetANNO only takes 2 hours.

(36)

Start

Can only submit 1 query sequence at a time. Can only select 1 Emboss Program

Obtain Results

[Results are not compiled]

Repeat the same process for the other 9 sequences

and also for the other program

Conventional Emboss

(37)

Start

Select Emboss Programs

[How many depends on user perference] Restrict

Results

Can submit more than 1 query sequence at a time. E.g all 10 query seqs

Results Eprimer 3 Running In Parallel

Compile into 1 result text file

GetEmboss

10 sequence is combined into a

single FASTA format file

(38)

For 10 sequence DNA analysis with 2 program, Institute Pasteur Web takes

30mins but Get Emboss takes 2 mins.

Conventional Blast

Vs GetEmboss

30

25

20

15

10

5

0

Time (mins) GET Conventional

(39)

Start

Upload file that contains more than 1 sequences

Choose parameters E.g window size, k-tuple

Obtain result [Jalview, alignment, phylogenetic tree] in individual files

(40)

Start

Upload file that contains more than 1 sequence

Choose parameters E.g window size, k-tuple

Allow users the option to build a hmm profile.

Obtain result [Jalview, alignment,

phylogenetic tree, hmmbuild] in 1 text profile.

(41)

Conventional MSA

Vs GetMSA

The GetMSA offers more option of building the hmm profile for their sequence. Thus saving it an extra step

(42)

Why use our program??

● The time taken for GET to complete a process is faster than the conventional method.

● The GET provide multiple option for analysis.

● It is more user-friendly than conventional method.

(43)

Target Audiences

● Biologists ● Students ● Teachers

● Anyone who need information on DNA or Protein sequencing.

(44)

Summary

●

Grid Enabled Tools Suite is developed for

Biologists to access computing resources via a

user friendly web interface for highthroughput

bioinformatics analysis.

●

Provide a convenient resource for annotation

extraction and sequence analysis

●

Capitalize on the availability of cluster and grid

(45)

THANK

YOU for

listening!