• No results found

Databases indexation

N/A
N/A
Protected

Academic year: 2021

Share "Databases indexation"

Copied!
13
0
0

Loading.... (view fulltext now)

Full text

(1)

Databases

indexation

Laurent Falquet, Basel October, 2006

Swiss Institute of Bioinformatics

Swiss EMBnet node

Overview

„

Data access concept

…

sequential

…

direct

„

Indexing

…

EMBOSS

…

Fetch

…

Other

„

BLAST

…

Why indexing?

…

formatdb

…

Parsing output

„

Excel import/export

…

Tab delimited

…

Coma delimited

(2)

LF, Basel October 2006

Why indexing?

„

Human tendency to

classify and group

„

Examples:

…

Dictionnary

…

Book

…

Library

…

DVD chapters

…

iPod play lists

„

Advantages:

…

Fast access

…

Easy data finding

„

Disadvantages:

…

Time to prepare indices

Data access: sequential vs direct

„

Sequential access

„

Direct access

Vary from very short to very long

Very small variations

track

sector

(3)

LF, Basel October 2006

Similar concept for databases

„

Flat files = sequential

„

Indexing = simulated direct

>seq1

cgatgtcatgtg

>seq2

cgatcgtagctgtagctgtag

>seq3

catgtgcatgcgacgt

23

47

seq3

28

19

seq2

19

0

seq1

Length

(byte)

Position

(byte)

ID

Tools

„

EMBOSS

…

dbxflat

…

dbxfasta

…

dbiblast

…

seqret

…

seqretsplit

…

entret

„

Other examples

…

SRS (icarus language)

„

http://srs.ebi.ac.uk

„

http://www.lionbioscience.com/

…

indexer & fetch (warning

local SIB tool)

…

Relational (MySQL, Oracle…)

…

Web (Google!!)

(4)

LF, Basel October 2006

EMBOSS how to index?

„

Where is your file?

„

What is the format?

„

Where should be the

indices?

„

Where is the

emboss.default file?

(.embossrc)

„

Other EMBOSS tools

…

textsearch

…

Whichdb

„

More details

…

www.emboss.org

EMBOSS example

„

Input file and directory

…

~/embossidx/ECOLI.dat

…

cd embossidx

„

Index creation

…

dbxflat -idformat swiss -dbname ecoli -filenames '*.dat' -dbresource

swiss -directory . -release 1.0 -date 26/09/06 -fields id,acc

„

Generates 5 files (default)

…

ECOLI.ent

…

ECOLI.pxac

…

ECOLI.pxid

…

ECOLI.xac

…

ECOLI.xid

(5)

LF, Basel October 2006

.embossrc

„

Example of queries

…

seqret ecoli:thio_ecoli

…

seqret ecoli:P00274

…

entret ecoli:thio_ecoli

„

and even

…

seqret ‘ecoli:*_ECOLI’

set emboss_filter 1

# Ecoli

D B ecoli[

type: P

comment: "E.coli proteome"

method: emboss

format: swiss

dir: "

{path}

/embossidx"

file: "ECOLI.dat"

release: "1.0"

indexdir:"

{path}

/embossidx"

]

Where

{path}

is the path to your home

directory

Indexer & fetch

„

Warning

this is a local SIB tool!!

„

Input file and directory

…

~/embossidx/ECOLI.dat

…

cd embossidx

„

Index creation

…

indexer -h '^ID' -t '^//' -i -p '^ID\s+(\S+)' ECOLI.dat ecoli.idx

„

Generates 1 file

…

ecoli.idx

(6)

LF, Basel October 2006

Config file: fetch.conf

„

Example of queries

…

fetch -c fetch.conf ecoli:thio_ecoli

…

fetch -c fetch.conf -f ‘ecoli:thio_ecoli[20..50]’

„

fetch.conf

#dbkey format indexfile datafile

ecolisp ~/embossidx/ecoli.idx ~/embossidx/ECOLI.dat

BLAST

„

Maintained at NCBI

„

Source distributed freely with

several accessory tools

…

ftp://ftp.ncbi.nlm.nih.gov/too

lbox/ncbi_tools/ncbi.tar.gz

„

May require compilation to

install on your local computer

„

blastall contains

…

blastp

…

blastn

…

blastx

…

tblastn

…

tblastx

„

Other tools

…

blastpgp

…

megablast

…

formatdb

(7)

LF, Basel October 2006

Available Blast programs

Program

Query

Database

blastp

protein

protein

blastn

nucleotide

nucleotide

blastx

protein

nucleotide

protein

tblastn

protein

protein

nucleotide

tblastx

protein

nucleotide

protein

nucleotide

VS VS VS VS VS

What makes BLAST so fast?

„

Indexing

all words of 3 aa or

11 bp in the sequence database

„

Searching the query for all

words of a score > T

„

Search the

indexed

database

for all perfect matches

„

Try to align matches that are

(8)

LF, Basel October 2006

Indexing for Blast (1)

REL

Query

RSL

RSL

AAA

AAC

AAD

YYY

AAA

AAC

AAD

YYY

List of all possible words with

3 amino acid residues (8000)

...

ACT

RSL

TVF

ACT

RSL

TVF

List of words matching the

query with a score > T

score > T

...

...

LKP

LKP

LKP

LKP

score < T

Ö

A substitution matrix is used to compute the word scores

Ö

A

substitution matrix

is used to compute the word scores

Indexing for Blast (2)

ACT

RSL

TVF

ACT

RSL

TVF

List of words matching the

query with a score > T

...

...

ACT

ACT

ACT

RSL

RSL

TVF

RSL

RSL

RSL

RSL

TVF

TVF

Database sequences

Ö

List of sequences containing

words similar to the query (hits)

Ö

List of sequences containing

words similar to the query (hits)

Search for

(9)

LF, Basel October 2006

Indexing for Blast (3)

Database sequence

Query

A

Ungapped extension if:

2 "Hits" are on the same diagonal but

at a distance less than A

Database sequence

Query

A

Extension using

dynamic programming

limited to a restricted region

limited through a

score drop-off

threshold

BLAST indexing with formatdb

„

Formatdb

…

mydb.seq

must contain sequences in FASTA format

…

formatdb -i mydb.seq

-p T -n

mydb

„

Generates 3 files

…

mydb

.psq

…

mydb

.pin

…

mydb

.phr

„

Then start a Blast:

(10)

LF, Basel October 2006

Blast local vs

remote

„

blastall

…

Executed locally

…

Slow

…

No need to transfert db

„

blastall.remote

…

Executed remotely

…

Fast

…

Requires special

priviledges and db

transfert

„

Using BioPerl

(remoteblast.pm)

…

Blast at NCBI

…

No user db

…

See www.bioperl.org

Multiple Blasts?

„

1 seq

vs

db seq

…

1 FASTA seq as input

„

db seq

vs

db seq

…

Several single FASTA

seq files as input or

…

1 Multiple FASTA seq

file as input

„

Possibility to export

results as XML

„

Use Perl to automatize the

queries and parse the

output

(11)

LF, Basel October 2006

Parsing Blast output

BL ASTP 2.2.10 [Oct-19-2004]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,

Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),

"Gapped BLAST and PSI-BLAST: a new generation of protein database search

programs", Nucleic Acids Res. 25:3389-3402.

Query= ACC A_BACSU O34847 Acetyl-coenzyme A carboxylase carboxyl

transferase subunitalpha (EC 6.4.1.2).

(325 letters)

Database:

ecoli_blast

4339 sequences; 1,373,039 total letters

Searching...done

Score E

Sequences producing significant alignments: (bits) Value

A C C A_ECOLI P30867 Acetyl-coenzyme A carboxylase carboxyltransfe... 266 1e-72

Parsing Blast output (2)

>ACC A_ECO LI P30867 Acetyl-coenzyme A carboxylase carboxyl transferase subunitalpha (EC 6.4.1.2).

Length = 318

Score = 266 bits (681), Expect= 1e-72

Identities = 143/312 (45%), Positives = 188/312 (60%), Gaps = 3/312 (0%)

Query: 5 LEFEKPVIELQTKIAELK KFTQ DS---D M DLSAEIERLEDRLA KL Q D DIYK NL KP W D R V Q 61 L+FE+P+ EL+ KI L ++ D+++ E+ RL ++ +L I+ +L W Q

Sbjct: 5 LDFEQPIAELEAKIDSLTA VSR Q DE KL DINIDEEV HRLREKSVELTR K IFADLG A W Q IAQ 64

Query: 62 IARLA D RPTTLD YIEHLFTDFFECH G D R A Y G D DE AIVGGIAKFH GLPVT VIGH QR G K DT K 121 +AR RP TLD Y+ F +F E GDR A Y DD+AIVG GIA+ G PV +IGH Q+ G++TK

Sbjct: 65 LARHPQRPYTLDY V RL AFDEFDEL A G D R A Y A D D K AIVGGIARLD G RPV MIIGH Q K G RETK 124

Query: 122 ENLVR NFG MPHPEG Y R K ALRL M K Q A D K FNRPIICFIDTK G A YP GR A AEER G QSEAIAKNL 181 E + RNFG MP PEG Y R K ALRL M+ A++F PII FIDT GA YPG AEERG QSEAIA+ NL

Sbjct: 125 EKIRR NFG MP APEG YR K ALRLM Q M A ERFKM PIITFIDTPG A YPG V G A EERG QSEAIA R NL 184

Query: 182 FEM A G LR VPXX X X X X X X X X X X X X X X X X X X X X X H M LENSTYSVISPEG A A ALLW K DSSLA K 241 EM+ L VP +ML+ STYSVISPEG A++L W K + A

(12)

LF, Basel October 2006

Parsing Blast output (3)

„

With BioPerl:

#!/usr/local/bin/perl

use

Bio::SearchIO;

my

$blast_report =

new

Bio::SearchIO (

'-format'=>

'blast'

,

'-file'

=> $AR G V[

0

]);

print

"Query name:\tQuery description:\tHit name:\tHit description:\tE-value\tScore\n"

;

while

(

my

$result = $blast_report->next_result) {

print

$result->query_name(),"\t"

, $result->query_description(),"\n"

;

while

(

my

$hit = $result->next_hit()) {

print

"\t\t"

, $hit->name(),"\t"

, $hit->description();

while

(

my

$hsp = $hit->next_hsp()) {

print

"\t"

, $hsp->evalue(),"\t"

, $hsp->score();

}

print

"\n"

;

}

}

exit0

;

MS-Excel import/export

„

Excel can import

…

Tab delimited

…

Coma delimited

„

Excel can export

…

Tab delimited

…

Space delimited

AC/ID

desc

score

e-value

THIO_ECOLI

thioredoxin Escherichia coli

234

2.1e-5

THIO_HUMAN thioredoxin Homo sapiens

120

0.001

(13)

LF, Basel October 2006

MS-Excel import/export

„

Tab delimited file:

…

\t

delimits the columns

…

\n

delimits the lines

…

Optional first line contains columns title

…

Example:

A C/ID

\t

desc

\t

score

\t

e-value

\n

THIO_EC OLI

\t

thioredoxin Escherichia coli

\t

234

\t

2.1e-5

\n

THIO_HU M A N

\t

thioredoxin Homo sapiens

\t

120

\t

0.001

\n

MS-Excel import/export

„

Coma delimited file:

…

,

delimits the columns, each value is surrounded by

‘ ’

…

\n

delimits the lines

…

Optional first line contains columns title

…

Example:

AC/ID

,

desc

,

score

,

e-value

\n

THIO_EC OLI

,

thioredoxin Escherichia coli

,

234

,

2.1e-5

\n

THIO_H U M A N

,

thioredoxin Homo sapiens

,

120

,

0.001

\n

References

Related documents

GRCP: Has received consulting fees from AbbVie, Bristol-Myers Squibb, Eli Lilly, Glaxosmithkline, Janssen, Pfizer, Sanofi Genzyme and Roche; ABVS: Has received supporting

(e) Part of tumors from (a) were frozen in OCT, sectioned, and stained with cleaved Caspaese-3 and F4/80 antibodies labeled with Cy5 and Cy3 respectively. Representative fluorescent

Tsur, Reuven (Steiner, Róbert): Menekülés a gettóból: egy nagyváradi zsidó család története (Escape from the Ghetto: The Story of a Jewish Family of Budapest)..

To overcome these disadvantages, a novel approach for applying a combination of the semantically different declarative rule-based languages (dialects) for interoperable

w Department of Neurology, College of Medicine, Inje University, Sanggye Paik Hospital, Seoul, Korea x Department of Neurology, Gyeongsang National University School of Medicine,

Thus, the study expects to offer newer insights by detecting the quantile causality in mean and variance between financial assets (stocks) and tradable commodities (gold and

Antibiotic Resistance Patterns and Detection of OXA-23 and OXA-48 Genes in Acinetobacter baumannii Isolated from Ventilator Association Infections.. ARTICLE INFO

While individualized consumption and buying local, organic food, factors in to the lifestyle activism of farmers’ market shoppers with whom I spoke, so do more collective