Databases
indexation
Laurent Falquet, Basel October, 2006
Swiss Institute of Bioinformatics
Swiss EMBnet node
Overview
Data access concept
sequential
direct
Indexing
EMBOSS
Fetch
Other
BLAST
Why indexing?
formatdb
Parsing output
Excel import/export
Tab delimited
Coma delimited
LF, Basel October 2006
Why indexing?
Human tendency to
classify and group
Examples:
Dictionnary
Book
Library
DVD chapters
iPod play lists
Advantages:
Fast access
Easy data finding
Disadvantages:
Time to prepare indices
Data access: sequential vs direct
Sequential access
Direct access
Vary from very short to very long
Very small variations
tracksector
LF, Basel October 2006
Similar concept for databases
Flat files = sequential
Indexing = simulated direct
>seq1
cgatgtcatgtg
>seq2
cgatcgtagctgtagctgtag
>seq3
catgtgcatgcgacgt
23
47
seq3
28
19
seq2
19
0
seq1
Length
(byte)
Position
(byte)
ID
Tools
EMBOSS
dbxflat
dbxfasta
dbiblast
seqret
seqretsplit
entret
Other examples
SRS (icarus language)
http://srs.ebi.ac.uk
http://www.lionbioscience.com/
indexer & fetch (warning
local SIB tool)
Relational (MySQL, Oracle…)
Web (Google!!)
LF, Basel October 2006
EMBOSS how to index?
Where is your file?
What is the format?
Where should be the
indices?
Where is the
emboss.default file?
(.embossrc)
Other EMBOSS tools
textsearch
Whichdb
More details
www.emboss.org
EMBOSS example
Input file and directory
~/embossidx/ECOLI.dat
cd embossidx
Index creation
dbxflat -idformat swiss -dbname ecoli -filenames '*.dat' -dbresource
swiss -directory . -release 1.0 -date 26/09/06 -fields id,acc
Generates 5 files (default)
ECOLI.ent
ECOLI.pxac
ECOLI.pxid
ECOLI.xac
ECOLI.xid
LF, Basel October 2006
.embossrc
Example of queries
seqret ecoli:thio_ecoli
seqret ecoli:P00274
entret ecoli:thio_ecoli
and even
seqret ‘ecoli:*_ECOLI’
set emboss_filter 1
# Ecoli
D B ecoli[
type: P
comment: "E.coli proteome"
method: emboss
format: swiss
dir: "
{path}
/embossidx"
file: "ECOLI.dat"
release: "1.0"
indexdir:"
{path}
/embossidx"
]
Where
{path}
is the path to your home
directory
Indexer & fetch
Warning
this is a local SIB tool!!
Input file and directory
~/embossidx/ECOLI.dat
cd embossidx
Index creation
indexer -h '^ID' -t '^//' -i -p '^ID\s+(\S+)' ECOLI.dat ecoli.idx
Generates 1 file
ecoli.idx
LF, Basel October 2006
Config file: fetch.conf
Example of queries
fetch -c fetch.conf ecoli:thio_ecoli
fetch -c fetch.conf -f ‘ecoli:thio_ecoli[20..50]’
fetch.conf
#dbkey format indexfile datafile
ecolisp ~/embossidx/ecoli.idx ~/embossidx/ECOLI.dat
BLAST
Maintained at NCBI
Source distributed freely with
several accessory tools
ftp://ftp.ncbi.nlm.nih.gov/too
lbox/ncbi_tools/ncbi.tar.gz
May require compilation to
install on your local computer
blastall contains
blastp
blastn
blastx
tblastn
tblastx
Other tools
blastpgp
megablast
formatdb
LF, Basel October 2006
Available Blast programs
Program
Query
Database
blastp
protein
protein
blastn
nucleotide
nucleotide
blastx
protein
nucleotide
protein
tblastn
protein
protein
nucleotide
tblastx
protein
nucleotide
protein
nucleotide
VS VS VS VS VSWhat makes BLAST so fast?
Indexing
all words of 3 aa or
11 bp in the sequence database
Searching the query for all
words of a score > T
Search the
indexed
database
for all perfect matches
Try to align matches that are
LF, Basel October 2006
Indexing for Blast (1)
REL
Query
RSL
RSL
AAA
AAC
AAD
YYY
AAA
AAC
AAD
YYY
List of all possible words with
3 amino acid residues (8000)
...
ACT
RSL
TVF
ACT
RSL
TVF
List of words matching the
query with a score > T
score > T
...
...
LKP
LKP
LKP
LKP
score < T
Ö
A substitution matrix is used to compute the word scores
Ö
A
substitution matrix
is used to compute the word scores
Indexing for Blast (2)
ACT
RSL
TVF
ACT
RSL
TVF
List of words matching the
query with a score > T
...
...
ACT
ACT
ACT
RSL
RSL
TVF
RSL
RSL
RSL
RSL
TVF
TVF
Database sequences
Ö
List of sequences containing
words similar to the query (hits)
Ö
List of sequences containing
words similar to the query (hits)
Search for
LF, Basel October 2006
Indexing for Blast (3)
Database sequence
Query
A
Ungapped extension if:
2 "Hits" are on the same diagonal but
at a distance less than A
Database sequence
Query
A
Extension using
dynamic programming
limited to a restricted region
limited through a
score drop-off
threshold
BLAST indexing with formatdb
Formatdb
mydb.seq
must contain sequences in FASTA format
formatdb -i mydb.seq
-p T -n
mydb
Generates 3 files
mydb
.psq
mydb
.pin
mydb
.phr
Then start a Blast:
LF, Basel October 2006
Blast local vs
remote
blastall
Executed locally
Slow
No need to transfert db
blastall.remote
Executed remotely
Fast
Requires special
priviledges and db
transfert
Using BioPerl
(remoteblast.pm)
Blast at NCBI
No user db
See www.bioperl.org
Multiple Blasts?
1 seq
vs
db seq
1 FASTA seq as input
db seq
vs
db seq
Several single FASTA
seq files as input or
1 Multiple FASTA seq
file as input
Possibility to export
results as XML
Use Perl to automatize the
queries and parse the
output
LF, Basel October 2006
Parsing Blast output
BL ASTP 2.2.10 [Oct-19-2004]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Query= ACC A_BACSU O34847 Acetyl-coenzyme A carboxylase carboxyl
transferase subunitalpha (EC 6.4.1.2).
(325 letters)
Database:
ecoli_blast
4339 sequences; 1,373,039 total letters
Searching...done
Score E
Sequences producing significant alignments: (bits) Value
A C C A_ECOLI P30867 Acetyl-coenzyme A carboxylase carboxyltransfe... 266 1e-72
Parsing Blast output (2)
>ACC A_ECO LI P30867 Acetyl-coenzyme A carboxylase carboxyl transferase subunitalpha (EC 6.4.1.2).
Length = 318
Score = 266 bits (681), Expect= 1e-72
Identities = 143/312 (45%), Positives = 188/312 (60%), Gaps = 3/312 (0%)
Query: 5 LEFEKPVIELQTKIAELK KFTQ DS---D M DLSAEIERLEDRLA KL Q D DIYK NL KP W D R V Q 61 L+FE+P+ EL+ KI L ++ D+++ E+ RL ++ +L I+ +L W Q
Sbjct: 5 LDFEQPIAELEAKIDSLTA VSR Q DE KL DINIDEEV HRLREKSVELTR K IFADLG A W Q IAQ 64
Query: 62 IARLA D RPTTLD YIEHLFTDFFECH G D R A Y G D DE AIVGGIAKFH GLPVT VIGH QR G K DT K 121 +AR RP TLD Y+ F +F E GDR A Y DD+AIVG GIA+ G PV +IGH Q+ G++TK
Sbjct: 65 LARHPQRPYTLDY V RL AFDEFDEL A G D R A Y A D D K AIVGGIARLD G RPV MIIGH Q K G RETK 124
Query: 122 ENLVR NFG MPHPEG Y R K ALRL M K Q A D K FNRPIICFIDTK G A YP GR A AEER G QSEAIAKNL 181 E + RNFG MP PEG Y R K ALRL M+ A++F PII FIDT GA YPG AEERG QSEAIA+ NL
Sbjct: 125 EKIRR NFG MP APEG YR K ALRLM Q M A ERFKM PIITFIDTPG A YPG V G A EERG QSEAIA R NL 184
Query: 182 FEM A G LR VPXX X X X X X X X X X X X X X X X X X X X X X H M LENSTYSVISPEG A A ALLW K DSSLA K 241 EM+ L VP +ML+ STYSVISPEG A++L W K + A
LF, Basel October 2006
Parsing Blast output (3)
With BioPerl:
#!/usr/local/bin/perl
use
Bio::SearchIO;
my
$blast_report =
new
Bio::SearchIO (
'-format'=>
'blast'
,
'-file'
=> $AR G V[
0
]);
"Query name:\tQuery description:\tHit name:\tHit description:\tE-value\tScore\n"
;
while
(
my
$result = $blast_report->next_result) {
$result->query_name(),"\t"
, $result->query_description(),"\n"
;
while
(
my
$hit = $result->next_hit()) {
"\t\t"
, $hit->name(),"\t"
, $hit->description();
while
(
my
$hsp = $hit->next_hsp()) {
"\t"
, $hsp->evalue(),"\t"
, $hsp->score();
}
"\n"
;
}
}
exit0
;
MS-Excel import/export
Excel can import
Tab delimited
Coma delimited
Excel can export
Tab delimited
Space delimited
AC/ID
desc
score
e-value
THIO_ECOLI
thioredoxin Escherichia coli
234
2.1e-5
THIO_HUMAN thioredoxin Homo sapiens
120
0.001
LF, Basel October 2006