1
Sequence Formats and
Retrieval



Charles Steward


2
Aims
•
 
Acquaint you with different file formats and
associated annotations.
•
 
Introduce different nucleotide and protein
databases.
•
 
Show how to access different genomic data
from a variety of databases, using UniProt
and GQuery (Entrez).
3
Databases
•
 
Nucleotide databases: DDBJ/EMBL/NCBI form
the
I
nternational
N
ucleotide
S
equence
D
atabase
C
ollaboration and store Genomic/cDNAs/ESTs
sequences.
•
 
Protein database: UniProt: Swiss-Prot (manually
curated) and TrEMBL (automated annotation)
sequences .
•
 
Accession numbers (a unique number or
combination of letters and numbers assigned to
each record in a database) identify such
4
Information is mirrored daily between
DDBJ/EMBL/NCBI.
DDBJ/EMBL/GenBank
:
INSDC
(International
Nucleotide Sequence Database).
DDBJ
: DNA databank of Japan.
CIB-DDBJ
: Centre for Information Biology
and DNA Data Banks of Japan.
EBI
: European Bioinformatics Institute.
ENA
: European Nucleotide Archive contains EMBL
Nucleotide Sequence Database
EMBL
: European Molecular Biology Laboratory.
NCBI
: National Centre for Biotechnology Information.
IAM
: International Advisory Meeting.
5
EMBL
format
6
Abbreviations
found in the
EMBL flat file:
7
8
Sequence Read Archive (SRA) for next-generation
sequencing submission.
INSDC now accept sequence data produced by next-generation sequencing machines.
This screen shot is taken from the ENA hosted at EBI. For further information go here:
http://www.ebi.ac.uk/ena/about/sra_submissions
http://www.ebi.ac.uk/ena/about/sra_format
9
NGS file formats
http://www.ensembl.org/info/website/upload/index.html
http://www.ensembl.org/info/website/upload/large.html
BAM format -
A BAM file (.bam) is the binary version of a SAM file
bigBED format –
indexed BED file (1 line per feature and at least 3 columns)
BigWig format
– used for dense continuous data and displayed as a graph-
wiggle plot
VCF format
- for variants
Wellcome Trust Next Generation Sequencing course 6-13 April 2014
Next Generation Sequencing Courses
http://www.wellcome.ac.uk/Education-resources/Courses-and-conferences/
http://www.ebi.ac.uk/training/course/next-generation-sequencing-workshop-0
EBI - Monday, October 14, 2013 - Thursday, October 17, 2013
See here for more information:
10
11
GQuery (Entrez) entry point
http://www.ncbi.nlm.nih.gov/books/NBK3837/
12
Goal
: One sequence entry for each naturally occurring DNA, RNA and
protein molecule
NP_000000 XM_000000 NM_000000 mRNA proteinKey:
curated
calculated
XR_000000 NR_000000 XP_000000 predicted mRNA predicted proteinMultiple products for one
gene are instantiated as
separate RefSeqs with the
same LocusID.
predicted non-coding RNA non-coding RNANC_000000
chromosome
NT_000000
contig
13
CCDS
Comparison of common CDSs to form consensus gene set
QC by UCSC (filter out possible pseudogenes)
CCDS set is displayed in Ensembl/Vega/UCSC/NCBI
Build 104.0 – 27,752 agreed CCDS IDs
14
15
EBI search: access all
databases
16
ENA sequence window
17
ENA data view
See here for a clone example:
18
UniProt
19
PE (protein existence) line
Format
!
!
PE
Level
:
Evidence
;
!
!
Values
"
"
・
1: Evidence at protein level
"
"
・
2: Evidence at transcript level
"
"
・
3: Inferred from homology
"
"
・
4: Predicted
"
"
・
5: Uncertain
"
"
http://www.expasy.org/cgi-bin/lists?pe_criteria.txt
"
20
All information is
automatically generated
TrEMBL
entry
21
Manually curated entry
containing more
information than Trembl
22
BLAST similarity searching
•
 
B
asic
L
ocal
A
lignment
S
earch
T
ool
•
 
There are many different databases available to search against, which may vary depending on
which site you start from.
23
Blast output.
2) E-value
is estimate of the likelihood that a sequence match with that score has occurred
by chance. The
E-value
is calculated from the size of the sequence, database and
score
(or scoring system used) and so is specific to that search. Thus, two results on different
databases may not be directly comparable.
But the take home message: The smaller the
E-value
, the smaller the likelihood that it has
happened at random and is therefore more likely to be real.
For example:
0.000001 1 in a million searches would produce a false positive with this score
0.01
1 in 100 searches would produce a false positive with this score
1
1 match above threshold is likely to be FP
100
100 matches above threshold are FP
For further details see Karlin & Altschul - PNAS 1990 87:2264-8
1)
The
score
is a measure of the similarity of the query sequence to the subject sequence.
"
It is
calculated from the number of gaps and substitutions associated with each aligned
sequence. The higher the
score
, the more significant the alignment. Each
score
links to the
corresponding pairwise alignment between query sequence and subject sequence.
"24
Worked examples