Module 1. Sequence Formats and Retrieval. Charles Steward

(1)

1

Sequence Formats and

Retrieval 

Charles Steward

(2)

2

Aims

•

Acquaint you with different file formats and

associated annotations.

•

Introduce different nucleotide and protein

databases.

•

Show how to access different genomic data

from a variety of databases, using UniProt

and GQuery (Entrez).

(3)

3

Databases

•

Nucleotide databases: DDBJ/EMBL/NCBI form

the

I

nternational

N

ucleotide

S

equence

D

atabase

C

ollaboration and store Genomic/cDNAs/ESTs

sequences.

•

Protein database: UniProt: Swiss-Prot (manually

curated) and TrEMBL (automated annotation)

sequences .

•

Accession numbers (a unique number or

combination of letters and numbers assigned to

each record in a database) identify such

(4)

4

Information is mirrored daily between

DDBJ/EMBL/NCBI.

DDBJ/EMBL/GenBank

:

INSDC

(International

Nucleotide Sequence Database).

DDBJ

: DNA databank of Japan.

CIB-DDBJ

: Centre for Information Biology

and DNA Data Banks of Japan.

EBI

: European Bioinformatics Institute.

ENA

: European Nucleotide Archive contains EMBL

Nucleotide Sequence Database

EMBL

: European Molecular Biology Laboratory.

NCBI

: National Centre for Biotechnology Information.

IAM

: International Advisory Meeting.

(5)

5

EMBL

format

(6)

6

Abbreviations

found in the

EMBL flat file:

(7)

7

(8)

8

Sequence Read Archive (SRA) for next-generation

sequencing submission.

INSDC now accept sequence data produced by next-generation sequencing machines.

This screen shot is taken from the ENA hosted at EBI. For further information go here:

http://www.ebi.ac.uk/ena/about/sra_submissions

http://www.ebi.ac.uk/ena/about/sra_format

(9)

9

NGS file formats

http://www.ensembl.org/info/website/upload/index.html

http://www.ensembl.org/info/website/upload/large.html

BAM format -

A BAM file (.bam) is the binary version of a SAM file

bigBED format –

indexed BED file (1 line per feature and at least 3 columns)

BigWig format

– used for dense continuous data and displayed as a graph-

wiggle plot

VCF format

- for variants

Wellcome Trust Next Generation Sequencing course 6-13 April 2014

Next Generation Sequencing Courses

http://www.wellcome.ac.uk/Education-resources/Courses-and-conferences/

http://www.ebi.ac.uk/training/course/next-generation-sequencing-workshop-0

EBI - Monday, October 14, 2013 - Thursday, October 17, 2013

See here for more information:

(10)

10

(11)

11

GQuery (Entrez) entry point

http://www.ncbi.nlm.nih.gov/books/NBK3837/

(12)

12

Goal

: One sequence entry for each naturally occurring DNA, RNA and

protein molecule

NP_000000 XM_000000 NM_000000 mRNA protein

Key:

curated

calculated

XR_000000 NR_000000 XP_000000 predicted mRNA predicted protein

Multiple products for one

gene are instantiated as

separate RefSeqs with the

same LocusID.

predicted non-coding RNA non-coding RNA

NC_000000

chromosome

NT_000000

contig

(13)

13

CCDS

Comparison of common CDSs to form consensus gene set

QC by UCSC (filter out possible pseudogenes)

CCDS set is displayed in Ensembl/Vega/UCSC/NCBI

Build 104.0 – 27,752 agreed CCDS IDs

(14)

14

(15)

15

EBI search: access all

databases

(16)

16

ENA sequence window

(17)

17

ENA data view

See here for a clone example:

(18)

18

UniProt

(19)

19

PE (protein existence) line

Format

!

PE

Level

:

Evidence

;

!

Values

"

･

1: Evidence at protein level

"

･

2: Evidence at transcript level

"

･

3: Inferred from homology

"

･

4: Predicted

"

･

5: Uncertain

"

http://www.expasy.org/cgi-bin/lists?pe_criteria.txt

"

(20)

20

All information is

automatically generated

TrEMBL

entry

(21)

21

Manually curated entry

containing more

information than Trembl

(22)

22

BLAST similarity searching

•

B

asic

L

ocal

A

lignment

S

earch

T

ool

•

There are many different databases available to search against, which may vary depending on

which site you start from.

(23)

23

Blast output.

2) E-value

is estimate of the likelihood that a sequence match with that score has occurred

by chance. The

E-value

is calculated from the size of the sequence, database and

score

(or scoring system used) and so is specific to that search. Thus, two results on different

databases may not be directly comparable.

But the take home message: The smaller the

E-value

, the smaller the likelihood that it has

happened at random and is therefore more likely to be real.

For example:

0.000001 1 in a million searches would produce a false positive with this score

0.01

1 in 100 searches would produce a false positive with this score

1

1 match above threshold is likely to be FP

100

100 matches above threshold are FP

For further details see Karlin & Altschul - PNAS 1990 87:2264-8

1)

The

score

is a measure of the similarity of the query sequence to the subject sequence.

"

It is

calculated from the number of gaps and substitutions associated with each aligned

sequence. The higher the

score

, the more significant the alignment. Each

score

links to the

corresponding pairwise alignment between query sequence and subject sequence.

"

(24)

24

Worked examples