How To Use The Assembly Database In A Microarray (Perl) With A Microarcode) (Perperl 2) (For Macrogenome) (Genome 2)

(1)

(2)

Useful links

•  Installation instructions: http://www.ensembl.org/info/docs/api/api_installation.html •  Schema description: http://www.ensembl.org/info/docs/api/core/core_schema.html •  Tutorial: http://www.ensembl.org/info/docs/api/core/core_tutorial.html •  Documentation (Doxygen): http://www.ensembl.org/info/docs/Doxygen/core-api/index.html

•  Ensembl-dev mailing list:

http://www.ensembl.org/info/about/contact/mailing.html

•  Ensembl helpdesk:

(3)

Ensembl databases

•  MySQL

•  Species-specific databases:

core: genomic sequences and most annotation

variation: genetic variation

funcgen: regulatory elements

•  Cross-species database:

(4)

Public MySQL servers

Ensembl

host ensembldb.ensembl.org user anonymous

password -

port 3306 (up to version 47) 5306 (version 48 onwards) Ensembl Genomes host mysql.ebi.ac.uk user anonymous password - port 4157

(5)

Ensembl Core databases

•  The Ensembl Core databases store:

genomic sequence assembly information

gene, transcript and protein models cDNA and protein alignments

cytogenetic bands, markers, repeats, CpG islands etc. external references •  homo_sapiens_core_66_37 species group assembly version software version (release)

(6)

MySQL

•  Very good knowledge of database schemas needed

•  Queries can quickly become very complex

(7)

Ensembl Core Perl API

•  Used to retrieve data from and store data in the Ensembl

Core databases

•  Written in Object-Oriented Perl

•  Partly based on and compatible with BioPerl (version 1.2.3)

objects (http://www.bioperl.org)

•  Used by the Ensembl analysis and annotation pipeline and

the Ensembl web code

•  Robust, reliable and well-supported

(8)

What do we need?

•  Perl

•  BioPerl 1.2.3 (this is not the latest BioPerl version!)

•  Ensembl API:

http://www.ensembl.org/info/docs/api/api_installation.html

(9)

Versioning

•  API version must match database version

•  Old scripts using the API should continue working with a

newer API! 65 65

API

66 66

API

your perl script

output for e!65 output for e!66

(10)

Data objects

•  Data objects model biological entities, e.g. genes, regulatory

elements, variations, …

•  Each data object encapsulates information from one or a few

specific MySQL tables

•  Name space: object modules start with Bio::EnsEMBL, e.g.

(11)

(12)

Object adaptors

•  Data objects are retrieved from and stored in the database

using object adaptors

•  Object adaptors are data object factories

•  Each object adaptor is responsible for creating data objects of

only one particular type

•  Name space: object adaptor modules start with

Bio::EnsEMBL::DBSQL, e.g.

(13)

The Registry

•  The Registry is an object adaptor factory

•  Loads all databases of the same version as the API

(14)

Each script should start like this

…

#!/usr/bin/perl -w! ! use strict;! ! use Bio::EnsEMBL::Registry;! ! my $registry = 'Bio::EnsEMBL::Registry';! !

## Load the databases into the registry!

$registry->load_registry_from_db(!

-host => 'ensembldb.ensembl.org',! -user => 'anonymous'!

);!

!

## Get the object adaptor for the object you’re interested in!

my $gene_adaptor = $registry->get_adaptor('Human', 'Core', 'Gene');! my $slice_adaptor = $registry->get_adaptor('Human', 'Core', 'Slice');! !

(15)

Coordinate systems

•  Sequences stored in Ensembl are associated with sequence

regions

•  Sequence regions are linked to a distinct hierarchy of

coordinate systems

•  Coordinate systems vary from species to species:

human: chromosome, supercontig, clone, contig zebrafish: chromosome, scaffold, contig

•  Sequence information is directly stored in the database for the

‘sequence level’ coordinate system

•  The coordinate system of the highest level in a given region is

(16)

Coordinate systems

Chromosome Contigs

Clones

(Tiling path) Top level coordinate system

Sequence level coordinate system CCAGGCAGCGGGTT

GGTTAAGGCTTTTGATTTAGGGAG

AGGGAGAGGGACCTGG

(17)

CoordSystem object

•  Retrieve using CoordinateSystemAdaptor

Attribute Example value(s) Method(s)

name chromosome, scaffold, contig, clone! name!

(18)

Slices

•  A slice represents an arbitrary region of a genome

•  Slices are not directly stored in the database

•  Slices are used to obtain sequences or features from a

(19)

Slice object

•  Retrieve using SliceAdaptor

coordinate system name chromosome, scaffold, clone! coord_system_name!

sequence region name Y, Zv9_scaffold1219, AADC0109557.1! seq_region_name!

start 1! start!

end 59373566! end!

length 59373566! length!

strand 1, -1! strand!

name chromosome:GRCh37:Y:1:59373566:1! name!

(20)

Exercise 1

An easy exercise to get started:

Fetch the slice corresponding to basepair 32890000 to

32891000 of human chromosome 13 and print its sequence.

What do you need first, when you want to retrieve a slice?

Have a look in the Doxygen documentation at the list of methods available for the object(s) you’re using:

http://www.ensembl.org/info/docs/Doxygen/core-api/index.html

If you have time left:

Print the soft-masked and hard-masked version as well as the reverse complement of the above sequence.

(21)

Features

•  Features have a defined location on the genome

•  All features have a start, end, strand and slice

•  The start coordinate of a feature is always less than its end

coordinate, irrespective of the strand on which it is located (exception: insertion features)

(22)

Features

Object Represent(s)

Gene, Transcript, Exon Ensembl gene models

PredictionTranscript, PredictionExon Genscan gene models

DNAAlignFeature, ProteinAlignFeature cDNAs, proteins

RepeatFeature repeats

MarkerFeature markers

OligoFeature microarray probes

KaryotypeBandFeature cytogenetic bands

SimpleFeature results of cpg, Eponine, FirstEF and

tRNAscan

MiscFeature clones, ENCODE regions

(23)

Inheritance

•  Data objects inherit methods from their parent object

•  So, for example all methods that apply to the Feature object,

also apply to its children, i.e the Gene object, the Transcript object, the Exon object etc. etc.

(24)

Feature object

•  Retrieve by using FeatureAdaptor

•  Retrieve from Slice

name AluSp, D13S1788! display_id!

coordinates 13! 22398! 22594! 32912008! 32912204! 1! seq_region_name! start! end! seq_region_start! seq_region_end! strand!

sequence GATTGGTCAGGTAGACAGCAGCAAG ...! seq!

length 196! length!

slice returns Slice object!

with which feature is associated!

slice!

feature slice returns Slice object!

that covers feature!

feature_Slice!

slice relative

(25)

Exercise 2

Get the repeats on the sequence you retrieved in Exercise 1. Print the name of each repeat and its relative (slice) and

absolute (chromosomal) coordinates.

Is there anything that strikes you with regard to the coordinates of the repeats?

(26)

Genes, transcripts, translations

•  Genes, transcripts and exons are features

•  Introns are not explicitly defined in the database

•  Translations are not features

•  Protein sequences are not stored in the database, but

(27)

Gene object

•  Retrieve by using GeneAdaptor

•  Retrieve from Slice

stable ID ENSG00000139618! stable_id!

name BRCA2! external_name!

description breast cancer 2, early onset! description!

biotype protein_coding, miRNA! biotype!

analysis ensembl, havana, ensembl_havana_gene! analysis->logic_name!

status KNOWN, NOVEL! status!

transcripts returns listref of Transcript objects! get_all_Transcripts!

exons returns listref of Exon objects! get_all_Exons!

(28)

Transcript object

•  Retrieve by using TranscriptAdaptor

•  Retrieve from Slice or Gene

stable ID ENST00000380152! stable_id!

name BRCA2-001! external_name!

biotype protein_coding, nonsense_mediated_decay! biotype!

analysis ensembl, havana, ensembl_havana_transcript! analysis->logic_name!

status KNOWN, NOVEL! status!

CDS ATGCCTATTGGATCCAAAGAGAGGC...! translateable_seq!

UTRs returns Seq object! five_prime_utr!

three_prime_utr!

(29)

Transcript object (continued)

translation returns Translation object! translation!

exons returns listref of Exon objects! get_all_Exons!

introns returns listref of Intron objects! get_all_Introns!

(30)

Exon object

•  Retrieve by using ExonAdaptor

•  Retrieve from Slice, Gene or Transcript

(31)

Translation object

•  Retrieve by using TranslationAdaptor

•  Retrieve from Transcript

stable id ENSP00000369497! stable_id!

length 3418! length!

(32)

Exercise 3

Write a script to retrieve the upstream sequences for a list of Ensembl Gene IDs.

The script should take as input (from the command line): •  the species

•  the length of the upstream sequence

•  the name of the file containing the Ensembl Gene IDs

and give as output:

•  a file containing the upstream sequences in FASTA format

Take into account that a gene can be either on the forward or the reverse strand of the genome!

Use as input a file with Ensembl Gene IDs of yourself or use the file 100_human_genes.txt in /homes/evopadmin/Ensembl.

(33)

External references

•  External references (Xrefs) are cross references to identifiers

from other databases, e.g. HGNC, WikiGenes, UniProtKB/ Swiss-Prot, RefSeq, OMIM etc. etc.

•  External references can be on the gene, transcript or protein

(34)

DBEntry object

•  Retrieve by using DBentryAdaptor

•  Retrieve from Gene, Transcript or Translation

database name HGNC, Uniprot_SWISSPROT, EMBL! dbname!

(35)

DBAdaptor object

•  Retrieve from Registry

database name homo_sapiens_core_66_37, danio_rerio_variation_66_9!

dbname!

database group core, variation, compara, funcgen! group!

database species homo_sapiens, danio_rerio! species!

database connection returns DBConnection object! dbc!

(36)

Exercise 4

Write a script that gets for all Ensembl species the protein sequence of the canonical transcript for the genes that have been annotated with a given gene symbol.

The script should take as input (from the command line): •  the gene symbol

and give as output:

•  a file containing the protein sequences in FASTA format with

the Ensembl Gene ID and the species name in the FASTA header

There are several ways to loop through the core dbs for all

species in Ensembl. You can use the DBAdaptor object or, if you feel adventurous, the GenomeDB object from the Compara API.