Useful links
• Installation instructions: http://www.ensembl.org/info/docs/api/api_installation.html • Schema description: http://www.ensembl.org/info/docs/api/core/core_schema.html • Tutorial: http://www.ensembl.org/info/docs/api/core/core_tutorial.html • Documentation (Doxygen): http://www.ensembl.org/info/docs/Doxygen/core-api/index.html• Ensembl-dev mailing list:
http://www.ensembl.org/info/about/contact/mailing.html
• Ensembl helpdesk:
Ensembl databases
• MySQL
• Species-specific databases:
core: genomic sequences and most annotation
variation: genetic variation
funcgen: regulatory elements
• Cross-species database:
Public MySQL servers
Ensembl
host ensembldb.ensembl.org user anonymous
password -
port 3306 (up to version 47) 5306 (version 48 onwards) Ensembl Genomes host mysql.ebi.ac.uk user anonymous password - port 4157
Ensembl Core databases
• The Ensembl Core databases store:
genomic sequence assembly information
gene, transcript and protein models cDNA and protein alignments
cytogenetic bands, markers, repeats, CpG islands etc. external references • homo_sapiens_core_66_37 species group assembly version software version (release)
MySQL
• Very good knowledge of database schemas needed
• Queries can quickly become very complex
Ensembl Core Perl API
• Used to retrieve data from and store data in the Ensembl
Core databases
• Written in Object-Oriented Perl
• Partly based on and compatible with BioPerl (version 1.2.3)
objects (http://www.bioperl.org)
• Used by the Ensembl analysis and annotation pipeline and
the Ensembl web code
• Robust, reliable and well-supported
What do we need?
• Perl
• BioPerl 1.2.3 (this is not the latest BioPerl version!)
• Ensembl API:
http://www.ensembl.org/info/docs/api/api_installation.html
Versioning
• API version must match database version
• Old scripts using the API should continue working with a
newer API! 65 65
API
66 66API
your perl scriptoutput for e!65 output for e!66
Data objects
• Data objects model biological entities, e.g. genes, regulatory
elements, variations, …
• Each data object encapsulates information from one or a few
specific MySQL tables
• Name space: object modules start with Bio::EnsEMBL, e.g.
Object adaptors
• Data objects are retrieved from and stored in the database
using object adaptors
• Object adaptors are data object factories
• Each object adaptor is responsible for creating data objects of
only one particular type
• Name space: object adaptor modules start with
Bio::EnsEMBL::DBSQL, e.g.
The Registry
• The Registry is an object adaptor factory
• Loads all databases of the same version as the API
Each script should start like this
…
#!/usr/bin/perl -w! ! use strict;! ! use Bio::EnsEMBL::Registry;! ! my $registry = 'Bio::EnsEMBL::Registry';! !## Load the databases into the registry!
$registry->load_registry_from_db(!
-host => 'ensembldb.ensembl.org',! -user => 'anonymous'!
);!
!
## Get the object adaptor for the object you’re interested in!
my $gene_adaptor = $registry->get_adaptor('Human', 'Core', 'Gene');! my $slice_adaptor = $registry->get_adaptor('Human', 'Core', 'Slice');! !
Coordinate systems
• Sequences stored in Ensembl are associated with sequence
regions
• Sequence regions are linked to a distinct hierarchy of
coordinate systems
• Coordinate systems vary from species to species:
human: chromosome, supercontig, clone, contig zebrafish: chromosome, scaffold, contig
• Sequence information is directly stored in the database for the
‘sequence level’ coordinate system
• The coordinate system of the highest level in a given region is
Coordinate systems
Chromosome Contigs
Clones
(Tiling path) Top level coordinate system
Sequence level coordinate system CCAGGCAGCGGGTT
GGTTAAGGCTTTTGATTTAGGGAG
AGGGAGAGGGACCTGG
CoordSystem object
• Retrieve using CoordinateSystemAdaptor
Attribute Example value(s) Method(s)
name chromosome, scaffold, contig, clone! name!
Slices
• A slice represents an arbitrary region of a genome
• Slices are not directly stored in the database
• Slices are used to obtain sequences or features from a
Slice object
• Retrieve using SliceAdaptor
Attribute Example value(s) Method(s)
coordinate system name chromosome, scaffold, clone! coord_system_name!
sequence region name Y, Zv9_scaffold1219, AADC0109557.1! seq_region_name!
start 1! start!
end 59373566! end!
length 59373566! length!
strand 1, -1! strand!
name chromosome:GRCh37:Y:1:59373566:1! name!
Exercise 1
An easy exercise to get started:
Fetch the slice corresponding to basepair 32890000 to
32891000 of human chromosome 13 and print its sequence.
What do you need first, when you want to retrieve a slice?
Have a look in the Doxygen documentation at the list of methods available for the object(s) you’re using:
http://www.ensembl.org/info/docs/Doxygen/core-api/index.html
If you have time left:
Print the soft-masked and hard-masked version as well as the reverse complement of the above sequence.
Features
• Features have a defined location on the genome
• All features have a start, end, strand and slice
• The start coordinate of a feature is always less than its end
coordinate, irrespective of the strand on which it is located (exception: insertion features)
Features
Object Represent(s)
Gene, Transcript, Exon Ensembl gene models
PredictionTranscript, PredictionExon Genscan gene models
DNAAlignFeature, ProteinAlignFeature cDNAs, proteins
RepeatFeature repeats
MarkerFeature markers
OligoFeature microarray probes
KaryotypeBandFeature cytogenetic bands
SimpleFeature results of cpg, Eponine, FirstEF and
tRNAscan
MiscFeature clones, ENCODE regions
Inheritance
• Data objects inherit methods from their parent object
• So, for example all methods that apply to the Feature object,
also apply to its children, i.e the Gene object, the Transcript object, the Exon object etc. etc.
Feature object
• Retrieve by using FeatureAdaptor
• Retrieve from Slice
Attribute Example value(s) Method(s)
name AluSp, D13S1788! display_id!
coordinates 13! 22398! 22594! 32912008! 32912204! 1! seq_region_name! start! end! seq_region_start! seq_region_end! strand!
sequence GATTGGTCAGGTAGACAGCAGCAAG ...! seq!
length 196! length!
slice returns Slice object!
with which feature is associated!
slice!
feature slice returns Slice object!
that covers feature!
feature_Slice!
slice relative
Exercise 2
Get the repeats on the sequence you retrieved in Exercise 1. Print the name of each repeat and its relative (slice) and
absolute (chromosomal) coordinates.
Is there anything that strikes you with regard to the coordinates of the repeats?
Genes, transcripts, translations
• Genes, transcripts and exons are features
• Introns are not explicitly defined in the database
• Translations are not features
• Protein sequences are not stored in the database, but
Gene object
• Retrieve by using GeneAdaptor
• Retrieve from Slice
Attribute Example value(s) Method(s)
stable ID ENSG00000139618! stable_id!
name BRCA2! external_name!
description breast cancer 2, early onset! description!
biotype protein_coding, miRNA! biotype!
analysis ensembl, havana, ensembl_havana_gene! analysis->logic_name!
status KNOWN, NOVEL! status!
transcripts returns listref of Transcript objects! get_all_Transcripts!
exons returns listref of Exon objects! get_all_Exons!
Transcript object
• Retrieve by using TranscriptAdaptor
• Retrieve from Slice or Gene
Attribute Example value(s) Method(s)
stable ID ENST00000380152! stable_id!
name BRCA2-001! external_name!
biotype protein_coding, nonsense_mediated_decay! biotype!
analysis ensembl, havana, ensembl_havana_transcript! analysis->logic_name!
status KNOWN, NOVEL! status!
CDS ATGCCTATTGGATCCAAAGAGAGGC...! translateable_seq!
UTRs returns Seq object! five_prime_utr!
three_prime_utr!
Transcript object (continued)
Attribute Example value(s) Method(s)
translation returns Translation object! translation!
exons returns listref of Exon objects! get_all_Exons!
introns returns listref of Intron objects! get_all_Introns!
Exon object
• Retrieve by using ExonAdaptor
• Retrieve from Slice, Gene or Transcript
Attribute Example value(s) Method(s)
Translation object
• Retrieve by using TranslationAdaptor
• Retrieve from Transcript
Attribute Example value(s) Method(s)
stable id ENSP00000369497! stable_id!
length 3418! length!
Exercise 3
Write a script to retrieve the upstream sequences for a list of Ensembl Gene IDs.
The script should take as input (from the command line): • the species
• the length of the upstream sequence
• the name of the file containing the Ensembl Gene IDs
and give as output:
• a file containing the upstream sequences in FASTA format
Take into account that a gene can be either on the forward or the reverse strand of the genome!
Use as input a file with Ensembl Gene IDs of yourself or use the file 100_human_genes.txt in /homes/evopadmin/Ensembl.
External references
• External references (Xrefs) are cross references to identifiers
from other databases, e.g. HGNC, WikiGenes, UniProtKB/ Swiss-Prot, RefSeq, OMIM etc. etc.
• External references can be on the gene, transcript or protein
DBEntry object
• Retrieve by using DBentryAdaptor
• Retrieve from Gene, Transcript or Translation
Attribute Example value(s) Method(s)
database name HGNC, Uniprot_SWISSPROT, EMBL! dbname!
DBAdaptor object
• Retrieve from Registry
Attribute Example value(s) Method(s)
database name homo_sapiens_core_66_37, danio_rerio_variation_66_9!
dbname!
database group core, variation, compara, funcgen! group!
database species homo_sapiens, danio_rerio! species!
database connection returns DBConnection object! dbc!
Exercise 4
Write a script that gets for all Ensembl species the protein sequence of the canonical transcript for the genes that have been annotated with a given gene symbol.
The script should take as input (from the command line): • the gene symbol
and give as output:
• a file containing the protein sequences in FASTA format with
the Ensembl Gene ID and the species name in the FASTA header
There are several ways to loop through the core dbs for all
species in Ensembl. You can use the DBAdaptor object or, if you feel adventurous, the GenomeDB object from the Compara API.