Data Warehouse Technology
And The MSD Databases
Data Warehouses
The MSD Databases
Populating & using the Search
What is a Data Warehouse?
“
A
subject-oriented, integrated, nonvolatile,
and
time-variant
collection of data that is
used primarily in organizational decision
making.”
(W. H. Inmon, Building the Data Warehouse,
John Wiley & Sons, 2002)
Data Warehouse—Subject-Oriented
Organised around major subjects, such as
customer, product, sales
Focusing on the modelling and analysis of
data for decision makers, not on daily
operations or transaction processing
Provides
a simple and concise
view around
particular subject issues by
excluding data
that are not useful in the decision support
process
Data Warehouse
—
Integrated
Constructed by integrating multiple,
heterogeneous data sources
relational databases, flat files, on-line transaction records
Data cleaning and data integration
techniques are applied
Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
When data are moved to the warehouse, they are converted into a consistent format to facilitate integration.
Data Warehouse
—
Non-Volatile
A physically separate store of data transformed from the operational environment
Operational update of data does not occur in the data warehouse environment
Does not require transaction processing, recovery, and
concurrency control mechanisms
Requires only two operations in data accessing:
Data Warehouse
—
Time Variant
The time horizon for the data warehouse is
significantly longer than that of
operational systems
Operational database: current value data
Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not contain
Dimensional Modelling
A typical commercial data warehouse is based on
multidimensional data model which views data in the form
of a data cube
A data cube allows data to be modelled and viewed in multiple dimensions (such as sales)
Dimension tables, representing important subject areas
such as item (item_name, brand, type), or time(day, week,
month, quarter, year)
Fact table containing varying levels of summarised data
(such as dollars_sold) and keys to each of the related
Data Warehouse Models
Star schema
A fact table, containing varying levels of summarised data, in
the middle connected to a set of dimension tables, representing important subject areas
Snowflake schema
A refinement of star schema where some dimensional hierarchy
is normalised into a set of smaller dimension tables, forming a shape similar to snowflake
Fact constellation
Multiple fact tables share dimension tables, viewed as
a collection of stars, therefore called galaxy schema
Snowflake Schema
SaleID ItemID Quantity SalePrice Amount Transactions ItemID Description QuantityOnHand ListPrice Category Merchandise SaleID SaleDate EmployeeID CustomerID SalesTax Sale CustomerID Phone FirstName LastName Address ZipCode CityID Customer CityID ZipCode City State CityDimension tables can join to other
Fact Constellation
time_key item_key branch_key location_key unit_sold euros_sold avg_salesSales Fact Table
time_key day day_of_the_week month quarter year Time branch_key branch_name branch_type Branch item_key item_name brand type supplier_key Item location_key street city province/street country Location time_key item_key shipper_key from_location_key to_location_key unit_shipped
Shipping Fact Table
shipper_key shipper_name location_key shipper_type
Why Build a Data Warehouse?
Access to data from multiple sources, have a
comprehensive data collection.
Separate transactional and analysis systems:
Improve query response time (without slowing downtransaction processing)
Easy formulation of complex queries
Access to historical data (not in operational
systems)
Improved data quality (fewer errors and
missing values)
The Data Warehouse Pipeline
Data Warehouse Extract Transform Load Refresh OLAP Engine Analysis Query Reports Data mining Front-End Tools Serve Operational DBs other sources Data Storage OLAP ServerUsing the Warehouse
Ad Hoc query
simplistic submission of SQL statements from a command-line tool or a SQL-generation tool
Complex Analytical Questions
using custom written query tools or commercial online analytical processing (OLAP) tools
Data Mining
What is a Data Warehouse? (revisited)
“A data warehouse is simply a single,complete,
and consistent store of data obtained from a variety of sources and made available to end
users in a way they can understand and use it in
a business context.”
-- Barry Devlin, IBM Consultant
The MSD actually consists of two separate
databases:
the deposition database is highly normalised, with
thousands of relationships linking some 325 tables; the deposition database is the definitive archive for all
structural data at MSD
the search database is a simpler, but larger
denormalised database,which contains a large amount
of additional derived data, with data items duplicated
and aggregated into 170 much wider tables, making it more amenable to searching and retrieval of data
A third intermediate database is involved in
transforming the data from the deposition database to the search database and in calculating and adding the
derived data
The deposition database comprises:
common reference data, such as amino-acid connectivity, HET groups structures, etc.
older PDB entries, loaded from legacy files
schema includes strict constraints, enforcing internal
consistency and performing type checking and validation against the reference data
a huge amount of effort has gone into cleaning up the legacy data
new entries, loaded from recent PDB submissions
new entries are loaded on a weekly basis, subject to the same constraints and checks during loading as legacy data
NTX CHANGED TAX ID # OLD_TAX_ID * NEW_NAME_TXT * OLD_NAME_TXT * USERSTAMP o TIMESTAMP_NCBI NCBI SYNONYMS # NAME_CLASS # NAME_TXT NCBI # TAX_ID o PARENT_ID * SCIENTIFIC_NAME o PREFERRED_COMMON_NAME o RANK * HIDDEN ETAXI SYNONYMS # NAME_TXT # NAME_CLASS ETAXI o PARENT_ID o UPPER_NAME o FULL_NAME * COMPLETE_GENOME_FLAG o RANK * HIDDEN * SCIENTIFIC_NAME o LEFT_NUMBER o RIGHT_NUMBER * ANNOTATION_SOURCE DEP_PDB_TAXONOMY o PDB_COMMON_NAME o PDB_SCIENTIFIC_NAME o PDB_STRAIN DEP_NATURAL_SOURCE * CELL_ID o ATCC o CELL_LINE o CELL_LOCATION o CELL_TYPE o DETAILS o FRAGMENT o GENE o ORGAN o ORGANELLE o PLASMID o SECRETION o TISSUE o VARIANT DEP_ENTITY * NAME o ENTITY_SRC * ID o DETAILS o SYSTEM REF_CHEM_COMP o EBI_ID o HETGROUP_PARENT o MODEL_DETAILS o MOLECULAR_FORMULA o MOLECULAR_WEIGHT o EBI_RESERVED_NAME o RCSB_RESERVED_NAME ... REF_SEQ_REF_SEQ * COMPONENT * SERIAL REF_SEQ_REF * DB_NAME * PRIMARY_ID o SECONDARY_ID o VERSION DEP_POLY_ENTITY_SEQ * HETERO * SERIAL DEPOSITION * NUM_XTALS o PDB_EXP_TYPE * DEPOSITION_PROCESSED_BY * CREATION_DATE * LAST_UPDATE * TITLE ... DEP_RESIDUE DEP_SEQ_CORR * CONFLICT_ANNOTATED_FLAG * TYPE o CIF_TYPE o CONFLICT_TYPE o DETAILS DEP_SEQ_MATCH * CIF_ID o DETAILS DEP_SEQ_REF * PROC_MATCHED_FLAG o CIF_ID o DETAILS DEP_POLY_ENTITY * ENGINEERED * HETERO_FLAG * MUTANT_FLAG o FRAGMENT_FLAG o MUTATION_STRING * SYNTHETIC * MASTER_ID DEP_POLY_ENTITY_SEGMENT * RCSB_DEFINE_AS_ENTITY * RCSB_SERIAL_OFFSET DEP_POLY_ENTITY_MASTER has taxonomy taxonomy of is parent of is child of eta_ncb_fk eta_ncb_fk es_eta_fk es_eta_fk ns_ncb_fk ns_ncb_fk ncti_ncb_fk ncti_ncb_fk c d a b CP606 R/1374 JA3 JR3 compound defines compound defines is defined by parent comp child comp a b sequence of has sequence refers to referred to by conflict conflicts with matches entity matched by is a component of is composed of describes is described by referred to by referrs to precedes follows R/1939 CP623 R/1937 CP621
Is not a true data warehouse
Breaks several of the data warehouse
car
di
nal
“r
ul
es”
Non-volatile
Time Variant
But does make use of many of the data
warehousing concepts and techniques
Closest to a fact constellation
NOT The MSD Data Warehouse
Each data item occurs only once in the
deposition database, so that data from a single entry are spread across many tables
To make searching faster, the data are
aggregated into fewer, larger tables in the Search
database
Searching the Search database requires fewer
table joins, making database queries significantly faster and much less complex
The top-level entity in a structure entry is the assembly, as determined using the Protein Quaternary Structure server (PQS)
Chains Residues Atoms Exp. Result Assembly
ALT ASSEMBLY ASSEMBLY DATA ATOM ATOM DATA CHAIN RESIDUE ENTRY MODEL
Representing Macromolecular Structures (2)
Each level of the hierarchy can have associated properties, e.g.
Bound molecules
Domains
Site residues
Derived properties (e.g. asa)
Reference information (e.g.
standard geometry) ASU observed exp data Chains Residues Atoms Biological Unit(s) Independent units
During
transformation
from the deposition
database to the search database additional
derived data are added
Numerous processes are run on the
deposition data, including:
characterisation of ligand binding sites
derivation of secondary structure information mapping data onto other databases such as
UniProt, Pfam, InterPro, GO, SCOP and CATH
Search Database Data Models
Entity-relationship
data model
as used in the deposition database but denormalised to aid querying
could be generated from deposition data model
Dimensional
model
typical of commercial data warehouses requires separate data model
A hybrid of these two
required to handle to complexity of macromolecular structure data
very complex –fact tables could be dimensions for other fact tables
Part of Search Database Schema
MSD SEARCH DATABASE MARTS:
SECONDARY STRUCTURE COORDINATES-SEQUENCE TAXONOMY CATH
TURN
STRAND DATA SHEET ORDER
SHEET HBOND SHEET NCBI SYNONYMS NCBI MODEL HELIX HELIX HELIX DATA HAIRPIN MOTIF ETAXI SYNONYMS ETAXI DEPOSITION CATH_MAPPING COMPONENT DATA COMPONENT CHAIN BULGE ATOM DATA ATOM ASSEMBLY DATA ASSEMBLY ALT atom atom_data etaxi_ncbi_fk etaxi_ncbi_fk etaxi_syn_fk etaxi_syn_fk ncbi_syn_fk ncbi_syn_fk alt atl_atd_fk atd_model_fk atd_model_fk shb_fk8 shb_fk8 sheet_order_fk1 sheet_order_fk1 sheet_order_fk2 sheet_order_fk2 shb_fk2 shb_fk2 sheet_order_fk4 sheet_order_fk4 helix_data_fk2 helix_data_fk2 strand_data_fk1 strand_data_fk1 strand_data_fk2 strand_data_fk2 helix_d_fk2 helix_d_fk2 helix_d_fk1 helix_d_fk1 helix_data_fk1 helix_data_fk1 hairpin_motif_fk5 hairpin_motif_fk5 hairpin_motif_fk4 hairpin_motif_fk4 hairpin_motif_fk3 hairpin_motif_fk3 strand_data_fk4 strand_data_fk4 strand_data_fk5 strand_data_fk5 strand_data_fk6 _strand_data_fk6 strand_data_fk7 _strand_data_fk7 strand_data_fk8 _strand_data_fk8 strand_data_fk9 _strand_data_fk9 turn_fk1 turn_fk1 turn_fk3 turn_fk3 bulge_fk7 bulge_fk7 bulge_fk6 bulge_fk6 bulge_fk5 bulge_fk5 bulge_fk4 bulge_fk4 bulge_fk3 bulge_fk3 bulge_fk2 bulge_fk2 bulge_fk1 bulge_fk1 turn_fk4 turn_fk4 turn_fk5 turn_fk5 turn_fk6 turn_fk6 comp_tax_fk comp_tax_fk turn_fk2 turn_fk2 strand_data_fk3 strand_data_fk3 sheet_order_fk3 sheet_order_fk3 shb_fk7 shb_fk7 shb_fk1 shb_fk1 sheet_datak2 sheet_datak2 sheet_datak1 sheet_datak1 hairpin_motif_fk1 hairpin_motif_fk1 db_comp_fk db_comp_fk db_chain_fk db_chain_fk component_data_fk4 component_data_fk4 chain_tax_fk chain_tax_fk assembly_data_model_fk assembly_data_model_fk assembly_deposition_fk assembly_deposition_fk
Star Database Design
Interactions
Number of interactions Strongest interaction
Fact Table (residue_contact)
Residue Geometry Dimension Tables PDB Entry Model Assembly Ligand Neighbour Secondary structure Bond type Helix Turn Strand Chemical compound
From
Snowflake
to
Star
Residue Contact
Chemical compound Residue Assembly PDB entryFrom Deposition to Search Database
Deposition Database Search database
Transformation Normalised relationships authoritative complete Denormalised fewer relationships derived information subset of data
Search database 200 GB Deposition database 40 GB D I S T R I B U T I O N 130 GB 200 GB Transformation database 200 GB Web services External Processes derived data load via mmcif Web services deposit transform distribute search PDB files replicate
MSD Database Pipeline
Transformation
Moving from a complex normalised model to enforce integrity to a simple, efficient simple user oriented
model
Assignment of consistent identifiers
In addition to PDB identifiers
Extensive indexing
Based on a flexible metadata driven mechanism, in-house developed to overcome Oracle limitations
Models composite entities and their dependencies Allows incremental transformation
Post Transformation
Calculation of derived/aggregated data
Consistent across whole archive
Ability to query derived data
Adding value to the database
Active-site information
Structure , Sequence Alignment
Cross referencing SCOP, CATH, PFAM, UniProt, InterPro
Additional derived data and indexing required for web-based search services
Stored in search database, but conceptually part of search systems
Scientific parameterisation reflects requirements of search services
Efficiently Querying The Database
Requires using many of the tools provided by
the Oracle DBMS
STAR Joins
STAR Tranformations Bitmap indexes
Index Organized Tables
Set operations (intersect, minus, union)
3D Spatial searches
Search example: find the following triangle site: Cbeta of Isoleucine or Leucine
Cbeta of Tryptophan or Tyrosine or Phenylalanine Cbeta of Arginine 6-8 Angstroms 6-8 Angstroms 6-8 Angstroms
The Query
select d1.atom_data1_id, d1.atom_data2_id from
(select /*+ NO_MERGE INDEX(atomic_dists)*/
atom_data1_id, atom_data2_id from atomic_dists where dist_id in (select id from dists
where code_3_letter1 in ('ILE','LEU') and code_3_letter2 in ('TRP','TYR','PHE') and chem_atom1_name = 'CB' and chem_atom2_name = 'CB'
and dist in (6,7,8))) d1,
(select /*+ NO_MERGE INDEX(atomic_dists) */
atom_data1_id, atom_data2_id from atomic_dists where dist_id in (select id from dists
where code_3_letter1 = 'ARG' and code_3_letter2 in ('ILE','LEU') and chem_atom1_name = 'CB' and chem_atom2_name = 'CB'
and dist in (6,7,8))) d2,
(select /*+ NO_MERGE INDEX(atomic_dists) */
atom_data1_id, atom_data2_id from atomic_dists where dist_id in (select id from dists
where code_3_letter1 = 'ARG' and code_3_letter2 in ('TRP','TYR','PHE') and chem_atom1_name = 'CB' and chem_atom2_name = 'CB'
and dist in (6,7,8))) d3
where d1.atom_data1_id = d2.atom_data2_id and d1.atom_data2_id = d3.atom_data2_id and d2.atom_data1_id = d3.atom_data1_id;
Managing Data
The information available in the MSD database is organised in application areas (data marts)
Users may replicate only the data marts that they are interested in
Some data marts are quite valuable and still small enough to be used on desktop systems as in the demonstration
The data marts are also loosely interrelated and can be synchronised independently
Data Marts
The Search database is divided into application areas, or data marts:
Structure Data Descriptions Secondary Structure Taxonomy Ligands Experimental details Citations
Mapping to UniProt, SCOP, CATH, Pfam, InterPro, Go,
IntEnz, PubMed
Active Sites
Structural-Sequence alignment
Each data mart can be distributed and managed separately
What is the Search Database used for?
A target for data integration
eFamily
A direct backend for web-based services
MSDLite, MSDPro, MSDSite etc.
A source for data files
Data exported from DB to support web-based services (indirectly a backend)
Atlas pages
XML representation of sections of the DB
Including eFamily
Coordinates in PDB format for software that requires it
Data Mining
http://www.ebi.ac.uk/msd-srv/docs/dbdoc/