Data Warehouse Technology And The MSD Databases

(1)

Data Warehouse Technology

And The MSD Databases

(2)



Data Warehouses



The MSD Databases



Populating & using the Search

(3)

(4)

What is a Data Warehouse?

“

A

subject-oriented, integrated, nonvolatile,

and

time-variant

collection of data that is

used primarily in organizational decision

making.”

(W. H. Inmon, Building the Data Warehouse,

John Wiley & Sons, 2002)

(5)

Data Warehouse—Subject-Oriented



Organised around major subjects, such as

customer, product, sales



Focusing on the modelling and analysis of

data for decision makers, not on daily

operations or transaction processing



Provides

a simple and concise

view around

particular subject issues by

excluding data

that are not useful in the decision support

process

(6)

Data Warehouse

—

Integrated



Constructed by integrating multiple,

heterogeneous data sources

relational databases, flat files, on-line transaction records



Data cleaning and data integration

techniques are applied

Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources

When data are moved to the warehouse, they are converted into a consistent format to facilitate integration.

(7)

Data Warehouse

—

Non-Volatile

A physically separate store of data transformed from the operational environment

Operational update of data does not occur in the data warehouse environment

Does not require transaction processing, recovery, and

concurrency control mechanisms

Requires only two operations in data accessing:

(8)

Data Warehouse

—

Time Variant



The time horizon for the data warehouse is

significantly longer than that of

operational systems

Operational database: current value data

Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)



Every key structure in the data warehouse

Contains an element of time, explicitly or implicitly

But the key of operational data may or may not contain

(9)

Dimensional Modelling

A typical commercial data warehouse is based on

multidimensional data model which views data in the form

of a data cube

A data cube allows data to be modelled and viewed in multiple dimensions (such as sales)

Dimension tables, representing important subject areas

such as item (item_name, brand, type), or time(day, week,

month, quarter, year)

Fact table containing varying levels of summarised data

(such as dollars_sold) and keys to each of the related

(10)

Data Warehouse Models



Star schema

 A fact table, containing varying levels of summarised data, in

the middle connected to a set of dimension tables, representing important subject areas



Snowflake schema

 A refinement of star schema where some dimensional hierarchy

is normalised into a set of smaller dimension tables, forming a shape similar to snowflake



Fact constellation

 Multiple fact tables share dimension tables, viewed as

a collection of stars, therefore called galaxy schema

(11)

(12)

(13)

Snowflake Schema

SaleID ItemID Quantity SalePrice Amount Transactions ItemID Description QuantityOnHand ListPrice Category Merchandise SaleID SaleDate EmployeeID CustomerID SalesTax Sale CustomerID Phone FirstName LastName Address ZipCode CityID Customer CityID ZipCode City State City

Dimension tables can join to other

(14)

Fact Constellation

time_key item_key branch_key location_key unit_sold euros_sold avg_sales

Sales Fact Table

time_key day day_of_the_week month quarter year Time branch_key branch_name branch_type Branch item_key item_name brand type supplier_key Item location_key street city province/street country Location time_key item_key shipper_key from_location_key to_location_key unit_shipped

Shipping Fact Table

shipper_key shipper_name location_key shipper_type

(15)

Why Build a Data Warehouse?



Access to data from multiple sources, have a

comprehensive data collection.



Separate transactional and analysis systems:

Improve query response time (without slowing down

transaction processing)



Easy formulation of complex queries



Access to historical data (not in operational

systems)



Improved data quality (fewer errors and

missing values)

(16)

The Data Warehouse Pipeline

Data Warehouse Extract Transform Load Refresh OLAP Engine Analysis Query Reports Data mining Front-End Tools Serve Operational DBs other sources Data Storage OLAP Server

(17)

Using the Warehouse



Ad Hoc query

simplistic submission of SQL statements from a command-line tool or a SQL-generation tool



Complex Analytical Questions

using custom written query tools or commercial online analytical processing (OLAP) tools



Data Mining

(18)

(19)

What is a Data Warehouse? (revisited)

“A data warehouse is simply a single,complete,

and consistent store of data obtained from a variety of sources and made available to end

users in a way they can understand and use it in

a business context.”

-- Barry Devlin, IBM Consultant

(20)

The MSD actually consists of two separate

databases:

the deposition database is highly normalised, with

thousands of relationships linking some 325 tables; the deposition database is the definitive archive for all

structural data at MSD

the search database is a simpler, but larger

denormalised database,which contains a large amount

of additional derived data, with data items duplicated

and aggregated into 170 much wider tables, making it more amenable to searching and retrieval of data

A third intermediate database is involved in

transforming the data from the deposition database to the search database and in calculating and adding the

derived data

(21)

The deposition database comprises:

common reference data, such as amino-acid connectivity, HET groups structures, etc.

older PDB entries, loaded from legacy files

schema includes strict constraints, enforcing internal

consistency and performing type checking and validation against the reference data

a huge amount of effort has gone into cleaning up the legacy data

new entries, loaded from recent PDB submissions

new entries are loaded on a weekly basis, subject to the same constraints and checks during loading as legacy data

(22)

(23)

NTX CHANGED TAX ID # OLD_TAX_ID * NEW_NAME_TXT * OLD_NAME_TXT * USERSTAMP o TIMESTAMP_NCBI NCBI SYNONYMS # NAME_CLASS # NAME_TXT NCBI # TAX_ID o PARENT_ID * SCIENTIFIC_NAME o PREFERRED_COMMON_NAME o RANK * HIDDEN ETAXI SYNONYMS # NAME_TXT # NAME_CLASS ETAXI o PARENT_ID o UPPER_NAME o FULL_NAME * COMPLETE_GENOME_FLAG o RANK * HIDDEN * SCIENTIFIC_NAME o LEFT_NUMBER o RIGHT_NUMBER * ANNOTATION_SOURCE DEP_PDB_TAXONOMY o PDB_COMMON_NAME o PDB_SCIENTIFIC_NAME o PDB_STRAIN DEP_NATURAL_SOURCE * CELL_ID o ATCC o CELL_LINE o CELL_LOCATION o CELL_TYPE o DETAILS o FRAGMENT o GENE o ORGAN o ORGANELLE o PLASMID o SECRETION o TISSUE o VARIANT DEP_ENTITY * NAME o ENTITY_SRC * ID o DETAILS o SYSTEM REF_CHEM_COMP o EBI_ID o HETGROUP_PARENT o MODEL_DETAILS o MOLECULAR_FORMULA o MOLECULAR_WEIGHT o EBI_RESERVED_NAME o RCSB_RESERVED_NAME ... REF_SEQ_REF_SEQ * COMPONENT * SERIAL REF_SEQ_REF * DB_NAME * PRIMARY_ID o SECONDARY_ID o VERSION DEP_POLY_ENTITY_SEQ * HETERO * SERIAL DEPOSITION * NUM_XTALS o PDB_EXP_TYPE * DEPOSITION_PROCESSED_BY * CREATION_DATE * LAST_UPDATE * TITLE ... DEP_RESIDUE DEP_SEQ_CORR * CONFLICT_ANNOTATED_FLAG * TYPE o CIF_TYPE o CONFLICT_TYPE o DETAILS DEP_SEQ_MATCH * CIF_ID o DETAILS DEP_SEQ_REF * PROC_MATCHED_FLAG o CIF_ID o DETAILS DEP_POLY_ENTITY * ENGINEERED * HETERO_FLAG * MUTANT_FLAG o FRAGMENT_FLAG o MUTATION_STRING * SYNTHETIC * MASTER_ID DEP_POLY_ENTITY_SEGMENT * RCSB_DEFINE_AS_ENTITY * RCSB_SERIAL_OFFSET DEP_POLY_ENTITY_MASTER has taxonomy taxonomy of is parent of is child of eta_ncb_fk eta_ncb_fk es_eta_fk es_eta_fk ns_ncb_fk ns_ncb_fk ncti_ncb_fk ncti_ncb_fk c d a b CP606 R/1374 JA3 JR3 compound defines compound defines is defined by parent comp child comp a b sequence of has sequence refers to referred to by conflict conflicts with matches entity matched by is a component of is composed of describes is described by referred to by referrs to precedes follows R/1939 CP623 R/1937 CP621

(24)



Is not a true data warehouse



Breaks several of the data warehouse

car

di

nal

“r

ul

es”

Non-volatile

Time Variant



But does make use of many of the data

warehousing concepts and techniques

Closest to a fact constellation

NOT The MSD Data Warehouse

(25)

Each data item occurs only once in the

deposition database, so that data from a single entry are spread across many tables

To make searching faster, the data are

aggregated into fewer, larger tables in the Search

database

Searching the Search database requires fewer

table joins, making database queries significantly faster and much less complex

The top-level entity in a structure entry is the assembly, as determined using the Protein Quaternary Structure server (PQS)

(26)

Chains Residues Atoms Exp. Result Assembly

ALT ASSEMBLY ASSEMBLY DATA ATOM ATOM DATA CHAIN RESIDUE ENTRY MODEL

(27)

Representing Macromolecular Structures (2)

Each level of the hierarchy can have associated properties, e.g.

Bound molecules

Domains

Site residues

Derived properties (e.g. asa)

Reference information (e.g.

 standard geometry) ASU observed exp data Chains Residues Atoms Biological Unit(s) Independent units

(28)



During

transformation

from the deposition

database to the search database additional

derived data are added



Numerous processes are run on the

deposition data, including:

characterisation of ligand binding sites

derivation of secondary structure information mapping data onto other databases such as

UniProt, Pfam, InterPro, GO, SCOP and CATH

(29)

Search Database Data Models



Entity-relationship

data model

as used in the deposition database but denormalised to aid querying

could be generated from deposition data model



Dimensional

model

 typical of commercial data warehouses requires separate data model



A hybrid of these two

 required to handle to complexity of macromolecular structure data

very complex –fact tables could be dimensions for other fact tables

(30)

Part of Search Database Schema

MSD SEARCH DATABASE MARTS:

SECONDARY STRUCTURE COORDINATES-SEQUENCE TAXONOMY CATH

TURN

STRAND DATA SHEET ORDER

SHEET HBOND SHEET NCBI SYNONYMS NCBI MODEL HELIX HELIX HELIX DATA HAIRPIN MOTIF ETAXI SYNONYMS ETAXI DEPOSITION CATH_MAPPING COMPONENT DATA COMPONENT CHAIN BULGE ATOM DATA ATOM ASSEMBLY DATA ASSEMBLY ALT atom atom_data etaxi_ncbi_fk etaxi_ncbi_fk etaxi_syn_fk etaxi_syn_fk ncbi_syn_fk ncbi_syn_fk alt atl_atd_fk atd_model_fk atd_model_fk shb_fk8 shb_fk8 sheet_order_fk1 sheet_order_fk1 sheet_order_fk2 sheet_order_fk2 shb_fk2 shb_fk2 sheet_order_fk4 sheet_order_fk4 helix_data_fk2 helix_data_fk2 strand_data_fk1 strand_data_fk1 strand_data_fk2 strand_data_fk2 helix_d_fk2 helix_d_fk2 helix_d_fk1 helix_d_fk1 helix_data_fk1 helix_data_fk1 hairpin_motif_fk5 hairpin_motif_fk5 hairpin_motif_fk4 hairpin_motif_fk4 hairpin_motif_fk3 hairpin_motif_fk3 strand_data_fk4 strand_data_fk4 strand_data_fk5 strand_data_fk5 strand_data_fk6 _strand_data_fk6 strand_data_fk7 _strand_data_fk7 strand_data_fk8 _strand_data_fk8 strand_data_fk9 _strand_data_fk9 turn_fk1 turn_fk1 turn_fk3 turn_fk3 bulge_fk7 bulge_fk7 bulge_fk6 bulge_fk6 bulge_fk5 bulge_fk5 bulge_fk4 bulge_fk4 bulge_fk3 bulge_fk3 bulge_fk2 bulge_fk2 bulge_fk1 bulge_fk1 turn_fk4 turn_fk4 turn_fk5 turn_fk5 turn_fk6 turn_fk6 comp_tax_fk comp_tax_fk turn_fk2 turn_fk2 strand_data_fk3 strand_data_fk3 sheet_order_fk3 sheet_order_fk3 shb_fk7 shb_fk7 shb_fk1 shb_fk1 sheet_datak2 sheet_datak2 sheet_datak1 sheet_datak1 hairpin_motif_fk1 hairpin_motif_fk1 db_comp_fk db_comp_fk db_chain_fk db_chain_fk component_data_fk4 component_data_fk4 chain_tax_fk chain_tax_fk assembly_data_model_fk assembly_data_model_fk assembly_deposition_fk assembly_deposition_fk

(31)

Star Database Design

Interactions

Number of interactions Strongest interaction

Fact Table (residue_contact)

Residue Geometry Dimension Tables PDB Entry Model Assembly Ligand Neighbour Secondary structure Bond type Helix Turn Strand Chemical compound

(32)

From

Snowflake

to

Star

Residue Contact

Chemical compound Residue Assembly PDB entry

(33)

(34)

From Deposition to Search Database

Deposition Database Search database

Transformation Normalised relationships authoritative complete Denormalised fewer relationships derived information subset of data

(35)

Search database 200 GB Deposition database 40 GB D I S T R I B U T I O N 130 GB 200 GB Transformation database 200 GB Web services External Processes derived data load via mmcif Web services deposit transform _distribute search PDB files replicate

MSD Database Pipeline

(36)

Transformation

Moving from a complex normalised model to enforce integrity to a simple, efficient simple user oriented

model

Assignment of consistent identifiers

In addition to PDB identifiers

Extensive indexing

Based on a flexible metadata driven mechanism, in-house developed to overcome Oracle limitations

Models composite entities and their dependencies Allows incremental transformation

(37)

Post Transformation

Calculation of derived/aggregated data

Consistent across whole archive

Ability to query derived data

Adding value to the database

Active-site information

Structure , Sequence Alignment

Cross referencing SCOP, CATH, PFAM, UniProt, InterPro

Additional derived data and indexing required for web-based search services

Stored in search database, but conceptually part of search systems

Scientific parameterisation reflects requirements of search services

(38)

Efficiently Querying The Database

Requires using many of the tools provided by

the Oracle DBMS

STAR Joins

STAR Tranformations Bitmap indexes

Index Organized Tables

Set operations (intersect, minus, union)

(39)

3D Spatial searches

Search example: find the following triangle site: Cbeta of Isoleucine or Leucine

Cbeta of Tryptophan or Tyrosine or Phenylalanine Cbeta of Arginine 6-8 Angstroms _{6-8 Angstroms} 6-8 Angstroms

(40)

The Query

select d1.atom_data1_id, d1.atom_data2_id from

(select /*+ NO_MERGE INDEX(atomic_dists)*/

atom_data1_id, atom_data2_id from atomic_dists where dist_id in (select id from dists

where code_3_letter1 in ('ILE','LEU') and code_3_letter2 in ('TRP','TYR','PHE') and chem_atom1_name = 'CB' and chem_atom2_name = 'CB'

and dist in (6,7,8))) d1,

(select /*+ NO_MERGE INDEX(atomic_dists) */

where code_3_letter1 = 'ARG' and code_3_letter2 in ('ILE','LEU') and chem_atom1_name = 'CB' and chem_atom2_name = 'CB'

and dist in (6,7,8))) d2,

(select /*+ NO_MERGE INDEX(atomic_dists) */

where code_3_letter1 = 'ARG' and code_3_letter2 in ('TRP','TYR','PHE') and chem_atom1_name = 'CB' and chem_atom2_name = 'CB'

and dist in (6,7,8))) d3

where d1.atom_data1_id = d2.atom_data2_id and d1.atom_data2_id = d3.atom_data2_id and d2.atom_data1_id = d3.atom_data1_id;

(41)

Managing Data

The information available in the MSD database is organised in application areas (data marts)

 Users may replicate only the data marts that they are interested in

 Some data marts are quite valuable and still small enough to be used on desktop systems as in the demonstration

 The data marts are also loosely interrelated and can be synchronised independently

(42)

Data Marts

The Search database is divided into application areas, or data marts:

Structure Data Descriptions Secondary Structure Taxonomy Ligands Experimental details Citations

Mapping to UniProt, SCOP, CATH, Pfam, InterPro, Go,

IntEnz, PubMed

Active Sites

Structural-Sequence alignment

Each data mart can be distributed and managed separately

(43)

What is the Search Database used for?



A target for data integration

eFamily



A direct backend for web-based services

MSDLite, MSDPro, MSDSite etc.



A source for data files

Data exported from DB to support web-based services (indirectly a backend)

Atlas pages

XML representation of sections of the DB

Including eFamily

Coordinates in PDB format for software that requires it



Data Mining

(44)

http://www.ebi.ac.uk/msd-srv/docs/dbdoc/

(45)