• No results found

Data Warehouse Technology And The MSD Databases

N/A
N/A
Protected

Academic year: 2021

Share "Data Warehouse Technology And The MSD Databases"

Copied!
45
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Warehouse Technology

And The MSD Databases

(2)

Data Warehouses

The MSD Databases

Populating & using the Search

(3)
(4)

What is a Data Warehouse?

A

subject-oriented, integrated, nonvolatile,

and

time-variant

collection of data that is

used primarily in organizational decision

making.”

(W. H. Inmon, Building the Data Warehouse,

John Wiley & Sons, 2002)

(5)

Data Warehouse—Subject-Oriented

Organised around major subjects, such as

customer, product, sales

Focusing on the modelling and analysis of

data for decision makers, not on daily

operations or transaction processing

Provides

a simple and concise

view around

particular subject issues by

excluding data

that are not useful in the decision support

process

(6)

Data Warehouse

Integrated

Constructed by integrating multiple,

heterogeneous data sources

relational databases, flat files, on-line transaction records

Data cleaning and data integration

techniques are applied

Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources

When data are moved to the warehouse, they are converted into a consistent format to facilitate integration.

(7)

Data Warehouse

Non-Volatile

A physically separate store of data transformed from the operational environment

Operational update of data does not occur in the data warehouse environment

Does not require transaction processing, recovery, and

concurrency control mechanisms

Requires only two operations in data accessing:

(8)

Data Warehouse

Time Variant

The time horizon for the data warehouse is

significantly longer than that of

operational systems

Operational database: current value data

Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

Every key structure in the data warehouse

Contains an element of time, explicitly or implicitly

But the key of operational data may or may not contain

(9)

Dimensional Modelling

A typical commercial data warehouse is based on

multidimensional data model which views data in the form

of a data cube

A data cube allows data to be modelled and viewed in multiple dimensions (such as sales)

Dimension tables, representing important subject areas

such as item (item_name, brand, type), or time(day, week,

month, quarter, year)

Fact table containing varying levels of summarised data

(such as dollars_sold) and keys to each of the related

(10)

Data Warehouse Models

Star schema

A fact table, containing varying levels of summarised data, in

the middle connected to a set of dimension tables, representing important subject areas

Snowflake schema

A refinement of star schema where some dimensional hierarchy

is normalised into a set of smaller dimension tables, forming a shape similar to snowflake

Fact constellation

Multiple fact tables share dimension tables, viewed as

a collection of stars, therefore called galaxy schema

(11)
(12)
(13)

Snowflake Schema

SaleID ItemID Quantity SalePrice Amount Transactions ItemID Description QuantityOnHand ListPrice Category Merchandise SaleID SaleDate EmployeeID CustomerID SalesTax Sale CustomerID Phone FirstName LastName Address ZipCode CityID Customer CityID ZipCode City State City

Dimension tables can join to other

(14)

Fact Constellation

time_key item_key branch_key location_key unit_sold euros_sold avg_sales

Sales Fact Table

time_key day day_of_the_week month quarter year Time branch_key branch_name branch_type Branch item_key item_name brand type supplier_key Item location_key street city province/street country Location time_key item_key shipper_key from_location_key to_location_key unit_shipped

Shipping Fact Table

shipper_key shipper_name location_key shipper_type

(15)

Why Build a Data Warehouse?

Access to data from multiple sources, have a

comprehensive data collection.

Separate transactional and analysis systems:

Improve query response time (without slowing down

transaction processing)

Easy formulation of complex queries

Access to historical data (not in operational

systems)

Improved data quality (fewer errors and

missing values)

(16)

The Data Warehouse Pipeline

Data Warehouse Extract Transform Load Refresh OLAP Engine Analysis Query Reports Data mining Front-End Tools Serve Operational DBs other sources Data Storage OLAP Server

(17)

Using the Warehouse

Ad Hoc query

simplistic submission of SQL statements from a command-line tool or a SQL-generation tool

Complex Analytical Questions

using custom written query tools or commercial online analytical processing (OLAP) tools

Data Mining

(18)
(19)

What is a Data Warehouse? (revisited)

“A data warehouse is simply a single,complete,

and consistent store of data obtained from a variety of sources and made available to end

users in a way they can understand and use it in

a business context.”

-- Barry Devlin, IBM Consultant

(20)

The MSD actually consists of two separate

databases:

the deposition database is highly normalised, with

thousands of relationships linking some 325 tables; the deposition database is the definitive archive for all

structural data at MSD

the search database is a simpler, but larger

denormalised database,which contains a large amount

of additional derived data, with data items duplicated

and aggregated into 170 much wider tables, making it more amenable to searching and retrieval of data

A third intermediate database is involved in

transforming the data from the deposition database to the search database and in calculating and adding the

derived data

(21)

The deposition database comprises:

common reference data, such as amino-acid connectivity, HET groups structures, etc.

older PDB entries, loaded from legacy files

schema includes strict constraints, enforcing internal

consistency and performing type checking and validation against the reference data

a huge amount of effort has gone into cleaning up the legacy data

new entries, loaded from recent PDB submissions

new entries are loaded on a weekly basis, subject to the same constraints and checks during loading as legacy data

(22)
(23)

NTX CHANGED TAX ID # OLD_TAX_ID * NEW_NAME_TXT * OLD_NAME_TXT * USERSTAMP o TIMESTAMP_NCBI NCBI SYNONYMS # NAME_CLASS # NAME_TXT NCBI # TAX_ID o PARENT_ID * SCIENTIFIC_NAME o PREFERRED_COMMON_NAME o RANK * HIDDEN ETAXI SYNONYMS # NAME_TXT # NAME_CLASS ETAXI o PARENT_ID o UPPER_NAME o FULL_NAME * COMPLETE_GENOME_FLAG o RANK * HIDDEN * SCIENTIFIC_NAME o LEFT_NUMBER o RIGHT_NUMBER * ANNOTATION_SOURCE DEP_PDB_TAXONOMY o PDB_COMMON_NAME o PDB_SCIENTIFIC_NAME o PDB_STRAIN DEP_NATURAL_SOURCE * CELL_ID o ATCC o CELL_LINE o CELL_LOCATION o CELL_TYPE o DETAILS o FRAGMENT o GENE o ORGAN o ORGANELLE o PLASMID o SECRETION o TISSUE o VARIANT DEP_ENTITY * NAME o ENTITY_SRC * ID o DETAILS o SYSTEM REF_CHEM_COMP o EBI_ID o HETGROUP_PARENT o MODEL_DETAILS o MOLECULAR_FORMULA o MOLECULAR_WEIGHT o EBI_RESERVED_NAME o RCSB_RESERVED_NAME ... REF_SEQ_REF_SEQ * COMPONENT * SERIAL REF_SEQ_REF * DB_NAME * PRIMARY_ID o SECONDARY_ID o VERSION DEP_POLY_ENTITY_SEQ * HETERO * SERIAL DEPOSITION * NUM_XTALS o PDB_EXP_TYPE * DEPOSITION_PROCESSED_BY * CREATION_DATE * LAST_UPDATE * TITLE ... DEP_RESIDUE DEP_SEQ_CORR * CONFLICT_ANNOTATED_FLAG * TYPE o CIF_TYPE o CONFLICT_TYPE o DETAILS DEP_SEQ_MATCH * CIF_ID o DETAILS DEP_SEQ_REF * PROC_MATCHED_FLAG o CIF_ID o DETAILS DEP_POLY_ENTITY * ENGINEERED * HETERO_FLAG * MUTANT_FLAG o FRAGMENT_FLAG o MUTATION_STRING * SYNTHETIC * MASTER_ID DEP_POLY_ENTITY_SEGMENT * RCSB_DEFINE_AS_ENTITY * RCSB_SERIAL_OFFSET DEP_POLY_ENTITY_MASTER has taxonomy taxonomy of is parent of is child of eta_ncb_fk eta_ncb_fk es_eta_fk es_eta_fk ns_ncb_fk ns_ncb_fk ncti_ncb_fk ncti_ncb_fk c d a b CP606 R/1374 JA3 JR3 compound defines compound defines is defined by parent comp child comp a b sequence of has sequence refers to referred to by conflict conflicts with matches entity matched by is a component of is composed of describes is described by referred to by referrs to precedes follows R/1939 CP623 R/1937 CP621

(24)

Is not a true data warehouse

Breaks several of the data warehouse

car

di

nal

“r

ul

es”

Non-volatile

Time Variant

But does make use of many of the data

warehousing concepts and techniques

Closest to a fact constellation

NOT The MSD Data Warehouse

(25)

Each data item occurs only once in the

deposition database, so that data from a single entry are spread across many tables

To make searching faster, the data are

aggregated into fewer, larger tables in the Search

database

Searching the Search database requires fewer

table joins, making database queries significantly faster and much less complex

The top-level entity in a structure entry is the assembly, as determined using the Protein Quaternary Structure server (PQS)

(26)

Chains Residues Atoms Exp. Result Assembly

ALT ASSEMBLY ASSEMBLY DATA ATOM ATOM DATA CHAIN RESIDUE ENTRY MODEL

(27)

Representing Macromolecular Structures (2)

Each level of the hierarchy can have associated properties, e.g.

Bound molecules

Domains

Site residues

Derived properties (e.g. asa)

Reference information (e.g.

standard geometry) ASU observed exp data Chains Residues Atoms Biological Unit(s) Independent units

(28)

During

transformation

from the deposition

database to the search database additional

derived data are added

Numerous processes are run on the

deposition data, including:

characterisation of ligand binding sites

derivation of secondary structure informationmapping data onto other databases such as

UniProt, Pfam, InterPro, GO, SCOP and CATH

(29)

Search Database Data Models

Entity-relationship

data model

as used in the deposition databasebut denormalised to aid querying

could be generated from deposition data model

Dimensional

model

typical of commercial data warehousesrequires separate data model

A hybrid of these two

required to handle to complexity of macromolecular structure data

very complex –fact tables could be dimensions for other fact tables

(30)

Part of Search Database Schema

MSD SEARCH DATABASE MARTS:

SECONDARY STRUCTURE COORDINATES-SEQUENCE TAXONOMY CATH

TURN

STRAND DATA SHEET ORDER

SHEET HBOND SHEET NCBI SYNONYMS NCBI MODEL HELIX HELIX HELIX DATA HAIRPIN MOTIF ETAXI SYNONYMS ETAXI DEPOSITION CATH_MAPPING COMPONENT DATA COMPONENT CHAIN BULGE ATOM DATA ATOM ASSEMBLY DATA ASSEMBLY ALT atom atom_data etaxi_ncbi_fk etaxi_ncbi_fk etaxi_syn_fk etaxi_syn_fk ncbi_syn_fk ncbi_syn_fk alt atl_atd_fk atd_model_fk atd_model_fk shb_fk8 shb_fk8 sheet_order_fk1 sheet_order_fk1 sheet_order_fk2 sheet_order_fk2 shb_fk2 shb_fk2 sheet_order_fk4 sheet_order_fk4 helix_data_fk2 helix_data_fk2 strand_data_fk1 strand_data_fk1 strand_data_fk2 strand_data_fk2 helix_d_fk2 helix_d_fk2 helix_d_fk1 helix_d_fk1 helix_data_fk1 helix_data_fk1 hairpin_motif_fk5 hairpin_motif_fk5 hairpin_motif_fk4 hairpin_motif_fk4 hairpin_motif_fk3 hairpin_motif_fk3 strand_data_fk4 strand_data_fk4 strand_data_fk5 strand_data_fk5 strand_data_fk6 _strand_data_fk6 strand_data_fk7 _strand_data_fk7 strand_data_fk8 _strand_data_fk8 strand_data_fk9 _strand_data_fk9 turn_fk1 turn_fk1 turn_fk3 turn_fk3 bulge_fk7 bulge_fk7 bulge_fk6 bulge_fk6 bulge_fk5 bulge_fk5 bulge_fk4 bulge_fk4 bulge_fk3 bulge_fk3 bulge_fk2 bulge_fk2 bulge_fk1 bulge_fk1 turn_fk4 turn_fk4 turn_fk5 turn_fk5 turn_fk6 turn_fk6 comp_tax_fk comp_tax_fk turn_fk2 turn_fk2 strand_data_fk3 strand_data_fk3 sheet_order_fk3 sheet_order_fk3 shb_fk7 shb_fk7 shb_fk1 shb_fk1 sheet_datak2 sheet_datak2 sheet_datak1 sheet_datak1 hairpin_motif_fk1 hairpin_motif_fk1 db_comp_fk db_comp_fk db_chain_fk db_chain_fk component_data_fk4 component_data_fk4 chain_tax_fk chain_tax_fk assembly_data_model_fk assembly_data_model_fk assembly_deposition_fk assembly_deposition_fk

(31)

Star Database Design

Interactions

Number of interactions Strongest interaction

Fact Table (residue_contact)

Residue Geometry Dimension Tables PDB Entry Model Assembly Ligand Neighbour Secondary structure Bond type Helix Turn Strand Chemical compound

(32)

From

Snowflake

to

Star

Residue Contact

Chemical compound Residue Assembly PDB entry

(33)
(34)

From Deposition to Search Database

Deposition Database Search database

Transformation Normalised relationships authoritative complete Denormalised fewer relationships derived information subset of data

(35)

Search database 200 GB Deposition database 40 GB D I S T R I B U T I O N 130 GB 200 GB Transformation database 200 GB Web services External Processes derived data load via mmcif Web services deposit transform distribute search PDB files replicate

MSD Database Pipeline

(36)

Transformation

Moving from a complex normalised model to enforce integrity to a simple, efficient simple user oriented

model

Assignment of consistent identifiers

In addition to PDB identifiers

Extensive indexing

Based on a flexible metadata driven mechanism, in-house developed to overcome Oracle limitations

Models composite entities and their dependenciesAllows incremental transformation

(37)

Post Transformation

Calculation of derived/aggregated data

Consistent across whole archive

Ability to query derived data

Adding value to the database

Active-site information

Structure , Sequence Alignment

Cross referencing SCOP, CATH, PFAM, UniProt, InterPro

Additional derived data and indexing required for web-based search services

Stored in search database, but conceptually part of search systems

Scientific parameterisation reflects requirements of search services

(38)

Efficiently Querying The Database

Requires using many of the tools provided by

the Oracle DBMS

STAR Joins

STAR TranformationsBitmap indexes

Index Organized Tables

Set operations (intersect, minus, union)

(39)

3D Spatial searches

Search example: find the following triangle site: Cbeta of Isoleucine or Leucine

Cbeta of Tryptophan or Tyrosine or Phenylalanine Cbeta of Arginine 6-8 Angstroms 6-8 Angstroms 6-8 Angstroms

(40)

The Query

select d1.atom_data1_id, d1.atom_data2_id from

(select /*+ NO_MERGE INDEX(atomic_dists)*/

atom_data1_id, atom_data2_id from atomic_dists where dist_id in (select id from dists

where code_3_letter1 in ('ILE','LEU') and code_3_letter2 in ('TRP','TYR','PHE') and chem_atom1_name = 'CB' and chem_atom2_name = 'CB'

and dist in (6,7,8))) d1,

(select /*+ NO_MERGE INDEX(atomic_dists) */

atom_data1_id, atom_data2_id from atomic_dists where dist_id in (select id from dists

where code_3_letter1 = 'ARG' and code_3_letter2 in ('ILE','LEU') and chem_atom1_name = 'CB' and chem_atom2_name = 'CB'

and dist in (6,7,8))) d2,

(select /*+ NO_MERGE INDEX(atomic_dists) */

atom_data1_id, atom_data2_id from atomic_dists where dist_id in (select id from dists

where code_3_letter1 = 'ARG' and code_3_letter2 in ('TRP','TYR','PHE') and chem_atom1_name = 'CB' and chem_atom2_name = 'CB'

and dist in (6,7,8))) d3

where d1.atom_data1_id = d2.atom_data2_id and d1.atom_data2_id = d3.atom_data2_id and d2.atom_data1_id = d3.atom_data1_id;

(41)

Managing Data

The information available in the MSD database is organised in application areas (data marts)

Users may replicate only the data marts that they are interested in

Some data marts are quite valuable and still small enough to be used on desktop systems as in the demonstration

The data marts are also loosely interrelated and can be synchronised independently

(42)

Data Marts

The Search database is divided into application areas, or data marts:

Structure DataDescriptionsSecondary StructureTaxonomyLigandsExperimental detailsCitations

Mapping to UniProt, SCOP, CATH, Pfam, InterPro, Go,

IntEnz, PubMed

Active Sites

Structural-Sequence alignment

Each data mart can be distributed and managed separately

(43)

What is the Search Database used for?

A target for data integration

eFamily

A direct backend for web-based services

MSDLite, MSDPro, MSDSite etc.

A source for data files

Data exported from DB to support web-based services (indirectly a backend)

Atlas pages

XML representation of sections of the DB

Including eFamily

Coordinates in PDB format for software that requires it

Data Mining

(44)

http://www.ebi.ac.uk/msd-srv/docs/dbdoc/

(45)

References

Related documents

Players can create characters and participate in any adventure allowed as a part of the D&D Adventurers League.. As they adventure, players track their characters’

Genome editing using CRISPR/Cas9 was recently successfully applied in vivo and/or in cell lines of several major aquaculture species of Salmonidae (Atlantic salmon, Salmo salar

American Bar Association, Partnership Taxation Section American College of Trust and Estate Counsel (ACTEC), Business Planning and Asset Protection Committees Florida Bar

His current research interests are in the area of supramolecular chemistry, including the design and synthesis of functional dendritic and polymeric materials based on fullerene

Although not specifically corrosion related, there are other problems that occur in steam turbines including: deposition on blade surfaces; water droplet erosion of wet stage

Therefore, in the present study, the negative evidence is the two WCF types (direct and indirect) provided to the learners’ written work. The treatment, which includes the written

In 2018 STAR took data in Ru+Ru and Zr+Zr collisions to study the CME, and Au+Au collisions at √s NN = 27 GeV to look for difference in the global polarization between Λ and ¯Λ due

a) An α-helix is a right-handed structure containing disulfide bonds while a β-pleated sheet is a left-handed structure containing ionic bonds. c) An α-helix