• No results found

Triple Store Databases and their role in high throughput, automated, extensible, data analysis

N/A
N/A
Protected

Academic year: 2020

Share "Triple Store Databases and their role in high throughput, automated, extensible, data analysis"

Copied!
16
0
0

Loading.... (view fulltext now)

Full text

(1)

March 2005

March 2005 Jeremy FreyJeremy Frey ACS CINFACS CINF

Triple store databases

Triple store databases

and their role in high

and their role in high

throughput, automated

throughput, automated

extensible data analysis

extensible data analysis

Jeremy Frey

Jeremy Frey

ACS San Diego March 2005

ACS San Diego March 2005

Talk: Workflow

Talk: Workflow

!

!

Introduction to the Introduction to the CombechemCombechem Project Project

!

!

Smart Dark LabsSmart Dark Labs

!

!

Semantics & DatabasesSemantics & Databases

!

!

RDF and Triple StoresRDF and Triple Stores

!

(2)

March 2005

March 2005 Jeremy FreyJeremy Frey ACSACS

e

e

-Science

-Science

‘[The Grid] intends to make access to computing power, scientific data

repositories and experimental facilities as easy as the Web makes access to information.’ Tony Blair, 2002

!

What is the web?

!

Need to worry about what the data means as well as simple numerical value

The Comb

e

Chem Project

! The exponential world of combinatorial

synthesis and high throughput analysis meets the exponentially growing power of computing

! Provenance:- Trace all the way back from publication to the original data which may be in several different labs

! But then.... “Who wants provenance?”

!"#$%&!'()*&+$,&-./%&0.1 %&/.2&3 &4"5567&8 9 9 :

! ; 6*<,= &67&$)> $?5$*6"> $@ "5&A 6',B"'(* B*C#5(''6> *(@ $C

(3)

March 2005

March 2005 Jeremy FreyJeremy Frey ACSACS

Literature data

Expt Data

Calculation

Processes

Analysed Data

Expt Data Different laboratories

samples

Triple Stores

Agent

Instrumentation

Databases Archive Databases

Archive Suppliers Simulations Calculations

Collaboration

Collaboration

Notification Comms Security

XML WF

XML Researcher

Researcher Researcher

Suppliers

(4)

March 2005

March 2005 Jeremy FreyJeremy Frey ACSACS

Develop process on

few

Apply to the whole

library Synthesis,

analysis, measurements Design of

experiments

High-throughput vs Parallel

May impose a real time constraint in analysis

Chemists and programming

Chemists and programming

!

!

Many ChemistsMany Chemists think that they

think that they

can program

can program

!

!

So leave theSo leave the systems to the

systems to the

Chemists

Chemists

(5)

March 2005

March 2005 Jeremy FreyJeremy Frey ACSACS

March 2005

March 2005 Jeremy FreyJeremy Frey ACSACS

Databases - Our experience

Databases - Our experience

!

!

What do you do when the actual usersWhat do you do when the actual users

keep changing their mind?

keep changing their mind?

!

!

Is a traditional relational databaseIs a traditional relational database

suitable?

suitable?

!

!

Danger of re-enforcing scientific biasDanger of re-enforcing scientific bias

against relational database for

against relational database for

laboratory data.

laboratory data.

!

(6)

March 2005

March 2005 Jeremy FreyJeremy Frey ACSACS

RDF

RDF

!

!

TriplesTriples

!

! Subject, predicate, objectSubject, predicate, object

!

!

Results in a chain of reasoning that canResults in a chain of reasoning that can be mapped by generic inference software

be mapped by generic inference software

!

!

Real power comes from the dataReal power comes from the data becomes self describing

becomes self describing

!

!

Together with an Ontology the dataTogether with an Ontology the data becomes

becomes ‘‘alivealive’’

RDF

RDF

!

! Project started in 1997 - W3C standard in 2000Project started in 1997 - W3C standard in 2000 !

! Aim to enable the Semantic WebAim to enable the Semantic Web

!

! But how many are there even though we have hadBut how many are there even though we have had the standard for 4 years

the standard for 4 years

!

! Enable search engines to collect informationEnable search engines to collect information from many sources

from many sources

!

! At present requires deliberate human controlledAt present requires deliberate human controlled efforts

efforts

!

! Advent of a critical mass of metadata facilitatesAdvent of a critical mass of metadata facilitates machine-based discovery.

(7)

March 2005

March 2005 Jeremy FreyJeremy Frey ACSACS

RDF

RDF

!

!

Capable of representing more complexCapable of representing more complex structures than XML which is

structures than XML which is

fundamentally a tree like structure

fundamentally a tree like structure

!

!

Relationships like Relationships like ‘‘similar tosimilar to’’ can be can be employed

employed

!

!

Not just for web & sharingNot just for web & sharing

!

!

Useful as a local storage and referenceUseful as a local storage and reference system

system

Chemical data

Chemical data

!

!

Chemical dataChemical data

!

! Requires extensive annotationRequires extensive annotation

!

! Purity, method, Purity, method, accuracy,conditionsaccuracy,conditions

!

!

RDFRDF

!

! Schema can supply this informationSchema can supply this information

!

! Easy to add new informationEasy to add new information

!

! Very flexibleVery flexible

!

(8)

March 2005

March 2005 Jeremy FreyJeremy Frey ACSACS

root

root

!

!

INChIINChI (International Chemical Identifier) (International Chemical Identifier)

!

! Can be too long so use MD5 hashCan be too long so use MD5 hash

!

!

Molecule centric viewMolecule centric view

!

! Full stereochemistry (if known)Full stereochemistry (if known)

!

! Phase & PolymorphsPhase & Polymorphs

!

! StructureStructure

!

!Calculated, derived, crystalCalculated, derived, crystal

!

!

PropertiesProperties

!

! Contains tracking informationContains tracking information

!

(9)

March 2005

March 2005 Jeremy FreyJeremy Frey ACSACS Property in RDF

Property in RDF ! <c:OrganicMolecule

rdf:about="file:///storage/ba8efc2ce0edada69d63b02d1b8630c6.rdf"> !

<c:has-inchi>1.12Beta/C12H13NO2/c1-2-15-8-9-5-6-11(14)12-10(9)4-3-7-13-12/h1H3,2H2,3-7H,8H2,14H</c:has-inchi> ! <c:has-cas>22049-19-0</c:has-cas>

! <c:has-empirical-formula>C12H13NO2</c:has-empirical-formula> ! <c:has-stereocentres>0</c:has-stereocentres>

! <c:has-property> ! <c:MeltingPoint> ! <c:has-information> ! <c:Information>

! <c:has-value>150</c:has-value> ! <c:has-uncertainty>

! <c:Range>

! <c:has-value>16</c:has-value> ! </c:Range>

! </c:has-uncertainty> ! </c:Information> ! </c:has-information> ! </c:MeltingPoint> ! </c:has-property> ! </c:OrganicMolecule>

Currently testing on 200,000 compounds but about to go up by order of magnitude

3Store is a scaleable solution

Schema

Schema ! <rdfs:Class rdf:about="&c;OrganicMolecule">

! <rdfs:label>Organic Molecule</rdfs:label>

! <rdfs:subClassOf rdf:resource="&c;Molecule" />

! </rdfs:Class>

! <rdfs:Class rdf:about="&c;PhysicalProperty">

! <rdfs:label>Property</rdfs:label>

! </rdfs:Class>

! <rdfs:Class rdf:about="&c;PartitionCoefficient">

! <rdfs:label>Paritition Coefficient</rdfs:label>

! <rdfs:subClassOf rdf:resource="&c;PhysicalProperty" />

! <rdfs:description>Ratio of substance dissolved in octan-1-ol and water </rdfs:description>

! </rdfs:Class>

(10)

March 2005

March 2005 Jeremy FreyJeremy Frey ACSACS

Triple Store

Triple Store

!

!

Simply a store for a set of RDF triplesSimply a store for a set of RDF triples

!

!

Collection of triples can then be used forCollection of triples can then be used for inference as each triple is an assertion

inference as each triple is an assertion

!

!

Several triple stores are available Several triple stores are available –– JENA JENA

!

!

But we require a very large triple storeBut we require a very large triple store

!

! 250,000 molecules some with properties250,000 molecules some with properties needed 10 million triples

needed 10 million triples

!

! 100 million triples to represent the ZINC100 million triples to represent the ZINC database

database

(11)

March 2005

March 2005 Jeremy FreyJeremy Frey ACSACS Semantic web

-But how to store it?

(12)

March 2005

March 2005 Jeremy FreyJeremy Frey ACSACS

Triple Store

Triple Store

3Store

3Store

!

!

Need to have a triple store that scalesNeed to have a triple store that scales

!

!

3Store from the AKT project looked like it3Store from the AKT project looked like it would support this quantity of data

would support this quantity of data

!

!

Current tests suggest that it can.Current tests suggest that it can.

!

!

Achieves this by careful construction of aAchieves this by careful construction of a type of index

type of index

!

!

Queries need to be done in RDQLQueries need to be done in RDQL

!

! This is not yet fully developed but should beThis is not yet fully developed but should be W3C standard soon

(13)

March 2005

March 2005 Jeremy FreyJeremy Frey ACSACS

Triple Stores

Triple Stores

!

!

Must be walked as the data structure isMust be walked as the data structure is not determined in advance

not determined in advance

!

! Slower than relational databaseSlower than relational database

!

(14)

March 2005

March 2005 Jeremy FreyJeremy Frey ACSACS

Scaling?

Scaling?

0 50000 100000 150000 200000 250000

0 5000 10000 15000 20000

no. molecules added /200

Time/s

t = 0.0003n2+8.487n

(15)

March 2005

March 2005 Jeremy FreyJeremy Frey ACSACS

Similar architecture to the smart lab systems – integrate 2 systems

SURIGSURIG SURIG Data stores Semantic

Data

Other services Weights &

Measures Bench Planner0

Viewer0

PHP

Java

SOAP

Jena

SURIG

Applications

Institutional archives

and metadata publication

The time is

right to think of the Future – now we have an RDF

Platform for experimental &

computational combinatorial Chemistry

(16)

March 2005

March 2005 Jeremy FreyJeremy Frey ACSACS

People

People

!

! KieronKieron Taylor (Chemistry) Taylor (Chemistry) !

! Rob Gledhill (Chemistry)Rob Gledhill (Chemistry) !

! HongchenHongchen Fu (Chemistry) Fu (Chemistry) !

! Jamie Robinson (Chemistry)Jamie Robinson (Chemistry) !

! Steve Harris (ECS)Steve Harris (ECS) !

! Hugo Mills (ECS)Hugo Mills (ECS) !

! Gareth Hughes (ECS)Gareth Hughes (ECS) !

! Jon Essex (Chemistry)Jon Essex (Chemistry) !

! Dave De Dave De RoureRoure (ECS) (ECS) !

! Nigel Nigel ShadboltShadbolt (ECS) (ECS)

Making sure other people can re-use

Making sure other people can re-use

your data easily and with confidence

your data easily and with confidence

Even when there is a huge amount of it!

Even when there is a huge amount of it!

But how big a triple store will we need?

References

Related documents

[r]

Predicting Forage Nutritive Value Using an In Vitro Gas Production Technique and Dry Matter Intake of Grazing Animals Using n-Alkanes..

600 free texts, to any UK network, any time Unlimited* web access for just 30p a day 20p per min flat rate to any UK network.. When you top up with £15

seized by force whatever parts of Finnish territory he wanted to use for prosecuting his crusade against Bolshevism. Considerable intelligence data suggested that at least some

“I am the head of the Engineering, Sales, Marketing, and Project Management training departments and UT’s Master’s in Engineering Management program looked like something that

Between 2012 and 2014, faecal samples from 119 feral cats trapped across both North and South Islands were examined for the presence of helminth eggs and protozoal stages

Wear appropriate personal protective equipment (such as hard hats, work gloves, safety shoes, and eye protection). Implement injury awareness training (such as dropped objects,

Military Grade Security &amp; Data Replication - Included Instant access to existing and deleted messages - Included Search across all Message Types Simultaneously - Included.