• No results found

Metadata Requirements for Data Management

N/A
N/A
Protected

Academic year: 2021

Share "Metadata Requirements for Data Management"

Copied!
56
0
0

Loading.... (view fulltext now)

Full text

(1)

Metadata Requirements for Data

Management

Paul Millar

(2)

or

(3)

Contents

The problem

The metadata group Summary of use-cases

Metadata as keyword-value pairs Performance issues

Looking at the keys Looking at the values Collections

(4)

The problem

HEPCAL I defines metadata as experiment’s responsibility, but obviously there’s commonality between different experiments.

(5)

The problem

HEPCAL I defines metadata as experiment’s responsibility, but obviously there’s commonality between different experiments. EDG provided two components for supplying metadata: replica metadata catalogue (RMC) and Spitfire: a Grid-enabled database.

(6)

The problem

HEPCAL I defines metadata as experiment’s responsibility, but obviously there’s commonality between different experiments. EDG provided two components for supplying metadata: replica metadata catalogue (RMC) and Spitfire: a Grid-enabled database. RMC supported light-weight metadata, Spitfire was a fully functional database that can be used for storing

(7)

The problem

HEPCAL I defines metadata as experiment’s responsibility, but obviously there’s commonality between different experiments. EDG provided two components for supplying metadata: replica metadata catalogue (RMC) and Spitfire: a Grid-enabled database. RMC supported light-weight metadata, Spitfire was a fully functional database that can be used for storing

application/experiment-specific metadata.

But both were solutions at too low-level: experiments had common layer of functionality above these.

(8)

The metadata group

First meeting was May 2004 in Stepps (near Glasgow).
(9)

The metadata group

First meeting was May 2004 in Stepps (near Glasgow).

Current membership includes people from CMS, CDF/D0, BaBar, LCHb, EGEE, ATLAS; (possibly ALICE soon)

(10)

The metadata group

First meeting was May 2004 in Stepps (near Glasgow).

Current membership includes people from CMS, CDF/D0, BaBar, LCHb, EGEE, ATLAS; (possibly ALICE soon)

Working groups formed to look into Use cases, Services,

Deployment Architecture, Query Languages and Interfaces, Tools and Deliverables.

(11)

The metadata group

First meeting was May 2004 in Stepps (near Glasgow).

Current membership includes people from CMS, CDF/D0, BaBar, LCHb, EGEE, ATLAS; (possibly ALICE soon)

Working groups formed to look into Use cases, Services,

Deployment Architecture, Query Languages and Interfaces, Tools and Deliverables.

Trying to pull together common functionality from different experiments.

(12)

The metadata group

First meeting was May 2004 in Stepps (near Glasgow).

Current membership includes people from CMS, CDF/D0, BaBar, LCHb, EGEE, ATLAS; (possibly ALICE soon)

Working groups formed to look into Use cases, Services,

Deployment Architecture, Query Languages and Interfaces, Tools and Deliverables.

Trying to pull together common functionality from different experiments.

This common functionality can then be packaged and used by different experiments (reducing reinventing the wheel)

(13)

The metadata group

First meeting was May 2004 in Stepps (near Glasgow).

Current membership includes people from CMS, CDF/D0, BaBar, LCHb, EGEE, ATLAS; (possibly ALICE soon)

Working groups formed to look into Use cases, Services,

Deployment Architecture, Query Languages and Interfaces, Tools and Deliverables.

Trying to pull together common functionality from different experiments.

This common functionality can then be packaged and used by different experiments (reducing reinventing the wheel)

Also looking at existing software and see how it can be used by all experiments (reducing reinventing the wheel)

(14)

The metadata group

First meeting was May 2004 in Stepps (near Glasgow).

Current membership includes people from CMS, CDF/D0, BaBar, LCHb, EGEE, ATLAS; (possibly ALICE soon)

Working groups formed to look into Use cases, Services,

Deployment Architecture, Query Languages and Interfaces, Tools and Deliverables.

Trying to pull together common functionality from different experiments.

This common functionality can then be packaged and used by different experiments (reducing reinventing the wheel)

Also looking at existing software and see how it can be used by all experiments (reducing reinventing the wheel)

(15)

Summary of use-cases

Steve Hanlon summarised HEPCAL I, II and CDF use-cases. Here is the summary of the summary.

HEPCAL I defines event metadata (application dependent, dealt with by the experiments), Conditions Data (running conditions and calibration data). Both types of metadata. Also defines metadata as keyword-value pairs.

(16)

Summary of use-cases

Steve Hanlon summarised HEPCAL I, II and CDF use-cases. Here is the summary of the summary.

HEPCAL I defines event metadata (application dependent, dealt with by the experiments), Conditions Data (running conditions and calibration data). Both types of metadata. Also defines metadata as keyword-value pairs.

HEPCAL II also mentions dataset metadata, event metadata. Also introduce idea of component metadata e.g.sub-detector or physics channel the component is relevant to. It also mentions metadata as mechanism for datafile provenance and book-keeping.

(17)

Summary of use-cases

Steve Hanlon summarised HEPCAL I, II and CDF use-cases. Here is the summary of the summary.

HEPCAL I defines event metadata (application dependent, dealt with by the experiments), Conditions Data (running conditions and calibration data). Both types of metadata. Also defines metadata as keyword-value pairs.

HEPCAL II also mentions dataset metadata, event metadata. Also introduce idea of component metadata e.g.sub-detector or physics channel the component is relevant to. It also mentions metadata as mechanism for datafile provenance and book-keeping.

SAM/CDF takes the view that everything has metadata: files, jobs, services, . . . Metadata is keyword-value pairs, stored in a central database.

(18)

Metadata as keyword-value pairs

What is metadata? “Data about data”
(19)

Metadata as keyword-value pairs

What is metadata? “Data about data”

Different people want to store different (sub-)sets of information against different objects.

(20)

Metadata as keyword-value pairs

What is metadata? “Data about data”

Different people want to store different (sub-)sets of information against different objects.

To cover the lowest common denominator, maps an arbitrary string to another string (numerical data stored as strings?)

(21)

Metadata as keyword-value pairs

What is metadata? “Data about data”

Different people want to store different (sub-)sets of information against different objects.

To cover the lowest common denominator, maps an arbitrary string to another string (numerical data stored as strings?)

(22)

Metadata as keyword-value pairs

What is metadata? “Data about data”

Different people want to store different (sub-)sets of information against different objects.

To cover the lowest common denominator, maps an arbitrary string to another string (numerical data stored as strings?)

This isn’t very good for either the “keyword” part, or the “value”. Sometimes a string isn’t long enough. Instances of metadata that exceeds the string size limit.

(23)

Performance issues

Keyword-value pairs, stored in a separate table, are a bad choice for performance. Queries are more complex and its more difficult for the database to optimise.

(24)

Performance issues

Keyword-value pairs, stored in a separate table, are a bad choice for performance. Queries are more complex and its more difficult for the database to optimise.

(25)

Performance issues

Keyword-value pairs, stored in a separate table, are a bad choice for performance. Queries are more complex and its more difficult for the database to optimise.

Optimisation method discussed at the metadata meeting: Monitor people’s queries.

(26)

Performance issues

Keyword-value pairs, stored in a separate table, are a bad choice for performance. Queries are more complex and its more difficult for the database to optimise.

Optimisation method discussed at the metadata meeting: Monitor people’s queries.

Regularly (per month, week, or day?) choose most popular keywords.

(27)

Performance issues

Keyword-value pairs, stored in a separate table, are a bad choice for performance. Queries are more complex and its more difficult for the database to optimise.

Optimisation method discussed at the metadata meeting: Monitor people’s queries.

Regularly (per month, week, or day?) choose most popular keywords.

Build an optimisation table that summaries these keyword-value pairs as regular columns.

(28)

Performance issues

Keyword-value pairs, stored in a separate table, are a bad choice for performance. Queries are more complex and its more difficult for the database to optimise.

Optimisation method discussed at the metadata meeting: Monitor people’s queries.

Regularly (per month, week, or day?) choose most popular keywords.

Build an optimisation table that summaries these keyword-value pairs as regular columns.

(29)

Performance issues

Keyword-value pairs, stored in a separate table, are a bad choice for performance. Queries are more complex and its more difficult for the database to optimise.

Optimisation method discussed at the metadata meeting: Monitor people’s queries.

Regularly (per month, week, or day?) choose most popular keywords.

Build an optimisation table that summaries these keyword-value pairs as regular columns.

Dynamically translate user queries to use optimisation table. What about synchronisation?

(30)

Looking at the keys

(31)

Looking at the keys

Any keyword from a keyword-value pair must mean something. Is a text string good enough to define something’s meaning?

(32)

Looking at the keys

Any keyword from a keyword-value pair must mean something. Is a text string good enough to define something’s meaning?

The precise meaning might vary from person to person, so merging metadata might be problematic.

(33)

Looking at the keys

Any keyword from a keyword-value pair must mean something. Is a text string good enough to define something’s meaning?

The precise meaning might vary from person to person, so merging metadata might be problematic.

(34)

Looking at the values

If we define value as strings, only a subset of all available strings is likely to be valid.

(35)

Looking at the values

If we define value as strings, only a subset of all available strings is likely to be valid.

(36)

Looking at the values

If we define value as strings, only a subset of all available strings is likely to be valid.

Need to be able to test if the data is correct.

A regular expression (regexp) should be able to test validity of any string (given sufficiently long regexp).

(37)

Looking at the values

If we define value as strings, only a subset of all available strings is likely to be valid.

Need to be able to test if the data is correct.

A regular expression (regexp) should be able to test validity of any string (given sufficiently long regexp).

(38)

Looking at the values

If we define value as strings, only a subset of all available strings is likely to be valid.

Need to be able to test if the data is correct.

A regular expression (regexp) should be able to test validity of any string (given sufficiently long regexp).

Is a regexp test a sufficient condition?

What to do when the validation tests changes and metadata then fails?

(39)

Collections

Collections are multiple datafiles.
(40)

Collections

Collections are multiple datafiles.
(41)

Collections

Collections are multiple datafiles.

All the files have something in common (even if its abstract).

(42)

Collections

Collections are multiple datafiles.

All the files have something in common (even if its abstract).

If they have something in common, the have identical metadata. Should a collection change in size if the metadata changes?

(43)

Hierarchies of metadata

Hierarchies of inheritance (“subsets” from OOP): more metadata is added as one descends the hierarchy and the object becomes more specific.

(44)

Hierarchies of metadata

Hierarchies of inheritance (“subsets” from OOP): more metadata is added as one descends the hierarchy and the object becomes more specific.

Hierarchies of scope (who gets to see the metadata). Some metadata is sensitive. Also avoids confusion.

(45)

Hierarchies of metadata

Hierarchies of inheritance (“subsets” from OOP): more metadata is added as one descends the hierarchy and the object becomes more specific.

Hierarchies of scope (who gets to see the metadata). Some metadata is sensitive. Also avoids confusion.

(46)

Hierarchies of metadata

Hierarchies of inheritance (“subsets” from OOP): more metadata is added as one descends the hierarchy and the object becomes more specific.

Hierarchies of scope (who gets to see the metadata). Some metadata is sensitive. Also avoids confusion.

Hierarchies of extent, applying metadata to a collection of objects. Confusing these different hierarchies can cause headaches later.

(47)

Consolidation layer

(48)

Consolidation layer

ARDA defines different services, metadata as a separate one. Users have requested to do JOINs across metadata and replica location.

(49)

Consolidation layer

ARDA defines different services, metadata as a separate one. Users have requested to do JOINs across metadata and replica location.

(50)

Consolidation layer

ARDA defines different services, metadata as a separate one. Users have requested to do JOINs across metadata and replica location.

Solutions:

(51)

Consolidation layer

ARDA defines different services, metadata as a separate one. Users have requested to do JOINs across metadata and replica location.

Solutions:

One way is to merge metadata and replica location services. One way is to add extra functionality in either replica location service or metadata that queries the other.

(52)

Consolidation layer

ARDA defines different services, metadata as a separate one. Users have requested to do JOINs across metadata and replica location.

Solutions:

One way is to merge metadata and replica location services. One way is to add extra functionality in either replica location service or metadata that queries the other.

One way is to have some “consolidation” layer, that allows linking of two layers in as optimal way as possible.

(53)

Conclusions

Metadata is difficult.
(54)

Conclusions

Metadata is difficult.

Just having a database doesn’t solve some of the longer-term issues.

(55)

Conclusions

Metadata is difficult.

Just having a database doesn’t solve some of the longer-term issues.

Working group formed to allow rapid development and evolve towards common solution.

(56)

Conclusions

Metadata is difficult.

Just having a database doesn’t solve some of the longer-term issues.

Working group formed to allow rapid development and evolve towards common solution.

Do we need (want) a completely flexible system? How good is good enough?

References

Related documents

Andreas Geppert Spring 2014 Slide 19 GUI … Metadata Management Reusable Selection, Aggregation, Calculation Web/App Servers Integration, Historization Landing Zone

ReplicaInitiation AccessHistory Processing Service Management Replica 000000 000000 111111 111111 Collection Optimization Core Sessions Consistency Subscription Transport

In the tpf interface, metadata is crucial for clients to evaluate sparql queries efficiently. By estimating the total number of matches per triple pattern, patterns with

Technical metadata Helps to: Decode Render Interpret Examples: File format Is it compressed. Has any processing been

BLOCK MAPPING • •     Patented, flat metadata design scales to unlimited virtual copies 10-50x storage consolidation, add parallel environments at no cost COMPRESSION • •  

Another measure of data accountability is the return on the investment (ROI) that the data provides. Metadata can serve to support data ROI assessment by providing an indication as

• Geodata, model data and unstructured data (documents) can be searched by ArcGIS metadata functionality (using ArcCatalog) respectively by webbased metadata service or File Explorer

Query management : often update new triggers or queries requested by 3 rd party History of values. : no scalable way to support