Metadata Requirements for Data Management

(1)

Metadata Requirements for Data

Management

Paul Millar

(2)

or

(3)

Contents

The problem

The metadata group Summary of use-cases

Metadata as keyword-value pairs Performance issues

Looking at the keys Looking at the values Collections

(4)

The problem

HEPCAL I defines metadata as experiment’s responsibility, but obviously there’s commonality between different experiments.

(5)

The problem

HEPCAL I defines metadata as experiment’s responsibility, but obviously there’s commonality between different experiments. EDG provided two components for supplying metadata: replica metadata catalogue (RMC) and Spitfire: a Grid-enabled database.

(6)

The problem

HEPCAL I defines metadata as experiment’s responsibility, but obviously there’s commonality between different experiments. EDG provided two components for supplying metadata: replica metadata catalogue (RMC) and Spitfire: a Grid-enabled database. RMC supported light-weight metadata, Spitfire was a fully functional database that can be used for storing

(7)

The problem

HEPCAL I defines metadata as experiment’s responsibility, but obviously there’s commonality between different experiments. EDG provided two components for supplying metadata: replica metadata catalogue (RMC) and Spitfire: a Grid-enabled database. RMC supported light-weight metadata, Spitfire was a fully functional database that can be used for storing

application/experiment-specific metadata.

But both were solutions at too low-level: experiments had common layer of functionality above these.

(8)

The metadata group

First meeting was May 2004 in Stepps (near Glasgow).

(9)

The metadata group

Current membership includes people from CMS, CDF/D0, BaBar, LCHb, EGEE, ATLAS; (possibly ALICE soon)

(10)

The metadata group

Working groups formed to look into Use cases, Services,

Deployment Architecture, Query Languages and Interfaces, Tools and Deliverables.

(11)

The metadata group

Trying to pull together common functionality from different experiments.

(12)

The metadata group

This common functionality can then be packaged and used by different experiments (reducing reinventing the wheel)

(13)

The metadata group

Also looking at existing software and see how it can be used by all experiments (reducing reinventing the wheel)

(14)

The metadata group

Also looking at existing software and see how it can be used by all experiments (reducing reinventing the wheel)

(15)

Summary of use-cases

Steve Hanlon summarised HEPCAL I, II and CDF use-cases. Here is the summary of the summary.

HEPCAL I defines event metadata (application dependent, dealt with by the experiments), Conditions Data (running conditions and calibration data). Both types of metadata. Also defines metadata as keyword-value pairs.

(16)

Summary of use-cases

HEPCAL II also mentions dataset metadata, event metadata. Also introduce idea of component metadata e.g.sub-detector or physics channel the component is relevant to. It also mentions metadata as mechanism for datafile provenance and book-keeping.

(17)

Summary of use-cases

HEPCAL II also mentions dataset metadata, event metadata. Also introduce idea of component metadata e.g.sub-detector or physics channel the component is relevant to. It also mentions metadata as mechanism for datafile provenance and book-keeping.

SAM/CDF takes the view that everything has metadata: files, jobs, services, . . . Metadata is keyword-value pairs, stored in a central database.

(18)

Metadata as keyword-value pairs

What is metadata? “Data about data”

(19)

Metadata as keyword-value pairs

Different people want to store different (sub-)sets of information against different objects.

(20)

To cover the lowest common denominator, maps an arbitrary string to another string (numerical data stored as strings?)

(21)

Metadata as keyword-value pairs

(22)

This isn’t very good for either the “keyword” part, or the “value”. Sometimes a string isn’t long enough. Instances of metadata that exceeds the string size limit.

(23)

Performance issues

Keyword-value pairs, stored in a separate table, are a bad choice for performance. Queries are more complex and its more difficult for the database to optimise.

(24)

Performance issues

(25)

Performance issues

Optimisation method discussed at the metadata meeting: Monitor people’s queries.

(26)

Performance issues

Regularly (per month, week, or day?) choose most popular keywords.

(27)

Performance issues

Build an optimisation table that summaries these keyword-value pairs as regular columns.

(28)

Performance issues

(29)

Performance issues

Dynamically translate user queries to use optimisation table. What about synchronisation?

(30)

Looking at the keys

(31)

Looking at the keys

Any keyword from a keyword-value pair must mean something. Is a text string good enough to define something’s meaning?

(32)

Looking at the keys

The precise meaning might vary from person to person, so merging metadata might be problematic.

(33)

Looking at the keys

The precise meaning might vary from person to person, so merging metadata might be problematic.

(34)

Looking at the values

If we define value as strings, only a subset of all available strings is likely to be valid.

(35)

Looking at the values

(36)

Need to be able to test if the data is correct.

A regular expression (regexp) should be able to test validity of any string (given sufficiently long regexp).

(37)

Looking at the values

(38)

Is a regexp test a sufficient condition?

What to do when the validation tests changes and metadata then fails?

(39)

Collections

Collections are multiple datafiles.

(40)

Collections

(41)

Collections

All the files have something in common (even if its abstract).

(42)

Collections

All the files have something in common (even if its abstract).

If they have something in common, the have identical metadata. Should a collection change in size if the metadata changes?

(43)

Hierarchies of metadata

Hierarchies of inheritance (“subsets” from OOP): more metadata is added as one descends the hierarchy and the object becomes more specific.

(44)

Hierarchies of metadata

Hierarchies of scope (who gets to see the metadata). Some metadata is sensitive. Also avoids confusion.

(45)

Hierarchies of metadata

(46)

Hierarchies of metadata

Hierarchies of extent, applying metadata to a collection of objects. Confusing these different hierarchies can cause headaches later.

(47)

Consolidation layer

(48)

Consolidation layer

ARDA defines different services, metadata as a separate one. Users have requested to do JOINs across metadata and replica location.

(49)

Consolidation layer

(50)

Consolidation layer

Solutions:

(51)

Consolidation layer

Solutions:

One way is to merge metadata and replica location services. One way is to add extra functionality in either replica location service or metadata that queries the other.

(52)

Consolidation layer

Solutions:

One way is to merge metadata and replica location services. One way is to add extra functionality in either replica location service or metadata that queries the other.

One way is to have some “consolidation” layer, that allows linking of two layers in as optimal way as possible.

(53)

Conclusions

Metadata is difficult.

(54)

Conclusions

Just having a database doesn’t solve some of the longer-term issues.

(55)

Conclusions

Working group formed to allow rapid development and evolve towards common solution.

(56)

Conclusions

Working group formed to allow rapid development and evolve towards common solution.

Do we need (want) a completely flexible system? How good is good enough?