Metadata Requirements for Data
Management
Paul Millar
or
Contents
The problemThe metadata group Summary of use-cases
Metadata as keyword-value pairs Performance issues
Looking at the keys Looking at the values Collections
The problem
HEPCAL I defines metadata as experiment’s responsibility, but obviously there’s commonality between different experiments.
The problem
HEPCAL I defines metadata as experiment’s responsibility, but obviously there’s commonality between different experiments. EDG provided two components for supplying metadata: replica metadata catalogue (RMC) and Spitfire: a Grid-enabled database.
The problem
HEPCAL I defines metadata as experiment’s responsibility, but obviously there’s commonality between different experiments. EDG provided two components for supplying metadata: replica metadata catalogue (RMC) and Spitfire: a Grid-enabled database. RMC supported light-weight metadata, Spitfire was a fully functional database that can be used for storing
The problem
HEPCAL I defines metadata as experiment’s responsibility, but obviously there’s commonality between different experiments. EDG provided two components for supplying metadata: replica metadata catalogue (RMC) and Spitfire: a Grid-enabled database. RMC supported light-weight metadata, Spitfire was a fully functional database that can be used for storing
application/experiment-specific metadata.
But both were solutions at too low-level: experiments had common layer of functionality above these.
The metadata group
First meeting was May 2004 in Stepps (near Glasgow).The metadata group
First meeting was May 2004 in Stepps (near Glasgow).Current membership includes people from CMS, CDF/D0, BaBar, LCHb, EGEE, ATLAS; (possibly ALICE soon)
The metadata group
First meeting was May 2004 in Stepps (near Glasgow).Current membership includes people from CMS, CDF/D0, BaBar, LCHb, EGEE, ATLAS; (possibly ALICE soon)
Working groups formed to look into Use cases, Services,
Deployment Architecture, Query Languages and Interfaces, Tools and Deliverables.
The metadata group
First meeting was May 2004 in Stepps (near Glasgow).Current membership includes people from CMS, CDF/D0, BaBar, LCHb, EGEE, ATLAS; (possibly ALICE soon)
Working groups formed to look into Use cases, Services,
Deployment Architecture, Query Languages and Interfaces, Tools and Deliverables.
Trying to pull together common functionality from different experiments.
The metadata group
First meeting was May 2004 in Stepps (near Glasgow).Current membership includes people from CMS, CDF/D0, BaBar, LCHb, EGEE, ATLAS; (possibly ALICE soon)
Working groups formed to look into Use cases, Services,
Deployment Architecture, Query Languages and Interfaces, Tools and Deliverables.
Trying to pull together common functionality from different experiments.
This common functionality can then be packaged and used by different experiments (reducing reinventing the wheel)
The metadata group
First meeting was May 2004 in Stepps (near Glasgow).Current membership includes people from CMS, CDF/D0, BaBar, LCHb, EGEE, ATLAS; (possibly ALICE soon)
Working groups formed to look into Use cases, Services,
Deployment Architecture, Query Languages and Interfaces, Tools and Deliverables.
Trying to pull together common functionality from different experiments.
This common functionality can then be packaged and used by different experiments (reducing reinventing the wheel)
Also looking at existing software and see how it can be used by all experiments (reducing reinventing the wheel)
The metadata group
First meeting was May 2004 in Stepps (near Glasgow).Current membership includes people from CMS, CDF/D0, BaBar, LCHb, EGEE, ATLAS; (possibly ALICE soon)
Working groups formed to look into Use cases, Services,
Deployment Architecture, Query Languages and Interfaces, Tools and Deliverables.
Trying to pull together common functionality from different experiments.
This common functionality can then be packaged and used by different experiments (reducing reinventing the wheel)
Also looking at existing software and see how it can be used by all experiments (reducing reinventing the wheel)
Summary of use-cases
Steve Hanlon summarised HEPCAL I, II and CDF use-cases. Here is the summary of the summary.
HEPCAL I defines event metadata (application dependent, dealt with by the experiments), Conditions Data (running conditions and calibration data). Both types of metadata. Also defines metadata as keyword-value pairs.
Summary of use-cases
Steve Hanlon summarised HEPCAL I, II and CDF use-cases. Here is the summary of the summary.
HEPCAL I defines event metadata (application dependent, dealt with by the experiments), Conditions Data (running conditions and calibration data). Both types of metadata. Also defines metadata as keyword-value pairs.
HEPCAL II also mentions dataset metadata, event metadata. Also introduce idea of component metadata e.g.sub-detector or physics channel the component is relevant to. It also mentions metadata as mechanism for datafile provenance and book-keeping.
Summary of use-cases
Steve Hanlon summarised HEPCAL I, II and CDF use-cases. Here is the summary of the summary.
HEPCAL I defines event metadata (application dependent, dealt with by the experiments), Conditions Data (running conditions and calibration data). Both types of metadata. Also defines metadata as keyword-value pairs.
HEPCAL II also mentions dataset metadata, event metadata. Also introduce idea of component metadata e.g.sub-detector or physics channel the component is relevant to. It also mentions metadata as mechanism for datafile provenance and book-keeping.
SAM/CDF takes the view that everything has metadata: files, jobs, services, . . . Metadata is keyword-value pairs, stored in a central database.
Metadata as keyword-value pairs
What is metadata? “Data about data”Metadata as keyword-value pairs
What is metadata? “Data about data”Different people want to store different (sub-)sets of information against different objects.
Metadata as keyword-value pairs
What is metadata? “Data about data”Different people want to store different (sub-)sets of information against different objects.
To cover the lowest common denominator, maps an arbitrary string to another string (numerical data stored as strings?)
Metadata as keyword-value pairs
What is metadata? “Data about data”Different people want to store different (sub-)sets of information against different objects.
To cover the lowest common denominator, maps an arbitrary string to another string (numerical data stored as strings?)
Metadata as keyword-value pairs
What is metadata? “Data about data”Different people want to store different (sub-)sets of information against different objects.
To cover the lowest common denominator, maps an arbitrary string to another string (numerical data stored as strings?)
This isn’t very good for either the “keyword” part, or the “value”. Sometimes a string isn’t long enough. Instances of metadata that exceeds the string size limit.
Performance issues
Keyword-value pairs, stored in a separate table, are a bad choice for performance. Queries are more complex and its more difficult for the database to optimise.
Performance issues
Keyword-value pairs, stored in a separate table, are a bad choice for performance. Queries are more complex and its more difficult for the database to optimise.
Performance issues
Keyword-value pairs, stored in a separate table, are a bad choice for performance. Queries are more complex and its more difficult for the database to optimise.
Optimisation method discussed at the metadata meeting: Monitor people’s queries.
Performance issues
Keyword-value pairs, stored in a separate table, are a bad choice for performance. Queries are more complex and its more difficult for the database to optimise.
Optimisation method discussed at the metadata meeting: Monitor people’s queries.
Regularly (per month, week, or day?) choose most popular keywords.
Performance issues
Keyword-value pairs, stored in a separate table, are a bad choice for performance. Queries are more complex and its more difficult for the database to optimise.
Optimisation method discussed at the metadata meeting: Monitor people’s queries.
Regularly (per month, week, or day?) choose most popular keywords.
Build an optimisation table that summaries these keyword-value pairs as regular columns.
Performance issues
Keyword-value pairs, stored in a separate table, are a bad choice for performance. Queries are more complex and its more difficult for the database to optimise.
Optimisation method discussed at the metadata meeting: Monitor people’s queries.
Regularly (per month, week, or day?) choose most popular keywords.
Build an optimisation table that summaries these keyword-value pairs as regular columns.
Performance issues
Keyword-value pairs, stored in a separate table, are a bad choice for performance. Queries are more complex and its more difficult for the database to optimise.
Optimisation method discussed at the metadata meeting: Monitor people’s queries.
Regularly (per month, week, or day?) choose most popular keywords.
Build an optimisation table that summaries these keyword-value pairs as regular columns.
Dynamically translate user queries to use optimisation table. What about synchronisation?
Looking at the keys
Looking at the keys
Any keyword from a keyword-value pair must mean something. Is a text string good enough to define something’s meaning?
Looking at the keys
Any keyword from a keyword-value pair must mean something. Is a text string good enough to define something’s meaning?
The precise meaning might vary from person to person, so merging metadata might be problematic.
Looking at the keys
Any keyword from a keyword-value pair must mean something. Is a text string good enough to define something’s meaning?
The precise meaning might vary from person to person, so merging metadata might be problematic.
Looking at the values
If we define value as strings, only a subset of all available strings is likely to be valid.
Looking at the values
If we define value as strings, only a subset of all available strings is likely to be valid.
Looking at the values
If we define value as strings, only a subset of all available strings is likely to be valid.
Need to be able to test if the data is correct.
A regular expression (regexp) should be able to test validity of any string (given sufficiently long regexp).
Looking at the values
If we define value as strings, only a subset of all available strings is likely to be valid.
Need to be able to test if the data is correct.
A regular expression (regexp) should be able to test validity of any string (given sufficiently long regexp).
Looking at the values
If we define value as strings, only a subset of all available strings is likely to be valid.
Need to be able to test if the data is correct.
A regular expression (regexp) should be able to test validity of any string (given sufficiently long regexp).
Is a regexp test a sufficient condition?
What to do when the validation tests changes and metadata then fails?
Collections
Collections are multiple datafiles.Collections
Collections are multiple datafiles.Collections
Collections are multiple datafiles.All the files have something in common (even if its abstract).
Collections
Collections are multiple datafiles.All the files have something in common (even if its abstract).
If they have something in common, the have identical metadata. Should a collection change in size if the metadata changes?
Hierarchies of metadata
Hierarchies of inheritance (“subsets” from OOP): more metadata is added as one descends the hierarchy and the object becomes more specific.
Hierarchies of metadata
Hierarchies of inheritance (“subsets” from OOP): more metadata is added as one descends the hierarchy and the object becomes more specific.
Hierarchies of scope (who gets to see the metadata). Some metadata is sensitive. Also avoids confusion.
Hierarchies of metadata
Hierarchies of inheritance (“subsets” from OOP): more metadata is added as one descends the hierarchy and the object becomes more specific.
Hierarchies of scope (who gets to see the metadata). Some metadata is sensitive. Also avoids confusion.
Hierarchies of metadata
Hierarchies of inheritance (“subsets” from OOP): more metadata is added as one descends the hierarchy and the object becomes more specific.
Hierarchies of scope (who gets to see the metadata). Some metadata is sensitive. Also avoids confusion.
Hierarchies of extent, applying metadata to a collection of objects. Confusing these different hierarchies can cause headaches later.
Consolidation layer
Consolidation layer
ARDA defines different services, metadata as a separate one. Users have requested to do JOINs across metadata and replica location.
Consolidation layer
ARDA defines different services, metadata as a separate one. Users have requested to do JOINs across metadata and replica location.
Consolidation layer
ARDA defines different services, metadata as a separate one. Users have requested to do JOINs across metadata and replica location.
Solutions:
Consolidation layer
ARDA defines different services, metadata as a separate one. Users have requested to do JOINs across metadata and replica location.
Solutions:
One way is to merge metadata and replica location services. One way is to add extra functionality in either replica location service or metadata that queries the other.
Consolidation layer
ARDA defines different services, metadata as a separate one. Users have requested to do JOINs across metadata and replica location.
Solutions:
One way is to merge metadata and replica location services. One way is to add extra functionality in either replica location service or metadata that queries the other.
One way is to have some “consolidation” layer, that allows linking of two layers in as optimal way as possible.
Conclusions
Metadata is difficult.Conclusions
Metadata is difficult.Just having a database doesn’t solve some of the longer-term issues.
Conclusions
Metadata is difficult.Just having a database doesn’t solve some of the longer-term issues.
Working group formed to allow rapid development and evolve towards common solution.
Conclusions
Metadata is difficult.Just having a database doesn’t solve some of the longer-term issues.
Working group formed to allow rapid development and evolve towards common solution.
Do we need (want) a completely flexible system? How good is good enough?