Apache Tika for Enabling
Metadata Interoperability
Apache: Big Data Europe September 28 – 30, 2015
Budapest, Hungary
Presented by Michael Starch (NASA JPL) and Nick Burch (Quan,cate)
Proposed by Giuseppe Totaro (Sapienza University of Rome) and Chris Ma=mann (NASA JPL)
Summary
• What is Apache Tika
• Tika and Metadata
• Metadata Interoperability
• Tika for Enabling Metadata Interoperability
WHAT IS TIKA
Apache Tika as the de facto babel fish for digital documents
What is Tika
• Java-‐based toolkit to detect and extract
metadata and text from heterogeneous files
• Built from source code using Maven
• Provides a single Parser interface to wrap
around third-‐party parsing libraries
• Enables recursive parsing
• Performs language detecUon and translaUon
Supported Format
• HTML, XML, XHTML
• MicrosoZ Office document formats
• OpenDocument Format
• iWorks document formats
• EPUB, PDF, RTF
• Compression and packaging formats
• Text formats
• Audio, Image and Video formats
• Mail formats
DetecUon
• Tika tries to idenUfy the right file type
• Custom MIME types registry
• DetecUon methods – File name – Content-Type hints – MAGIC bytes – Character encodings – Combined approaches
Extending Tika Parsers
• Add your MIME-‐Type (Uka-‐mimetypes.xml)
<mime-type type="application/x-isatab-investigation"> <_comment>ISA-Tab Investigation file</_comment> <magic priority="50">
<match value="ONTOLOGY SOURCE REFERENCE” type="string" offset="0"/>
</magic>
<glob pattern="i_*.txt"/> </mime-type>
• Create your Parser class
public class ISArchiveParser implements Parser { ...
@Override
public void parse() {...} }
New Features
• The last stable release is Tika 1.10
• Some new features of the last releases:
– Upgraded to Java 7 (TIKA-‐1536)
– ExtracUon of biomedical informaUon relying on
Apache cTAKES (TIKA-‐1645, TIKA-‐1642)
– ProbabilisUc mimetype detecUon
– tika-batch module for directory to directory
TIKA AND METADATA
ExtracUon of metadata with Apache Tika
What is Metadata
• Informally defined as “data about data”
– DescripUve, structural, administraUve, rights
management, preservaUon [NISO (2004)]
• E.g., Title, Author, CreaUon Date, Rights
• Every metadata schema may vary a lot:
– Naming (e.g., Description and Info)
– Correspondences (e.g., Creator and FirstName /
LastName)
Model PerspecUve of Metadata
A Survey of Techniques for Achieving Metadata Interoperability · 11 Schema Definition Language Metadata Schema Abstraction Levels Model Meta data instance of Meta-Model instance of M2 M1 M0 Universal Modelling Language Meta-Meta-Model instance of M3Fig. 5. Metadata building blocks from a model perspective
Especially from an interoperability point of view, rigid and semantic precise defini-tions enable consistent interpretation across system boundaries [Seidewitz 2003].
Metadata models and meta-models are arranged on di↵erent levels that are or-thogonal to the previously mentioned levels of information. On the lowest level we can find metadata (descriptions) that are (valid) instances of a metadata model (e.g., Java classes, UML model, database relations) that reflects the elements of a certain metadata schema. The metadata model itself is a valid instance of a meta-data meta-model being part of a certain schema definition language. Due to this abstraction, it is possible to create meta-model representations of metadata (e.g., metadata instances of an UML model can also be represented as instances of the UML meta-model).
The MOF specification [OMG 2006a] o↵ers a definition for these di↵erent levels:
M0 is the lowest level, the level of metadata instances (e.g., Title=Lake Placid 1980, Alpine Skiing, I. Stenmark). M1 holds the models for a particular
ap-plication; i.e., metadata schemes (e.g., definition of the fieldTitle) are M1 models.
Modeling languages reside on level M2 — their abstract syntax or meta-model
can be considered as model of a particular modeling system (e.g., definition of the language primitive attribute). On the topmost-level, at M3, we can find universal modeling languages in which modeling systems are specified (e.g., core constructs, primitive types). Figure 5 illustrates the four levels, their constituents and depen-dencies.
ACM Journal Name, Vol. V, No. N, M 20YY.
[HASLHOFER, KLAS (2010)]
Tika and Metadata
• Tika enables metadata extracUon (if present)
• Tika maps metadata onto common, consistent
key-‐value pairs in Metadata
• (Some) Metadata APIs
– org.apache.Uka.metadata
• Metadata class: mulU-‐values metadata container
• TikaCoreProperties: core set of basic properUes
– org.apache.Uka.xmp
Solr’s ExtracUngRequestHandler
• Apache Solr uses Tika to ingest binary and/or
structured documents
• Solr's ExtractingRequestHandler uses
Tika to upload binary files into Solr
• Input parameters (configuraUon)
– fmap.<source_field>=<target_field>
– uprefix=<prefix>
• Performs only name mapping
Metadata Roadmap
• The six-‐point roadmap includes:
– Reorganize metadata keys internally
– Move XMP output to an extra XMP module of Tika
– Correct parsers where necessary
– Add support for structured data to metadata class
– Introduce versioning scheme for metadata
mappings
– Introduce the ability for clients to define own
METADATA INTEROPERABILITY
Interoperability as prerequisite for uniform data access
Metadata Interoperability
• Prerequisite for uniform access to media
objects
“Metadata interoperability is a qualita2ve
property of metadata informa2on objects that enables systems and applica2ons to work with
or use these objects across system boundaries.”
Metadata HeterogeneiUes
Predominat heterogeneiUes have been originally idenUfied by:
[Sheth, Larson (1990)] [Ouksel, Sheth (1999)] [Wache (2003)] [Visser et al. (1997)]
[HASLHOFER, KLAS (2010)]
Interoperability Techniques
• Model Agreement
– e.g., Standardized Metadata Schema
• Meta-‐Model Agreement
– e.g., Global Conceptual Model
• Model ReconciliaUon
– Language Mapping (M2)
– Schema Mapping (M1)
– Instance TransformaOon (M0) Metadata Mapping
Metadata Mapping
• Technique that subsumes:
– schema mapping
– instance transforma2on
Given a source schema and a target schema , each consisUng of a set of schema elements and , is a direcUonal relaUonship between two sets of elements and .
crosswalks funcUons Ss St et ∈ St es ∈ Ss eis ∈ St M etj ∈ St
Mapping RelaUonship
• Mapping expressions [Spaccapietra et al. (1992)]:
– Exclude – Equivalent – Include – Overlap I e
( )
is ∩ I e( )
tj = ∅ m ∈ M p ∈ P f ∈ F minstance transforma2on func2on mapping expression
cardinality
I e
( )
is ≡ I e( )
tjI e
( )
is ⊆ I e( )
tj ∨ I e( )
tj ⊆ I e( )
isI e
( )
is ∩I e( )
tj ≠ ∅∧ I e( )
is ⊆ I e( )
tj ∧ I e( )
tj ⊆ I e( )
isElements of Metadata Mapping
TIKA FOR ENABLING METADATA
INTEROPERABILITY
Introduce the ability for clients to define their own mappings
Tika for Enabling Metadata
Interoperability
• To be integrated into Tika (TIKA-‐1691) as new
component
• Based on the following improvements:
– MappedMetadata class
– Mapping uUliUes (schema and instance)
– MetadataConfig class
MappedMetadata class
• Wrapper of Metadata class
• Decorates two methods of Metadata:
– get: maps metadata on geBer side (default)
– set: maps metadata on seBer side
UUliUes and ConfiguraUon
• Mapping Methods
– CrosswalkUtils class (schema)
– TransformationsUtils class (instance)
• Mapping ConfiguraUon
– MetadataConfig class
• works as well as TikaConfig (parse XML config file)
– Fine-‐grained configuraUon
Example of MappedMetadata (1/2)
CONCLUSION AND FUTURE WORK
Future direcUons for Uka-‐metadata
Conclusion
• Oka-‐metadata is a new component to enable
metadata interoperability on client side
• Pros
– Highly configurable technique
– Fine-‐grained mapping
• Cons
– Configuring a new mapping from scratch may require
much Ume
Future Work
• IntegraUon with the next releases of Tika
• Simplify configuraUon and provide a complete
sample with documentaUon
• Support strategies for unknown metadata
• Return current mappings as graphical
representaUon (i.e., Hierarchical Edge Bundling)
Acknowledgements
• This work has been started by Giuseppe
Totaro and Chris Mapmann at NASA JPL
• Thanks to Michael Starch who has presented
this proposal
• Thanks to Nick Burch who is kindly supporUng
this idea
• Thanks to Tim Allison who is acUvely