• No results found

Apache Tika for Enabling Metadata Interoperability

N/A
N/A
Protected

Academic year: 2021

Share "Apache Tika for Enabling Metadata Interoperability"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

Apache  Tika  for  Enabling  

Metadata  Interoperability  

Apache:  Big  Data  Europe   September  28  –  30,  2015  

Budapest,  Hungary    

Presented  by  Michael  Starch  (NASA  JPL)  and  Nick  Burch  (Quan,cate)  

Proposed  by  Giuseppe  Totaro  (Sapienza  University  of  Rome)  and  Chris  Ma=mann  (NASA  JPL)  

(2)

Summary  

•  What  is  Apache  Tika  

•  Tika  and  Metadata  

•  Metadata  Interoperability  

•  Tika  for  Enabling  Metadata  Interoperability  

(3)

WHAT  IS  TIKA  

Apache  Tika  as  the  de  facto  babel  fish  for  digital  documents  

(4)

What  is  Tika  

•  Java-­‐based  toolkit  to  detect  and  extract  

metadata  and  text  from  heterogeneous  files  

•  Built  from  source  code  using  Maven  

•  Provides  a  single  Parser  interface  to  wrap  

around  third-­‐party  parsing  libraries  

•  Enables  recursive  parsing  

•  Performs  language  detecUon  and  translaUon  

(5)

Supported  Format  

•  HTML,  XML,  XHTML  

•  MicrosoZ  Office  document  formats  

•  OpenDocument  Format  

•  iWorks  document  formats  

•  EPUB,  PDF,  RTF  

•  Compression  and  packaging  formats  

•  Text  formats  

•  Audio,  Image  and  Video  formats  

•  Mail  formats  

(6)

DetecUon  

•  Tika  tries  to  idenUfy  the  right  file  type  

•  Custom  MIME  types  registry  

•  DetecUon  methods   – File  name   – Content-Type  hints   – MAGIC  bytes   – Character  encodings   – Combined  approaches  

(7)

Extending  Tika  Parsers  

•  Add  your  MIME-­‐Type  (Uka-­‐mimetypes.xml)  

<mime-type type="application/x-isatab-investigation"> <_comment>ISA-Tab Investigation file</_comment> <magic priority="50">

<match value="ONTOLOGY SOURCE REFERENCE” type="string" offset="0"/>

</magic>

<glob pattern="i_*.txt"/> </mime-type>

•  Create  your  Parser  class  

public class ISArchiveParser implements Parser { ...

@Override

public void parse() {...} }

(8)

New  Features  

•  The  last  stable  release  is  Tika  1.10  

•  Some  new  features  of  the  last  releases:  

– Upgraded  to  Java  7  (TIKA-­‐1536)  

– ExtracUon  of  biomedical  informaUon  relying  on  

Apache  cTAKES  (TIKA-­‐1645,  TIKA-­‐1642)  

– ProbabilisUc  mimetype  detecUon  

– tika-batch  module  for  directory  to  directory  

(9)

TIKA  AND  METADATA  

ExtracUon  of  metadata  with  Apache  Tika  

(10)

What  is  Metadata  

•  Informally  defined  as  “data  about  data”  

–  DescripUve,  structural,  administraUve,  rights  

management,  preservaUon  [NISO  (2004)]  

•  E.g.,  Title,  Author,  CreaUon  Date,  Rights  

•  Every  metadata  schema  may  vary  a  lot:  

–  Naming  (e.g.,  Description  and  Info)  

–  Correspondences  (e.g.,  Creator  and  FirstName  /  

LastName)  

(11)

Model  PerspecUve  of  Metadata  

A Survey of Techniques for Achieving Metadata Interoperability · 11 Schema Definition Language Metadata Schema Abstraction Levels Model Meta data instance of Meta-Model instance of M2 M1 M0 Universal Modelling Language Meta-Meta-Model instance of M3

Fig. 5. Metadata building blocks from a model perspective

Especially from an interoperability point of view, rigid and semantic precise defini-tions enable consistent interpretation across system boundaries [Seidewitz 2003].

Metadata models and meta-models are arranged on di↵erent levels that are or-thogonal to the previously mentioned levels of information. On the lowest level we can find metadata (descriptions) that are (valid) instances of a metadata model (e.g., Java classes, UML model, database relations) that reflects the elements of a certain metadata schema. The metadata model itself is a valid instance of a meta-data meta-model being part of a certain schema definition language. Due to this abstraction, it is possible to create meta-model representations of metadata (e.g., metadata instances of an UML model can also be represented as instances of the UML meta-model).

The MOF specification [OMG 2006a] o↵ers a definition for these di↵erent levels:

M0 is the lowest level, the level of metadata instances (e.g., Title=Lake Placid 1980, Alpine Skiing, I. Stenmark). M1 holds the models for a particular

ap-plication; i.e., metadata schemes (e.g., definition of the fieldTitle) are M1 models.

Modeling languages reside on level M2 — their abstract syntax or meta-model

can be considered as model of a particular modeling system (e.g., definition of the language primitive attribute). On the topmost-level, at M3, we can find universal modeling languages in which modeling systems are specified (e.g., core constructs, primitive types). Figure 5 illustrates the four levels, their constituents and depen-dencies.

ACM Journal Name, Vol. V, No. N, M 20YY.

[HASLHOFER,  KLAS  (2010)]  

(12)

Tika  and  Metadata  

•  Tika  enables  metadata  extracUon  (if  present)  

•  Tika  maps  metadata  onto  common,  consistent  

key-­‐value  pairs  in  Metadata

•  (Some)  Metadata  APIs  

– org.apache.Uka.metadata  

•  Metadata  class:  mulU-­‐values  metadata  container  

•  TikaCoreProperties:  core  set  of  basic  properUes  

– org.apache.Uka.xmp  

(13)

Solr’s  ExtracUngRequestHandler  

•  Apache  Solr  uses  Tika  to  ingest  binary  and/or  

structured  documents  

•  Solr's  ExtractingRequestHandler  uses  

Tika  to  upload  binary  files  into  Solr  

•  Input  parameters  (configuraUon)  

– fmap.<source_field>=<target_field>

– uprefix=<prefix>

•  Performs  only  name  mapping  

(14)

Metadata  Roadmap  

•  The  six-­‐point  roadmap  includes:  

– Reorganize  metadata  keys  internally  

– Move  XMP  output  to  an  extra  XMP  module  of  Tika  

– Correct  parsers  where  necessary  

– Add  support  for  structured  data  to  metadata  class  

– Introduce  versioning  scheme  for  metadata  

mappings  

– Introduce  the  ability  for  clients  to  define  own  

(15)

METADATA  INTEROPERABILITY  

Interoperability  as  prerequisite  for  uniform  data  access  

(16)

Metadata  Interoperability  

•  Prerequisite  for  uniform  access  to  media  

objects    

Metadata  interoperability  is  a  qualita2ve  

property  of  metadata  informa2on  objects  that   enables  systems  and  applica2ons  to  work  with  

or  use  these  objects  across  system  boundaries.”  

(17)

Metadata  HeterogeneiUes  

Predominat  heterogeneiUes  have  been  originally  idenUfied  by:  

[Sheth,  Larson  (1990)]  [Ouksel,  Sheth  (1999)]  [Wache  (2003)]  [Visser  et  al.  (1997)]  

[HASLHOFER,  KLAS  (2010)]  

(18)
(19)

Interoperability  Techniques  

•  Model  Agreement  

– e.g.,  Standardized  Metadata  Schema  

•  Meta-­‐Model  Agreement  

– e.g.,  Global  Conceptual  Model  

•  Model  ReconciliaUon  

– Language  Mapping  (M2)  

– Schema  Mapping  (M1)  

– Instance  TransformaOon  (M0)   Metadata  Mapping  

(20)

Metadata  Mapping  

•  Technique  that  subsumes:  

– schema  mapping  

– instance  transforma2on  

 

Given  a  source  schema            and  a  target  schema          ,     each  consisUng  of  a  set  of  schema  elements                                 and                              ,              is  a  direcUonal  relaUonship  between   two  sets  of  elements                                and                              .    

crosswalks   funcUons   Ss St etSt esSs eis St M etj St

(21)

Mapping  RelaUonship  

•  Mapping  expressions  [Spaccapietra  et  al.  (1992)]:  

– Exclude   – Equivalent   – Include   – Overlap   I e

( )

isI e

( )

tj = ∅ mM pP f F m

instance  transforma2on  func2on   mapping  expression  

cardinality  

I e

( )

isI e

( )

tj

I e

( )

isI e

( )

tjI e

( )

tjI e

( )

is

I e

( )

isI e

( )

tj ≠ ∅∧ I e

( )

isI e

( )

tjI e

( )

tjI e

( )

is

(22)

Elements  of  Metadata  Mapping  

(23)

TIKA  FOR  ENABLING  METADATA  

INTEROPERABILITY  

Introduce  the  ability  for  clients  to  define  their  own  mappings  

(24)

Tika  for  Enabling  Metadata  

Interoperability  

•  To  be  integrated  into  Tika  (TIKA-­‐1691)  as  new  

component  

•  Based  on  the  following  improvements:  

– MappedMetadata  class  

– Mapping  uUliUes  (schema  and  instance)  

– MetadataConfig  class  

(25)

MappedMetadata  class  

•  Wrapper  of  Metadata  class  

•  Decorates  two  methods  of  Metadata:  

– get:  maps  metadata  on  geBer  side  (default)  

– set:  maps  metadata  on  seBer  side  

(26)

UUliUes  and  ConfiguraUon  

•  Mapping  Methods  

– CrosswalkUtils  class  (schema)  

– TransformationsUtils  class  (instance)  

•  Mapping  ConfiguraUon  

– MetadataConfig  class  

•  works  as  well  as  TikaConfig  (parse  XML  config  file)  

– Fine-­‐grained  configuraUon  

(27)

Example  of  MappedMetadata  (1/2)  

(28)
(29)

CONCLUSION  AND  FUTURE  WORK  

Future  direcUons  for  Uka-­‐metadata  

(30)

Conclusion  

•  Oka-­‐metadata  is  a  new  component  to  enable  

metadata  interoperability  on  client  side  

•  Pros  

–  Highly  configurable  technique  

–  Fine-­‐grained  mapping  

•  Cons  

–  Configuring  a  new  mapping  from  scratch  may  require  

much  Ume    

(31)

Future  Work  

•  IntegraUon  with  the  next  releases  of  Tika  

•  Simplify  configuraUon  and  provide  a  complete  

sample  with  documentaUon  

•  Support  strategies  for  unknown  metadata  

•  Return  current  mappings  as  graphical  

representaUon  (i.e.,  Hierarchical  Edge   Bundling)  

(32)

Acknowledgements  

•  This  work  has  been  started  by  Giuseppe  

Totaro  and  Chris  Mapmann  at  NASA  JPL  

•  Thanks  to  Michael  Starch  who  has  presented  

this  proposal    

•  Thanks  to  Nick  Burch  who  is  kindly  supporUng  

this  idea  

•  Thanks  to  Tim  Allison  who  is  acUvely  

References

Related documents

In both America and Europe, fair wage considerations compress differentials between the wages for skilled and unskilled workers, leading to involuntary unemployment of

find that estimated NAIRU with the survey data suggests very small unemployment gaps (the difference between unemployment rate and NAIRU), compared to that of estimation without

Egunjobi (2013) examined the relationship between corruption and economic growth in Nigeria using time series data from 1989-2009 and found that corruption has direct effect

Θεωρείται ένα απ τα πιο πιστά αντίγραφα του Kαθολικού της Nέας Mονής Xίου και ανήκει σε μια σειρά απ πλούσια μνημεία που πρωτοεμφανίζονται στις αρχές

We believe that great dental care begins with good doctor patient communication, always feel free to ask our doctors or staff any questions or concerns at any time during

It was also mentioned that a schema for validating the metadata required by the OVAL Repository (as opposed to the metadata required by the OVAL Language) should be available to

This thesis has five principal aims: (i) to explore the extent of compliance with IFRS disclosure requirements by Kuwaiti non-financial listed companies; (ii) to

• CASU: Cambridge • WFAU: Edinburgh High Energy Astrophysics data • LEDAS: Leicester Radio data • Jodrell Bank AstroGrid Solar/STP data • MSSL • RAL.. Wider UK