• No results found

LAMUS & LAT Archiving software

N/A
N/A
Protected

Academic year: 2021

Share "LAMUS & LAT Archiving software"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands

LAMUS & LAT Archiving software

Daan Broeder

(2)

•  MPI for Psycholinguistics research corpora: child language,

bilingualism, gesture, sign language, Corpus Spoken Dutch, second learner corpora, etc.

•  Archive for the DOBES project

•  Hosting (and inviting) corpora for other projects in need

(UNESCO study: 80% of all material is endangered)

–  DBD, NGT, Leiden Univ. language documentation corpora

–  Donated endangered language corpora

–  Eibl Eibersfeldt human ethology collection

•  Maintain a metadata catalog for properly described resources from

other institutes

–  BAS, C-ORAL-ROM (Univ. Florence), …

–  LR from Lund Univ, INL, other archive partners

•  Copy of CHILDES and Talkbank corpora from CMU

Mainly annotated audio/video recordings

50 TB: 200k MD records, 250k AV resources, 200k annotation files, lexicons, sketch grammars, etc.

(3)

History

•  Started in 2000 to try solve the mounting data

chaos at the MPI for Psycholinguistics •  First needed proper data descriptions

•  Archive software development linked to the

IMDI metadata set for Language Resource

•  First archive was basically a file-system with

metadata descriptions and resource files •  Tools operating directly on the files

•  A researcher s notebook disk was just as

(4)

IMDI – ISLE Metadata Initiative

•  Metadata schema for Language Resources

•  Developed from 2000 also in several EU projects

ISLE, ECHO, INTERA

•  Especially multi-media/multi-modal recordings

•  3 XML metadata schema + special profiles for

specific communities: Sign-Language, SL-acquisition, … C C S S S S S C M M M T T T CT I

(5)

•  Archiving formats only •  Metadata in XML files •  Relations represented by links •  DBs only as helpers •  Data safety through HSM, pushing data to TLs TLA ARCHIVE C C S S S S S C M M M M T T T

}

IMDI metadata

}

resources T

TLA Archive Organization

language expedition age group genre sessionX media file annot. file

(6)

Local tools - ARBIL - ELAN WWW browser media files metadata annotations ARCHIVE LOCAL DATA IMDI- Browser HTTP server resource download Browsing/Search/Visualization LAMUS AMS Archive Access Upload data LARI TROVE

All resources accessible by HTTP if authorized

PID service

All web-apps can be configured to use either Shibboleth or a local LDAP for authentication

(7)

imdidb. structurecorpus amsdb C C S S S S S C LAMUS crawler archive archive manager content search IMDI lucene idx IMDI search IMDI browser annexdb lamusdb AMS API API API API API API Archive Administration

(8)
(9)

Why ‘user managed’ deposition?

•  Increasing costs

–  New cheaper technologies for recording, digitization and storage

causes huge increase in data quantities.

•  Using depositor knowledge

–  Researcher/depositor knows where to put the data in the logical

structure (catalogue) of the archive.

–  Communication with archive managers is overhead.

•  Offer remote archiving services

–  Support distributed projects

•  Stricter checking

–  Make checks explicit

–  Archive managers have short contracts, knowledge seems to get lost.

•  Maximizing deposition

–  80 percent of all recordings is in danger (UNESCO report)

–  We want to open our archive for external depositors

(10)

LAMUS is a web-application that allows

•  Uploading and naming individual resources (media,

annotations, information files)

•  Specifying ‘limited’ metadata and mutual relations for

and between resources

•  Creating relevant linguistic groupings for the data

(sub-corpora) LAMUS will:

•  Carry out checks for consistency and coherence: check

for accepted formats etc. (configurable list)

•  Updating databases and indexes

•  Issue PID for the new resources and metadata records

(11)

ARCHIVE

WORKSPACE local disk

(12)

The Archive

check out

modify/add/.. check in

workspace

Add to original after

• consistency check • versioning Local tools: •  Arbil, •  ELAN, •  Shoebox, •  … Using Arbil using LAMUS

(13)

TLA – Versioning of resources

TLA versioning policy

•  Nothing gets actually deleted

•  Users can delete resources which are removed

from the visible collection (corpus tree) but remain in the archive

•  Users can update (replace) existing resources

–  The new version will get a new PID

–  Old version will be shelved but keep their PID

•  Access to old versions is managed by the owner

(14)

C C

S S S S S C

•  User role administration: archive

manager, domain curator, domain manager, domain editor

•  Set a required license

•  Set access rules per media type:

annotations, images, audio, video, info

•  A rule sets access/denial to user/

group for type of data

•  Special groups: ‘all’, ‘registered

user’

•  Rules have priority

•  Inheritance of rules by descendant

nodes M M M M M M C C C S M Rule 1 Rule 2 Rule 3 Rule 1 Rule 2 Rule 3

AMS – Access Management System Sign academic

(15)

IMDI-Browser & Metadata Search

•  Browse the hierarchy of corpora

•  Inspect metadata records

•  Create bookmarks

–  resources

–  IMDI-Browser showing resources

•  Show PIDs, URLs for resources and metadata

•  Make resource access requests

•  Search the metadata:

–  simple keyword,

(16)
(17)
(18)
(19)
(20)

(21)

Regional Archives Initiative: Cooperation of TLA/MPI-PL with other organizations interested in EL archiving They use TLA LAT archiving software

•  Encourage local resource collecting & archiving

•  Network of South American archives has been established and contacts

with CLARA were made

(22)

Synchronization physical structure •  Use rsync software

•  Complete replication

•  No special conditions possible

•  Use for backup to computing

centers

Synchronization logical structure

•  Special software needed

•  Per corpus copy to a selected

target

•  Owner can make special

exceptions

•  Use to synchronize between

archives C C S S S S S C S S S C C Logical synchronization Data Synchronization I

(23)

C C S S S S S C LAMUS archive API C S S S HTTP server COSIX

COSIX: complex logic to compare corpus trees and determine •  what is new •  what to replace •  what to add •  what to delete Data Synchronization II

In a cooperation with CMU, COSIX is used to copy CHILDES and Talkbank corpora into

our archive. CMU generating IMDI records on the fly from their DBs

(24)

Technical Info

•  Java web-applications running inside Tomcat

servlet container

•  Postgress DBMS

•  Platform: Linux

•  Web-app frameworks: JSP, Applets, JSF, FLEX,

Wicket,…

•  Works with most web browsers (Explorer,

(25)

LAMUS & LAT Future

•  TLA is part of CLARIN and is promoting CMDI, so …

•  We are planning the transition from LAMUS – IMDI to

LAMUS CMDI

•  We analyzed our set-up and still like the LAT

fundaments e.g. file based, modularity, …

•  But we will also alleviate some current problems and

inconveniences:

–  limited metadata editing in LAMUS

–  Insufficient provenance tracking of resources

–  Better handling of download/modify/upload cycle

–  Better integration with other (LAT) archives and

(26)
(27)

Thank you for your attention

CLARIN has received funding from


the European Community's Seventh Framework Programme


References

Related documents

EagleView’s complaint touted the close competition between Respondents, alleging, “Xactware has developed a product, known as Aerial Sketch, which enables it to compete directly with

Six validated tools were used to screen for the full range of mental health disorders including developmental disorders like Personality Disorder, Attention Deficit

Administration Error: Using Expired

HARRY GETS UP, STEPPING ON MARV IN THE PROCESS KEVIN RUNS AWAY AS MARV JOINS HARRY ON THE STEPS. Marv: He's only a

As already mentioned, many shapes incorporate straight line segments for vertical and horizontal bars, but slightly curved arc segments for diagonal lines.. At small sizes,

The Forth Replacement Crossing is currently being built across the Firth of Forth to maintain and improve reliability of a vital transport link in Scotland. The total length of the

Keywords: Jatropha curcas L, Chrysocoris javanus Westw, Trissolcus latisulcus Crawford, parasitoid