The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands
LAMUS & LAT Archiving software
Daan Broeder
• MPI for Psycholinguistics research corpora: child language,
bilingualism, gesture, sign language, Corpus Spoken Dutch, second learner corpora, etc.
• Archive for the DOBES project
• Hosting (and inviting) corpora for other projects in need
(UNESCO study: 80% of all material is endangered)
– DBD, NGT, Leiden Univ. language documentation corpora
– Donated endangered language corpora
– Eibl Eibersfeldt human ethology collection
• Maintain a metadata catalog for properly described resources from
other institutes
– BAS, C-ORAL-ROM (Univ. Florence), …
– LR from Lund Univ, INL, other archive partners
• Copy of CHILDES and Talkbank corpora from CMU
Mainly annotated audio/video recordings
50 TB: 200k MD records, 250k AV resources, 200k annotation files, lexicons, sketch grammars, etc.
History
• Started in 2000 to try solve the mounting data
chaos at the MPI for Psycholinguistics • First needed proper data descriptions
• Archive software development linked to the
IMDI metadata set for Language Resource
• First archive was basically a file-system with
metadata descriptions and resource files • Tools operating directly on the files
• A researcher s notebook disk was just as
IMDI – ISLE Metadata Initiative
• Metadata schema for Language Resources
• Developed from 2000 also in several EU projects
ISLE, ECHO, INTERA
• Especially multi-media/multi-modal recordings
• 3 XML metadata schema + special profiles for
specific communities: Sign-Language, SL-acquisition, … C C S S S S S C M M M T T T CT I
• Archiving formats only • Metadata in XML files • Relations represented by links • DBs only as helpers • Data safety through HSM, pushing data to TLs TLA ARCHIVE C C S S S S S C M M M M T T T
}
IMDI metadata}
resources TTLA Archive Organization
language expedition age group genre sessionX media file annot. file
Local tools - ARBIL - ELAN WWW browser media files metadata annotations ARCHIVE LOCAL DATA IMDI- Browser HTTP server resource download Browsing/Search/Visualization LAMUS AMS Archive Access Upload data LARI TROVE
All resources accessible by HTTP if authorized
PID service
All web-apps can be configured to use either Shibboleth or a local LDAP for authentication
imdidb. structurecorpus amsdb C C S S S S S C LAMUS crawler archive archive manager content search IMDI lucene idx IMDI search IMDI browser annexdb lamusdb AMS API API API API API API Archive Administration
Why ‘user managed’ deposition?
• Increasing costs
– New cheaper technologies for recording, digitization and storage
causes huge increase in data quantities.
• Using depositor knowledge
– Researcher/depositor knows where to put the data in the logical
structure (catalogue) of the archive.
– Communication with archive managers is overhead.
• Offer remote archiving services
– Support distributed projects
• Stricter checking
– Make checks explicit
– Archive managers have short contracts, knowledge seems to get lost.
• Maximizing deposition
– 80 percent of all recordings is in danger (UNESCO report)
– We want to open our archive for external depositors
LAMUS is a web-application that allows
• Uploading and naming individual resources (media,
annotations, information files)
• Specifying ‘limited’ metadata and mutual relations for
and between resources
• Creating relevant linguistic groupings for the data
(sub-corpora) LAMUS will:
• Carry out checks for consistency and coherence: check
for accepted formats etc. (configurable list)
• Updating databases and indexes
• Issue PID for the new resources and metadata records
ARCHIVE
WORKSPACE local disk
The Archive
check out
modify/add/.. check in
workspace
Add to original after
• consistency check • versioning Local tools: • Arbil, • ELAN, • Shoebox, • … Using Arbil using LAMUS
TLA – Versioning of resources
TLA versioning policy
• Nothing gets actually deleted
• Users can delete resources which are removed
from the visible collection (corpus tree) but remain in the archive
• Users can update (replace) existing resources
– The new version will get a new PID
– Old version will be shelved but keep their PID
• Access to old versions is managed by the owner
C C
S S S S S C
• User role administration: archive
manager, domain curator, domain manager, domain editor
• Set a required license
• Set access rules per media type:
annotations, images, audio, video, info
• A rule sets access/denial to user/
group for type of data
• Special groups: ‘all’, ‘registered
user’
• Rules have priority
• Inheritance of rules by descendant
nodes M M M M M M C C C S M Rule 1 Rule 2 Rule 3 Rule 1 Rule 2 Rule 3
AMS – Access Management System Sign academic
IMDI-Browser & Metadata Search
• Browse the hierarchy of corpora
• Inspect metadata records
• Create bookmarks
– resources
– IMDI-Browser showing resources
• Show PIDs, URLs for resources and metadata
• Make resource access requests
• Search the metadata:
– simple keyword,
Regional Archives Initiative: Cooperation of TLA/MPI-PL with other organizations interested in EL archiving They use TLA LAT archiving software
• Encourage local resource collecting & archiving
• Network of South American archives has been established and contacts
with CLARA were made
Synchronization physical structure • Use rsync software
• Complete replication
• No special conditions possible
• Use for backup to computing
centers
Synchronization logical structure
• Special software needed
• Per corpus copy to a selected
target
• Owner can make special
exceptions
• Use to synchronize between
archives C C S S S S S C S S S C C Logical synchronization Data Synchronization I
C C S S S S S C LAMUS archive API C S S S HTTP server COSIX
COSIX: complex logic to compare corpus trees and determine • what is new • what to replace • what to add • what to delete Data Synchronization II
In a cooperation with CMU, COSIX is used to copy CHILDES and Talkbank corpora into
our archive. CMU generating IMDI records on the fly from their DBs
Technical Info
• Java web-applications running inside Tomcat
servlet container
• Postgress DBMS
• Platform: Linux
• Web-app frameworks: JSP, Applets, JSF, FLEX,
Wicket,…
• Works with most web browsers (Explorer,
LAMUS & LAT Future
• TLA is part of CLARIN and is promoting CMDI, so …
• We are planning the transition from LAMUS – IMDI to
LAMUS CMDI
• We analyzed our set-up and still like the LAT
fundaments e.g. file based, modularity, …
• But we will also alleviate some current problems and
inconveniences:
– limited metadata editing in LAMUS
– Insufficient provenance tracking of resources
– Better handling of download/modify/upload cycle
– Better integration with other (LAT) archives and
Thank you for your attention
CLARIN has received funding from
the European Community's Seventh Framework Programme