Data Categories and Codes

(1)

Data Categories and Codes

V0.01 First version

WG Charter

Standardisation of data categories and codes for human communication resources

Value Proposition

A specific description of who will benefit from the adoption / implementation of the CWG outcomes /“deliverables” and what tangible impacts the adoption / implementation of the deliverables should have.

The outputs of this Working Group will assist with greater data sharing, data discovery, and interoperability of repositories/archives

Who benefits:

• interdisciplinary researchers

• in-‐domain researchers

• researchers’ institutions (through increased visibility and accessibility)

• data collections/repositories (visibility, accessibility)

Outcomes:

• Engagement with ISO processes in support of standardisation of ISO639 by TC37/SC2

• Establishment of Australian mirror committee of TC37

Deliverables:

• Recommended set of CMDI core components and ISO data categories onto which they map

• Recommended CMDI schema with mappings to metadata schemas currently accepted in the relevant domains

(2)

Impacts:

1. Improved practice in identification of all aspects of human communication in resource descriptions

2. Easier identification of language resources in other repositories where language is not the primary focus

3. Easier identification of non-‐language aspects (e.g. music) in repositories where language is the primary focus

4. Users of language archives will receive improved access to resources 5. Language archives will be able to more easily share resources

6. Increased researcher influence on standards that they use 7. Greater discovery (and hence potential for re-‐use) of resources

8. More explicit semantics around language and music resources enable more informed re-‐use and facilitates automated re-‐use

Engagement with existing work in the area

A brief review of related work and plan for engagement with any other activities in the area.

A number of existing initiatives are working in related areas. The following table shows these initiatives by impact area. This Working Group will liaise closely with these initiatives to avoid duplication of effort and to ensure coordination across this space.

Initiative Full name Relevant Impacts

CMDI Component MetaData

Infrastructure

[Applied in all]

E-MELD Electronic Metastructure

for Endangered Languages Data

1, 6, 8

DOBES/LIBES Dokumentation Bedrohter

Sprachen (Documentation of Endangered Languages)

1, 3, 4, 5, 6 ,7 ,8

CLARIN ERIC Common Language

Resources and Technology Infrastructure

1, 4, 5, 6, 7, 8

ISO TC-37 1, 8

META-SHARE Multilingual Europe

Technology Alliance

All

(3)

FROLIC Framework for the

Organization of Language Identification Codes

1, 6, 8

RELISH Rendering Endangered

Language Lexicons Interoperable through Standards Harmonization

1, 3, 4, 5, 6 ,7 ,8

Plan for engagement with any other activities HuNI

(Humanities Networked Infrastructure)

The HuNI Project is using linked Open Data technology to integrate 28 of Australia’s most important cultural datasets into a ‘virtual laboratory’. Many of these datasets contain material relevant to research on human

communication while not all being primarily oriented to such material. Engagement with this project is therefore important in developing the WG’s aim to enable access to human communication resources outside of domain specific repositories.

HCS vLab

(Human Communication Science Virtual Laboratory)

The HCS vLab will connect HCS researchers, their desks, computers, labs, and universities and so accelerate HCS research and produce emergent knowledge that comes from novel application of previously unshared tools to analyse previously difficult to access data sets. The HCS vLab infrastructure will overcome resource limitations of individual desktops; allow easy access to shared tools and data; and provide the guided use of workflow tools and options to allow researchers to cross disciplinary

boundaries. One of the bases of this project is sharing of data; engagement with it will provide important input to the WG’s activities in improving interoperability of HCS datasets.

ISO TC37

ISO Technical Committee 37 Terminology and other language and content resources

TC37 is the body which has responsibility for overseeing the development of international standards for identifying codes for languages, language families and varieties within languages. Engagement with this committee is crucial to the planned activities of the WG in order to ensure that any recommendations align with proposed standards.

(4)

CLARIN

Common Language Resources and Technology Infrastructure

CLARIN aims at providing easy and sustainable access for scholars in the humanities and social sciences to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyse or combine them, independent of where they are located. The aims of the WG are closely aligned with those of CLARIN;

engagement with the organisation will ensure the input of European expertise in the WG’s activity and will provide a platform for the adoption of outputs in European research communities.

LinguistList Through its involvement in projects such as E-MELD and

RELISH (see above), LinguistList has become a point of contact for US-based work on interoperability of language resources. Engagement with this organisation will ensure the input of North American expertise in the WG’s activity and will provide a platform for the adoption of outputs in international research communities (given the worldwide reach of LinguistList as a medium of

communication).

Repositories/Aggregators The endpoint of the WG’s activity should be the adoption of its recommendations by the repositories active in the relevant fields. This would also flow through to

aggregators of metadata from repositories (such as OLAC – Open Language Archiving Community). Engagement with these stakeholders from the WG’s inception is therefore critical. Such engagement will be achieved through direct contact with repositories (WG members Drude, Barwick and Thieberger are staff members of relevant repositories) and through representative bodies such as DELAMAN (Digital Endangered Languages and Music Archives Network).

Work Plan

A specific and detailed description of how the CWG will operate including:

a) The form and description of final deliverables of the candidate Working Group,

• Recommended set of CMDI core components and an associated schema for use by repositories holding resources which include a language component;

• Recommended set of data categories onto which the CMDI components map, to be

proposed as candidate standards within the ISOcat process.

(5)

• Mappings between common metadata schemas currently in use in relevant repositories and the CDMI schema.

b) The form and description of milestones and intermediate documents, code or other deliverables that will be developed during the course of the CWG’s work,

For each of the deliverables listed above:

i) Discussion papers detailing the issues to be addressed and canvassing possible solutions.

ii) Draft recommendations based on i) and input from relevant communities.

Additionally:

Establishment of an Australian mirror committee of ISO TC37 as an avenue for increased engagement of research communities in the ISO 639 processes.

Once agreed, it is expected that the deliverables will be progressively implemented in computer software and systems. See also Adoption Plan below.

c) a description of the Working Group’s mode and frequency of operation (e.g. on-‐line and/or on-‐site, how frequently will the group meet, etc.),

The WG will work mainly on-‐line; however opportunities for physical meetings by members of the WG (e.g. alongside other conferences or meetings) will be utilised also.

d) a description of how the Working Group plans to develop consensus, address conflicts, stay on track and within scope, and move forward during operation, and

The WG will proceed by means of open processes in online environments. Representatives of key stakeholders are involved which will ensure that proposals made by the WG will have an excellent chance of acceptance by consensus across those groups. The Australian involvement is based on existing networks which have had an outstanding record of co-‐operative endeavour over at least the last five years (e.g. HCSNet). Similarly, the European involvement in the WG is based partially on the CLARIN network, which has an excellent record in fostering co-‐operative work. Wider co-‐operation between these two groups has been underway for the past year. Whilst the WG will call on this basis of co-‐operation to enable its work, the involvement of the stakeholder groups will also ensure sound governance and oversight of the progress of the group.

(6)

e) a description of the CWG’s planned approach to broader community engagement and participation

The interim documents detailed at b) above will be made available as widely as possible. The WG will have a distribution list of major stakeholders and documents will be sent to them for

consideration; in addition documents will be made available online for general consultation and this process will be publicised in relevant fora e.g. LinguistList.

Adoption Plan

A specific plan for adoption / implementation of the CWG deliverables / outcomes within the organizations and institutions represented by CWG members, as well as plans for adoption / implementation of the deliverables / outcomes more broadly within the community. Such

adoption/implementation should start within the 12-‐18 month timeframe, prior to the completion of the Working Group.

Initial implementation of the WG deliverables will mean use of the CMDI components and schema in description of resources in the two repositories represented in the WG (The Language Archive and PARADISEC) [approximate timescale: 12 months]. At the same time, the data categories associated with the CMDI schema will be proposed as standards within the ISOcat framework. ) [approximate timescale: 12 months] When these two initial stages are completed, the outcomes of the WG’s activities will be visible to relevant communities and activity will move to advocating the adoption of the schema and standards more widely. This will be accomplished by conference presentations, publications and demonstrations. [approximate timescale: 12-‐18 months]

Initial Membership

A specific list of initial members of the CWG and a description of initial leadership of the CWG.

Linda Barwick, PARADISEC/USyd

Anna Belew, Linguist List/ Eastern Michigan University Steve Cassidy, HCS V-‐Lab/Macquarie

Sebastian Drude, TLA/MPI Dominique Estival, MARCS/UWS Ingrid Mason, HuNI/Intersect

Simon Musgrave, ARGILARE/Monash/AusNC (Initial leader) Gary Simons, SIL/Graduate Institute of Applied Linguistics/OLAC Nick Thieberger, PARADISEC/UniMelb

(7)

Michael Walsh, AIATSIS/USyd Menzo Windhouwer, TLA/DANS

Data Categories and Codes