• No results found

LANGUAGE CODING IN INFORMATION TECHNOLOGIES

N/A
N/A
Protected

Academic year: 2021

Share "LANGUAGE CODING IN INFORMATION TECHNOLOGIES"

Copied!
41
0
0

Loading.... (view fulltext now)

Full text

(1)

LANGUAGE CODING IN INFORMATION

TECHNOLOGIES

TKE 2014: Language Codes at the Crossroads

Peter Constable

(2)

Does industry need or care about ISO

639-3?

(3)

“The main question that I have is whether language identification should be a task for ISO, the International Organization for Standardization… ISO is

basically an organization [f]or industry, not for science… The reason why ISO got involved in language name issues in the first place is of course the

economic significance of translation and localization, which is far greater than the relevance of distant stars for businesses. But does this mean that someone needs ISO’s industry standard to identify little-known languages of small communities that are never or hardly used in writing and that are

often in danger of extinction?”

(4)

“Business has an interest in the stable identification of economically

significant languages, for example for translation and computer localization, and this is why the ISO 639-1 and 639-2 standards were established in the first place. However, those standards are adequate for the needs of

industry; business has no significant interest in the many small, unwritten and often endangered languages with no measurable economic impact.” — “Kwamikagami” (Wikipedia contributor), April 2014

(5)

Information technologies rely heavily on ISO 639, including ISO 639-3

IETF BCP 47 is a key industry technology using ISO 639-3

Many of linguists’ needs for language identification can be accommodated by the same BCP 47 mechanisms used in the IT industry

(6)

AGENDA

Use of language identifiers in information technologies

History: development of ISO 639-3 and industry adoption via BCP 47Which languages are significant?

Overview of IETF BCP 47 (language tags)Utility of industry mechanisms for linguists

(7)

USE OF LANGUAGE IDENTIFIERS IN

INFORMATION TECHNOLOGIES

(8)

USE OF LANGUAGE IDENTIFIERS

Tagging content to declare the language of content

• Text, audio, video

Tagging of software resources for language-specific processingMatching user language preferences with content

(9)

USE OF LANGUAGE IDENTIFIERS

Examples

Display of content in my preferred language

• Web pages, videos, captions, etc.

Display of application user interfaces in my preferred languageActivating input methods for different languages

Spell checkingText-to-speech… (many others)

(10)

DEVELOPMENT OF ISO 639-3 AND

INDUSTRY ADOPTION VIA BCP 47

(11)

LANDSCAPE CIRCA 2000

Language documentation / applied linguistics

Research, literature development in 1000s of languagesLarge language corpora

(12)

LANDSCAPE CIRCA 2000

Industry

Limited “locale” identifier mechanisms

• Windows: numbers — 512 maximum

Mac: numbers — 150 defined

• Internet: RFC 1766 — based on ISO 639-1 and ISO 3166-1, e.g., “en-US”

• XML: using RFC 1766

ISO 639-1: ≈ 180 languagesISO 639-2: ≈ 350 languages

(13)

LANDSCAPE CIRCA 2000

Changing industry landscape

Unicode Consortium mission: “This Corporation’s specific purpose shall be

to enable people around the world to use computers in any language…”1

Unicode 3.0, finally becoming mainstream in software

• Office 97, Windows 2000, Mac OS X , XETEX, XML, .Net, Java, C++, ECMAScript, Pango…

Rapidly-growing interest in expanding language support

Major vendors: “We don’t want to be a bottleneck for language

(14)

LANDSCAPE CIRCA 2000

“We need a comprehensive language coding standard!”

(15)

DEVELOPMENT SINCE 2000

ISO 639-3

• 2002: start of work

• 2007: published

BCP 47

• 2001: IETF RFC 3066 — incorporation of ISO 639-2

• 2006: IETF RFC 4646 — enhancements to compatibility, stability, structure

• 2009: IETF RFC 5646 — incorporation of ISO 639-3

Widespread adoption of BCP 47, ISO 639-3 across technologies, Web, and

(16)

CURRENT INDUSTRY USE OF ISO 639-3

Unicode CLDR 25:

• Used in Android, Mac OS, iOS, Windows, Debian Linux, Apache, …

• Data for 369 languages in ISO 639-3 not in ISO 639-1 / ISO 639-2

• Exemplar data for 600+ more

General support for all of ISO 639-3

Windows 8:

• Explicit use of 115 IDs in ISO 639-3 not in ISO 639-1 / ISO 639-2

(17)
(18)

WHICH LANGUAGES ARE SIGNIFICANT FOR

INDUSTRY?

Extended Graded

Inter-generational Disruption Scale

See:

https://www.ethnologue.com/about/language-status

(19)

0 500 1000 1500 2000 2500 3000 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10

(20)

0 500 1000 1500 2000 2500 3000 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10 Institutional

Mass media / publishingLibraries

Education

Commerce, marketingProduct localization,

(21)

0 500 1000 1500 2000 2500 3000 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10 Developing – waning • Limited-to-no institutional

support, mass media, etc. Use of ICTs:

End-user content—Web,

SMS, email

Some product localizationSignificant enhancement to

language stabilization, vitality

(22)

0 500 1000 1500 2000 2500 3000 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10 Dying – extinct Use of ICTs: • Language documentationXML

(23)

WHICH LANGUAGES ARE SIGNIFICANT FOR

INDUSTRY?

Some will be more used and better supported by industry than others… … but all need and get some level of industry support

(24)
(25)

BCP 47

IETF Best Current Practice specification

Reference: http://tools.ietf.org/html/bcp47History: • 1995: RFC 1766 • 2001: RFC 3066 • 2006: RFC 4646 + RFC 4647 • 2009: RFC 5646 + RFC 4647

Designed to accommodate language variations

• Language, writing system, orthography, dialect, …

(26)

HANDLING VARIATIONS

Start with IDs for discrete languages from ISO 639-1, ISO 639-3Using BCP 47, add qualifiers to language tags as needed

Examples:

• pt-BR = Portuguese as used in Brazil

• az-Cyrl = Azerbaijani written in Cyrillic script

• ca-valencia = Valencian

• de-1996 = German using 1996 orthographic conventions

(27)

KEY COMPONENTS OF BCP 47

Tag syntax

Subtag registry (maintained by IANA)Mechanism to register “variant” subtagsMechanism to register extensions

(28)

BCP 47 SYNTAX

Language-Tag = langtag / privateuse

langtag = language ; ISO 639 ("-" script)? ; ISO 15924

("-" region)? ; ISO 3166-1 or UN M.49 ("-" variant)* ; registered

("-" extension)* ; registered RFC ("-" privateuse)?

extension = singleton ("-" alphanum{2,8})+ privateuse = "x" ("-" alphanum{1,8})+

(29)

BCP 47 SYNTAX

Examples:

• haw language

• pt-BR language + ISO 3166-1 region

• es-419 language + UN M.49 region

az-Cyrl language + script

• ca-valencia language + variant

• pww-Latn-fonipa language + script + variant

• x-foobar private use

• fil-x-foobar language + private use

(30)

VARIANT SUBTAGS

Registration requests can be submitted by anyoneReviewed for best practice

Added to IANA Language Subtag Registry

Process: see http://tools.ietf.org/html/bcp47#section-3.564 variant subtags registered to date

(31)

VARIANT SUBTAGS

Examples:

Variant subtag Meaning

aluku Aluku dialect of the "Busi Nenge Tongo" English-based Creole continuum in Eastern Suriname and Western French Guiana

balanka The Balanka dialect of Anii itihasa Epic Sanskrit

vallader Vallader idiom of Romansh

1959acad "Academic" ("governmental") variant of Belarusian as codified in 1959 1606nict Late Middle French (to 1606 — as in Jean Nicot, "Thresor de la langue

francoyse", 1606)

baku1926 Unified Turkic Latin Alphabet (principles codified at the 1926 Turkological Conference in Baku)

(32)

VARIANT SUBTAGS

Registration form example:

LANGUAGE SUBTAG REGISTRATION FORM 1. Name of requester: Tomaž Erjavec

2. E-mail address of requester: tomaz.erjavec&ijs.si 3. Record Requested:

Type: variant Subtag: metelko

Description: Slovene in Metelko alphabet Prefix: sl

Comments: The subtag represents the alphabet codified by Franc Serafin Metelko and used from 1825 to 1833.

4. Intended meaning of the subtag: The subtag marks texts written in Slovene using the historical Metelko alphabet, which is distinguished from the contemporary norm by borrowing (and modifying) letters from Cyrillic.

5. Reference to published description of the language (book or article): http://en.wikipedia.org/wiki/Metelko_alphabet Stabej, Marko. Franc Serafin Metelko in Metelčica. In (Janez Cvirn, ed.) Slovenska Kronika XIX. stoletja. (2001). Print.

6. Any other relevant information: The tag "sl-metelko" is relevant as a possible value of the @xml:lang attribute to be used by

language technology applications for transcribing and modernising such texts, e.g. for text search in cultural heritage digital libraries. E.g. the National and University Library of Slovenia has plans to digitise about 5,000 pages of books written in the Metelko alphabet.

(33)

EXTENSIONS

For concepts that go beyond language but have language as a core

component

Created by IETF process

Details of an extension can be owned by other authoritiesExisting extensions

• “t” — Transformed content

• RFC 6497

• Maintaining authority: Unicode Consortium

• “u” — Unicode locale

• RFC 6067

(34)

BCP 47 EXTENSIONS

Example: “t” extension — Transformed content

• Extension defined in RFC 6497 + Unicode specification UTS #35

• Example tag: und-Hebr-t-und-latn-m0-ungegn-1977

• Syntax: language + script + “t” extension

Meaning: content in Hebrew script transformed from Latin script according to a

(35)

UTILITY OF INDUSTRY MECHANISMS FOR

LINGUISTS

(36)

LINGUISTICS / LANGUAGE DOCUMENTATION

Goals:

Language development (literacy, lexicography, content development)Scholarly documentation

Objects of scholarly investigation

Languages, dialects

(37)

HANDLING LINGUISTS’ NEEDS

Use existing BCP 47 mechanisms

• Used in xml:lang

• Used in Dublin Core Metadata Element Set, Version 1.11

Existing process to register “variant” subtags

• Existing subtags registered for different kinds of variants

• Dialect variants

• Pronunciation variants

Historic variants

(38)

HANDLING LINGUISTS’ NEEDS

Linguists could create a new BCP 47 extension

• Extension could use glottocode or other domain-specific vocabulary

• Example (hypothetical):

pww-l-gc-maes1238-sc-L2FSI2-sd-disarthr ISO 639-3: Northern Pwo Karen

Extension — key-value pairs:

glottocode: Mae Sarieng variant

speaker competence: L2 speaker, estimated FSI level 2 speech defect: symptoms of dysarthria

(39)

HANDLING LINGUISTS’ NEEDS

What if ISO 639, BCP 47 aren’t enough?

• May need properties that don’t belong in language tags (e.g., speaker’s social network)

• Create appropriate data schemas

• May need properties pertaining directly to language variation, but BCP 47 variants / extensions deemed not a good fit (e.g., tentative analysis, doesn’t align to existing ISO 639-3 categories)

• Create other metadata vocabularies and data schemas

(40)
(41)

SUMMATION

Information technologies rely heavily on ISO 639, including ISO 639-3, most

often via BCP 47

Many needs of linguists can be accommodated by existing BCP 47

mechanisms

Additional needs of linguists might be accommodated by creation of a new

BCP 47 extension

References

Related documents

genomic GM-CSF gene and the resulting PCR product was cloned in a pET vector (plasmid for expression by 77 RNA polymerase) and expressed in Escherichia coli

COSE in family business Differentiation Customer well-being Customer experience Family influence Social skills Decision-making authority Motivation Technical skills RP1 RP2 RP3

For West Virginia middle school students with disabilities to meet the goals of NCLB, IHEs and educational policy makers at the state level need to understand how principals serve

The Target Country is the country in which the Priced Storage Configuration is available for sale and in which the required hardware maintenance and software support is provided

As well as supporting your own child, English primary schools offer many opportunities for parents to get involved in the life of the school more generally.. Involving parents is a

Network analysis and scientific methodology consti - tute the evaluation and recognition of practical relations among the study issues and their structures. These nodes

A functional role for inducible costimulator (ICOS) in atherosclerosis. Natural regulatory T cells control the development of atherosclerosis in mice. B cell deple- tion reduces

• Created new mobile friendly templates in use by 2 departments • Saved resources with the goal of one central repository of websites • Well trained content owners and