LANGUAGE CODING IN INFORMATION TECHNOLOGIES

(1)

LANGUAGE CODING IN INFORMATION

TECHNOLOGIES

TKE 2014: Language Codes at the Crossroads

Peter Constable

(2)

Does industry need or care about ISO

639-3?

(3)

“The main question that I have is whether language identification should be a task for ISO, the International Organization for Standardization… ISO is

basically an organization [f]or industry, not for science… The reason why ISO got involved in language name issues in the first place is of course the

economic significance of translation and localization, which is far greater than the relevance of distant stars for businesses. But does this mean that someone needs ISO’s industry standard to identify little-known languages of small communities that are never or hardly used in writing and that are

often in danger of extinction?”

(4)

“Business has an interest in the stable identification of economically

significant languages, for example for translation and computer localization, and this is why the ISO 639-1 and 639-2 standards were established in the first place. However, those standards are adequate for the needs of

industry; business has no significant interest in the many small, unwritten and often endangered languages with no measurable economic impact.” — “Kwamikagami” (Wikipedia contributor), April 2014

(5)

Information technologies rely heavily on ISO 639, including ISO 639-3

IETF BCP 47 is a key industry technology using ISO 639-3

Many of linguists’ needs for language identification can be accommodated by the same BCP 47 mechanisms used in the IT industry

(6)

AGENDA

• _{Use of language identifiers in information technologies}

• _{History: development of ISO 639-3 and industry adoption via BCP 47} • _{Which languages are significant?}

• _{Overview of IETF BCP 47 (language tags)} • _{Utility of industry mechanisms for linguists}

(7)

USE OF LANGUAGE IDENTIFIERS IN

INFORMATION TECHNOLOGIES

(8)

USE OF LANGUAGE IDENTIFIERS

• _{Tagging content to declare the language of content}

• Text, audio, video

• _{Tagging of software resources for language-specific processing} • _{Matching user language preferences with content}

(9)

USE OF LANGUAGE IDENTIFIERS

Examples

• _{Display of content in my preferred language}

• Web pages, videos, captions, etc.

• _{Display of application user interfaces in my preferred language} • _{Activating input methods for different languages}

• _{Spell checking} • _{Text-to-speech} • _{… (many others)}

(10)

DEVELOPMENT OF ISO 639-3 AND

INDUSTRY ADOPTION VIA BCP 47

(11)

LANDSCAPE CIRCA 2000

Language documentation / applied linguistics

• _{Research, literature development in 1000s of languages} • _{Large language corpora}

(12)

LANDSCAPE CIRCA 2000

Industry

• _{Limited “locale” identifier mechanisms}

• Windows: numbers — 512 maximum

• _{Mac: numbers — 150 defined}

• Internet: RFC 1766 — based on ISO 639-1 and ISO 3166-1, e.g., “en-US”

• XML: using RFC 1766

• _{ISO 639-1: ≈ 180 languages} • _{ISO 639-2: ≈ 350 languages}

(13)

LANDSCAPE CIRCA 2000

Changing industry landscape

• _{Unicode Consortium mission: “This Corporation’s specific purpose shall be}

to enable people around the world to use computers in any language…”1

• _{Unicode 3.0, finally becoming mainstream in software}

• Office 97, Windows 2000, Mac OS X , X_ET_EX, XML, .Net, Java, C++, ECMAScript, Pango…

• _{Rapidly-growing interest in expanding language support}

• _{Major vendors: “We don’t want to be a bottleneck for language}

(14)

LANDSCAPE CIRCA 2000

• _{“We need a comprehensive language coding standard!”}

(15)

DEVELOPMENT SINCE 2000

• _{ISO 639-3}

• 2002: start of work

• 2007: published

• _{BCP 47}

• 2001: IETF RFC 3066 — incorporation of ISO 639-2

• 2006: IETF RFC 4646 — enhancements to compatibility, stability, structure

• 2009: IETF RFC 5646 — incorporation of ISO 639-3

• _{Widespread adoption of BCP 47, ISO 639-3 across technologies, Web, and}

(16)

CURRENT INDUSTRY USE OF ISO 639-3

• _{Unicode CLDR 25:}

• Used in Android, Mac OS, iOS, Windows, Debian Linux, Apache, …

• Data for 369 languages in ISO 639-3 not in ISO 639-1 / ISO 639-2

• Exemplar data for 600+ more

• _{General support for all of ISO 639-3}

• _{Windows 8:}

• Explicit use of 115 IDs in ISO 639-3 not in ISO 639-1 / ISO 639-2

(17)

(18)

WHICH LANGUAGES ARE SIGNIFICANT FOR

INDUSTRY?

Extended Graded

Inter-generational Disruption Scale

See:

https://www.ethnologue.com/about/language-status

(19)

0 500 1000 1500 2000 2500 3000 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10

(20)

0 500 1000 1500 2000 2500 3000 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10 Institutional

• _{Mass media / publishing} • _Libraries

• _Education

• _{Commerce, marketing} • _{Product localization,}

(21)

0 500 1000 1500 2000 2500 3000 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10 Developing – waning • _{Limited-to-no institutional}

support, mass media, etc. Use of ICTs:

• _{End-user content—Web,}

SMS, email

• _{Some product localization} • _{Significant enhancement to}

language stabilization, vitality

(22)

0 500 1000 1500 2000 2500 3000 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10 Dying – extinct Use of ICTs: • _{Language documentation} • _XML

(23)

WHICH LANGUAGES ARE SIGNIFICANT FOR

INDUSTRY?

Some will be more used and better supported by industry than others… … but all need and get some level of industry support

(24)

(25)

BCP 47

• _IETF_{Best Current Practice}_{specification}

• _{Reference: http://tools.ietf.org/html/bcp47} • _History: • 1995: RFC 1766 • 2001: RFC 3066 • 2006: RFC 4646 + RFC 4647 • 2009: RFC 5646 + RFC 4647

• _{Designed to accommodate language variations}

• Language, writing system, orthography, dialect, …

(26)

HANDLING VARIATIONS

• _{Start with IDs for discrete languages from ISO 639-1, ISO 639-3} • _{Using BCP 47, add qualifiers to language tags as needed}

• _Examples:

• pt-BR = Portuguese as used in Brazil

• az-Cyrl = Azerbaijani written in Cyrillic script

• ca-valencia = Valencian

• de-1996 = German using 1996 orthographic conventions

(27)

KEY COMPONENTS OF BCP 47

• _{Tag syntax}

• _{Subtag registry (maintained by IANA)} • _{Mechanism to register “variant” subtags} • _{Mechanism to register extensions}

(28)

BCP 47 SYNTAX

Language-Tag = langtag / privateuse

langtag = language ; ISO 639 ("-" script)? ; ISO 15924

("-" region)? ; ISO 3166-1 or UN M.49 ("-" variant)* ; registered

("-" extension)* ; registered RFC ("-" privateuse)?

extension = singleton ("-" alphanum{2,8})+ privateuse = "x" ("-" alphanum{1,8})+

(29)

BCP 47 SYNTAX

• _Examples:

• haw language

• pt-BR language + ISO 3166-1 region

• es-419 language + UN M.49 region

• _az-Cyrl _{language + script}

• ca-valencia language + variant

• pww-Latn-fonipa language + script + variant

• x-foobar private use

• fil-x-foobar language + private use

(30)

VARIANT SUBTAGS

• _{Registration requests can be submitted by anyone} • _{Reviewed for best practice}

• _{Added to IANA Language Subtag Registry}

• _{Process: see}_{http://tools.ietf.org/html/bcp47#section-3.5} • _{64 variant subtags registered to date}

(31)

VARIANT SUBTAGS

Examples:

Variant subtag Meaning

aluku Aluku dialect of the "Busi Nenge Tongo" English-based Creole continuum in Eastern Suriname and Western French Guiana

balanka The Balanka dialect of Anii itihasa Epic Sanskrit

vallader Vallader idiom of Romansh

1959acad "Academic" ("governmental") variant of Belarusian as codified in 1959 1606nict Late Middle French (to 1606 — as in Jean Nicot, "Thresor de la langue

francoyse", 1606)

baku1926 Unified Turkic Latin Alphabet (principles codified at the 1926 Turkological Conference in Baku)

(32)

VARIANT SUBTAGS

Registration form example:

LANGUAGE SUBTAG REGISTRATION FORM 1. Name of requester: Tomaž Erjavec

2. E-mail address of requester: tomaz.erjavec&ijs.si 3. Record Requested:

Type: variant Subtag: metelko

Description: Slovene in Metelko alphabet Prefix: sl

Comments: The subtag represents the alphabet codified by Franc Serafin Metelko and used from 1825 to 1833.

4. Intended meaning of the subtag: The subtag marks texts written in Slovene using the historical Metelko alphabet, which is distinguished from the contemporary norm by borrowing (and modifying) letters from Cyrillic.

5. Reference to published description of the language (book or article): http://en.wikipedia.org/wiki/Metelko_alphabet Stabej, Marko. Franc Serafin Metelko in Metelčica. In (Janez Cvirn, ed.) Slovenska Kronika XIX. stoletja. (2001). Print.

6. Any other relevant information: The tag "sl-metelko" is relevant as a possible value of the @xml:lang attribute to be used by

language technology applications for transcribing and modernising such texts, e.g. for text search in cultural heritage digital libraries. E.g. the National and University Library of Slovenia has plans to digitise about 5,000 pages of books written in the Metelko alphabet.

(33)

EXTENSIONS

• _{For concepts that go beyond language but have language as a core}

component

• _{Created by IETF process}

• _{Details of an extension can be owned by other authorities} • _{Existing extensions}

• “t” — Transformed content

• RFC 6497

• Maintaining authority: Unicode Consortium

• “u” — Unicode locale

• RFC 6067

(34)

BCP 47 EXTENSIONS

• _{Example: “t” extension — Transformed content}

• Extension defined in RFC 6497 + Unicode specification UTS #35

• Example tag: und-Hebr-t-und-latn-m0-ungegn-1977

• Syntax: language + script + “t” extension

• _Meaning:_{content in Hebrew script}_{transformed from Latin script according to a}

(35)

UTILITY OF INDUSTRY MECHANISMS FOR

LINGUISTS

(36)

LINGUISTICS / LANGUAGE DOCUMENTATION

Goals:

• _{Language development (literacy, lexicography, content development)} • _{Scholarly documentation}

Objects of scholarly investigation

• _{Languages, dialects}

(37)

HANDLING LINGUISTS’ NEEDS

• _{Use existing BCP 47 mechanisms}

• Used in xml:lang

• Used in Dublin Core Metadata Element Set, Version 1.11

• _{Existing process to register “variant” subtags}

• Existing subtags registered for different kinds of variants

• Dialect variants

• Pronunciation variants

• _{Historic variants}

(38)

HANDLING LINGUISTS’ NEEDS

• _{Linguists could create a new BCP 47 extension}

• Extension could use glottocode or other domain-specific vocabulary

• Example (hypothetical):

pww-l-gc-maes1238-sc-L2FSI2-sd-disarthr ISO 639-3: Northern Pwo Karen

Extension — key-value pairs:

glottocode: Mae Sarieng variant

speaker competence: L2 speaker, estimated FSI level 2 speech defect: symptoms of dysarthria

(39)

HANDLING LINGUISTS’ NEEDS

• _{What if ISO 639, BCP 47 aren’t enough?}

• May need properties that don’t belong in language tags (e.g., speaker’s social network)

• Create appropriate data schemas

• May need properties pertaining directly to language variation, but BCP 47 variants / extensions deemed not a good fit (e.g., tentative analysis, doesn’t align to existing ISO 639-3 categories)

• Create other metadata vocabularies and data schemas

(40)

(41)

SUMMATION

• _{Information technologies rely heavily on ISO 639, including ISO 639-3, most}

often via BCP 47

• _{Many needs of linguists can be accommodated by existing BCP 47}

mechanisms

• _{Additional needs of linguists might be accommodated by creation of a new}

BCP 47 extension