LANGUAGE CODING IN INFORMATION
TECHNOLOGIES
TKE 2014: Language Codes at the Crossroads
Peter Constable
Does industry need or care about ISO
639-3?
“The main question that I have is whether language identification should be a task for ISO, the International Organization for Standardization… ISO is
basically an organization [f]or industry, not for science… The reason why ISO got involved in language name issues in the first place is of course the
economic significance of translation and localization, which is far greater than the relevance of distant stars for businesses. But does this mean that someone needs ISO’s industry standard to identify little-known languages of small communities that are never or hardly used in writing and that are
often in danger of extinction?”
“Business has an interest in the stable identification of economically
significant languages, for example for translation and computer localization, and this is why the ISO 639-1 and 639-2 standards were established in the first place. However, those standards are adequate for the needs of
industry; business has no significant interest in the many small, unwritten and often endangered languages with no measurable economic impact.” — “Kwamikagami” (Wikipedia contributor), April 2014
Information technologies rely heavily on ISO 639, including ISO 639-3
IETF BCP 47 is a key industry technology using ISO 639-3
Many of linguists’ needs for language identification can be accommodated by the same BCP 47 mechanisms used in the IT industry
AGENDA
• Use of language identifiers in information technologies
• History: development of ISO 639-3 and industry adoption via BCP 47 • Which languages are significant?
• Overview of IETF BCP 47 (language tags) • Utility of industry mechanisms for linguists
USE OF LANGUAGE IDENTIFIERS IN
INFORMATION TECHNOLOGIES
USE OF LANGUAGE IDENTIFIERS
• Tagging content to declare the language of content
• Text, audio, video
• Tagging of software resources for language-specific processing • Matching user language preferences with content
USE OF LANGUAGE IDENTIFIERS
Examples
• Display of content in my preferred language
• Web pages, videos, captions, etc.
• Display of application user interfaces in my preferred language • Activating input methods for different languages
• Spell checking • Text-to-speech • … (many others)
DEVELOPMENT OF ISO 639-3 AND
INDUSTRY ADOPTION VIA BCP 47
LANDSCAPE CIRCA 2000
Language documentation / applied linguistics
• Research, literature development in 1000s of languages • Large language corpora
LANDSCAPE CIRCA 2000
Industry
• Limited “locale” identifier mechanisms
• Windows: numbers — 512 maximum
• Mac: numbers — 150 defined
• Internet: RFC 1766 — based on ISO 639-1 and ISO 3166-1, e.g., “en-US”
• XML: using RFC 1766
• ISO 639-1: ≈ 180 languages • ISO 639-2: ≈ 350 languages
LANDSCAPE CIRCA 2000
Changing industry landscape
• Unicode Consortium mission: “This Corporation’s specific purpose shall be
to enable people around the world to use computers in any language…”1
• Unicode 3.0, finally becoming mainstream in software
• Office 97, Windows 2000, Mac OS X , XETEX, XML, .Net, Java, C++, ECMAScript, Pango…
• Rapidly-growing interest in expanding language support
• Major vendors: “We don’t want to be a bottleneck for language
LANDSCAPE CIRCA 2000
• “We need a comprehensive language coding standard!”
DEVELOPMENT SINCE 2000
• ISO 639-3
• 2002: start of work
• 2007: published
• BCP 47
• 2001: IETF RFC 3066 — incorporation of ISO 639-2
• 2006: IETF RFC 4646 — enhancements to compatibility, stability, structure
• 2009: IETF RFC 5646 — incorporation of ISO 639-3
• Widespread adoption of BCP 47, ISO 639-3 across technologies, Web, and
CURRENT INDUSTRY USE OF ISO 639-3
• Unicode CLDR 25:
• Used in Android, Mac OS, iOS, Windows, Debian Linux, Apache, …
• Data for 369 languages in ISO 639-3 not in ISO 639-1 / ISO 639-2
• Exemplar data for 600+ more
• General support for all of ISO 639-3
• Windows 8:
• Explicit use of 115 IDs in ISO 639-3 not in ISO 639-1 / ISO 639-2
WHICH LANGUAGES ARE SIGNIFICANT FOR
INDUSTRY?
Extended Graded
Inter-generational Disruption Scale
See:
https://www.ethnologue.com/about/language-status
0 500 1000 1500 2000 2500 3000 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10
0 500 1000 1500 2000 2500 3000 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10 Institutional
• Mass media / publishing • Libraries
• Education
• Commerce, marketing • Product localization,
0 500 1000 1500 2000 2500 3000 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10 Developing – waning • Limited-to-no institutional
support, mass media, etc. Use of ICTs:
• End-user content—Web,
SMS, email
• Some product localization • Significant enhancement to
language stabilization, vitality
0 500 1000 1500 2000 2500 3000 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10 Dying – extinct Use of ICTs: • Language documentation • XML
WHICH LANGUAGES ARE SIGNIFICANT FOR
INDUSTRY?
Some will be more used and better supported by industry than others… … but all need and get some level of industry support
BCP 47
• IETF Best Current Practice specification
• Reference: http://tools.ietf.org/html/bcp47 • History: • 1995: RFC 1766 • 2001: RFC 3066 • 2006: RFC 4646 + RFC 4647 • 2009: RFC 5646 + RFC 4647
• Designed to accommodate language variations
• Language, writing system, orthography, dialect, …
HANDLING VARIATIONS
• Start with IDs for discrete languages from ISO 639-1, ISO 639-3 • Using BCP 47, add qualifiers to language tags as needed
• Examples:
• pt-BR = Portuguese as used in Brazil
• az-Cyrl = Azerbaijani written in Cyrillic script
• ca-valencia = Valencian
• de-1996 = German using 1996 orthographic conventions
KEY COMPONENTS OF BCP 47
• Tag syntax
• Subtag registry (maintained by IANA) • Mechanism to register “variant” subtags • Mechanism to register extensions
BCP 47 SYNTAX
Language-Tag = langtag / privateuse
langtag = language ; ISO 639 ("-" script)? ; ISO 15924
("-" region)? ; ISO 3166-1 or UN M.49 ("-" variant)* ; registered
("-" extension)* ; registered RFC ("-" privateuse)?
extension = singleton ("-" alphanum{2,8})+ privateuse = "x" ("-" alphanum{1,8})+
BCP 47 SYNTAX
• Examples:
• haw language
• pt-BR language + ISO 3166-1 region
• es-419 language + UN M.49 region
• az-Cyrl language + script
• ca-valencia language + variant
• pww-Latn-fonipa language + script + variant
• x-foobar private use
• fil-x-foobar language + private use
VARIANT SUBTAGS
• Registration requests can be submitted by anyone • Reviewed for best practice
• Added to IANA Language Subtag Registry
• Process: see http://tools.ietf.org/html/bcp47#section-3.5 • 64 variant subtags registered to date
VARIANT SUBTAGS
Examples:
Variant subtag Meaning
aluku Aluku dialect of the "Busi Nenge Tongo" English-based Creole continuum in Eastern Suriname and Western French Guiana
balanka The Balanka dialect of Anii itihasa Epic Sanskrit
vallader Vallader idiom of Romansh
1959acad "Academic" ("governmental") variant of Belarusian as codified in 1959 1606nict Late Middle French (to 1606 — as in Jean Nicot, "Thresor de la langue
francoyse", 1606)
baku1926 Unified Turkic Latin Alphabet (principles codified at the 1926 Turkological Conference in Baku)
VARIANT SUBTAGS
Registration form example:
LANGUAGE SUBTAG REGISTRATION FORM 1. Name of requester: Tomaž Erjavec
2. E-mail address of requester: tomaz.erjavec&ijs.si 3. Record Requested:
Type: variant Subtag: metelko
Description: Slovene in Metelko alphabet Prefix: sl
Comments: The subtag represents the alphabet codified by Franc Serafin Metelko and used from 1825 to 1833.
4. Intended meaning of the subtag: The subtag marks texts written in Slovene using the historical Metelko alphabet, which is distinguished from the contemporary norm by borrowing (and modifying) letters from Cyrillic.
5. Reference to published description of the language (book or article): http://en.wikipedia.org/wiki/Metelko_alphabet Stabej, Marko. Franc Serafin Metelko in Metelčica. In (Janez Cvirn, ed.) Slovenska Kronika XIX. stoletja. (2001). Print.
6. Any other relevant information: The tag "sl-metelko" is relevant as a possible value of the @xml:lang attribute to be used by
language technology applications for transcribing and modernising such texts, e.g. for text search in cultural heritage digital libraries. E.g. the National and University Library of Slovenia has plans to digitise about 5,000 pages of books written in the Metelko alphabet.
EXTENSIONS
• For concepts that go beyond language but have language as a core
component
• Created by IETF process
• Details of an extension can be owned by other authorities • Existing extensions
• “t” — Transformed content
• RFC 6497
• Maintaining authority: Unicode Consortium
• “u” — Unicode locale
• RFC 6067
BCP 47 EXTENSIONS
• Example: “t” extension — Transformed content
• Extension defined in RFC 6497 + Unicode specification UTS #35
• Example tag: und-Hebr-t-und-latn-m0-ungegn-1977
• Syntax: language + script + “t” extension
• Meaning: content in Hebrew script transformed from Latin script according to a
UTILITY OF INDUSTRY MECHANISMS FOR
LINGUISTS
LINGUISTICS / LANGUAGE DOCUMENTATION
Goals:
• Language development (literacy, lexicography, content development) • Scholarly documentation
Objects of scholarly investigation
• Languages, dialects
HANDLING LINGUISTS’ NEEDS
• Use existing BCP 47 mechanisms
• Used in xml:lang
• Used in Dublin Core Metadata Element Set, Version 1.11
• Existing process to register “variant” subtags
• Existing subtags registered for different kinds of variants
• Dialect variants
• Pronunciation variants
• Historic variants
HANDLING LINGUISTS’ NEEDS
• Linguists could create a new BCP 47 extension
• Extension could use glottocode or other domain-specific vocabulary
• Example (hypothetical):
pww-l-gc-maes1238-sc-L2FSI2-sd-disarthr ISO 639-3: Northern Pwo Karen
Extension — key-value pairs:
glottocode: Mae Sarieng variant
speaker competence: L2 speaker, estimated FSI level 2 speech defect: symptoms of dysarthria
HANDLING LINGUISTS’ NEEDS
• What if ISO 639, BCP 47 aren’t enough?
• May need properties that don’t belong in language tags (e.g., speaker’s social network)
• Create appropriate data schemas
• May need properties pertaining directly to language variation, but BCP 47 variants / extensions deemed not a good fit (e.g., tentative analysis, doesn’t align to existing ISO 639-3 categories)
• Create other metadata vocabularies and data schemas
SUMMATION
• Information technologies rely heavily on ISO 639, including ISO 639-3, most
often via BCP 47
• Many needs of linguists can be accommodated by existing BCP 47
mechanisms
• Additional needs of linguists might be accommodated by creation of a new
BCP 47 extension