Everything you always wanted to know about UTF-8 * * But never dared to ask

(1)

Juliette Reinders Folmer | Advies en zo

Everything you always wanted to know

about

UTF-8

*

(2)

“Internationalization is like parenting: a

lifelong cycle of hardship in which no cumulative knowledge is gained.”

Mark Pilgrim, april 2004

“Mark believes that because Unicode is

harder than not-Unicode people will always create systems that fail to use Unicode and so break in unpleasant ways only after they are widely enough deployed that I18N

becomes an issue.”

J. Graham, april 2004

“Internationalization is like parenting: a

lifelong cycle of hardship in which no cumulative knowledge is gained.”

(3)

Some common misconceptions

• Unicode !== UTF-8

• UTF-8 !== internationalization • UTF-8 !== charset

(4)

Why worry about it anyway ?

• It’s all about being prepared:

– Company/Client gets taken over by a foreign company

– Mergers

– Expansion to other regions

– Local users/employees from other origins • Code efficiency

• Cost

(5)

Some language statistics

• 7105 ‘living’ languages • +/- 308 languages with > 1 million speakers • Nr 1 language in the world ? • Nr 2 ?

Did you know:

• That Germany has 27* officially recognized languages ?

• That the country with the most languages is Papua New

Guinea ?

* Alemannic, Bavarian, Danish, Frankish, Eastern Frisian, Northern Frisian,

Standard German, Kabardian, Kölsch, Limburgish, Luxembourgeois,

Mainfränkisch, Pfaelzisch, Plautdietsch, Polish, Balkan Romani, Sinte Romani, Vlax Romani, Saterfriesisch, Low Saxon, Upper Saxon, Lower Sorbian, Upper

Sorbian, Swabian, Westphalien, Yeniche, Western Yiddish.

(836)

Mandarin Chinese Spanish

(6)

Top 20 languages in the world

*

Roman 61 Italian 20 Javanese 84 Javanese 10 Arabic 63 Urdu 19 Hiragana, Katakana, and Kanji 122 Japanese 9 Korean (Hangul) 66 Korean 18 Cyrillic 162 Russian 8 Roman 68 Vietnamese 17 Bengali 193 Bengali 7 Roman 69 French 16 Roman 202 Portuguese 6 Tamil 69 Tamil 15 Arabic 223 Arabic (standard) 5 Devanagari 72 Marathi 14 Devanagari 260 Hindi 4 Telugu 74 Telugu 13 Roman 335 English 3 Lahnda, Arabic 83 Lahnda ( Western Punjabi) 12 Roman 406 Spanish 2 Roman 84 German (standard) 11 Vernacular Chinese 1.197 Mandarin Chinese 1 Script Total speakers (M) Language Script Total speakers (M) Language *Source: Ethnologue 2013

(7)

About writing systems

• There are approximately writing systems in active use

• Most are used (with or without extensions) for several languages

• Some languages use more than one writing system

• Numerous other writing systems for

ceremonial or religious use

• Or for fun ;-)

180 *

(8)

Distribution of writing systems

(9)

Writing system resources

• Info on languages:

http://www.ethnologue.com/

• Info on writing systems:

http://www.omniglot.com/

• Which characters are used in language X ?

http://www.eki.ee/letter/

• Info on Latin extensions for African languages

http://www.bisharat.net/A12N/

• And of course:

http://en.wikipedia.org/wiki/ Writing_system

(10)

There Ain't No Such Thing As

Plain Text

(11)

On character sets and encoding

“A coded character set is a set of characters

for which a unique number has been

assigned to each character. Units of a coded character set are known as code points.”

(W3C)

“The character encoding reflects the way

these abstract characters are mapped to

(12)

Unicode

“Unicode is a computing industry standard

allowing computers to consistently

represent and manipulate text expressed in most of the world's writing systems.”

(Wikipedia)

• Unicode Code charts:

(13)

UTF

• UTF = Unicode Transform Format

• UTF-8 is one of the character encodings for implementing Unicode

• Alternatives are UTF-7 (legacy), UTF-16, UTF-32 • UTF-8 is (backward) compatible with ASCII,

UTF-16/32 are not.

(14)

Advantages of UTF-8

• Backward compatible with ASCII

• UTF-8 can encode any Unicode character • XML requires UTF-8 or UTF-16

• UTF-8 and UTF-16 are the standards for

having Unicode in HTML. UTF-8 is preferred.

• Can be fairly reliably recognized with small chance of confusion.

• Sorting UTF-8 as arrays of unsigned bytes with result in same order as sorting on

(15)

So, what’s the problem ?

• Everything defaults to non-UTF-8

Mostly latin writing system, ISO-8859-1 or US-ASCII

So, what’s the solution ?

• Be EXPLICIT everywhere (and I don’t mean $%&@-explicit)

(16)

We’ll be covering:

•Dependancy on user’s computer setup •Client side code – HTML, CSS, JS

•Communication with the client •Server side code – PHP

•Communicating with a MySQL database •MySQL

•Communicating with files •Other common issues

(17)

We’ll be covering:

•Dependancy on user’s computer setup

•Client side code – HTML, CSS, JS •Communication with the client

•Server side code – PHP

(18)

User’s computer

Potential issues:

• Extended language support ? • Code pages ?

• Font ?

(19)

Characteristics of text

• Language • Writing system • Writing direction • Writing direction • Character (sub)set • Character encoding • Font • Meaning • English • Roman/Latin • Left to right • Top to bottom • Basic Latin

• UTF-16 (can vary) • Arial

“What I really love”

“

ار

ا يذ ا

ا

“

• Arabic • Arabic • Right to left • Top to bottom • Arabic • UTF-16 (...) • Arial

(20)

Languages: •English •Greek •Ukrainian •Mandarin Chinese •Japanese •Hindi •Korean •Kannada •Punjabi Gurmuki •Tamil •Tigre •Myanmar •Arabic •Farsi •Hebrew

(21)

About Fonts

• Unicode versus non-unicode fonts

Be aware & be wary !

• Few fonts capable of handling a wide range of Unicode characters.

Examples:

Arial Unicode MS, Bitstream Cyberbit, Code2000, GNU Unifont

(22)

Fonts used: •Verdana •Verdana •Verdana •SimSun •MS Mincho •Code2000 •Batang •Arial Unicode MS •Lohit Punjabi •Latha •GS GeezMahtemUnicode •WinInnwa •Arial Unicode MS •Arial •Arial Unicode MS

(23)

These are the same phrases converted to the Verdana font.

WinInnwa is a non-Unicode compliant font...

(24)

To stress the importance of Unicode and

Unicode-compliant fonts:

(25)

These are the same phrases again, now converted to the Arial Unicode MS font.

The Ge’ez script Character (sub)set (Ethiopic range) are not included in Arial Unicode MS.

(26)

Useful font-related resources:

• Galary of Unicode fonts – find fonts per writing system/language:

http://www.wazu.jp/

• Unicode test pages and more:

http://www.alanwood.net/unicode/

• Unicode typefaces:

http://en.wikipedia.org/wiki/ Unicode_typefaces

• Font frequency on computers:

(27)

Some font viewers:

• Free and easy Font viewer:

http://www.styopkin.com/details_ free_and_easy_fonts_viewer.html

• ListFont:

http://www.heiner-eichmann.de/ software/listfont/listfont.htm

(28)

We’ll be covering:

•Dependancy on user’s computer setup

•Client side code – HTML, CSS, JS

(29)

Client side

• Always declare the character encoding

(30)

Client side – HTML

• Use meta-headers:

<meta http-equiv="Charset" content="utf-8"> <meta http-equiv="Content-Type"

content="text/html; charset=utf-8">

• Tell the browser (and the search engines) the language too if you can:

<html dir="[DIRECTION – ltr/rtl]"

lang="[LANGCODE – eg de-DE or zh-cn for

Mandarin Chinese]">

<meta name="language" content="[LANGCODE]"> <meta http-equiv="Content-Language"

(31)

Client side –

X

HTML

• Add an XML declaration at the top

<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1 -transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="[LANGCODE]" lang="[LANGCODE]">

(32)

Client side – CSS

• Add an encoding to the CSS file – must be on the very first line of the file!

@charset "utf-8";

• Use Unicode compliant fonts and tell the browser which to use with CSS

<p lang="zh-cn">我很喜欢</p>

P[LANG|="zh"] { font-family: SimSun, "MS Song", "Adobe Song Std L", sans-serif;}

P[LANG="zh-cn"] { font-family: SimSun, “MS Song”, "Adobe Song Std L", sans-serif;}

P[LANG|="ar"] { font-family: Arial, "Arial Unicode MS", sans-serif; direction: rtl; }

(33)

Useful client side resources:

• W3C on best practices for internationalization:

http://www.w3.org/International/ techniques/authoring-html

• ISO language codes:

http://www.sil.org/iso639-3/codes.asp

• ISO country codes:

http://www.iso.org/iso/country_codes/ iso_3166_code_lists/country_names_and_ code_elements

• W3C on Language tags in HTML and XML:

http://www.w3.org/International/ articles/language-tags/

(34)

We’ll be covering:

•Communication with the client

(35)

Putting files on the server

• Save the file(s) as encoded in UTF-8

• Don’t forget to upload the file as binary rather than ascii !

(36)

URL’s

(37)

Client-server communication

Sending data to the client

• Send a HTTP header:

header( 'Content-Type: text/html;

charset=utf-8' );

• .htaccess/ Apache’s httpd.conf

(38)

Client-server communication

.htaccess examples

• # Maps file extensions to a character encoding. Especially useful in content negotiation situations. (httpd.conf)

AddCharset utf-8 .utf8

• # Pass the default character encoding for content-type text/plain and text/html

AddDefaultCharset On|Off|charset AddDefaultCharset UTF-8

AddDefaultCharset On => iso-8859-1

• # Add a default character encoding per file extension

AddType 'text/html; charset=UTF-8' html

• # Identify the encoding for a particular file:

<Files ~ "events\.html">

ForceType 'text/html; charset=UTF-8‘ </Files>

(39)

We’ll be covering:

(40)

Client-server communication

Receiving data from the client

• Make sure you send information to the server in the correct encoding

• This is especially important for user-input, i.e. Forms!!!

(41)

We’ll be covering:

(42)

PHP

• Currently not very friendly for UTF-8 • PHP6 development dormant

• Some PHP extensions come to the rescue:

MBstring iconv

~Intl

• There are also some nifty function collections / classes available to help you.

Take note of:

(43)

PHP UTF-8 safe functions

Safe: • explode() • str_replace() • PHP5+ ~htmlentities() NOT Safe: • Everything else

(44)

Test for well-formedness

function utf8_compliant( $string ) {

if ( strlen( $string ) == 0 ) {

return true; }

return ( preg_match( '/^.{1}/us',

$string , $array ) == 1 ); }

(45)

PRCE

• PRCE can be relatively UTF-8 safe if compiled with Unicode.

Use: preg_match(‘/^.+$/u’, $string);

• Test whether PRCE has been compiled with Unicode support:

if( preg_match('/^.{1}$/u',"Ã±", $UTF8_ar) != 1 ){

trigger_error('PCRE is not compiled with UTF-8 support',E_USER_ERROR);

(46)

Handling text

• You don’t need htmlentities() anymore. Use htmlspecialchars() instead:

$html = htmlspecialchars($utf8_string,

ENT_COMPAT, 'UTF-8');

• strlen() will count bytes, so use:

function utf8_strlen( $string ){

return strlen( utf8_decode( $str ) ); }

(47)

utf8_encode() & utf8_decode()

• Only useful for converting between

(48)

MBstring extension

• Multibyte aware implementations of some of the most common PHP string functions, the POSIX extended regex extension and the mail function.

• Mbstring supports many different character sets, most importantly UTF-8.

• Allows for conversion between character sets and implements some level of encoding

(49)

Iconv extension

• Bundled since PHP 5+.

• Main purpose of iconv : converting between different character sets.

• From PHP 5+, iconv has implementations of some common string functions, but is slower than mbstring for UTF-8.

(50)

Intl extension

• Bundled since PHP 5.3+, but not always enabled. • Modules: – Collator – Number Formatter – Message Formatter – Normalizer – Locale

(51)

Useful resources

• http://www.php.net/mbstring • http://www.php.net/iconv • http://www.php.net/intl • http://www.php.net/regexp.reference.u nicode • http://www.phpwact.org/php/i18n • http://sourceforge.net/projects/phput f8

(52)

We’ll be covering:

•Communicating with a MySQL database

•MySQL

(53)

Communicating with MySQL

• The connection between PHP and MySQL defaults to a latin1 connection.

• The first query you should run after making your connection:

mysql_query( 'SET NAMES "utf8" [COLLATE "collation_name"]' );

OR

mysql_query( 'SET CHARACTER SET utf8' );

• PHP 5.2+:

(54)

We’ll be covering:

•Communicating with a MySQL database

•MySQL

(55)

Finding out current settings

• Find out how your system is set up:

mysql> SHOW VARIABLES LIKE 'character_set%'; mysql> SHOW VARIABLES LIKE 'collation%';

(56)

Finding out what’s available

• To find out which character encodings are available and what their default collation is:

SHOW CHARACTER SET;

• To find out which collations are available:

(57)

Setting up a server

• Add the following to your /etc/my.cnf file:

[mysqld] ...

default-character-set=utf8

default-collation=utf8_general_ci

• If you are the only user you could even do:

(MySQL 5.x and later)

(not executed for super-user logins)

(58)

Setting up databases & tables

• Make sure that both database, tables as well as

text columns are in UTF-8:

(CREATE | ALTER) DATABASE / TABLE ... ( ...

) [DEFAULT] CHARACTER SET utf8

[[DEFAULT] COLLATE collation]

(59)

Choosing the collation

• Collation == Sort order

• Guideline to the collations:

_ci = case insensitive

_cs = case sensitive

_bin = binary

• Test !

(60)

Collation Resources

• Collation charts:

http://www.collation-charts.org/

• Unicode collation charts:

http://www.unicode.org/charts/uca/

• Examples of collation choice effects:

http://dev.mysql.com/doc/refman/5.7/en/ charset-collation-effect.html

(61)

Converting an existing database

• Using MySQL’s CONVERT function you can migrate ‘old’ data:

INSERT INTO utf8table (utf8column)

SELECT CONVERT(latin1field USING utf8) FROM latin1table;

• For a complete php script to convert your database:

(62)

Querying a database

• You can specify the collation to use for a specific query:

SELECT k FROM t1

ORDER BY k COLLATE utf8_spanish_ci;

• You can even use it in the WHERE clause:

SELECT * FROM t1

WHERE k LIKE _latin1 'Müller' COLLATE latin1_german2_ci;

(63)

Common issue

• If you run into the following error message when running a query:

Illegal mix of collations

(utf8_bin,IMPLICIT) and

(latin1_swedish_ci,COERCIBLE) for operation

You may want to try and make your query more explicit with a character string literal:

SELECT * FROM table

(64)

We’ll be covering:

•Communicating with files

(65)

GetText / .po files

• Poedit understands all encodings supported by operating system and works in Unicode internally: http://www.poedit.net/

(66)

We’ll be covering:

•Communicating with files

(67)

BOM ! Run for your life !

• ï»¿ or extra blank line

• BOM = Byte Order Mark

The character is the ZERO WIDTH NON-BREAKING SPACE. If not placed at the top, it shouldn’t give any problems.

• In UTF-16/32 the BOM is necessary to

(68)

Bom squad

• Some browsers may display the BOM. • ‘headers already send’ problems.

It can also cause problems when in front of “#!” in a shell script.

• Check your editor settings

• Open the file in (another) editor, if you can see the BOM manually delete it and save the file.

Sometimes even just opening the file and saving it again as UTF-8 will solve it.

(69)

Useful BOM resources:

• W3C on the BOM character:

http://www.w3.org/International/questio ns/qa-utf8-bom

• Webbased BOM-testing:

http://people.w3.org/rishida/utils/bomt ester/

• Unicode Consortium on BOM character:

http://www.unicode.org/unicode/faq/utf_ bom.html#bom1

(70)

We’ve covered:

(71)

Keep in touch!

(I’m self-employed, you can hire me ;-) )

Juliette Reinders Folmer

Email: [email protected]

Web: http://www.adviesenzo.nl/

LinkedIn: http://nl.linkedin.com/in/julietterf Twitter: http://twitter.com/jrf_nl

GitHub: http://github.com/jrfnl/

Please rate this talk on joined.in/9519

Endorsements and recommendations on

(72)