Juliette Reinders Folmer | Advies en zo
Everything you always wanted to know
about
UTF-8
*
“Internationalization is like parenting: a
lifelong cycle of hardship in which no cumulative knowledge is gained.”
Mark Pilgrim, april 2004
“Mark believes that because Unicode is
harder than not-Unicode people will always create systems that fail to use Unicode and so break in unpleasant ways only after they are widely enough deployed that I18N
becomes an issue.”
J. Graham, april 2004
“Internationalization is like parenting: a
lifelong cycle of hardship in which no cumulative knowledge is gained.”
Some common misconceptions
• Unicode !== UTF-8
• UTF-8 !== internationalization • UTF-8 !== charset
Why worry about it anyway ?
• It’s all about being prepared:
– Company/Client gets taken over by a foreign company
– Mergers
– Expansion to other regions
– Local users/employees from other origins • Code efficiency
• Cost
Some language statistics
• 7105 ‘living’ languages • +/- 308 languages with > 1 million speakers • Nr 1 language in the world ? • Nr 2 ?Did you know:
• That Germany has 27* officially recognized languages ?
• That the country with the most languages is Papua New
Guinea ?
* Alemannic, Bavarian, Danish, Frankish, Eastern Frisian, Northern Frisian,
Standard German, Kabardian, Kölsch, Limburgish, Luxembourgeois,
Mainfränkisch, Pfaelzisch, Plautdietsch, Polish, Balkan Romani, Sinte Romani, Vlax Romani, Saterfriesisch, Low Saxon, Upper Saxon, Lower Sorbian, Upper
Sorbian, Swabian, Westphalien, Yeniche, Western Yiddish.
(836)
Mandarin Chinese Spanish
Top 20 languages in the world
*
Roman 61 Italian 20 Javanese 84 Javanese 10 Arabic 63 Urdu 19 Hiragana, Katakana, and Kanji 122 Japanese 9 Korean (Hangul) 66 Korean 18 Cyrillic 162 Russian 8 Roman 68 Vietnamese 17 Bengali 193 Bengali 7 Roman 69 French 16 Roman 202 Portuguese 6 Tamil 69 Tamil 15 Arabic 223 Arabic (standard) 5 Devanagari 72 Marathi 14 Devanagari 260 Hindi 4 Telugu 74 Telugu 13 Roman 335 English 3 Lahnda, Arabic 83 Lahnda ( Western Punjabi) 12 Roman 406 Spanish 2 Roman 84 German (standard) 11 Vernacular Chinese 1.197 Mandarin Chinese 1 Script Total speakers (M) Language Script Total speakers (M) Language *Source: Ethnologue 2013About writing systems
• There are approximately writing systems in active use
• Most are used (with or without extensions) for several languages
• Some languages use more than one writing system
• Numerous other writing systems for
ceremonial or religious use
• Or for fun ;-)
180 *
Distribution of writing systems
Writing system resources
• Info on languages:
http://www.ethnologue.com/
• Info on writing systems:
http://www.omniglot.com/
• Which characters are used in language X ?
http://www.eki.ee/letter/
• Info on Latin extensions for African languages
http://www.bisharat.net/A12N/
• And of course:
http://en.wikipedia.org/wiki/ Writing_system
There Ain't No Such Thing As
Plain Text
On character sets and encoding
“A coded character set is a set of characters
for which a unique number has been
assigned to each character. Units of a coded character set are known as code points.”
(W3C)
“The character encoding reflects the way
these abstract characters are mapped to
Unicode
“Unicode is a computing industry standard
allowing computers to consistently
represent and manipulate text expressed in most of the world's writing systems.”
(Wikipedia)
• Unicode Code charts:
UTF
• UTF = Unicode Transform Format
• UTF-8 is one of the character encodings for implementing Unicode
• Alternatives are UTF-7 (legacy), UTF-16, UTF-32 • UTF-8 is (backward) compatible with ASCII,
UTF-16/32 are not.
Advantages of UTF-8
• Backward compatible with ASCII
• UTF-8 can encode any Unicode character • XML requires UTF-8 or UTF-16
• UTF-8 and UTF-16 are the standards for
having Unicode in HTML. UTF-8 is preferred.
• Can be fairly reliably recognized with small chance of confusion.
• Sorting UTF-8 as arrays of unsigned bytes with result in same order as sorting on
So, what’s the problem ?
• Everything defaults to non-UTF-8
Mostly latin writing system, ISO-8859-1 or US-ASCII
So, what’s the solution ?
• Be EXPLICIT everywhere (and I don’t mean $%&@-explicit)
We’ll be covering:
•Dependancy on user’s computer setup •Client side code – HTML, CSS, JS
•Communication with the client •Server side code – PHP
•Communicating with a MySQL database •MySQL
•Communicating with files •Other common issues
We’ll be covering:
•Dependancy on user’s computer setup
•Client side code – HTML, CSS, JS •Communication with the client
•Server side code – PHP
•Communicating with a MySQL database •MySQL
•Communicating with files •Other common issues
User’s computer
Potential issues:
• Extended language support ? • Code pages ?
• Font ?
Characteristics of text
• Language • Writing system • Writing direction • Writing direction • Character (sub)set • Character encoding • Font • Meaning • English • Roman/Latin • Left to right • Top to bottom • Basic Latin• UTF-16 (can vary) • Arial
“What I really love”
“
ار
ا يذ ا
ا
“
• Arabic • Arabic • Right to left • Top to bottom • Arabic • UTF-16 (...) • ArialLanguages: •English •Greek •Ukrainian •Mandarin Chinese •Japanese •Hindi •Korean •Kannada •Punjabi Gurmuki •Tamil •Tigre •Myanmar •Arabic •Farsi •Hebrew
About Fonts
• Unicode versus non-unicode fonts
Be aware & be wary !
• Few fonts capable of handling a wide range of Unicode characters.
Examples:
Arial Unicode MS, Bitstream Cyberbit, Code2000, GNU Unifont
Fonts used: •Verdana •Verdana •Verdana •SimSun •MS Mincho •Code2000 •Batang •Arial Unicode MS •Lohit Punjabi •Latha •GS GeezMahtemUnicode •WinInnwa •Arial Unicode MS •Arial •Arial Unicode MS
These are the same phrases converted to the Verdana font.
WinInnwa is a non-Unicode compliant font...
To stress the importance of Unicode and
Unicode-compliant fonts:
These are the same phrases again, now converted to the Arial Unicode MS font.
The Ge’ez script Character (sub)set (Ethiopic range) are not included in Arial Unicode MS.
Useful font-related resources:
• Galary of Unicode fonts – find fonts per writing system/language:
http://www.wazu.jp/
• Unicode test pages and more:
http://www.alanwood.net/unicode/
• Unicode typefaces:
http://en.wikipedia.org/wiki/ Unicode_typefaces
• Font frequency on computers:
Some font viewers:
• Free and easy Font viewer:
http://www.styopkin.com/details_ free_and_easy_fonts_viewer.html
• ListFont:
http://www.heiner-eichmann.de/ software/listfont/listfont.htm
We’ll be covering:
•Dependancy on user’s computer setup
•Client side code – HTML, CSS, JS
•Communication with the client •Server side code – PHP
•Communicating with a MySQL database •MySQL
•Communicating with files •Other common issues
Client side
• Always declare the character encoding
Client side – HTML
• Use meta-headers:
<meta http-equiv="Charset" content="utf-8"> <meta http-equiv="Content-Type"
content="text/html; charset=utf-8">
• Tell the browser (and the search engines) the language too if you can:
<html dir="[DIRECTION – ltr/rtl]"
lang="[LANGCODE – eg de-DE or zh-cn for
Mandarin Chinese]">
<meta name="language" content="[LANGCODE]"> <meta http-equiv="Content-Language"
Client side –
X
HTML
• Add an XML declaration at the top
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1 -transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="[LANGCODE]" lang="[LANGCODE]">
Client side – CSS
• Add an encoding to the CSS file – must be on the very first line of the file!
@charset "utf-8";
• Use Unicode compliant fonts and tell the browser which to use with CSS
<p lang="zh-cn">我很喜欢</p>
P[LANG|="zh"] { font-family: SimSun, "MS Song", "Adobe Song Std L", sans-serif;}
P[LANG="zh-cn"] { font-family: SimSun, “MS Song”, "Adobe Song Std L", sans-serif;}
P[LANG|="ar"] { font-family: Arial, "Arial Unicode MS", sans-serif; direction: rtl; }
Useful client side resources:
• W3C on best practices for internationalization:
http://www.w3.org/International/ techniques/authoring-html
• ISO language codes:
http://www.sil.org/iso639-3/codes.asp
• ISO country codes:
http://www.iso.org/iso/country_codes/ iso_3166_code_lists/country_names_and_ code_elements
• W3C on Language tags in HTML and XML:
http://www.w3.org/International/ articles/language-tags/
We’ll be covering:
•Dependancy on user’s computer setup •Client side code – HTML, CSS, JS
•Communication with the client
•Server side code – PHP
•Communicating with a MySQL database •MySQL
•Communicating with files •Other common issues
Putting files on the server
• Save the file(s) as encoded in UTF-8
• Don’t forget to upload the file as binary rather than ascii !
URL’s
Client-server communication
Sending data to the client
• Send a HTTP header:
header( 'Content-Type: text/html;
charset=utf-8' );
• .htaccess/ Apache’s httpd.conf
Client-server communication
.htaccess examples
• # Maps file extensions to a character encoding. Especially useful in content negotiation situations. (httpd.conf)
AddCharset utf-8 .utf8
• # Pass the default character encoding for content-type text/plain and text/html
AddDefaultCharset On|Off|charset AddDefaultCharset UTF-8
AddDefaultCharset On => iso-8859-1
• # Add a default character encoding per file extension
AddType 'text/html; charset=UTF-8' html
• # Identify the encoding for a particular file:
<Files ~ "events\.html">
ForceType 'text/html; charset=UTF-8‘ </Files>
We’ll be covering:
•Dependancy on user’s computer setup •Client side code – HTML, CSS, JS
•Communication with the client
•Server side code – PHP
•Communicating with a MySQL database •MySQL
•Communicating with files •Other common issues
Client-server communication
Receiving data from the client
• Make sure you send information to the server in the correct encoding
• This is especially important for user-input, i.e. Forms!!!
We’ll be covering:
•Dependancy on user’s computer setup •Client side code – HTML, CSS, JS
•Communication with the client
•Server side code – PHP
•Communicating with a MySQL database •MySQL
•Communicating with files •Other common issues
PHP
• Currently not very friendly for UTF-8 • PHP6 development dormant
• Some PHP extensions come to the rescue:
MBstring iconv
~Intl
• There are also some nifty function collections / classes available to help you.
Take note of:
PHP UTF-8 safe functions
Safe: • explode() • str_replace() • PHP5+ ~htmlentities() NOT Safe: • Everything elseTest for well-formedness
function utf8_compliant( $string ) {
if ( strlen( $string ) == 0 ) {
return true; }
return ( preg_match( '/^.{1}/us',
$string , $array ) == 1 ); }
PRCE
• PRCE can be relatively UTF-8 safe if compiled with Unicode.
Use: preg_match(‘/^.+$/u’, $string);
• Test whether PRCE has been compiled with Unicode support:
if( preg_match('/^.{1}$/u',"ñ", $UTF8_ar) != 1 ){
trigger_error('PCRE is not compiled with UTF-8 support',E_USER_ERROR);
Handling text
• You don’t need htmlentities() anymore. Use htmlspecialchars() instead:
$html = htmlspecialchars($utf8_string,
ENT_COMPAT, 'UTF-8');
• strlen() will count bytes, so use:
function utf8_strlen( $string ){
return strlen( utf8_decode( $str ) ); }
utf8_encode() & utf8_decode()
• Only useful for converting betweenMBstring extension
• Multibyte aware implementations of some of the most common PHP string functions, the POSIX extended regex extension and the mail function.
• Mbstring supports many different character sets, most importantly UTF-8.
• Allows for conversion between character sets and implements some level of encoding
Iconv extension
• Bundled since PHP 5+.
• Main purpose of iconv : converting between different character sets.
• From PHP 5+, iconv has implementations of some common string functions, but is slower than mbstring for UTF-8.
Intl extension
• Bundled since PHP 5.3+, but not always enabled. • Modules: – Collator – Number Formatter – Message Formatter – Normalizer – Locale
Useful resources
• http://www.php.net/mbstring • http://www.php.net/iconv • http://www.php.net/intl • http://www.php.net/regexp.reference.u nicode • http://www.phpwact.org/php/i18n • http://sourceforge.net/projects/phput f8We’ll be covering:
•Dependancy on user’s computer setup •Client side code – HTML, CSS, JS
•Communication with the client •Server side code – PHP
•Communicating with a MySQL database
•MySQL
•Communicating with files •Other common issues
Communicating with MySQL
• The connection between PHP and MySQL defaults to a latin1 connection.
• The first query you should run after making your connection:
mysql_query( 'SET NAMES "utf8" [COLLATE "collation_name"]' );
OR
mysql_query( 'SET CHARACTER SET utf8' );
• PHP 5.2+:
We’ll be covering:
•Dependancy on user’s computer setup •Client side code – HTML, CSS, JS
•Communication with the client •Server side code – PHP
•Communicating with a MySQL database
•MySQL
•Communicating with files •Other common issues
Finding out current settings
• Find out how your system is set up:
mysql> SHOW VARIABLES LIKE 'character_set%'; mysql> SHOW VARIABLES LIKE 'collation%';
+---+---+ | Variable_name | Value | +---+---+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | | collation_connection | latin1_swedish_ci | | collation_database | latin1_swedish_ci | | collation_server | latin1_general_ci | +---+---+
Finding out what’s available
• To find out which character encodings are available and what their default collation is:
SHOW CHARACTER SET;
• To find out which collations are available:
Setting up a server
• Add the following to your /etc/my.cnf file:
[mysqld] ...
default-character-set=utf8
default-collation=utf8_general_ci
• If you are the only user you could even do:
(MySQL 5.x and later)
(not executed for super-user logins)
Setting up databases & tables
• Make sure that both database, tables as well as
text columns are in UTF-8:
(CREATE | ALTER) DATABASE / TABLE ... ( ...
) [DEFAULT] CHARACTER SET utf8
[[DEFAULT] COLLATE collation]
Choosing the collation
• Collation == Sort order
• Guideline to the collations:
_ci = case insensitive
_cs = case sensitive
_bin = binary
• Test !
Collation Resources
• Collation charts:
http://www.collation-charts.org/
• Unicode collation charts:
http://www.unicode.org/charts/uca/
• Examples of collation choice effects:
http://dev.mysql.com/doc/refman/5.7/en/ charset-collation-effect.html
Converting an existing database
• Using MySQL’s CONVERT function you can migrate ‘old’ data:
INSERT INTO utf8table (utf8column)
SELECT CONVERT(latin1field USING utf8) FROM latin1table;
• For a complete php script to convert your database:
Querying a database
• You can specify the collation to use for a specific query:
SELECT k FROM t1
ORDER BY k COLLATE utf8_spanish_ci;
• You can even use it in the WHERE clause:
SELECT * FROM t1
WHERE k LIKE _latin1 'Müller' COLLATE latin1_german2_ci;
Common issue
• If you run into the following error message when running a query:
Illegal mix of collations
(utf8_bin,IMPLICIT) and
(latin1_swedish_ci,COERCIBLE) for operation
You may want to try and make your query more explicit with a character string literal:
SELECT * FROM table
We’ll be covering:
•Dependancy on user’s computer setup •Client side code – HTML, CSS, JS
•Communication with the client •Server side code – PHP
•Communicating with a MySQL database •MySQL
•Communicating with files
GetText / .po files
• Poedit understands all encodings supported by operating system and works in Unicode internally: http://www.poedit.net/
We’ll be covering:
•Dependancy on user’s computer setup •Client side code – HTML, CSS, JS
•Communication with the client •Server side code – PHP
•Communicating with a MySQL database •MySQL
•Communicating with files
BOM ! Run for your life !
•  or extra blank line
• BOM = Byte Order Mark
The character is the ZERO WIDTH NON-BREAKING SPACE. If not placed at the top, it shouldn’t give any problems.
• In UTF-16/32 the BOM is necessary to
Bom squad
• Some browsers may display the BOM. • ‘headers already send’ problems.
It can also cause problems when in front of “#!” in a shell script.
• Check your editor settings
• Open the file in (another) editor, if you can see the BOM manually delete it and save the file.
Sometimes even just opening the file and saving it again as UTF-8 will solve it.
Useful BOM resources:
• W3C on the BOM character:
http://www.w3.org/International/questio ns/qa-utf8-bom
• Webbased BOM-testing:
http://people.w3.org/rishida/utils/bomt ester/
• Unicode Consortium on BOM character:
http://www.unicode.org/unicode/faq/utf_ bom.html#bom1
We’ve covered:
•Dependancy on user’s computer setup •Client side code – HTML, CSS, JS
•Communication with the client •Server side code – PHP
•Communicating with a MySQL database •MySQL
•Communicating with files •Other common issues
Keep in touch!
(I’m self-employed, you can hire me ;-) )
Juliette Reinders Folmer
Email: juliette@adviesenzo.nl
Web: http://www.adviesenzo.nl/
LinkedIn: http://nl.linkedin.com/in/julietterf Twitter: http://twitter.com/jrf_nl
GitHub: http://github.com/jrfnl/
Please rate this talk on joined.in/9519
Endorsements and recommendations on