• No results found

How to represent characters?

N/A
N/A
Protected

Academic year: 2021

Share "How to represent characters?"

Copied!
56
0
0

Loading.... (view fulltext now)

Full text

(1)

Text

Python

Text

Copyright © Software Carpentry 2010

(2)
(3)

How to represent characters? American English in the 1960s:

(4)

How to represent characters? American English in the 1960s:

26 characters × {upper, lower} 26 characters × {upper, lower}

(5)

How to represent characters? American English in the 1960s:

26 characters × {upper, lower} 26 characters × {upper, lower} + 10 digits

(6)

How to represent characters? American English in the 1960s:

26 characters × {upper, lower} 26 characters × {upper, lower} + 10 digits

(7)

How to represent characters? American English in the 1960s:

26 characters × {upper, lower} 26 characters × {upper, lower} + 10 digits

+ punctuation

+ special characters for controlling teletypes (new line, carriage return, form feed, bell, …) (new line, carriage return, form feed, bell, …)

(8)

How to represent characters? American English in the 1960s:

26 characters × {upper, lower} 26 characters × {upper, lower} + 10 digits

+ punctuation

+ special characters for controlling teletypes (new line, carriage return, form feed, bell, …) (new line, carriage return, form feed, bell, …) = 7 bits per character (ASCII standard)

(9)
(10)

How to represent text? 1. Fixed-width records

(11)

How to represent text? 1. Fixed-width records

A crash reduces A crash reduces

your expensive computer to a simple stone.

(12)

How to represent text? 1. Fixed-width records

A crash reduces A crash reduces

your expensive computer to a simple stone.

A c r a s h r e d u c e s · · · · y o u r e x p e n s i v e c o m p u t e r t o a s i m p l e s t o n e . · · · · ·

(13)

How to represent text? 1. Fixed-width records

A crash reduces A crash reduces

your expensive computer to a simple stone.

Easy to get to line N

A c r a s h r e d u c e s · · · · y o u r e x p e n s i v e c o m p u t e r t o a s i m p l e s t o n e . · · · · ·

(14)

How to represent text? 1. Fixed-width records

A crash reduces A crash reduces

your expensive computer to a simple stone.

Easy to get to line N

A c r a s h r e d u c e s · · · · y o u r e x p e n s i v e c o m p u t e r t o a s i m p l e s t o n e . · · · · ·

Easy to get to line N But may waste space

(15)

How to represent text? 1. Fixed-width records

A crash reduces A crash reduces

your expensive computer to a simple stone.

Easy to get to line N

A c r a s h r e d u c e s · · · · y o u r e x p e n s i v e c o m p u t e r t o a s i m p l e s t o n e . · · · · ·

Easy to get to line N But may waste space

(16)

How to represent text? 1. Fixed-width records

2. Stream with embedded end-of-line markers 2. Stream with embedded end-of-line markers

(17)

How to represent text? 1. Fixed-width records

2. Stream with embedded end-of-line markers 2. Stream with embedded end-of-line markers

A crash reduces

your expensive computer to a simple stone.

A c r a s h r e d u c e s y o u r e x p e n s i v e c o m p u t e r t o a s i m p l e s t o n e .

(18)

How to represent text? 1. Fixed-width records

2. Stream with embedded end-of-line markers 2. Stream with embedded end-of-line markers

A crash reduces

your expensive computer to a simple stone.

A c r a s h r e d u c e s y o u r e x p e n s i v e c o m p u t e r t o a s i m p l e s t o n e .

(19)

How to represent text? 1. Fixed-width records

2. Stream with embedded end-of-line markers 2. Stream with embedded end-of-line markers

A crash reduces

your expensive computer to a simple stone.

A c r a s h r e d u c e s y o u r e x p e n s i v e c o m p u t e r t o a s i m p l e s t o n e .

More flexible

(20)

How to represent text? 1. Fixed-width records

2. Stream with embedded end-of-line markers 2. Stream with embedded end-of-line markers

A crash reduces

your expensive computer to a simple stone.

A c r a s h r e d u c e s y o u r e x p e n s i v e c o m p u t e r t o a s i m p l e s t o n e .

(21)

How to represent text? 1. Fixed-width records

2. Stream with embedded end-of-line markers 2. Stream with embedded end-of-line markers

A crash reduces

your expensive computer to a simple stone.

A c r a s h r e d u c e s y o u r e x p e n s i v e c o m p u t e r t o a s i m p l e s t o n e .

More flexible

Wastes less space

Skipping ahead is harder

(22)
(23)

Unix: newline ('\n')

(24)

Unix: newline ('\n')

Windows: carriage return + newline ('\r\n') Oh dear…

(25)

Unix: newline ('\n')

Windows: carriage return + newline ('\r\n') Oh dear…

Oh dear…

(26)

Unix: newline ('\n')

Windows: carriage return + newline ('\r\n') Oh dear…

Oh dear…

Python converts '\r\n' to '\n' and back on Windows To prevent this (e.g., when reading image files)

(27)

Unix: newline ('\n')

Windows: carriage return + newline ('\r\n') Oh dear…

Oh dear…

Python converts '\r\n' to '\n' and back on Windows To prevent this (e.g., when reading image files)

open the file in

binary mode

reader = open('mydata.dat', 'rb') reader = open('mydata.dat', 'rb')

(28)
(29)

Back to characters…

(30)

Back to characters…

How to represent ĕ, β, Я, …? 7 bits = 0…127

(31)

Back to characters…

How to represent ĕ, β, Я, …? 7 bits = 0…127

7 bits = 0…127

(32)

Back to characters…

How to represent ĕ, β, Я, …? 7 bits = 0…127

7 bits = 0…127

8 bits (a byte) = 0…255

Different companies/countries defined different meanings for 128...255

(33)

Back to characters…

How to represent ĕ, β, Я, …? 7 bits = 0…127

7 bits = 0…127

8 bits (a byte) = 0…255

Different companies/countries defined different meanings for 128...255

Did not play nicely together Did not play nicely together

(34)

Back to characters…

How to represent ĕ, β, Я, …? 7 bits = 0…127

7 bits = 0…127

8 bits (a byte) = 0…255

Different companies/countries defined different meanings for 128...255

Did not play nicely together Did not play nicely together

(35)
(36)

1990s: Unicode standard

(37)

1990s: Unicode standard

Defines mapping from characters to integers Does

not

specify how to store those integers Does

not

specify how to store those integers

(38)

1990s: Unicode standard

Defines mapping from characters to integers Does

not

specify how to store those integers Does

not

specify how to store those integers 32 bits per character will do it...

(39)

1990s: Unicode standard

Defines mapping from characters to integers Does

not

specify how to store those integers Does

not

specify how to store those integers 32 bits per character will do it...

(40)

1990s: Unicode standard

Defines mapping from characters to integers Does

not

specify how to store those integers Does

not

specify how to store those integers 32 bits per character will do it...

...but wastes a lot of space in common cases Use in memory (for speed)

(41)

1990s: Unicode standard

Defines mapping from characters to integers Does

not

specify how to store those integers Does

not

specify how to store those integers 32 bits per character will do it...

...but wastes a lot of space in common cases Use in memory (for speed)

Use something else on disk and over the wire Use something else on disk and over the wire

(42)

(Almost) everyone uses a

variable-length encoding

(43)

(Almost) everyone uses a

variable-length encoding

called UTF-8 instead

First 128 characters (old ASCII) stored in 1 byte each First 128 characters (old ASCII) stored in 1 byte each

(44)

(Almost) everyone uses a

variable-length encoding

called UTF-8 instead

First 128 characters (old ASCII) stored in 1 byte each First 128 characters (old ASCII) stored in 1 byte each Next 1920 stored in 2 bytes, etc.

(45)

(Almost) everyone uses a

variable-length encoding

called UTF-8 instead

First 128 characters (old ASCII) stored in 1 byte each First 128 characters (old ASCII) stored in 1 byte each Next 1920 stored in 2 bytes, etc.

(46)

(Almost) everyone uses a

variable-length encoding

called UTF-8 instead

First 128 characters (old ASCII) stored in 1 byte each First 128 characters (old ASCII) stored in 1 byte each Next 1920 stored in 2 bytes, etc.

0xxxxxxx 7 bits 110yyyyy 10xxxxxx 11 bits

(47)

(Almost) everyone uses a

variable-length encoding

called UTF-8 instead

First 128 characters (old ASCII) stored in 1 byte each First 128 characters (old ASCII) stored in 1 byte each Next 1920 stored in 2 bytes, etc.

0xxxxxxx 7 bits 110yyyyy 10xxxxxx 11 bits 1110zzzz 10yyyyyy 10xxxxxx 16 bits 1110zzzz 10yyyyyy 10xxxxxx 16 bits

(48)

(Almost) everyone uses a

variable-length encoding

called UTF-8 instead

First 128 characters (old ASCII) stored in 1 byte each First 128 characters (old ASCII) stored in 1 byte each Next 1920 stored in 2 bytes, etc.

0xxxxxxx 7 bits 110yyyyy 10xxxxxx 11 bits 1110zzzz 10yyyyyy 10xxxxxx 16 bits 1110zzzz 10yyyyyy 10xxxxxx 16 bits 11110www 10zzzzzz 10yyyyyy 10xxxxxx 21 bits

(49)

(Almost) everyone uses a

variable-length encoding

called UTF-8 instead

First 128 characters (old ASCII) stored in 1 byte each First 128 characters (old ASCII) stored in 1 byte each Next 1920 stored in 2 bytes, etc.

0xxxxxxx 7 bits 110yyyyy 10xxxxxx 11 bits 1110zzzz 10yyyyyy 10xxxxxx 16 bits 1110zzzz 10yyyyyy 10xxxxxx 16 bits 11110www 10zzzzzz 10yyyyyy 10xxxxxx 21 bits

(50)
(51)

Python 2.* provides two kinds of string Classic: one byte per character

(52)

Python 2.* provides two kinds of string Classic: one byte per character

Unicode: "big enough" per character Unicode: "big enough" per character

(53)

Python 2.* provides two kinds of string Classic: one byte per character

Unicode: "big enough" per character Unicode: "big enough" per character Write u'the string' for Unicode

(54)

Python 2.* provides two kinds of string Classic: one byte per character

Unicode: "big enough" per character Unicode: "big enough" per character Write u'the string' for Unicode

Must specify

encoding

when converting from Unicode to bytes

(55)

Python 2.* provides two kinds of string Classic: one byte per character

Unicode: "big enough" per character Unicode: "big enough" per character Write u'the string' for Unicode

Must specify

encoding

when converting from Unicode to bytes

Use UTF-8

Use UTF-8

(56)

October 2010 created by

Greg Wilson

References

Related documents

Some of the papers cover several pressures (for example both river flows and nutrients), and some papers consider some pressures en route to estimating impacts on other pressures

Criteria that can be used to make a list of events that are important for monitoring of drug safety are ‘trigger for drug withdrawal’, ‘trigger for black box warning’, ‘leading

The Daily Balance is calculated by taking the beginning balance of purchases and other charges on your Account, adding any new purchases, debit adjustments for purchases and

Department of Economy, Political Science and Administration, University of Aalborg. 12 Demokrati og adminitrativ reform i

establish the effectiveness of extra mural centers in providing learners support services found areas of strength and weakness like lack of ICT, library and learning resources,

RealNetworks offers free access to Internet radio and some music videos on its Web site, recently including an exclusive live concert clip of The Vines, as well as paid content

Mobile devices, such as smartphones and tablets, are becoming more popular everyday. This new devices’ class is constantly evolving on what concerns computing capabilities,