• No results found

CompSci 201 Huffman Coding and More

N/A
N/A
Protected

Academic year: 2021

Share "CompSci 201 Huffman Coding and More"

Copied!
45
0
0

Loading.... (view fulltext now)

Full text

(1)

CompSci 201

Huffman Coding and More

Owen Astrachan Jeff Forbes

December 1, 2018

12/1/17 CompSci 201, Fall 2017, Huff and More 1

(2)

W is for …

• World Wide Web

• www.totallyfun.org

• Wiki

• Every person’s way to share

• Wifi

• We need this everyday

• Windows

• From OS to …

(3)

PFTD

• Review of Huffman Compression

• Key aspects of compression at a high-level

• Decompression with more detail

• Looking at bits and information

• What's in a .jpeg compared to .mp3

• WOTO and finishing the semester in 201

12/1/17 CompSci 201, Fall 2017, Huff and More 3

(4)

A tale of two disks

• 10Mb for $2,990 in 1981

• 64GB for $29.90 in 2018

(5)

Lossy v Lossless Compressoin

• RAW format compared to JPEG format

• Tradeoffs – another example of "it depends"

• Why do you ZIP files/folders?

• Upload to Dropbox/Google Drive

• What are advantages of MP3

• You were 0-3 years old

12/1/17 CompSci 201, Fall 2017, Huff and More 5

(6)

Huffman is Optimal

• We create an encoding for each 8-bit character

• Can’t do better than this on per-character basis

• Normally ‘A’ is 65 and ‘Z’ is 90 (ASCII/Unicode)

• A is 01000001 and Z is 01011010

• Why does this make sense? 8- or 16-bit/char

• Why doesn’t this make sense?

(7)

Leveraging Redundancy

• If there are 1,000 “A” and 10 “Z” characters …

• Use fewer bits for “A” and more bits for “Z”

• Huffman treats all A’s equally, no context

• A as first letter in a file is the same as last letter

• Other compression techniques can do better

• Faster and better compression, more complex

12/1/17 CompSci 201, Fall 2017, Huff and More 7

(8)

Summary of Huff Compress

Count how many times every character occurs

Character is 8-bit “chunk”, use .readBits(8)

Create a Huffman Trie/Tree, greedy algorithm PQ

Infrequent chars are far away from root

Frequent chars are close to root

Create encodings from trie to write compressed file

Reset/reread file, look up encoding, write out

(9)

Starting to code… what first?

• If you write compress first, how to test

• The bits written aren't "readable" as is

Shadow-print.writeBits with .println?

• If you write decompress first, how to test?

• Until you've got a compressed file …

• We'll provide several compressed files!

12/1/17 CompSci 201, Fall 2017, Huff and More 9

(10)

Huffman Trie/Tree

SPACE

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

(11)

From counts to trie via PQ

After counting every 8-bit chunk # occurrences, create the Trie

Greedy approach: frequencies[x]= # x's in file

PriorityQueue<HuffNode> forest = new PriorityQueue<>();

for (int i = 0; i < 256; i++)

if (frequencies[i] > 0) // computed elsewhere forest.add(new HuffNode(i, frequencies[i]));

while (forest.size() > 1) {

HuffNode left = forest.remove();

HuffNode right = forest.remove();

forest.add(new HuffNode(-1, left.weight()+right.weight(), left, right));

}

HuffNode root = forest.remove();

(12)

Trie used to encode & decode

• Compress: create encodings for each char/leaf

• Similar to LeafTrails APT

• Each 8-bit chunk/char mapped to encoding, e.g., in an array with codings[‘A’] == “010101”

• Decompress: Use trie/tree to decompress bits

• Trie unique to each file, part of compressed file

• Compress: write trie, Decompress: read trie

(13)

Compressing kjv10.txt

Encoding

Length # values with this length

3 1,159,124

4 1,487,471

5 712,325

6 485,333

7 261,611

8 84,107

9 81,467

10 48,019

11 21,065

12 1,863

Encoding

Length # values with this length

13 1,108

14 664

15 476

16 225

17 71

18 44

19 22

20 11

21 3

22 6

23 6

(14)

Uncompression with Huffman

We need the trie to uncompress

• 000100100010011001101111

As we read a bit, what do we do?

Go left on 0, go right on 1

When do we stop? What to do?

How do we get the trie?

Could store 256 counts, use same code

Could store trie: read and write

(15)

Reading and Writing Huff Trie

Similar to concept/techniques in Tree APTs

Distinguish interior and leaf nodes

In huff we label with 0 and 1 respectively

In Tree APT we store "null" explicitly

8 4 x 6 x x 12 10 x x 15 x x

Number? Read two subtrees

X ? Return null, no recursion

8 [ 4 x 6 x x] [12 10 x x 15 x x]

12 [10 x x] [15 x x]

12/1/17 CompSci 201, Fall 2017, Huff and More 15

(16)

Huff WOTO

http://bit.ly/201f17-huff-2

• How does decompress have access to Trie?

• When does decompressing stop, how many bits are written?

(17)

Anita borg: 1949-2003

12/1/17 CompSci 201, Fall 2017, Huff and More 17

Dr. Anita Borg tenaciously envisioned and set about to

change the world for women and for technology. … she fought

tirelessly for the development

technology with positive social and human impact.”

Anita Borg sought to revolutionize the world and the way we think about technology and its impact on our lives.”

http://www.youtube.com/watch?v=1yPxd5jqz_Q

(18)

Decoding a message

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

00000100001001101

(19)

Decoding a message

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

0000100001001101

G

12/1/17 CompSci 201, Fall 2017, Huff and More 19

(20)

Decoding a message

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

000100001001101

(21)

Decoding a message

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

00100001001101

G

12/1/17 CompSci 201, Fall 2017, Huff and More 21

(22)

Decoding a message

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

0100001001101

(23)

Decoding a message

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

100001001101

G

12/1/17 CompSci 201, Fall 2017, Huff and More 23

(24)

Decoding a message

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

00001001101

(25)

Decoding a message

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

0001001101

GO

12/1/17 CompSci 201, Fall 2017, Huff and More 25

(26)

Decoding a message

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

001001101

(27)

Decoding a message

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

01001101

GO

12/1/17 CompSci 201, Fall 2017, Huff and More 27

(28)

Decoding a message

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

1001101

(29)

Decoding a message

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

001101

GOO

12/1/17 CompSci 201, Fall 2017, Huff and More 29

(30)

Decoding a message

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

01101

(31)

Decoding a message

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

1101

GOO

12/1/17 CompSci 201, Fall 2017, Huff and More 31

(32)

Decoding a message

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

101

(33)

Decoding a message

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

01

GOO

12/1/17 CompSci 201, Fall 2017, Huff and More 33

(34)

Decoding a message

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

1

(35)

Decoding a message

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4 4

5

8 6 6

8 16

10

21

11

12 23 37

60

01100000100001001101

GOOD

http://bit.ly/201-f17-1129-0

12/1/17 CompSci 201, Fall 2017, Huff and More 35

(36)

How to Interpret Bits

• What can we tell from file extensions

• Foo.class, bar.jpg, file.txt, coolness.mp3

• How does OS know how to open these?

0000000: cafe babe 0000 0034 001d 0a00 0600 0f09 ...4...

0000010: 0010 0011 0800 120a 0013 0014 0700 1507 ...

0000020: 0016 0100 063c 696e 6974 3e01 0003 2829 ...<init>...() 0000000: ffd8 ffe0 0010 4a46 4946 0001 0200 0064 ...JFIF...d 0000010: 0064 0000 ffec 0011 4475 636b 7900 0100 .d...Ducky...

0000020: 0400 0000 5d00 00ff ee00 0e41 646f 6265 ....]...Adobe 0000000: 4944 3303 0000 0000 0048 5458 5858 0000 ID3...HTXXX..

0000010: 001a 0000 0045 6e63 6f64 6564 2062 7900 ...Encoded by.

(37)

Bits in a .class file

• Does JVM read bit-by-bit? By symbol?

• Consider file in Hex or Binary, does it matter?

• Compare Foo.java to Foo.class

12/1/17 CompSci 201, Fall 2017, Huff and More 37

0000000: cafe babe 0000 0034 001d 0a00 0600 0f09 ...4...

0000010: 0010 0011 0800 120a 0013 0014 0700 1507 ...

0000020: 0016 0100 063c 696e 6974 3e01 0003 2829 ...<init>...()

0000000: 11001010 11111110 10111010 10111110 00000000 00000000 ...

0000006: 00000000 00110100 00000000 00011101 00001010 00000000 .4....)

(38)

PicassoGuernica.jpg

• Viewed using "open .." and via "xxd .."

• Wikimedia "knows" how to display?

0000000: ffd8 ffe0 0010 4a46 4946 0001 0100 0001 ...JFIF...

0000010: 0001 0000 ffdb 0043 0008 0606 0706 0508 ....C...

0000020: 0707 0709 0908 0a0c 140d 0c0b 0b0c 1912 ...

(39)

Limits of Compression

• How many values represented with 3 bits?

• 000, 001, 010, 011, 100, 101, 110, 111

• How many values represented with N bits? 2N

• Can we compress all of these? Suppose N = 10

• 2 1-bit files, 4 2-bit files, … 512 9-bit files

• How many is this in total?

• Is this about lossy or lossless compression

12/1/17 CompSci 201, Fall 2017, Huff and More 39

(40)

Measuring Information

• Original Huff explanation at Duke used example:

• Compress "go go gophers"

ASCII 3 bits

g 103 1100111 000 00 o 111 1101111 001 01 p 112 1110000 010 1100 h 104 1101000 011 1101 e 101 1100101 100 1110 r 114 1110010 101 1111 s 115 1110011 110 100 sp. 32 1000000 111 101

3

2

p

1

h

1

2

e

1

r

1 4

s

1

*

2 7

g

3

o

3 6

13

(41)

Autocomplete meets Huff

12/1/17 CompSci 201, Fall 2017, Huff and More 41

(42)

Autocomplete meets Huff

(43)

Autocomplete meets Huff

12/1/17 CompSci 201, Fall 2017, Huff and More 43

(44)

Autocomplete meets Huff

(45)

YAHW

http://bit.ly/201f17-huff-3

12/1/17 CompSci 201, Fall 2017, Huff and More 45

References

Related documents