International Journal of Advanced Research in Computer Science and Software Engineering

(1)

International Journal of Advanced Research in Computer Science and Software Engineering

Research Paper

Available online at: www.ijarcsse.com

Greedy Algorithm: Huffman Algorithm

Annu Malik, Neeraj Goyat, Prof. Vinod Saroha(Guide) Computer Science and Engineering

(Network Security)

B.P.S M.V.,Khanpur kalan Haryana, India

Abstract— This paper presents a survey on Greedy Algorithm. This discussion is centered on overview of huffman code, huffman algorithm and applications of greedy algorithm. A greedy algorithm is an algorithm that follows the problem solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum. In many problems, a greedy strategy does not in general produce an optimal solution, but nonetheless a greedy heuristic may yield locally optimal solutions that approximate a global optimal solution in a reasonable time.Greedy algorithms determine the minimum number of coins to give while making change. These are the steps a human would take to emulate a greedy algorithm to represent 36 cents using only coins with values {1, 5, 10, 20}. The coin of the highest value, less than the remaining change owed, is the local optimum. (Note that in general the change-making problem requires dynamic programming or integer programming to find an optimal solution; However, most currency systems, including the Euro and US Dollar, are special cases where the greedy strategy does find an optimum solution.)

Keywords— Greedy, huffman, activity, optimal, algorithm etc

I. INTRODUCTION

Greedy Algorithm solves problem by making the choice that seems best at the particular moment. Many Optimization problems can be solved using a greedy algorithm. Some problems have no efficient

solution, but a greedy algorithm may provide an efficient solution that is close to optimal. A greedy algorithm works if a problem exhibit the following two properties:

1) Greedy Choice Property: A globally optimal solution can be arrived at by making a locally optimal solution. In other words, an optimal solution can be obtained by making “greedy” choices.

2) Optimal Substructure: Optimal solutions contains optimal sub solutions. In other words, solutions to sub problems of an optimal solution are optimal.

II. HUFFMAN CODES

Data can be encoded efficiently using Huffman codes. It is widely used and very effective technique for compressing data;

savings of 20% to 90% are typical, depending on the characteristics of the file being compressed. Huffman's greedy algorithm uses a table of the frequencies of occurrence of each character to build up an optimal way of representing each character as a binary string. Suppose we have 10⁵ characters in a data file. Normal storage: 8 bits per character (ASCII)-8*10⁵ bits in a file. But, we want to compress the file and store it compactly. Suppose only 6 characters appear in the file:

a b c d e f Total

Frequency 45 13 12 16 9 5 100

How can we represent the data in a compact way?

[1] Fixed Length Code: Each letter represented by an equal number of bits. With a fixed length code, at least 3 bits per character:

For example:

a 000

b 001

c 010

d 011

e 100

f 101

For a file with 105 characters, we need 3*10⁵ bits.

(2)

a 0

b 101

c 100

d 111

e 1101

f 1100

Number of bits = (45*1+13*3+12*3+16*3+9*4+5*4)*1000 =2.24*10⁵ bits.

Thus, 224000 bits to represent the file, a saving of approximately 25%. In fact, this is an optimal character code for this file.

Let us denote the characters by C₁, C₂, C₃,….., Cn and devote their frequencies by f₁, f₂,…, fn .

Suppose there is an encoding E in which a bit string S_i of length s_i represents C_i, the length of the file compressed by using encoding E is

L(E, F) = ∑ si * fi for i=1 to n III. PREFIX CODES

The prefixes of an encoding of one character must not be equal to a complete encoding of another character e.g. 1100 and 11001 are not valid codes because 1100 is a prefix of 11001. This constraint is called the prefix constraint. Codes in which no codewords is also a prefix f some other code word are called prefix codes. Shortening the encoding of one character may lengthen the encoding of others. To find an encoding E that satisfies the prefix constraint and minimizes L(E, F). Prefix codes are desirable because they simplify encoding(compression) and decoding. Encoding is always simple for any binary character code, we just concatenate the code words representing each character of the file. Decoding is also quite simple with a prefix code. Since no codeword is a prefix of any other, the codeword that begins an encoded file is unambiguous. We can simply identify the initial codeword, and repeat the decoding process on the remainder of the encoded file. The decoding process needs a convenient representation for the prefix code so that the initial codeword can be easily picked off. A binary tree whose leaves are the given characters provides one such representation. We interpret the binary codeword for a character as the path from the root to that character, where 0 means “go to the left child” and 1 means “go to the right child”. Note that these are not binary search trees, since the leaves need not appear in sorted order and internal nodes do not contain character keys. An optimal code for a file is always represented by a full binary tree, in which every non-leaf node has two children.

The fixed-length code in our example is not optimal since its tree, because it is not a full binary tree: there are code words beginning 10..., but none beginning 11... Since we can now restrict our attention to full binary trees, we can say that i f C is the alphabet from which the characters are drawn, then the tree for an optimal prefix code has exactly |C| leaves , one for each letter of the alphabet, and exactly |C|-1 internal nodes. Fig. Trees corresponding to the coding schemes. Each leaf is labeled with a character and its frequency of occurrence. Each internal node is labeled with the sum of the weights of the leaves in its subtree.

(a). The tree corresponding to the fixed-length code a=100,...,f=100

100

86

58 28

14

b:13 c:12

d:16

e:9 f:5

a:45

1

1 1

1

1 0

0

(a). Not Optimal

(3)

July - 2013, pp. 296-303

(b). the tree corresponding to the optimal prefix code a=0,b=101,..., f=1100.

GREEDY ALGORITHM FOR CONSTRUCTING A HUFFMAN CODE Huffman invented a greedy algorithm that constructs an optimal prefix code called a Huffman Code.

The algorithm builds the tree T corresponding to the optimal code in a bottom-up manner. It begins with a set of |C| leaves and performs a sequence of |C|-1 “merging” operations to create the final tree. In the pseudo code HUFFMAN(C), we assume that C is a set of n characters and that each character c C is an object with a defined frequency f[c]. A priority queue Q, keyed

100

55

30 14 25

b:13

c:12 d:16

f:5 e:9

a:45

1 0

1

1 1 0 1

0 0

0 (b) Optimal

A B A B

A+B

(4)

HUFFMAN(C) 3) n ← |C|

4) Q ← C

5) for I ← 1 to n-1

6) do z ← ALLOCATE-NODE() 7) x ← left[z] ← EXTRACT-MIN(Q) 8) y← right[z] ← EXTRACT-MIN(Q) 9) f[z] ← f[x] + f[y]

10) INSERT(Q, z)

11) return EXTRACT-MIN(Q)

The analysis of the remaining time of Huffman's algorithm assumes that Q is implemented as a binary heap. For a set of n characters , the initialization of Q in line Q in the 2 can be performed in O(n) time using the BUILD-HEAP operation requires time O(nlgn), the loop contributes O(nlgn) to the running time. Thus, the total running time of HUFFMAN on a set of n characters is O(nlgn).

The algorithm is based on a reduction of a problem with n characters to a problem with n-1 characters. A new character replaces two existing ones.

f:5 e:9 c:12 b:13 d:16

f:5 e:9

c:12 b:13 a:45

a:45

d:16 14

0 1

f:5 e:9 c:12 b:13

a:45 d:16

14 25

c:12 b:13

25 30

14 d:16

f:5 e:9

a:45

0

0 0

0

1 1 1

1

(5)

July - 2013, pp. 296-303

Example: Find an optimal huffman code for the following set of frequencies:

a : 50 b : 25 c : 15 d : 40 e : 75 Solution: Given that :C = {a, b, c, d, e}

f ( C ) = {50, 25, 15, 40, 75}

n = 5 Q ← c i.e.,

For i ← 1 to 4

i = Z ← Allocate node x ← Extract-Min(Q) y ← Extract-Min(Q)

b:13

25 30

14 d:16

f:5 e:9

a:45

c:12

55 100

0 0 0

0 0

1

1 0 101 100 111 1101 1100

a b c d e f

c 50 b 25 d 40 a 50 e 75

x y

(6)

F[z] = 40 i.e.,

Again, for i = 2

z ← Allocate node x ← 40

y ← 40 left[z] ← x Right[z] ← y F[z] = 40 + 40 = 80

40 c b

d 40 a 50 e 75

25 15

Left[z] Right[z]

x y

z

40 c b

d 40 a 50 e 75

25 15

x y

80 d

a 50 e 75

40 40

c 15 b 25

x y

Left[z] Right[z]

(7)

July - 2013, pp. 296-303 Similarly we apply the same process, we get

Using Huffman Codes

 Each message has a different tree. The tree must be saved with the message.

 Huffman codes are effective for long files where the savings in the message can offset the cost for storing the tree.

 Decode files by starting at root and proceeding down the tree according to the bits in the message (0 = left, 1 = right).

When a leaf is encountered, output the character at that leaf and restart at the root.

 Huffman codes are also effective when the tree can be pre-computed and used for a large number of messages (e.g., a tree based on the frequency of occurrence of characters in the English Language).

 Huffman codes are not very good for random files(each character about the same frequency).

80 d

a 50 e 75

40 40

c 15 b 25

125

80 d 40 a 50 e 75

40 c 15 b 25

125 205

0 0 0

0

1 1 1

1 Huffman Tree

(8)

best overall solution later. For example, all known greedy coloring algorithms for the graph coloring problem and all other NP-complete problems do not consistently find optimum solutions. Nevertheless, they are useful because they are quick to think up and often give good approximations to the optimum. If a greedy algorithm can be proven to yield the global optimum for a given problem class, it typically becomes the method of choice because it is faster than other optimization methods like dynamic programming. Examples of such greedy algorithms are Kruskal's algorithm and Prim's algorithm for finding minimum spanning trees, Dijkstra's algorithm for finding single-source shortest paths, and the algorithm for finding optimum Huffman trees.

V. CONCLUSION

Greedy algorithms are usually easy to think of, easy to implement and run fast. Proving their correctness may require rigorous mathematical proofs and is sometimes insidious hard. In addition, greedy algorithms are infamous for being tricky.

Missing even a very small detail can be fatal. But when you have nothing else at your disposal, they may be the only salvation. With backtracking or dynamic programming you are on a relatively safe ground. With greedy instead, it is more like walking on a mined field. Everything looks fine on the surface, but the hidden part may backfire on you when you least expect. While there are some standardized problems, most of the problems solvable by this method call for heuristics. There is no general template on how to apply the greedy method to a given problem, however the problem specification might give you a good insight. In some cases there are a lot of greedy assumptions one can make, but only few of them are correct . They can provide excellent challenge opportunities...

REFERENCES

 Wikipedia

 Google

 Algorithms Design and Analysis by Udit Agarwal