Quad-Byte Transformation using Zero-frequency Bytes

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014)

103

Quad-Byte Transformation using Zero-frequency Bytes

Jyotika Doshi

1

, Savita Gandhi

2

GLS Institute of Computer Technology, Ahmedabad, INDIA Department of Computer Science, Gujarat University, Ahmedabad, INDIA

Abstract— Byte pair encoding (BPE) algorithm was suggested by P. Gage is to achieve data compression. It encodes all instances of most frequent byte-pair using zero-frequency byte in the source data. This process is repeated for maximum m possible number of passes until no further compression is possible, either because there are no more frequently occurring byte pairs or there are no more unused zero-frequency bytes to represent pairs. The proposed quad-byte transformation algorithm encodes 4-quad-byte integers instead of byte pairs. Proposed QBT-Z (Quad-Byte Transformation using Zero-frequency bytes) is applied on data blocks of a file. Implementation of block-wise QBT-Z with varying number of passes is experimented with various test files. The aim is to minimize the compression time and achieve better compression rate. QBT-Z with m passes transforms only one quad-byte at a time using unused zero-frequency byte and the process is repeated maximum m possible number of times. With k-pass QBT-Z, k<<m, it transforms half of the possible most-frequent quad bytes in each pass except the last. In the last pass, it transforms all remaining possible quad-bytes. Algorithm is tested on 18 files of various types with a total size of nearly 40MB. Overall transformation rate (compression rate using transformation) is 17.04%, 18.91%, 19.25%, 19.31% and 22.75% using QBT-Z with 1, 2, 3, 4 and m passes respectively. It shows that transformation with large number passes increases reduction in transformed file size. Total transformation time taken by k-pass QBT-Z with 1, 2, 3 and 4 passes is only 7.629, 12.183, 16.091 and 20.994 seconds respectively; which is significantly small as compared to 268.434 seconds taken by m-pass QBT-Z. As the number of passes increases, compression is better with increased execution time. Reverse transformation is too fast, hardly taking 2 to 4 seconds.

Keywords—data compression, transformation, quad-byte encoding, substitution using zero-frequency bytes

I. INTRODUCTION

Data transformation transforms data from one form to another in order to reduce data size or to encrypt the data. Here, we have applied data transformation on quad-bytes; collection of four adjacent bytes. Frequent quad-bytes are encoded using zero-frequency bytes (bytes not used the data source).

Here, we have suggested a variation of BPE (Byte Pair Encoding). BPE as proposed by P. Gage [1] compresses data by transforming all instances of the most frequent pair of adjacent bytes in the data by substitution with a byte that is not used in the original data. The process is repeated until no further compression is possible, either because there are no more frequently occurring pairs or there are no more unused bytes to represent pairs.

Proposed QBT-Z is very similar to BPT-Z except the number of consecutive bytes transformed. With QBT-Z, it transforms 4 consecutive bytes.

QBT-Z is applied on blocks of data instead of on entire file at a time. This increases the possibility of getting zero-frequency bytes in a small size data block. It also enables parallel encoding algorithm and gives the advantage of getting more frequent quad-bytes in nearby locality.

We have suggested following variations of QBT-Z:  QBT-Z with maximum m possible number of passes;

where it transforms only one most frequent quad-byte at a time in one pass.

 QBT-Z with k number of passes; where k<<m and k can be specified as parameter. Here it transforms half of the maximum possible quad-bytes in each of the k-1 passes. In the last pass, it transforms all possible quad-bytes.

The intention of proposed variations of QBT-Z is to reduce the transformation time in addition to achieve compression.

II. LITERATURE REVIEW

In most of the source data, adjacent bytes are observed to be repeated many times [7]. Therefore, it is a better representative for achieving data compression with substitution using single byte. Byte-Pair Encoding (BPE) [1,5], digram encoding [3, 4] and Iterative Semi-Static Digram Coding (ISSDC) [2] are such algorithms. These algorithms are intended for text files. However, they can be applied to any type of source.

(2)

International Journal of Emerging Technology and Advanced Engineering

104 Digram coding first builds dictionary and then does encoding in the second pass. In semi-static dictionary of Digram encoding, all of the individual 1-byte characters used in the source are added to the first part of the dictionary and the most frequently used digrams are added to the second part of the dictionary. If the source contains n individual characters, and the dictionary size is d, then the number of digrams that can be added to the dictionary is d − n. The n and d values and the dictionary are written in the destination file along with encoded data for decoder. Digrams are encoded using index position in the dictionary. Altan Mesut and Aydin Carus [2] in their Iterative Semi-Static Digram Coding (ISSDC) use repeated digram coding. It requires the entire source to be in memory due to its two-pass multi-iterations.

Digram encoding and ISSDC are beneficial only when small-size alphabet is used in the source. If all 256 symbols are used in the source, the dictionary size needs to be longer than 256 words and each digram in the dictionary will be encoded using more than 8 bits.

Authors of this paper have proposed BPT-I [7] and QBT-I [6]. Both these algorithms are dictionary based transformation technique intended to introduce redundancy in the data for second stage conventional data compression techniques. The algorithm forms logical groups of most frequent byte-pairs or quad-bytes in dictionary and encodes byte-pair or quad-byte using variable-length group number and index position of data within group. Thus a byte-pair or quad-byte is encoded with less than 16-bits.

Byte pair encoding (BPE) algorithm as proposed by P. Gage [1] compresses data by finding the most frequent pair of adjacent bytes in the data and replacing all instances of this pair with a byte that was not in the original data. BPE is beneficial only when the source is having small alphabet size, because it needs unused zero-frequency byte symbols for encoding most frequent byte-pairs. Another drawback is multiple number passes requiring entire data to be stored in memory to avoid frequent file input/output operations.

A variation of BPE applied on blocks [1, 5] requires buffering small blocks of data and compressing each block separately. Here additional cost incurs due to storing substitution information and the output block size information for each data block. The main advantage of using block-wise transformation is the increased chances of finding unused (zero-frequency) bytes in small data block as compared to entire file.

III. RESEARCH SCOPE

Digram encoding, ISSDC and BPT-I are all dictionary based techniques using index in encoding byte pair.

Digram encoding and ISSDC are better only when applied to small size alpahbet source. BPT-I is intended for introducing data redundancy for second stage data compression.

The problem with BPE is the large execution time due to many number of passes.

We saw a scope in reducing large execution time taken by above methods by transforming four consecutibe bytes (quad-bytes) instead of digrams or byte-pairs.

QBT-I transforms quad-bytes using index positions in dictionary of frequent quad-bytes. It is quite slow to execute.

We have proposed QBT-Z (Quad-Byte Transformation using Zero-frequency bytes) and its variations where it encodes quad-bytes with zero-frequency bytes in a block in each pass. It is assumed to execute faster and give better data compression.

IV. BRIEF INTRODUCTION TO M-PASS AND K-PASS QBT-Z

QBT-Z is applied on blocks of data in a file. Here, in each pass, it requires to determine unused bytes and frequency of quad-byte data and then transform most frequent quad-bytes using zero-frequency bytes.

With m-pass QBT-Z, it transforms only one most frequent quad-byte at a time. Maximum number of passes is determined by program and cannot be specified by user.

With k-pass QBT-Z, it may transform more than one most-frequent quad-bytes at a time. Additionally user can experiment with varying number of passes.

A. Understanding m-pass QBT-Z with example

With m-pass quad-byte transformation, only one most frequent quad-byte is encoded in each pass. Thus header containing substitution information is of 5 bytes; 4 bytes for quad-byte to be encoded and 1 byte for byte used to encode quad-byte. For each encoded quad-byte, additional cost is of 5 bytes. Encoding 4 bytes using 1 byte saves 3 bytes. So, encoding two occurrences of quad-byte saves only 6-5=1 byte. Thus, if the frequency of quad-byte is less than 3, there is no saving. Thus the process stops when most frequent quad-byte occurs less than 3 times or there is no more zero-frequency byte.

Consider ―abcdababdcdabcdcdabcdab‖ as the source data to be encoded.

Pass-1

(3)

International Journal of Emerging Technology and Advanced Engineering

105 Thus abcdababdcdabcdcdabcdab is transformed to abeabdecdee with header information cdabe.

header Transformed data Total length

cdabe abeabdecdee 16

 Output (header, transformed data): cdabeabeabdecdee, length:16

The process stops here, as there is no quad-byte with frequency > 2 in ―cdabeabeabdecdee‖.

B. Understanding k-pass QBT-Z with example

With k-pass QBT-Z, it transforms one or more quad-byte in each pass. The header contains the information: n=number of quad-bytes to be transformed and n times substitution information. Thus header information requires (1+n*4) bytes.

Consider k-pass QBT-Z with k=1. Here, all possible frequent quad-bytes are transformed at once in single pass only. Consider ―abcdababdcdabcdcdabcdab‖ as the source data to be encoded.

Pass-1

 Input: abcdababdcdabcdcdabcdab, length=23

 Most frequent quad-bytes: cdab with frequency 4, abcd with frequency 3

 Encoding: substitute cdab by e and abcd by g

Thus abcdababdcdabcdcdabcdab is transformed to gababecdee with header information 2cdabeabcdg.

header Transformed data Total length

2cdabeabcdg gababecdee 21

Note that cdab is repeated 4 times but only three of its occurrences are encoded. Similarly frequency of abcd is 4, but only one of its occurrences is encoded.

 Output (header, transformed data): 2cdabeabcdg gababecdee, length:21

C. Variations of Block-wise QBT-Z m-pass BPT-Z:

In each pass, it replaces single most frequent quad-byte with single unused quad-byte. Here, m is for maximum number of possible passes. User can not specify value of m.

1-pass BPT-Z (k-pass BPT-Z with k=1):

All possible most frequent quad-bytes are replaced with unused bytes in one pass only.

k-pass BPT-Z where 1<k<<m:

In the first k-1 passes, only half of the maximum possible quad-bytes are transformed in each pass. In the last pass, all maximum possible quad-bytes are transformed just like 1-pass QBT-Z.

With all above proposed variations, an overhead cost is of storing substitution information in the header. Due to this overhead, it may not beneficial to transform less frequent data.

V. M-PASS QBT-ZALGORITHM

A. m-pass Encoder:

File is processed reading blocks of specified number of bytes, say BlockSize. For each block read, it transforms maximum one quad-byte at a time and repeats this process maximum possible number of times. Finally it writes number of passes, block-size and the encoded output buffer.

Repeat till end of file

o Read a block of size block_size in array, say buf o Determine number of unused bytes, say m, in buf o Store unused bytes in an array, say unused o Repeat m times (m-pass)

Find most frequent quad-byte (or byte-pair) integer, say num, in buf

If frequency is < 3, break the loop

Write header information: (num and unused[m]) to output buffer

Repeat till end of buf  Read 4 bytes data

 If it is same as num, write unused[m] to output buffer and move 4 bytes ahead in buf; else write 1st byte to output buffer and move to next byte in buf

Copy output buffer to buf for next pass

o Write size of buf (final encoded block) to output file

o Write number of passes m to output file o Write encoded buf to output file

Thus, the content of each block transformed (encoded) is as follows:

i. Byte 1 and 2: size of encoded block

(4)

International Journal of Emerging Technology and Advanced Engineering

106 iii. Byte 4 onwards: final encoded block after m passes

B. m-pass Decoder:

Repeat till end of decoded file

o Read 2 bytes block_len = size of encoded block from decoded file

o Read 1 byte m = number of passes

o Read block_len bytes in the buffer, say buf o Repeat m times

Read 4-byte encoded data, say num, from buf

Read unused byte, say code used to substitute num, from buf.

Repeat till end of buf

 Read 1 byte, say ch, from buf

 If ch is same as code, write 4-byte num to the output buffer; else write 1-byte ch to output buffer

Copy output buffer to buf o Write buf to output file

VI. K-PASS QBT-ZALGORITHM

A. k-pass Encoder:

File is processed reading blocks of specified number of bytes, say BlockSize. For each block read, it transforms half of the possible quad-bytes in initial k-1 passes. In the last pass, all possible quad-bytes are transformed.

At each pass, the encoded buffer contains the header information (number of quad-bytes to be transformed, values in substitution pairs (quad-byte, unused byte)) and then encoded data.

At the end of k passes, for each block, it writes final encoded buffer size and the encoded buffer itself.

 Repeat till end of file

o Read a block of size block_size in array, say buf o MaxToReplace=0

o Repeat for k-1 times or until MaxToReplace=0

Determine number of unused (zero-frequency) bytes, say m

Store unused bytes in an array, say unused

Determine frequency of quad-bytes

Sort the quad-bytes in the order of their frequency

Find number of quad-bytes, say n, to be considered for transformation depending on its frequency (ex. frequency >3).

Let maxToReplace = min(m, n)/2; maximum number of quad-bytes chosen to be replaced by unused bytes

Write the value of maxToReplace and those many substitution pairs (quad-byte, unused byte) for maxToReplace most frequent byte pairs to output buffer

Repeat till end of buf

 Read four bytes number from buf

 If it is from list of quad-bytes to be transformed, write corresponding unused byte in output buffer and move ahead by 4 bytes for next quad-byte in buf; else copy first byte of number in output buffer and move ahead by 1 byte in buf

o Repeat above process for last pass with maxToReplace=min(m,n)

o Write the length of output buffer to output file o Write the output buffer to output file

B. k-pass Decoder:

 Repeat till end of decoded file

o Read block_len = size of encoded block from decoded file

o Read buffer of block_len bytes from file to buf o Read m ( number of quad-bytes chosen for

replacement) from buf

o Read m substitution pairs (quad-byte, unused byte) in array from buf

o Repeat till end of buf

Read 1 byte, say ch, from buf

If ch is same as any of the unused byte in array of substitution pairs, write corresponding quad-byte in output file; else write ch in output file

VII. EXPERIMENTAL RESULTS AND ANALYSIS OF QBT-Z

Block-wise Quad-Byte Transformation using Zero-frequency bytes (QBT-Z) is performed with 1 and k (2 or more) passes in addition to maximum possible m-pass.

(5)

International Journal of Emerging Technology and Advanced Engineering

107

TABLEI

TEST FILES USED IN EXPERIMENT

No. Source File Name

Source size (Bytes)

1 act2may2.xls 1348036

2 calbook2.txt 610856

3 cal-obj2 246814

4 cal-pic 513216

5 cycle.doc 1483264

6 every.wav 6994092

7 family1.jpg 198372

8 frymire.tif 3706306

9 kennedy.xls 1029744

10 lena3.tif 786568

11 linux.pdf 8091180

12 linuxfil.ppt 246272

13 monarch.tif 1179784

14 pine.bin 1566200

15 profile.pdf 2498785

16 sadvchar.pps 1797632

17 shriji.jpg 4493896

18 world95.txt 3005020

Total size (Bytes) 39796037

Results are recorded taking average of five runs for each test file. Most of the test files are selected from Calgary corpus, Canterbury corpus, ACT web site. Test files are selected to include all different file types and various file sizes as shown in Table 1.

Table 2 gives the resulting file size of each file after transformation using various numbers of passes of QBT-Z. In most of the files, it can be observed that as the number of passes increases, the reduction in file size also increases. Maximum reduction is observed with m-pass.

TABLEII

TRANSFORMED FILE SIZE (BYTES)

File No.

Transformed File Size (Bytes) after m-pass and k-pass QBT-Z

m pass 1 pass 2 pass 3 pass 4 pass

1 730560 824736 765411 761401 760442 2 397802 474945 490188 490108 490183 3 163030 198080 189119 186066 185127 4 81331 184599 112377 99506 97813 5 574132 812832 665628 633909 628959 6 6996644 6996644 6997508 6998362 6999216 7 198431 198446 198472 198497 198522 8 1212183 1710533 1450974 1421200 1416383 9 335045 389700 350533 346246 346290 10 786243 786798 786858 786955 787052 11 5973207 6617753 6397862 6358141 6354790 12 179391 197036 186334 184066 183416

13 1130232 1153911 1166198 1166354 1166499

14 1110988 1271663 1242818 1232091 1226568 15 2493687 2495262 2494758 2495009 2495315 16 1729980 1751322 1744673 1743477 1743391 17 4479601 4488212 4487614 4487836 4488131 18 2170758 2461967 2544857 2544829 2545171

Table 3 shows the summarized results. It includes total file size after transformation, percentage of overall saving in file size (compression rate), average bits used per 8-bit symbol (showing bits saved per byte), total transformation time and total inverse transformation time.

TABLEIII

OVERALL RESULTS OF ALL TEST FILES

QBT-Z Total Transform ed File Size (Bytes)

Overall Transfor mation Rate (% saving in file size)

Overall BPS (Bits Per Symbol)

Total Transf ormati on Time (Sec)

Total Inverse Transfor mation Time (Sec)

m pass 30743245 22.748 6.180 268.434 4.697

1 pass 33014439 17.041 6.637 7.629 2.147

2 pass 32272182 18.906 6.488 12.183 2.340

3 pass 32134053 19.253 6.460 16.091 2.432

[image:5.612.327.559.153.426.2] [image:5.612.89.243.156.491.2]

(6)

International Journal of Emerging Technology and Advanced Engineering

108 VIII. OBSERVATIONS

Following are the observations based on summary results shown in Table 3.

[image:6.612.322.565.216.390.2]

With QBT-Z using 1, 2, 3, 4 and m-pass, overall transformation rate (reduction in file size due to transformation) of all test files is 17.04%, 18.91%, 19.25%, 19.31% and 22.75% respectively. It shows more reduction in total transformed file size with large number of passes. It can also be seen from reduced BPS (Bits used per 1-byte symbol). Figure 1 represents overall compression rate. With m-pass QBT-Z, maximum saving is seen. With k-pass QBT-Z, as k increases, reduction in file size also increases. After 3 passes, this reduction is not significant.

Figure I Percentage Of Reduction In File Size Due To Transformation

[image:6.612.49.289.294.456.2]

Figure 2 and Figure 3 represents total time taken to transform (compress) and reverse transform (decompress) all 18 files of total size 40 MB. Time taken by m-pass QBT-Z is considerably very high as compared to k-pass QBT-Z.

Figure II Total TransformationTime

Total transformation time of all files taken by QBT-Z with 1, 2, 3 and 4 passes is only 7.629, 12.183, 16.091 and 20.994 seconds respectively. It is significantly very small as compared to 268.434 seconds taken by m-pass QBT-Z. With k-pass QBT-Z, transformation time increases with increased number of passes.

Figure III Total Reverse Transformation Time

Reverse transformation is too fast. Total time taken by m-pass QBT-Z is 4.697 seconds. Time taken by k-pass QBT-Z is 2.147, 2.340, 2.432 and 2.543 seconds for k=1, 2, 3 and 4 respectively.

IX. OTHER FACTORS AFFECTING QBT-Z

Other than number of passes, size of the data block and the data structure used to store quad-bytes with its frequency affects the results.

Here, we have used data block of size 8KB. With larger block size, the probability of finding zero-frequency byte symbols reduces. Thus it does not compress well in many files.

The data structure used here is Binary Search Tree. Node of the tree is consisting of two elements: quad-byte integer data and its frequency. The data is then sorted on frequency using quick-sort method. This helps to speed up searching during transformation.

X. CONCLUSION

[image:6.612.49.289.543.702.2]

(7)

International Journal of Emerging Technology and Advanced Engineering

109 QBT-Z with 3 pass can be considered to be optimal considering both compression rate and compression time. One may trade-off between the compression rate and time as per need.

REFERENCES

[1] Philip Gage, "A New Algorithm For Data Compression", The C Users Journal, vol. 12(2)2, pp. 23–38, February 1994

[2] Altan Mesut, Aydin Carus, ―ISSDC: Digram Coding Based Lossless Dtaa Compression Algorithm‖, Computing and Informatics, Vol. 29, pp.741–754, 2010

[3] Sayood Khalid, "Introduction to Data Compression",2nd edition, Morgan Kaufmann, 2000

[4] Ian H. Witten, Alistair Moffat, Timothy C. Bell, ―Managing Gigabytes-Compressing and Indexing Documents and Images‖, 2nd edition, Morgan Kaufmann Publishers, 1999

[5] mattmahoney.net/dc/bpe2v2.cpp

[6] Jyotika Doshi, Savita Gandhi, ―Quad-Byte Transformation as a Pre-processing to Arithmetic Coding‖, International Journal of Engineering Research & Technology (IJERT), Vol.2 Issue 12, December 2013, e-ISSN: 2278-0181