string Module - Operations for Text Processing

Operations for Text Processing

7.8 string Module

With the ASCII character set as basis a few character sets have been deﬁned in the module string. They can be of use in text processing; the string can be downloaded and any set within accessed as string.xx. The details are summa-rized in Table7.4.

Integer - decimal

Integer - hex Integer - binary

Integer - octal Integer – decimal/

binary/octal/hex int()

bin() oct() hex()

bytes bytearray

Int.to_bytes() Int.from_bytes()

Int.to_bytearray() Int.from_bytearray()

string

Integer sequence bytes.fromhex()

bytearray.fromhex() character

ord() chr()

string: any base up to 36

Number - decimal Number - hex

(Number).hex()

float.fromhex()

string

.encode() .decode()

Fig. 7.14 A compact representation of the conversion possibilities between numbers and sequences in Python

166 7 Operations for Text Processing

7.9 Exercises

1. The stringSS = ‘Holidays’ is given. Center it ﬁlling it with four numbers of

‘*’ on either side followed by ﬁve numbers of ‘@’.

2. The stringSS = ‘Holidays’ is given. Center it ﬁlling it with ten numbers of

‘*@’ on left side and an equal numbers of ‘@*’ on the right.

3. The stringSS = ‘Holidays’ is given. Center it ﬁlling it with ten numbers of

‘*’ on left side and ﬁve numbers of ‘@’ followed by six numbers of ‘^’ on the right.

4. Round off the numbers considered in Example 7.5to 12 decimal places. Let N12 be such a rounded number. Obtain N11 from it by rounding to 11 decimal places. Similarly obtain N10 from N11, N9 from N10 and so on. Do this for both the numbers and explain any anomaly.

In classical cryptography encryption, decryption, and cryptanalysis are all done using simple algebra with characters and their numerical representations. The following exercises relate to classical cryptography (Shyamala et al. 2011).

5. Take a long enough text material (about 10,000 characters). If all the white spaces in it (coma, full stop, colon, question mark, blank space, and c) are removed and all characters in capital letters are converted to small letters, we will be left with a continuous sequence of small letters. Such a sequence is called a‘plain text’ in cryptography parlance. With some effort a plain text can be converted back to (almost) the original text we started with. Write a program to prepare a plain text and convert the text we started with to plain text.

6. Obtain the frequencies of all the 26 letters in the above plain text through a program for it.

7. The letter pairs ‘th’, ‘ht’, ‘in’, ‘on’, ‘gh’, … occur more commonly in normal English text. These are called‘bigrams’. Write a program to get the frequencies of all the bigrams and retain the data for the most frequent 20 of them. Get the most frequent 20 bigrams for the above plain text.

Table 7.4 Constants deﬁned in the string module

Item Contents

ascii_lowercase ‘abcdefghijklmnopqrstuvwxyz’

ascii_uppercase ‘ABCDEFGHIJKLMNOPQRSTUVWXYZ’

ascii_letters ASCII_lowercase + ASCII_uppercase

Digits ‘0123456789’

Hexdigits ‘0123456789abcdefABCDEF’

Octdigits ‘01234567’

Punctuation Punctuation character set

Whitespace Space, tab, linefeed, return, formfeed, and vertical tab Printable Digits, ascii_letters, punctuation, and d whitespace

8. A letter set of three like‘ght’, ‘ion’, ‘the’ is called a ‘trigram’. Write a program to get the frequencies of all the trigrams and retain the data for the most frequent 20 of them. Get the most frequent 20 trigrams for the above plain text.

9. The normalized letter frequencies form the probabilities of occurrence of the respective letters. With pias the probability of occurrence of the ith letter,P

p²_i is called ‘the Index of Coincidence (IC)’. The IC values for general texts in different languages are known. For English texts IC = 0.0655; in contrast for a completely random text it has the value of 0.0385. Write a program to get the IC value and get it for the given text.

Armed with the letter frequencies, knowledge of the dominant bigrams and trigrams, and the IC values one should be able to do cryptanalysis of most of the common conventional ciphers.

10. The‘Substitution Cipher’ uses a look-up table (LUT) to substitute every letter in the plain text with the one in the LUT to generate the cipher text. Write a program to generate the cipher text from the plain text using the LUT. Use it to get the cipher text for the given plain text.

11. A cipher text obtained using the Substitution Cipher is given. One can get its letter frequencies, compare with those of English text, and identify the sub-stitution used for the most common letters like‘e’, ‘s’, ‘t’ etc. Similarly one can identify the substitution used for the letters with the least frequencies like‘z’,

‘q’, ‘x’, etc. Still some indecision remains. The most common bigrams and trigrams can be identified and compared. With these the substitution used for many of the letters can be identified. Armed with these and our familiarity with common English words (eng?ish → english, re?ain → remain …) additional identification can be done. Identifying the plain text in this manner constitutes

‘Cryptanalysis’. Obtain the cipher text with a substitution cipher. Do crypt-analysis and retrieve the plain text (With readily available programs and cipher text of about 300 letters the exercise may take a few hours of effort for completion).

12. With‘a’, ‘b’, ‘c’, … ‘z’ represented by 1, 2, 3, …, 26, the Afﬁne Cipher uses the relation y = (ax + b) % 26 to substitute the letter represented by integer x by the letter represented by integer y. a and b are integers (the two together forms the encryption/decryption‘key’) with the constraint on a that its only common factor with 26 is one. Write a program to get the cipher text for a given plain text with Afﬁne Cipher (for given a and b values).

13. Affine Cipher is a special case of a Substitution Cipher. For a given cipher text one can obtain the letter frequencies; by comparison with the known fre-quencies of common texts a few most dominant letters can be identified. By substitution in the equation y = ax + b the same can be confirmed. Do crypt-analysis of the crypto text in Exercise (12) above.

14. With a = 1, the Afﬁne Cipher becomes a ‘Shift Cipher’. ‘Vigenere Cipher’ is a generalized version of the‘Shift Cipher’. It uses a set of m key values—{b1, b2,

… bm}. The plain text is split into successive blocks of m letters each (normally m will be a single digit integer). Theﬁrst letter of each block is shifted by b1,

168 7 Operations for Text Processing

the second by b2, and so on up to the mth letter (by bm). Do this successively for all the blocks. This completes encryption. Prepare a program to do encryption conforming to Vigenere Cipher. Get the cipher text for the given plain text.

15. Cryptanalysis of Vigenere Cipher is a more challenging affair. One has to identify the value of mﬁrst and then the set {b1, b₂,… bm}. The IC concept can be used to identify the m value. With c_ias the ith letter in the cipher text, the sub-sequence of letters—{c1, c_1+m, c_1+2m, …}—forms a Shift Cipher type crypto-text with b₁as the shift. It will have the characteristics of a normal text; its IC value will be close to that of plain text (=0.0655). Same is true of the other (m− 1) sub-sequences also. With different values of m (2, 3, 4, …), form the sub-sets {c₁, c_1+m, c_1+2m,…}. Get the character frequencies and the IC values.

The m-value which yields the IC closest to 0.0655 is the correct one. The procedure can be repeated with successive sub-sequences to conﬁrm the m-value. Once the m-value is identiﬁed, with each of the m separate sub-sequences the procedure in Exercise (13) above can be used to get the set—{b1, b₂,… bm}.

With the m value and the full set {b1, b2,… bm} known the plain text can be recovered. For the cipher text in the last exercise, do cryptanalysis and retrieve the plain text (With all the programs available cryptanalysis and plain text retrieval may take a few hours).

16. Huffman Coding: One of the earliest schemes of lossless data compression was proposed by Huffman (Forouzan 2013). We shall go through a simpliﬁed version of the scheme. A data transmission scheme uses a set of four symbols {a, b, c, d} with probabilities of occurrence {0.45, 0.3, 0.15, 0.1} respectively.

The Huffman scheme for the set follows:

The symbols are arranged in descending order of probabilities. The most probable symbol is assigned the code value 0—a single bit. For the rest the ﬁrst bit is taken as 1. The second most probable symbol is assigned the second bit value—0 and its code is 10. For the rest the second bit is assigned the value 1; a third bit is also assigned to them with values of 0 for the more probable one and value of 1 for the less probable one respectively (see Fig. 7.15).

d c b

0.45

0.10 0.15 0.30

11 10

110

111 Fig. 7.15 Huffman coding

scheme for the example in Exercise 7.16

The average number of bits per symbol is 0.45× 1 + 0.3 × 2 + 0.15× 3 + 0.1 × 3 = 1.8—a conspicuous gain over the value of 2 with brute force encoding.

The general algorithm for assigning codes is as follows:

a. Arrange the symbols in descending order of probabilities of occurrence. The last symbol is the least probable one. Each symbol is assigned a node.

b. Combine the least two probable symbols into one node having the com-bined probability value.

c. If the number of nodes left is one the‘coding tree’ is complete. Else go to step (a).

Assign bit value—0 (code value = 0)—to the top node. Subsequent bit values and code values are assigned as in Fig.7.15.

Decoding the received bit sequence into symbols, proceeds in the reverse sequence. With each succeeding bit identify the branches and nodes until a symbol is identiﬁed. Once this is done start all over again for identiﬁcation of the next symbol.

a. Write a Python program to assign code values to the given set of symbols, knowing their probabilities.

b. Write a Python program to produce the bit sequence given the message symbol sequence.

c. Write a Python program to decode the encoded bit sequence and produce the message symbol sequence.

d. A notepadﬁle is given. It is made up of ASCII characters. Prepare the table of symbols and their probabilities.

e. For theﬁle in (d) above do coding and decoding.

17. Arithmetic coding is an efﬁcient scheme of lossless compression of data (Forouzn). Operation of a simpliﬁed form of arithmetic coding is explained here through an example.

A message sequence is made up of the four symbols‘A’, ‘B’, ‘C’, and ‘D’. An additional symbol‘E’ is used as the last one to indicate the end of the message sequence. The probability of occurrence of each symbol is speciﬁed before-hand. Table 7.5gives the assigned probability values. Symbol ‘E’ is assigned the (nominal very low) probability of 0.05 arbitrarily. The table also has the list of cumulative probability ranges. The encoding process is explained here with reference to the message sequence ‘BCDDDBE’. Figure7.14 depicts the procedure.

Table 7.5 The symbols, their probability values, and the cumulative probability values for the Example in Exercise 7.17

Symbol A B C D E

Probability 0.1 0.3 0.2 0.35 0.05

Cumulative probability range 0.0–0.1 0.1–0.4 0.4–0.6 0.6–0.95 0.95–1.0

170 7 Operations for Text Processing

The symbol sequence is identified by its probability and the probability value forms the basis to decide the code to be assigned to it. Thefirst symbol ‘B’ is assigned the probability range 0.1–0.4 (P1Q1) as shown in thefirst line in the figure.

The second symbol ‘C’ has the absolute probability range 0.4–0.6. Hence the ﬁrst and the second symbols together is assigned the absolute probability range within the (P₁Q₁) band as 0.1 + (0.4 − 0.1) × 0.4 to 0.1 + (0.4 − 0.1) × 0.6—

that is 0.22–0.28—shown blown up in the second line. This range is repre-sented by (P₂Q₂) in line 2 in theﬁgure.

The third symbol ‘A’ has the absolute probability range 0.0–0.1. Hence the sequence ‘BCA’ is assigned the probability range 0.22–0.22 + (0.28− 0.22) × 0.1—that is 0.22–0.226—shown blown up in the third line.

This range is represented by (P₃Q₃) in line 3 in theﬁgure.

Proceeding successively in the same vein the probability range formation for the full sequence—‘BCADDBE’—is shown in the figure. Finally the sequence has the specific probability range 0.225142975–0.225154 assigned to it. The corresponding binary range is 0.0011100110100010111110000 101001010001100111 to 0.001110011010001110110001001010101001 0000101 any binary value within this range can be used to uniquely represent this sequence. Specifically 0.001110011010001 suffices here since this part is common for the full range. The additional bits are discarded since they do not add any additional information of interest to us here.

The code for the sequence—‘BCADDBE’—is generated from the binary value of the probability for it. It involves two changes:

a. Truncate the number of bits at a point where the value has crossed the point P7 in Fig.7.16signifying that the next symbol in the sequence is‘E’ itself

—that is the sequence has ended (this has been done above).

b. Ignore the ‘0.’ part of the probability and use only the rest of the bit sequence. ‘0.’ is superfluous and does not add any information to the sequence.

Any source sequence of characters from the set in Table7.5can be encoded in the same manner. The encoding algorithm is summarized as follows:

a. Start with the table of probabilities and cumulative probabilities.

b. Identify the probability range of (P₁Q₁) for theﬁrst character.

c. Let the probability range for (P_i−1Q_i−1) be D_is* Die where ‘s’ and ‘e’

signify the Start and End of the range.

d. The jth character in the table has the absolute cumulative probability range (C_j₋₁, C_j).

e. For all i from 2 onwards up to the last character (nth) in the source sequence, do the following recursively:

f. Let S_ibe the ith character in the sequence. We have the recursive relations for D_isand D_ieas D_is= D_i_−1,s+ (D_i_−1,e− Di−1,s)C_si₋₁and D_ie= D_i_−1,s+ (D_i_−1,e− D_i_−1,s)C_si. Update the probability range for the character sequence up to and inclusive of S_iusing these.

g. With a total of n characters in the sequence (D_ns* Dne) is the probability range representing the last character (‘E’). Truncate it such that the truncated value lies within (D_ns* Dne) range. Remove the‘0.’ part of the probability value of the truncated number to get the code for the sequence.

h. Successive characters in the source sequence affect only the trailing bits of the probability being evaluated. Hence the leading bits of the coded sequence can be progressively taken out from the left end and added to code as soon as they stabilize in value.

The decoder algorithm is as follows:

a. Preﬁx the received sequence with ‘0.’ to form the cumulative probability pc

of the sequence.

Fig. 7.16 Arithmetic coding procedure for the message sequence‘BCADDBE’

172 7 Operations for Text Processing

c. Subtract the cumulative probability D1srepresented by P1from pcto form (pc− D1s).

d. Continue the procedure for encoding recursively in the reverse order until the end of message symbol ‘E’ is identiﬁed. This completes decoding/decompression.

Prepare programs for encoding and decoding conforming to the above proce-dures. Test them with typical sequences.

18. Use the random.choice() method from the math module and generate sequences of 10, 20, and 30 characters. Use these to test the above two programs.

19. As long as the models used for encoding and decoding are identical the basic procedure for arithmetic coding can be modified/simplified/made more optimal/efficient in different ways. A few such modifications are suggested below which can be tried:

a. The source sequence can be split into sub-sequences ofﬁxed lengths (say 100 characters each). With this the need for the end of sequence character

—‘E’—can be eliminated. The last sub-sequence can be appended with known dummy characters to make up its length.

b. Instead of ‘E’ a known small sequence of characters (a rarely occurring combination) can be used to signify the end of sequence.

c. The probability table can be updated at regular intervals using the infor-mation from the sequence itself. This makes the scheme more optimal.

d. A ternary sequence can be used in place of the binary sequence; this may be better suited for transmission schemes which use three voltage levels (+V, 0,−V) for signaling.

e. One adaptation of arithmetic coding uses the following procedure:

Obtain the frequencies of all the characters in the sourcefile. With n char-acters in the sourcefile prepare a table of n entries for the character set, the characters being arranged in the descending order of their frequencies—the most frequent character being in the first row. Here each code value is

log₂n

d e bits long. For the most frequent 15 characters use 0h, 1_h, 2 _h,… Eh

as the code values leaving out F_h. For all the rest use a different coding table starting with 0h. Assign code values afresh for the second lot prefixing each value with F. Thus the full code comprises of two coding tables—one for the most common 15 characters (each of 4 bits) and the other for the rest (all starting with F). The encoding table is prefixed to the coded sequence. The encoding is efficient only if the latter set has conspicuously low probabilities.

f. An adaptation of arithmetic coding works directly on the binary sequence to be encoded. The sourceﬁle is split into 12-bits blocks. The frequencies of all the 2¹² possible blocks are obtained. They are arranged in descending order and code values assigned as in (e) above.

20. On the lines discussed in Exercises 2–4 in Chap. 5, write a program to convert a number in a given base to one in another base. The function int (a, base = b) has to be used in a functional loop for this. The base here can be any integer up to 36. Convert a number from one base to another and do the reverse to verify the correctness.

References

Forouzan B (2013) Data communications and networking, 5th edn. McGraw Hill, New York Original UTF-8 paper. (http://doc.cat-v.org/plan_9/4th_edition/papers/utf)

Padmanabhan TR (2007) Introduction to microcontrollers and their applications. Alpha Science International Ltd, Oxford

Shyamala CK, Harini N, Padmanabhan TR (2011) Cryptography and security. Wiley India, New Delhi

The Unicode Standard: A Technical—Introduction. (http://www.unicode.org/standard/principles.

html)

van Rossum G, Drake FL Jr (2014) The Python library reference. Python Software Foundation

174 7 Operations for Text Processing

Chapter 8

In document Programming with Python [2017].pdf (Page 174-183)