• No results found

The Google 5-grams used by this program are available in .csv format with the profile

shown in Figure 11 [8].

ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE

Figure 11: Format of Google 5-grams used by this method (boldface indicates fields

used in database construction).

Uncompressed these files equal 254 GB but through pruning and the use of the following

techniques are reduced by over 99%. The 5-grams are first converted from their default

.csv format into a relational MySQL database to both reduce size and allow real-time

access.

‘wordlist’ is a small table that contains indexes for the 176,436 unique individual words

that are contained in the stored 5-grams. It serves as an index of all word strings and

unique IDs, shown in Figure 12. This table allows each string to only be saved onto the

disk once regardless of how many times it is used [9]. For example the word ‘the’

appears over 34 million times in the ‘fivegram’ table, which would require more than 1.6

GB to store. This table is only roughly 8 MB in size. Figure 12 shows the structure of the

‘wordlist’ table.

+---+---+---+---+---+---+ | Field | Type | Null | Key | Default | Extra | +---+---+---+---+---+---+ | word | varchar(50) | NO | MUL | NULL | | | Wordindex | int(10) unsigned | NO | PRI | NULL | | +---+---+---+---+---+---+

Figure 12: The ‘wordlist’ relational database table structure, allows each word to

only be saved onto the disk once regardless of how many times it is used in the

stored 5-grams.

The following Tables 4 and 5 show a small example of how words are stored in the

'wordlist’ and referenced in the ‘fivegram’ table.

Word wordIndex

the 1

dog 2

and 3

cat 4

Table 4: Example of how words are stored in ‘wordlist’ as a string with a unique

integer ID.

The 5-gram “The dog and the cat” then would be stored as

Gram 1 Gram 2 Gram 3 Gram 4 Gram 5

1 2 3 1 4

Table 5: Example of how n-grams are stored in ‘fivegrams’, contains “the dog and

the cat” by storing word IDs rather than strings.

‘fivegrams’ is a moderately sized table which contains 70,783,464 of the Google 5-grams

and is roughly 14 GB in size. Extensive pruning was done during database construction

in an attempt to clean up the data and remove non-English symbols and other things that

would either not occur in natural text or were altered by the digitizing done by Google

books. This is an intermediate table and is not required to run the final system. The

structure of the ‘fivegrams’ table is shown in Figure 13.

+---+---+---+---+---+---+ | Field | Type | Null | Key | Default | Extra | +---+---+---+---+---+---+ | Gram_ID | int(10) unsigned | NO | PRI | NULL | auto_increment | | w1 | int(10) unsigned | NO | MUL | NULL | | | w2 | int(10) unsigned | NO | MUL | NULL | | | w3 | int(10) unsigned | NO | MUL | NULL | | | w4 | int(10) unsigned | NO | MUL | NULL | | | w5 | int(10) unsigned | NO | MUL | NULL | | | Frequency | mediumint(8) unsigned | NO | | NULL | | | lastUsed | mediumint(9) | YES | | NULL | | +---+---+---+---+---+---+

Figure 13: The ‘fivegrams’ SQL table structure to store all pruned 5-grams. This is

an intermediate table not required to run the final system.

For example periods and commas, and usually quotation marks are all considered

separate words in the n-grams and because it was felt that grams containing these would

likely contain context from multiple sentences/thoughts, these grams were removed.

Table 6 shows some of the many 5-grams that were removed during database

construction.

Gram 1 Gram 2 Gram 3 Gram 4 Gram 5

! ! ! " now

. and yet I hope

' ve gone . "

' • one of the

( will as they believe

interior . in this way

Table 6: Sample of some of the 5-grams from the original Google dataset(eng-1M-

5gram-20090715) that were removed through pruning during relational database

construction.

Construction of this table was quite straightforward. Every word was first converted to

lowercase then only 5-grams which each gram contained at least one alphabetic character

were added to the table.

During trials another database was constructed with minimal pruning as a comparison

case but only increased accuracy by a minor amount and increased the size significantly;

for example ‘wordlist’ became four times bigger on disk.

The second step was to create 4 individual feature tables to further reduce size and allow

portability. These 4 features are the before bigram, after bigram, split bigram and

important word. These 4 corresponding tables, ‘ct_bb’, ‘ct_ba’, ‘ct_bs’ and ‘ct_bis’, are

approximately 2.4 GB combined. These are the tables that are used as well as the

‘wordlist’ table when the program is in operation. These tables store the ID of the

queried word, the count for the specific word, the ID’s of the 2 context words, the global

frequencies and the local frequencies. Figure 14 shows the table structure used by each

feature.

+---+---+---+---+---+---+ | Field | Type | Null | Key | Default | Extra | +---+---+---+---+---+---+ | w_index | int(11) | NO | PRI | NULL | | | word_count | int(11) | NO | PRI | NULL | | | w1 | int(11) | NO | MUL | NULL | | | w2 | int(11) | NO | MUL | NULL | | | local_freq | int(11) | YES | | 0 | | | global_freq | int(11) | YES | | 0 | | +---+---+---+---+---+---+

Figure 14: The SQL table structure used by each of the 4 feature tables. Stores the 2

word context of the feature(w1 and w2), local and global frequencies of each

context. It is Indexed by the query near synonymous word(w_index) and the

number of unique contexts(word count).

Local and global frequencies are the last two fields shown in Figure 14 can be described

as follows:

Local frequency is the observed frequency of this context within the ‘fivegrams’ table.

For example suppose the before bigram feature table (ct_bb) contained the context ‘a

little’ for the word “lamb”. It would have as local frequency equal to the number of times

the word ‘lamb’ was preceded by the bigram ‘a little’. If the following two 5 grams

existed in the ‘fivegram’ table. This feature would have a local frequency of 2.

“a little lamb fleece as”

Global frequency denotes the frequency counts of n-grams as recorded by Google. This is

meant to represent how often this phrase was found in all the books that were scanned.

When two contexts are the same, the Google frequencies are simply added together.

“a little lamb fleece as” frequency=200

“a little lamb spent each” frequency=100

In this example, the context ‘a little’ before ‘lamb’ would have a global frequency of 300.

Related documents