The Google 5-grams used by this program are available in .csv format with the profile
shown in Figure 11 [8].
ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE
Figure 11: Format of Google 5-grams used by this method (boldface indicates fields
used in database construction).
Uncompressed these files equal 254 GB but through pruning and the use of the following
techniques are reduced by over 99%. The 5-grams are first converted from their default
.csv format into a relational MySQL database to both reduce size and allow real-time
access.
‘wordlist’ is a small table that contains indexes for the 176,436 unique individual words
that are contained in the stored 5-grams. It serves as an index of all word strings and
unique IDs, shown in Figure 12. This table allows each string to only be saved onto the
disk once regardless of how many times it is used [9]. For example the word ‘the’
appears over 34 million times in the ‘fivegram’ table, which would require more than 1.6
GB to store. This table is only roughly 8 MB in size. Figure 12 shows the structure of the
‘wordlist’ table.
+---+---+---+---+---+---+ | Field | Type | Null | Key | Default | Extra | +---+---+---+---+---+---+ | word | varchar(50) | NO | MUL | NULL | | | Wordindex | int(10) unsigned | NO | PRI | NULL | | +---+---+---+---+---+---+
Figure 12: The ‘wordlist’ relational database table structure, allows each word to
only be saved onto the disk once regardless of how many times it is used in the
stored 5-grams.
The following Tables 4 and 5 show a small example of how words are stored in the
'wordlist’ and referenced in the ‘fivegram’ table.
Word wordIndex
the 1
dog 2
and 3
cat 4
Table 4: Example of how words are stored in ‘wordlist’ as a string with a unique
integer ID.
The 5-gram “The dog and the cat” then would be stored as
Gram 1 Gram 2 Gram 3 Gram 4 Gram 5
1 2 3 1 4
Table 5: Example of how n-grams are stored in ‘fivegrams’, contains “the dog and
the cat” by storing word IDs rather than strings.
‘fivegrams’ is a moderately sized table which contains 70,783,464 of the Google 5-grams
and is roughly 14 GB in size. Extensive pruning was done during database construction
in an attempt to clean up the data and remove non-English symbols and other things that
would either not occur in natural text or were altered by the digitizing done by Google
books. This is an intermediate table and is not required to run the final system. The
structure of the ‘fivegrams’ table is shown in Figure 13.
+---+---+---+---+---+---+ | Field | Type | Null | Key | Default | Extra | +---+---+---+---+---+---+ | Gram_ID | int(10) unsigned | NO | PRI | NULL | auto_increment | | w1 | int(10) unsigned | NO | MUL | NULL | | | w2 | int(10) unsigned | NO | MUL | NULL | | | w3 | int(10) unsigned | NO | MUL | NULL | | | w4 | int(10) unsigned | NO | MUL | NULL | | | w5 | int(10) unsigned | NO | MUL | NULL | | | Frequency | mediumint(8) unsigned | NO | | NULL | | | lastUsed | mediumint(9) | YES | | NULL | | +---+---+---+---+---+---+
Figure 13: The ‘fivegrams’ SQL table structure to store all pruned 5-grams. This is
an intermediate table not required to run the final system.
For example periods and commas, and usually quotation marks are all considered
separate words in the n-grams and because it was felt that grams containing these would
likely contain context from multiple sentences/thoughts, these grams were removed.
Table 6 shows some of the many 5-grams that were removed during database
construction.
Gram 1 Gram 2 Gram 3 Gram 4 Gram 5
! ! ! " now
. and yet I hope
' ve gone . "
' • one of the
( will as they believe
interior . in this way
Table 6: Sample of some of the 5-grams from the original Google dataset(eng-1M-
5gram-20090715) that were removed through pruning during relational database
construction.
Construction of this table was quite straightforward. Every word was first converted to
lowercase then only 5-grams which each gram contained at least one alphabetic character
were added to the table.
During trials another database was constructed with minimal pruning as a comparison
case but only increased accuracy by a minor amount and increased the size significantly;
for example ‘wordlist’ became four times bigger on disk.
The second step was to create 4 individual feature tables to further reduce size and allow
portability. These 4 features are the before bigram, after bigram, split bigram and
important word. These 4 corresponding tables, ‘ct_bb’, ‘ct_ba’, ‘ct_bs’ and ‘ct_bis’, are
approximately 2.4 GB combined. These are the tables that are used as well as the
‘wordlist’ table when the program is in operation. These tables store the ID of the
queried word, the count for the specific word, the ID’s of the 2 context words, the global
frequencies and the local frequencies. Figure 14 shows the table structure used by each
feature.
+---+---+---+---+---+---+ | Field | Type | Null | Key | Default | Extra | +---+---+---+---+---+---+ | w_index | int(11) | NO | PRI | NULL | | | word_count | int(11) | NO | PRI | NULL | | | w1 | int(11) | NO | MUL | NULL | | | w2 | int(11) | NO | MUL | NULL | | | local_freq | int(11) | YES | | 0 | | | global_freq | int(11) | YES | | 0 | | +---+---+---+---+---+---+