B D E C G J M I H Figure 13.4: Graph 4 X Y Z Figure 13.5: Graph 5

X

Y

Z

Figure 13.6: Two resulting Causal Commonality Graphs from Graph 1 and Graph 4

Chapter 14

Notes on the Software

Implementation

14.1

Using the WordNet 3.0 Thesaurus

The basis for the dictionary used in 11.2 was the University of Princeton WordNet 3.0 Thesaurus see [13] and [39].

The WordNet dictionaries have been processed to be more suitable for use in the software. WordNet’s verb, noun and adjective dictionaries for most of the basis for the word corpus used by the software. WordNet’s index files are converted into the csv files, which can be found in the ’dict’ directory of the software.

The rest of the files for pronouns, prepositions, articles and conjunctions have been created by hand, because they were not part of the thesaurus in a suitable form.

Additionally there is the option to add user-defined nouns, verbs and adjectives.

14.1.1 Word Corpus Data Flow

The index files have been filtered using shell scripting to filter out all unique words from an index file. The index files contain many cross-referencing data, which is not relevant for use in the software. The software read the processed index files and built the final files. This was implemented in Java and is part of the source code distribution of the software, but the functionality is not accessible from within the software.

The functions used to build the final word corpus can be found in the Java class com.causalis.textutils.MkWordList. The MkWordList class reads the base forms and generates a large number of inflections for the base forms. Verbs for example will be in simple present, simple past, past, continuos and infinitive form in the database. The verb ’to write’ can be found as write,

wrote, written, writing and (again) write. Depending on the knowledge about irregular forms, inflections of the base forms are generated either procedurally according to rules of english grammar, or by looking up the irregular forms. The lookup of irregular forms is done using the thesaurus, which lists all irregular forms. The absence of irregular forms in the the- saurus is taken as an indication that a word is formed regularly. Creating the inflections of regular words is done by the software.

The resulting files are all located in the software distributions ’dict’ di- rectory. The generated files are

• nountable.csv • verbtable.csv and • adjectivetable.csv.

Other files have been created by hand: • conjunctions.lst,

• prepositions.lst and • pronountable.csv.

If the files exist the software will also read user-defined verb, noun and adjective files:

• verbtable.userdef.csv, • nountable.userdef.csv and • adjectivetable.userdef.csv

At system start the word corpus files are read and an in-memory database is built by the com.causalis.textutils.WordCorpus class.

File Formats The csv files contain table data as comma separated values. The lst files are simple lists, where each line has only one entry. All user defined corpus files must be in the same format as the normal word corpus files.

All formats share the convention that all entries are done using lower- case letters.

14.1. USING THE WORDNET 3.0 THESAURUS 131 verbtable.csv Format The verbtable.csv file has 5 entries per line. Each line holds different inflections for one verb, the order of appearance in a line determines the flection type.

1. infinitive, without ’to’ 2. simple present tense 3. simple past tense 4. past tense

5. continuous tense

Composite verbs are denoted with an underscore ’ ’. For example ’to wish well’ is denoted as

wish_well,wishes_well,wished_well,wished_well,wishing_well nountable.csv Format The nountable.csv file has 2 entries per line. Each line holds the singular and plural inflections of a noun, the singu- lar form is the first one, the pluralform the second. Composite nouns are denoted with an underscore ’ ’. For example ’landing gear’ is denoted as landing_gear,landing_gears

adjectivetable.csv Format The adjective.csv file has 3 entries per line. Each line holds different inflections for one adjective, the order of appearance in a line determined the flection type.

1. adjective 2. comparative 3. superlative

pronouns.lst, conjunctions.lst and prepositions.lst Format The con- junctions.lst and prepositions.lst files contains a list of pronouns, conjunc- tions and prepositions respectively. There is only one entry per line. Looking up Words with the WordCorpus

The in-memory database reads each entry from the various sources and stores the associated form-type with the entry. For example the verbtable.csv entry for ’to pilot’ is

WordCorpus would store the following in the in-memory database: • pilot is an infinitive form

• pilots is a simple present form • piloted is a simple past form • piloted is (also) a past form • piloting is a continuous form

From the nountable.csv entry for ’pilot’ pilot,pilots

WordCorpus would store in the in-memory database: • pilot is (also) a singular noun

• pilots is (also) a plural noun

If WordCorpus is queried for the word ’pilot’ it would deliver as a result that the word can be

• an infinitive form verb and • a singular noun.

This information is later used to determine the function of a word in a factor text.

In document Automation of Common Cause Analysis: monitoring improvement measure perfomance and predicting accidents (Page 127-132)