Figure B.1: Weka’s organization structure of data preprocessors
B.2 Concept-based representation creators
Katoa implements the concept-based representation creators as data preprocessors— filters—in Weka. Figure B.1 shows Weka’s structure of filters. Katoa creates two new filters: StringToWikipediaConceptVector and StringToWordNetConceptVector, both belonging to the unsupervised attribute filters category. They take a dataset that has one or more string attributes as input, and output a new dataset whose attributes correspond to the concepts identified from all string values, together with other non-string attributes in the input such as the class attribute.
Each filter has several options to specify how concepts should be identified, as Figure B.2 shows. They have seven options in common, as listed below:
• IDFTransform: sets whether to transform a concept’s weight to its tf ×idf weight.
(a) StringToWordNetConceptVector (b) StringToWikipediaConceptVector Figure B.2: Options of the filters for creating concept-based text representations
• TFTransform: sets whether to transform a concept’s term frequency to log(1 + tf ).
• conceptsToKeep: sets the maximum number of concepts to be kept. Default is 1000.
• debug: sets whether to turn on output of debugging information.
• lowerCaseTokens: sets whether to convert all letters to lower case before matching them against terms in the corresponding concept system.
• minConceptFreq: sets the minimum concept frequency, and is enforced on an all-classes basis.
• outputConceptCounts: sets whether to output concept counts rather than boolean 0 or 1 indicating absence or presence of a concept.
B.2. CONCEPT-BASED REPRESENTATION CREATORS
• reweightByCentrality: sets whether to weight a concept by its centrality with the local context: tf × LocalCentrality. If so, whether to take concept counts into account (the reweight by weighted centrality option) or not (the reweight by binary centrality option).
The StringToWordNetConceptVector filter has six more options. Three of them concern the natural language processing models, which are language dependent, and alternatives for other languages are available at http://opennlp.sourceforge .net/models-1.5/. These options are:
• posTaggerModel: sets the part of speech tagger model. • sentenceDetectorModel: sets the sentence detection model.
• stopwords: sets the stopword list to be used. If the useStoplist option is turned on and no list is specified with this option, Weka’s default stopword list as described in Section 4.3 will be used.
• tokenizerModel: sets the tokenizer model.
• useStoplist: sets whether to remove stopwords before matching terms in input text against terms in WordNet.
• wordNetQuery: sets the Perl script for querying WordNet. The StringToWikipediaConceptVector filter has three more options:
• cacheWikipedia: sets whether to cache relevant Wikipedia information in memory to improve efficiency.
• disambiguatorModel: sets the disambiguation model for Wikipedia concepts, which can be trained using the org.wikipedia.miner.annotation.Disambiguator class.
• stopwords: sets the stopword list used by the Wikipedia Miner, which should be a plain text file with one stopword per line. Word and phrases in this list will not be matched against the anchor text vocabulary.
(a) CosineDistance (b) EnrichedDistance
Figure B.3: Options of the plain cosine and the semantically enriched distance functions
B.3 Similarity measures
Weka implements similarity measures as distance functions, and we take 1 − similarity as the distance value, because all similarity measures in this thesis are bounded between 0 and 1. Katoa implements three distance functions: the cosine measure CosineDistance (see Section 2.2), EnrichedDistance that enriches similarity measure with semantic concept relatedness described in Section 5.4.2, and the machine learned measure LearnedDistance described in Chapter 6. Figure B.3 shows the options for each class.
The CosineDistance is the most basic measure, with three options:
• attributeIndices: specifies a range of attributes that are counted for com- puting the distance between the given texts, and by default all numeric attributes will be used.
• binary: sets whether to count the weights of each attribute or just its presence or absence.
• invertSelection: sets whether the range specified with attributeIndices is an inverse selection.
The EnrichedDistance has four options: relatednessMeasure, which specifies the concept relatedness measure; binaryCentrality, which specifies whether a concept’s
B.3. SIMILARITY MEASURES
(a) DocumentPair filter (b) LearnedDistance
Figure B.4: Options of the DocumentPair filter and the LearnedDistance function
occurrence frequency should be counted when computing its centrality or only its absence and occurrence; and the others—attributeIndices and invertSelection—are the same as CosineDistance.
The LearnedDistance class implements the learned similarity measure, as shown in Figure B.4. It involves another unsupervised instance filter: DocumentPair, as shown in Figure B.4(a), which converts a pair of documents to a new instance that describes their thematic similarity, using the features described in Section 6.2. Given a pair of texts (their bag-of-words and bag-of-concepts representations), the LearnedDistance measure first applies the DocumentPair filter to create an instance on their relation, based on which the trained regression model predicts the similarity between the input texts.
Figure B.5: Options of the SimpleKMeansReweighted clustering algorithm
be configured. The regressionModel option sets the trained regression model, and predictionIsSimilarity sets whether the prediction of the regression model is the similarity rather than the distance between the two texts and thus should be converted to 1 − prediction.