• No results found

5.5 Machine Learning Approach

5.5.1 Proposed Features

Here we explain some of the features used in the machine learning approach. Many of the features are selected because they follow simple writing rules, for example a sentence should not be too long or overly complex, a sentence should have a clear idea in and of itself, a sentence should be well written and have a correct grammatical structure.

Features are also selected based on characteristics of causal sentence, for example a causal sentence will link at least two actors, the cause and the effect. For the sentence to be clear, the causal actors must be clearly defined. Causation that is encoded throughout a document instead of a single sentence, is not within the scope of this work. We focus on causation that is encoded in a single sentence,

Below we present several features that are explored for the approach, and we provide a short description of why they are presented:

Causal Marker Used This is the basic feature of causal sentences, all explicit causal sentences contain causal markers. A caveat is that some markers are ambiguous, therefore not all sentences with causal markers are causal sentences.

Contains Quotes Many sentences contain a quoted section, the quotes are often important declarations, causal relations may be correlated to important declarations.

Word Count Since causation relates two actors in a sentence and the actors should be clearly defined, it follows that causal sentences may be longer than average.

Begins with Causal Marker If a sentence begins with a causal marker then the cause and effect should be separated within the sentence, for example with a comma “because of all the trouble, there has been a crisis”.

Major Sentence Before and After The Causal Marker If the causal actors are clearly defined then the text segments before and after the causal marker should be major sentences, major sentences contain a subject, verb and predicate.

Contains Numbers Encoding causation in a sentence implies a level of complexity in the sentence. Numbers in a sentence imply a level of detail that also implies complexity. It follows that to maintain the complexity of a sentence low, a causal sentence will not provide details such as numbers.

Stop Word Count Because of its complexity a causal sentence may contain less stop words than a non-causal sentence.

Contains Proper Nouns A causal sentence should contain at least two nouns, one for the cause and one for the effect.

Verb Count A causal sentence should contain at least two verbs, one for the cause and one for the effect.

Noun Location Relative to Causal Marker A noun should be present before and after the causal marker, or if sentence starts with a causal marker, then the nouns should be separated by a punctuation. Verb Location Relative to Causal Marker A marker verb should be

present before and after the causal marker, or the causal marker should start the sentence and the verbs should be separated by punctuation. Verb Tense Mis-match Because of the temporal order characteristic of

causal relations, we may find a temporal mismatch between the verbs in a sentence, for example “the broken economy leads to a rising of entrepreneurship”, where “broken” is past and “rising” is present. Contains Temporal Phrase Many causal markers can also be used to

indicate temporal order, a sentence encoding temporal order may include a temporal reference, for example “the greatest financial crisis since November has been the CITIGroup crash”, in this example we can see the temporal phrase “November”, other phrases may include the year or day of the week.

Count of Entities More than two entities referenced in a sentence may indicate the sentence is not causal.

Contains News topic references If the sentence contains more than two news topic references then it may be causal.

Temporal Order Between News Topic References If the sentence contains two news topic references that are temporally ordered, then it may be a strong indicator of causation.

Subsumed Noun Pair The causal actors are often of the same type of event, but of a different granularity level, for example a war may cause a battle or vice versa, this may be an indicator causation. Synonyms or near-synonyms Using synonyms or near synonyms in a

sentence increases the complexity of the sentence, it follows that this would be avoided in a causal sentence.

Contains Question Term There are often inquiries about the causes of activities. Sentences that contain a question word or a question mark are often questions. These sentences are non-causal.

Contains Negation The negation of the cause or the effect of an activity makes the sentence non-causal.

Position of Causal Marker If a sentence ends with a causal marker it may not be causal, for example “there has been no one since”, the causal marker “since” is not used to encode causality.

Causal Marker Ambiguity and Frequency Work has been done to quantify the frequency and ambiguity of many causal markers [Girju2003], this information may be useful to define causal sentences.

Count Of Definite Article “The” The marker “the” indicates a unique activity, having more than one unique activity may be an indicator of causality.

In this list we also include discovery features, these are features that have no clear correlation to causation, but may still provide insight into how causal sentences are formed.

Character Count Simple words often have a small character count, this may be an indicator of complexity.

Ratio of Upper to Lower Case Letters How many upper or lower case letters are found. or the ration of upper case to lower case. Number of Punctuation Marks How many punctuation marks are

found.

Number of Adjectives How many adjectives are found.

Parse-tree Pattern An analysis of the parse tree information may pro- vide insight into causal sentences. In particular, what patterns appear before and after each specific causal marker. Because each marker may have its own pattern. By pattern we mean the order of the part-of-speech tags in the sentence.

Part-of-speech Tag Count and Pattern An analysis of the part-of- speech information may provide insight into causal sentences. In particular, what patterns appear before and after a causal marker in a causal sentence.

Repeated words The intuition is that to maintain a distinction between the cause and the effect, there will be no repeated terms.

Words Before and After Causal Marker are Related and Causal In WordNet there are several definitions or examples of each word, each example is called a gloss, a gloss may contain a causal marker and also another word from the sentence, this may be an indicator of causality.

The features proposed here are intended to aid in discovering what features are valid and useful, not all the features may seem relevant, yet they may present latent information, therefore we consider and extract those that are available.