Word Cooccurrence and Collocation Analysis

Many approaches that are related to our model originate from the fields of corpus analytics, Web science, or text mining, and are concerned with the analysis of word cooccurrences or word collocations. The difference between cooccurrence and collocation is subtle and may vary between sources. A commonly accepted interpretation of word cooccurrence is the joint occurrence of words, whereas a collocation is a cooccurrence of words within a short distance from each other that is recurring (significantly) more frequently than expected[170]. The concept of collocation is typically found in corpus linguistics, where it is based on the Firthian principle that words are defined by the words that frequently occur in their context, and serves as a basis for research into word associations. In contrast, many text- and data mining approaches consider the more general cooccurrences to detect

2.2 Word Cooccurrence and Collocation Analysis

patterns or relations, often in a context of networks (for a brief introduction to complex networks and graphs, see Chapter2.3). In the following, we first consider related work on cooccurrences in general, before discussing collocations.

2.2.1 Word cooccurrence networks

Modelling and analyzing the cooccurrences of words has a long tradition that dates back at least to Van Rijsbergen, who proposed to drop the assumption of word occurrence in- dependence in favour of measuring word dependencies with non-linear weighting func- tions[204].Since graphs are a natural and intuitive way of representing sets of connected things, many works in the analysis of word cooccurrences use such a representation for a variety of applications. Some recent works on modelling and exploring term cooccurrences use networks to describe term relationships in a context-oriented framework, and employ network analysis techniques to derive measures from the network or to compare language specific networks [38, 44, 118, 119]. Others exploit the properties of such networks to learn document representations and context-dependent relationships through embeddings[195]. However, these types of networks typically consider words or terms without associating them with additional semantics, and are therefore ill-suited to han- dling entity-centric tasks. Due to the potential density of word cooccurrence networks, these approaches often model term relations in very narrow cooccurrence windows.

More recently, typed cooccurrence networks have been introduced, which include cooccurrences of (named) entities that are detected and extracted from text documents. Nodes still represent terms, but may also be associated with entity types, such as person, or or- ganization. Examples of such networks include entity graphs used for identifying entities that participate in trending events[160],time-term association graphs used to estimate the focus time of documents[103],or even the cross document co-reference resolution based on spectral graph clusterings of mentions[51]. A different approach, which focuses on the broader word category of concepts rather than on entities, is designed to determine the semantic relatedness of documents from extracted concept graphs[145].As is evident from these descriptions, these works cover vastly different applications, for which they are specialized. As a result, the underlying network structures tend to be highly optimized for a singular purpose, and are not flexible enough to act as document models.

There are, of course, some approaches that take a more direct approach to cooccurrences and do not focus on a graph representation, but instead employ cooccurrence statistics directly. An example of such an approach is SigniTrend, which uses word cooccurrences to detect events in text streams as spikes in the timeline[164]. However, the relation to

2 Background and Related Work

graphs is still evident, since the relations between cooccurring words can be interpreted as weighted edges, and have even been applied directly to the visualization of significant cooccurrences in text streams as graph-like word clouds[163].A downside of this approach is the minimalistic representation of cooccurrence counts that enables its high efficiency on text streams with a large volume, but also prevents its use for search or exploration applications on the history of the stream as a collection.

Finally, network-based word cooccurrence analysis also motivated some of our previ- ous work that influenced the implicit network model we present here, which includes the extraction of social[69]and location[70]networks from Wikipedia. In these works, entity relations are counted as simple cooccurrences, and weighted directed edges are derived from cosine weights of the node-edge incidence matrix. In contrast to the implicit network model, these works are focused on analyzing the topological and community structure of such cooccurrence networks, instead of the contents of the documents.

A direct predecessor of the implicit network model that we present here is the analysis of term-date networks in the text of Wikipedia [185],which restricts the exploration of word relations to the bipartite case of dates and terms. Thus, it can be seen as a special case of the model that we introduce in Chapter3.

2.2.2 Word collocations

In contrast to the cooccurrences of adjacent words, word collocations are more subtle. While there are multiple definitions of what constitutes a collocation and the label is used more freely in natural language processing than in linguistics or corpus linguistics, a pop- ular interpretation defines them as composites of words whose semantic and syntactic properties exceed what can be predicted from their components[56]. While this interpretation is (intentionally) vague, it highlights a requirement that goes well beyond the analysis of cooccurrences, which can only serve as indicators but not evidence of collocations. Due to this variety of definitions, most of which are too removed from our work on implicit networks, we only focus on a few specific examples to motivate why collocations are literally too narrow to account for entity relations. For an in-depth discussion and historic background of research into collocations, we refer to the thesis by Evert[56].

For the analysis of collocations, a typically considered window size of cooccurrences lies in the range of two to five words. For the relation of entities, such a cooccurrence window size is too strict, since it is unlikely to find entity mentions at such close proximity. In fact, if the use of pronouns is considered, it is likely that two related entities do not even occur

In document Implicit Entity Networks: A Versatile Document Model (Page 36-39)