5 The study
6. Methods and material
6.1. Quantitative and qualitative methods
6.2.3 Co-Word Analysis
Co-word analysis (in Finnish yhteissana-analyysi, in Swedish cowordanalys) is a quantitative bibliometric method. In brief, bibliometrics (FI: bibliometriikka, SW:
bibliometri) is to study document collections with the help of quantitative methods (Persson 1991, 6). In this study, co-word analysis is used to study indexing frequency and the lexical network of the key terms of this study in several databases, and also in interview material. Attention is also paid to Finnish and British indexing practices.
According to Callon & al. (1991) the methodological foundation of co-word analysis is the idea that the co-occurrence of keywords describes the contents of the documents in a file. Methodologically, it is therefore a question of using one (or more) index (or indices) to measure the relative intensity of these co-occurrences and to achieve simplified representation of the networks to which they give rise. (Ibid. 161) (About the history, origins and scope of bibliometrics see e.g. Forsman 2005; Kärki &
Kortelainen 1998; Persson 1991)
Co-word analysis is very similar to co-citation analysis. Co-word analysis deals with co-occurrences of terms in documents, while co-citation analysis deals with shared citations. (Persson 1991, 51; Horton et al. 1998) It is thus about the relatedness of terms rather than documents. (About the methods origin, see e.g. Schneider 2004, 135-136.)
Co-word analysis is used in different kinds of studies. Von Ungern-Sternberg has compared the documentary languages in four international databases in the subject field of biotechnology using co-word analysis and bibliographic coupling (see von Ungern-Sternberg 1998 and 1994). With co-word analysis she studied the co-occurrence of index terms in different articles. Although the method and tools appeared to be very useful, there were also some problems:
“1. The weakness of the databases --- 2. The citing practice ---
There were many problems in building up the terminology for a documentation language. Different databases use different depth and specificity for their indexing. --- The two methods for mapping the knowledge structure have certain weaknesses. The results of the two methods differ somewhat. This can partly be explained by the roughness of both methods. The indexing language is built in a hierarchical structure which is not noted in the coword analysis. --- The methods used are quantitative and neglect to some extent qualitative aspects of research. As a result the methods show the knowledge structure according to the number of articles on a topic, but tell us little about the depth or value of the research.” (von Ungern-Sternberg 1994, 317-318) Co-word analysis has also been recently used to examine how to organise keywords in a web search environment. Ding, Chowdhury and Foo (Ding et al. 2000) have used the co-word analysis method to identify the relationships among words and to create keyword maps that may be useful for information retrieval purposes in the domain of information retrieval. The results of the co-word analysis were compared with traditional thesauri to identify the difference. The research shows, that
--- “the associations of words identified by co-word analysis were different from those obtained from traditional thesauri. --- co-word analysis can catch the changes of its domain area to provide better and timely information guide for users. --- co-word analysis can be used for organizing knowledge through keyword maps and they can be quite useful to compliment the traditional vocabulary tools.” (Ding et al. 2000) Schneider (2004) has studied the methodological and theoretical aspects of bibliometric methods in semi-automatic49 thesaurus construction. The bibliometric methods investigated were document citation analysis, citation context analysis, co-word analysis, and bibliometric ageing methods. When studying the creation of a conceptual network of noun phrases within a concept group by use of co-word-analysis (in a certain sub-field of dentistry), he found, that
“Frequently occurring noun phrases in a concept group are agreed upon, contextual, and most likely semantically related to each other. In addition, the indirect approach ensures that the noun phrases assigned to a concept symbol appear together because they share the same textual context surrounding the specific reference marker in citing papers. All noun phrases thereby become related to each other, either directly by occurring together in the same context, or indirectly through their common co-occurrence with the cited reference representing the concept symbol. Together all concept symbols and their assigned noun phrases refer to the common concept of the group. Consequently, first and second order co-occurrence analyses can be performed by use of co-word analyses in order to investigate its ability to disclose equivalence, hierarchical, and associative relationships. In addition, network analysis is used as a visual aid for the interpretation of the relational network structures.” (Ibid, 321) Schneider (ibid) concludes, that the “applied bibliometric methods are very suitable for selection of candidate thesaurus terms in the specialty area of periodontology”, and further, that “co-word analysis is able to identify thesaural relationships, albeit with a certain error rate”, and in addition that, “in all evitability the co-word analysis applied in
49 When automatic approaches are used as a tool for thesaurus constructers, and not as a means in itself, then we speak of semi-automatic thesaurus construction (Soergel 1974). (Cited here Schneider 2004, 14)
the present methodology will detect valid as well as defective definitional relationships.”
The method is still quantitative, and needs to be supplemented by human intellectual interpretation, which is considered time consuming. – “Please keep in mind it is time consuming to manually identify and verify candidate thesaurus terms, as well as investigate their potential relations.” “--- automatic methods can assist in, but not replace, the intellectual effort needed for the construction of an indexing language or a thesaurus”. (Ibid, 323-329)
There are two opposite perspectives about using titles versus keywords in co-word analysis. Those who prefer titles want to get rid of the “indexer-effect”: indexing is a subjective action and standardised vocabularies used in indexing are sometimes outdated for the present studies. Therefore, words from titles are considered to give more subtle and current conception of ongoing research. From the authors’ viewpoint the main idea of the titles is perhaps to create as high a level of precision as possible so that the most unique feature will be distinguished. The indexers task is thus the opposite – to find the balance between precision and recall so that seekers will not miss any relevant information and on the other hand will not be too loaded with irrelevant information.
Thus indexers discover those documents, whose titles are insufficiently formulated.
(Persson 1991, 52)
Persson (1991) specifies several problems in the use of title words, and points out that they are often undervalued. Sometimes authors form provocative and rhetorical titles in order to get more attention. The titles are also not standardised, which means that one concept can be described by several terms, and on the other hand one term can have several meanings. Furthermore, titles also include a lot of trivial words which separated from the context have no special meaning, and in the analysis there is a subjective stage if the researcher sometimes has to modify material (for example select between singular and plural forms of the words). (Ibid, 52-53)
Whittaker (1989) prefers the use of keywords in co-word analysis and states, that
“As long as titles are available, co-word analysis can be performed and may be expected to yield coherent results. For document sets which are tightly focused on a relatively narrow research area, title words are probably entirely adequate, but they may not be capable of revealing the rich detail of more heterogeneous fields or document sets. For such material, properly chosen keywords are clearly superior. --- On the negative side, title words suffer from some serious drawbacks. The main one is that the set of themes addressed by a long and complex article is less likely to be fully expressed in the title than in a set of keywords, so that the aim of the co-word analysis is to expose the network of concepts (and the like) to the fullest comprehensible extent, title words would almost inevitably yield inferior results. Lesser, but still important, problems arise from the non-standardization of title words and the subdivision of concepts which may occur. The present study has shown that sensible results can be obtained despite these difficulties, but in the end the keyword analysis provided substantially more information.” (Whittaker 1989)
In this study the aim is to discover the similarities and differences of the use of certain Finnish and English indexing terms and keywords with the help of a co-word analysis. The final aim is to see how these practices and needs diverge and how they can be served in a multilingual thesaurus. This study focuses on metadata (present indexing
practices) and thus keywords (indexing terms, descriptors) are used. It is still good to be aware of the indexer effect and time delay factors when discussing indexing discourse.
The indexers may have out-dated tools and lexicology when compared with social scientists, who use up-to-date vocabulary and can apply the entire range.
For the study’s purposes it is essential that the words used in searches appear in certain fields (titles, indexing terms, author keywords etc). Iannella (1998) reminds us that metadata is structured data about data and states, that “this structure is the crucial element that gives metadata the edge over full-text indexing. The benefit is that the structure alludes to the semantics of the metadata”.
The tool used in co-word analysis is Bibexcel (Bibexcel 2002), developed by Olle Persson. It is also used to visualise the results. With these co-word maps, produced with Bibexcel, it is possible to make observable what has been occluded or invisible. One still has to keep in mind, that “[it] is not the “object” that presents the figure itself; rather, figures stand out relative to interest, attention, cultural and macroperceptual features”
(Chen 2003, 25).
With Bibexcel the terms that occur the most (e.g. descriptors) are listed first, and then how often they co-occur together is counted. With the help of MDS-technique (multi-dimensional-scaling) the co-occurrences can then be represented as a map, where the distance between terms is vice versa in relation to the number of terms that occurred simultaneously. – The more the terms co-occur together the closer they are placed in the map. Because the frequency of a single word strongly influences co-occurrences, the effect of word frequencies has been eliminated via a normalising distance unit (the observed frequency divided by the expected frequency). (Persson 1991, 56)
The maps are used to create a picture of the studied words and their relationships.
However, it is important to remember that a picture can also be experienced in different ways.
The number of aspects of visual thinking can be stated - vision is a unique source for thinking – insight, foresight, hindsight, and oversight. Before we focus in words, we examine the overall picture. (McKim 1980, cited here Chen 2003, 12) One cannot represent the whole picture of a phenomenon, since the maps are results of a complex selection process. With maps one can shed light on the phenomenon. In maps all the nuances are usually not included, and some information may be lost. If the analysed material consists of dozens or even hundreds of descriptors, it is clear that one picture cannot include them all in a sense-making way. In this study the terms represented in maps are of many kinds. In the case of descriptors (indexing terms) the number of terms shown in maps is sometimes limited to descriptors that co-occur most often (within every map is reported the minimum number of co-occurrences to be included in the map and how the terms represented are selected). Because of the richness in variations of natural language expressions, in the case of word associations a need is expected to occur for some modification and standardisation. – When a term occurs in the dataset both in a plural and in a singular form, the most common version is selected. If they occur evenly, the plural form is selected. If the word associations are more similar to stories than terms, they are keyworded. In addition, Finnish terms will be translated into
English, since the publication language is English and the reader of the study is not expected to know Finnish. Word associations are thus expected to demand a lot of subjective decisions (cf. subjective stage in Persson 1991, 52-53). It is noteworthy, that the word association maps are used rather to provide an interpretation and illustration of the material than to represent it as authentic. Semantic lexicological maps of the thesauri studied are supposedly most authentic and neutral in the respect to their origin, since only modifications have been performed of possible different versions in the use of the plural versus the singular form.
The maps have another significant advantage; the thematic maps improve recall for textual information and inferences (Rittschof et al., 1994). (See also Chen 2003, 6.)
Co-word analysis is thus used according to the purposes of this study: the aim is at finding which and what kind of concepts and terms co-occur with the certain indexing terms and what is their frequency and networks. It is thus used to gain contextual information 1) about the content of the studied concept and 2) about the context of their use, i.e. about the documents, which are indexed with certain descriptors (family roles, breadwinners, heads of households, homemakers, housewives).