Machine Coding - Data Collection: The South Africa Interaction Data Set (SAID)

Hypothesis 11: International policy interventions resulted in an improvement of the situation in South Africa

4 Research Design and Methods

4.3 Data Collection: The South Africa Interaction Data Set (SAID)

4.3.5 Machine Coding

Machine coding of event data processes an electronic text source as input and produces a systematically coded set of event data as output. This process includes several discrete steps (Figure 12).

Figure 12: Steps in Generating Machine-coded Event Data

Source: Schrodt and Gerner (2005: 2/23)

The first step of generating machine-coded event data involves the downloading of the raw text material (in this study, newswire stories from AP and Reuters) from the online databases LexisNexis and Factiva using the search strings “South Africa!” (AP from Lex-isNexis) and “South Africa*” (Reuters from Factiva) for headlines and lead sentences. Text filter programs written in Perl were used to automatize the download process and remove irrelevant stories such as sports reports and weather forecasts from text material. Another

utility program combines, integrates and downloads into a single data file, it eliminates dupli-cate stories and reformats the downloaded texts into the KEDS/TABARI input format.³⁴

The second step is by far the most labor intensive in the coding process and includes the development of the actor and verb dictionaries that are used by the machine to code actors and events from the text material. The development of the original actor and verb dictionaries for the KEDS and PANDA projects, for example, cost the project teams about four-person-years of working through Reuters lead sentences to identify relevant verb phrases and assign these phrases to appropriate event codes (Schrodt and Gerner 2000: 2/24/25). It is therefore necessary that new dictionaries are built on existing dictionaries from previous event data projects.

The SAID verb dictionary is based on the standard KEDS verb dictionary and is composed of different specific regional dictionaries the KEDS team developed over the years (Schrodt and Gerner 2000; Schrodt et al. 1994; Huxtable 1997; Goldstein and Pevehouse 1997; Schrodt and Gerner 1998). The KEDS verb dictionary was modified slightly using the manual coding function in TABARI for selected time periods that allows for a test of the validity of the automatically coded lead sentences and a modification of regional specific verb phrases. As a result, the SAID verb dictionary contains a total of 7,354 verb patterns that assign appropriate event codes to events that are described in the text material.

The SAID actor dictionary is based on the KEDS West Africa dictionary since southern Africa has not been a focus of the KEDS team and therefore no region specific dictionary has been developed yet. However, the West Africa dictionary proved to be useful to start with since it already contained nearly 2,850 actor definitions (as of January 2005) including all major international actors and already some of the important regional actors. The actor dictionary also needs a periodical update since actor names and functions vary over time. New actors were identified using the Actor_Filter³⁵ program that recognizes new poten-tial actors in the text material by comparing patterns of consecutive capitalized words with actor names already known from an existing actor dictionary. I thereby applied the rule that an actor has to show up at least twice over the whole period from 1977 to 1996 to be included in the actor dictionary (equivalent to an appearance in 0.01% of the total of lead sentences obtained from AP from 1977 to 1996). The modified SAID actor dictionary eventually

34 See the KEDS project website for text filters and additional utility programs:

http://www.ku.edu/~keds/software.html. Accessed 2 June 2006.

35 The Actor_Filter program is also available on the KEDS-website:

http://www.ku.edu/~keds/software.dir/filters.html. Accessed 2 June 2006.

cludes 3,688 actor definitions that are used to identify the source and target actors of the coded events.

The third step includes then the actual machine coding of the text material. Machine coding of event data was essentially invented by the KEDS project in the early 1990s (Schrodt and Gerner 2004, 1994; Gerner et al. 1994). In 2000, Schrodt created a new program called Text Analysis by Augmented Replacement Instructions (TABARI). Like its predeces-sor KEDS, TABARI applies a computational method called ‘sparse parsing’ and some lin-guistic knowledge to identify the role of words in a sentence. But instead of trying to decipher a sentence fully, TABARI determines only the parts of the sentence that are required to prop-erly code the events: the actors, compound nouns, compound verb phrases, and the reference of pronouns. The program first employs a large set of verb patterns (as defined in the verb dictionary) to determine the appropriate event code. A logarithm then identifies the relevant actors (according to the actor definitions in the actor dictionary), while the source of an event is usually the first actor in the sentence and the target of the second. This pattern is reversed when the sentence is in passive voice. Compound noun phrases generate two or more events.

Then, the program has to be fine-tuned. This fine-tuning is done in TABARI using the ‘coding’ option. Most modifications involve the addition of specific individual actors and the addition of verb phrases describing behaviors specific to the research project being under-taken. The fine-tuning of the dictionaries gives also an indication of the accuracy of the cod-ing system. Once the fine-tuncod-ing of the codcod-ing system has reached a satisfycod-ing level, the raw data gets autocoded. Autocoding ensures that the coding rules have been consistently applied across the entire data set. It also allows for the replication of the coding at a later time. Table 11 illustrates a sequence of the event data that result as output from the machine coding proc-ess.

Table 11: Maschine Coded Event Data (Excerpt) [Date] [Source] [Target] [Event type]

850720 ZAFGOV ZAF 172 (Impose administrative sanctions) 850720 USAGOV ZAF 012 (Make pessimistic comment)

850722 USAGOV ZAF 112 (Accuse) 850722 USAGOV ZAFGOV 020 (Appeal) 850723 GBRGOV GBR 050 (Support) 850723 GBRGOV ZAFGOV 020 (Appeal) 850723 USAGOV ZAF 020 (Appeal)

The fourth step includes an aggregation in terms of time and level of measurement.

The machine coding output consists of categorical time series data on a daily basis. The time

unit of one day, however, is often too small and the implied precision not accurate consider-ing the process that generated the data. Daily event data are therefore typically aggregated to weekly or (most of the time) monthly data. The time aggregation is done using the KEDS_count³⁶ program that counts events in selected dyads within given intervals (typically one month) and provides a spreadsheet that can be imported into standard statistical programs.

In spite of this temporal aggregation, weekly or monthly event data still show a major advan-tage over time series data usually gained from official statistics that occur usually on a yearly or quarterly basis. Such highly aggregated data tend to “swallow up important interaction effects” (Schrodt et al. 1994: 207) whereas smaller temporal units, such as on a weekly or monthly basis, are better able to sense the causal mechanisms at work.

For many analysis techniques the level of measurement has to be increased too.

Three features were used to generate discrete or continuous interval-scaled time series for each dyad under investigation: 1) the number of event counts per dyad, 2) the net-score on a conflict-cooperation scale, and 3) the number of specific event types.

Counting the number of events that occurred in a selected dyad within a particular time interval is the simplest measurement to get a rough estimate of the intensity of interac-tions between a pair of actors (dyad). As a matter of course, this measurement ignores both the content of the interactions, whether friendly or hostile, and the relative weight of individ-ual events, that is, which events have to be considered more relevant than others.

To include both content and relative weight of individual events, most event data projects convert nominal event counts into measurements of conflict and cooperation, based on the assumption that all behavior between political actors can be displayed somewhere on a hostility-friendship continuum (Goldstein and Pevehouse 1997: 110). For every observation a cooperation/conflict score is calculated by weighting each single event according to an inter-val-like scale. The totaling of these individual scores results in a net cooperation/conflict score per dyad and time interval. The most widely used interval-like scale in event data analy-sis was developed by Joshua Goldstein (2004) and is therefore also known as the Goldstein scale. Since the scale is based on the WEIS coding scheme it has to be adapted to CAMEO for the purpose of this study.

36 KEDS_Count is available form the KEDS-website: http://www.ku.edu/~keds/software.dir/utilities.html.

Accessed 6 June 2006. A one-a-day filter was also applied to the data. This specific program filters event data using the rule that each dyad can have only one event per coding category per day (“daily unique dyad-code rule”). This algorithm eliminates all events coded from duplicate stories at the expense of a few false positives.

The Goldstein value for the individual event codes of the CAMEO-based SAID cod-ing scheme are indicated in Table 8 (p. 105). Half of the CAMEO primary categories are identical to the according WEIS categories and the corresponding Goldstein values can be assigned directly. The remaining categories were weighted as follows: if the CAMEO code was a specification of a more general WEIS code, the more specific CAMEO code received the Goldstein value of the respective superior WEIS code. If a direct assignment to a broader WEIS code was not possible, the most appropriate WEIS code as regards content and its respective Goldstein value was assigned to the particular CAMEO code. For 15 CAMEO codes, neither of the two approaches allowed a meaningful assignment of a Goldstein value.

These codes received a value relative to their position in the CAMEO coding scheme, consid-ering the fact that the CAMEO categories show an increase in cooperation as one goes from category 01 to 09, and an increase in conflict as one goes from 10 to 20.

Apart from their attractiveness due to the quasi-interval scale characteristics, Gold-stein scores also have their weaknesses. Conflictive and cooperative events that occur in the same time unit compensate each other with the result, for example, that dyads with high con-flict and cooperation appear to include a rather neutral behavior between the involved actors.

It also does not appear how many events culminated in the respective Goldstein score. The score can be the result of low interaction intensity with a few very cooperative or conflictive events or emerge from very frequent interactions with a rather low conflict or cooperation level. There is also considerable evidence that conflict and cooperation are not one-dimensional. Nations that cooperate extensively, for example in trade or alliances, also tend to have greater conflict than nations that are mutually isolated (Goldstein 1992: 160).

Due to this particular weakness of score aggregates, event counts for individual event categories will be analyzed separately (Table 12).

Table 12: Event Category Aggregations

Category CAMEO Events

Verbal Cooperaton (vercp) Cue categories 01, 02, 03, 04, 05 Material Cooperation (matcp) Cue categories 06, 07, 08 Verbal Conflict (vercf) Cue categories 09, 10, 11, 12, 13 Material Conflict (matcf) Cue categories 14, 15, 16, 17, 18, 19, 20

Mediation and Negotiation (meneg) 025, 026, 027, 028, 035, 036, 037, 038, 039, 066, 105, 106, 107, 108 Based on Schrodt and Gerner (1995: 315).

The aggregation of event categories as shown in Table 12 also reduces the potential effects of coding errors. For example, individual codes for different forms of verbal conflict

can be assigned somewhat ambiguously to the different event types both by human coders and the machine coding systems. The aggregation of individual events to broader categories minimizes such potential ambiguities.

In a fifth step, such numerically aggregated event data allows now for an application of different formal data analysis methods.³⁷ The applied data analysis techniques will be discussed in more detail in Section 4.5.

Before that discussion, the validity of the data sets that were produced through the different steps described above will be examined. Event data, like any data used in the social sciences, contain errors due to their source, coding techniques and other factors such as scal-ing and data aggregation. The advantages and disadvantages of event data have been explored extensively in the literature (Hudson et al. 2006; Schrodt 1995; Schrodt and Gerner 1994;

Gerner et al. 1994; Hayes 1973; Merritt 1994; Duffy 1994; Huxtable and Pevehouse 1996;

Sommer and Scarritt 1999). In the following, I therefore discuss only potential specific valid-ity issues regarding the data used in this study. First, I will address some general concerns with event data related to the journalistic sources that are used in event data research and the applied coding procedures, in particular the machine coding of the data. Then, I turn to spe-cific questions related to the SAID data set and discuss the data set’s potentials and limita-tions in more detail, in particular regarding its main source (AP).

In document Intervening against apartheid : the South Africa policy of the United States, West Germany, Sweden and Switzerland, 1977-1996 (Page 111-116)