Coding decisions - Evidence I: Corpus results

3 Case notes: Independent factors

4 Evidence I: Corpus results

4.1 Coding decisions

In the light of the various factors discussed in the preceding chapters, it was decided to code all ICE-GB and ICE-EA data for six contextual factor groups, the first four of which (#1–#4) together with the dependent variable (dv) are presented in Table 4.1. As Table 4.1 illustrates, the main variants of the dependent variable prepositionplacement were, of course, ‘stranded’

and ‘pied-piped’. In addition to this, an initial survey of the ICE-EA corpus showed that for this variable additional variants had to be included (which are given in parentheses in Table 4.1):

Whereas Kenyan English is considered a variety with its own underlying rule system, for the sake of comparison, it is, of course, notable that the tokens in (4.1) deviate from the forms expected from British English: in (4.1a) the preposition with occurs both stranded and pied-piped, while in (4.1b) the preposition of seems to be missing (cf. which … I’ve featured on the forefront of). In contrast to this, in (4.1c) the preposition to surfaces instead of the expected in (cf. conditions in which students can study effi-ciently). Finally, occasionally a resumptive pronoun appears behind a pre-position which was expected to be stranded ((4.1d); cf. a process you have to get initiated into).

(4.1) a. top managers with whom they will be doing business with.

<ICE-EA:W1B-BK25>¹

b. Well this is an area which for quite some time I’ve I’ve featured on the forefront violence against women <ICE-EA: S1B037K>

c. Hostels no longer provide conditions to which students can study efficiently

<ICE-EA: S2B032K>

d. It is a process you have to get initiated into it <ICE-EA:S1A026K:B>

1 Remember that the ICE-GB data was automatically extracted using ICECUP, which provides exact text code information including not only the text category but also the particular line an example is taken from. Since the ICE-EA data was extracted manu-ally, it was only possible to give text category information but not the line number of an example.

As a result, tokens like (4.1a) were classified as ‘doubled’, like (4.1b) were coded as ‘missing’ and like (4.1d) were called ‘resumptive’. Moreover, tokens like (4.1c) were analysed as ‘Unexpected’ if the preposition was pied-piped and ‘unexpected’ if it was stranded.

The first group of independent factors, clausetype, then contains all of the clausal contexts discussed in section 3.1. In addition to these, however, it was also decided to code cleft-relatives separately from ordinary relative clauses due to the specific pragmatic foregrounding function of the former (cf. examples (3.13a–c) repeated here as (4.2a–c)):

In order to capture the influence of idiosyncratic pragmatic effects on pre-position placement on cleft-relatives, data such as (4.2) were classified as fol-lows: (4.2a) was coded as a stranded cleft-relative and (4.2c) as a pied-piped cleft-relative. In contrast to this, cases such as (4.2b) were treated as pied-piped relative clauses. While this decision might appear somewhat arbitrary, it had the advantage of distinguishing cleft relatives in which a PP was foregrounded (4.2c) from those in which only an NP was highlighted (4.2a). At the same time it made it possible to record the difference between (4.2a) and (4.2b).

(4.2) a. It was John who I talked to b. It was John to whom I talked c. It was to John that I talked

Table 4.1 Dependent variable and factor groups #1–#4

Factor group Factors

DV prepositionplacement stranded, pied-piped, (doubled), (resumptive), (unexpected), (missing)

#1 clausetypes finite relative, non-finite relative, cleft-relative, main clause question, embedded interrogative, free relative, hollow, passive, comparative, preposed, exclamative

#2 displacedelement who, whom, which, Ø , that, what, whose, when, NP, wh-ever^a, where, how

#3 typeof Xp containedin verb phrase, noun phrase, adjective phrase

#4 teXttype^b [spoken]: private dialogue, public dialogue, unscripted monologue, scripted monologue, mixed [written-as-spoken]: written-as-spoken

[written]: private correspondence, business correspondence, legal presentations, non-professional writing, printed/edited texts

a This code included all free relative tokens ending in -ever, i.e. whatever, whoever, whomever, whichever, whosoever, wherever, whenever and however.

b Not all of these text types were sampled for both ICE-GB and ICE-EA; see text for details.

The number of factors in group #1 was furthermore increased by subdiv-iding interrogative clauses into ‘main clause questions’ (Who did she talk to?) and ‘embedded interrogatives’ (I don’t know who she talked to.). Consequently, it was possible to test whether questions that are embedded in another clause behave differently from main clause questions.

The factors in the next two groups displacedelement and typeof Xp

contained in then contain no surprises. The group displaced element

includes the logical complements of the preposition (cf. sections 3.1.1 and 3.1.3), and the group typeof Xp contained in codes whether the PP in question is embedded in a VP (Who did she [sleep with]VP?), an AdjP (the things he is [capable of ]AP) or an NP (a girl who I couldn’t find [a present for]NP; cf. section 3.4).

In section 3.3, I claimed that a simple written–spoken dichotomy is insuf-ficient for the evaluation of the effect of formality on preposition-stranding and pied-piping. Therefore, it was decided to use the various ICE teXttypes as factors of the variable formality. As can be seen in Table 4.1, this made it possible to differentiate between formal and informal stylistic levels for both spoken and written English. The only point to remember with this factor group though is that the culture-specific situation in East Africa prevented the ICE-EA team from compiling a corpus which was perfectly matched with the British English component (see section 2.2.1.2). Unscripted monologues, for example, are not part of the Kenyan corpus. On top of that, ICE-EA has two additional text types not found in ICE-GB: ‘written-as-spoken’ texts (i.e.

originally spoken material which had been transcribed by third parties and not the ICE-EA team) and legal presentations (which are formal written manu-scripts which are to be read out). Nevertheless, since most text types are iden-tical it was possible to code both corpora for more or less the same factors.

In section 3.2 it was argued that a simple dichotic complement–adjunct classification is inadequate for a detailed syntactic description of the factor group pp type. For this variable it was instead decided to employ the fine-grained classification presented in Table 4.2.

Finally, in the last factor group variety (#6) data were coded as to whether they were taken from British English (ICE-GB) or Kenyan English (Kenyan English subcorpus of ICE-EA).

Thus all corpus tokens were classified according to their variant of the dependent variable prepositionplacement (dv), as well as the factor groups

clausetypes (#1), displaced element (#2), typeof Xp contained in

(#3), levelofformality (#4), pp type (#5) and variety (#6).

Due to the great number of interaction effects discussed in sections 3.1 and 3.1.2.2, only the relative clause data from the corpora were then subjected to an additional multivariate analysis which – with the exception of clause type – contained all of the factor groups just mentioned. Furthermore these tokens were also analysed for their finiteness, restrictiveness and com

-pleXity (Table 4.3).

Table 4.3 Additional factor groups for relative clause analysis

Factor group Factors

finiteness finite, non-finite

restrictiveness restrictive, non-restrictive

compleXity 2 = <2.5 a = 2.5 –<3.0 3 = 3.0 –<3.5 b = 3.5 – <4.0 4 = 4.0 – <4.5 c = 4.5 – <5 5 =5.0 – <5.5 d =5.5 – <6.0 6 = 6.0 – <6.5 e =6.5 – <7 7 = 7.0 – <7.5 f = 7.5 – <8.0 8 = 8.0 – <8.5 g =8.5 – <9.0 9 = >9.0

Syntactic function of PP Examples

OBLIGATORY

COMPLEMENT Idiosyncratic stranding Ps What … for / like

‘V-X-P’ idioms make light of, let go of, get rid of Prepositional ‘X’

(subcategorized P) sleep with ‘have sex with’, rely on, capable of

Subcategorized PP put something in/on/over Obligatory complement be/live in Spain/on the moon OPTIONAL

COMPLEMENT Optional complements work at, talk to, postcards of, a proposal on, worried about SPACE Affected location he sat on the chair, the book on the

table Movement (goal, source,

distance) he rushed to the church, the paintings from the gallery

Direction he ran along the road

Position/location he killed the cat in the garden

TIME Position in time He died on Saturday, the game on

Sunday

Duration/frequency He slept for seven hours

PROCESS Manner he ate the cake in a disgusting way

Means/instrument He killed him with a knife

Agent He was killed by John

RESPECT Accompaniment He came with Bill

Respect For him, something’s always

missing

the article in which she states that…

the house with red windows CONTINGENCY Cause, reason, purpose, result,

condition, concession as a result of which / due to which DEGREE Amplification, diminution … the extent to which / degree to

which

The first new factor group in Table 4.3, finiteness, was introduced because of the categorical effects of non-finite relative clauses (obligatory stranding in Ø -relative clauses and obligatory pied-piping in wh-relative clauses; see section 3.1.1). With respect to the token analysis, coding the data for their finiteness is straightforward, since these factors are overtly expressed in a sentence, e.g. by the non-finite marker to. In contrast to this, the classification of a relative clause as restrictive or non-restrictive is obvi-ously more difficult.² Thus, following Olofsson (1981: 27ff.), in addition to the type of semantic information which a relative clause contributes to the meaning of the antecedent, the following set of criteria were employed to distinguish between the two types of relative clauses.

Non-restrictive relative clauses have weaker semantic ties with their ante-cedent than their restrictive counterparts. In spoken English, this is often indi-cated by the insertion of a short pause between antecedent and non-restrictive relative clause (Olofsson 1981: 30). Now, unlike ICE-EA (cf. Hudson-Ettle and Schmied 1999: 13), the spoken ICE-GB data is annotated for short pauses ‘<,>’.

So at least for the ICE-GB data it was possible to use the presence of a pause marker as an indication of a non-restrictive relative clause. In (4.3), for example, the pause, signals that the relative clause is merely a comment. Whereas the presence of such a pause marker turned out to be a reliable indication of non-restrictive clauses, its absence did not allow the identification of a relative clause as restrictive: in (4.4), for example, there is no pause, even though the relative clause only provides additional information about the antecedent:

In written English, the pause is mirrored by the orthographic convention of putting a comma between a non-restrictive clause and its antecedent (Huddleston, Pullum and Peterson 2002: 1058). Thus, just as with the pause marker in spoken English, the absence or presence of a comma in the writ-ten data was taken as an indicator for restrictive and non-restrictive clauses, respectively. Again, however, the absence of a comma did not necessarily allow the classification of a relative clause as restrictive in either the ICE-GB (4.5) or the ICE-EA data (4.6):

2 As discussed in section 3.1.2.2, one should rather see the ‘restrictive’ vs ‘non-restrictive’ as an ‘obligatory’ vs ‘non-obligatory’ distinction. Due to their wide spread acceptance, it was, however, decided to use the traditional terms.

(4.3) They ’ve got a throw-in <,> which they ’ll have to settle for on the far side

<ICE-GB:S2A-014 #260:1:A>

(4.4) This is Humphrey Davy who you may have heard of in connection with nitrous oxide which he invented <ICE-GB:S2A-027 #1:1:A>

(4.5) a. You will need to show your sight test receipt and your AG 3 to the person from whom you buy your glasses. <ICE-GB:W2D-001 #86:1>

b. This novel was followed by Shadow of the Condor in which Ronald Malcolm reappears and was hailed by one critic as ‘the most likeable and unlikely CIA agent on record’. <ICE-GB:W2B-005 #59:1>

Whereas the relative clauses in (4.5a) and (4.6a) are clearly restrictive, the lack of a comma in (4.5b) and (4.6b) cannot be seen as an indication of restrictiveness: the relative clause might convey important additional infor-mation about the novel or the period after the two weeks, but they are clearly optional for the identification of the antecedent’s reference.

Furthermore, in accordance with Olofsson’s classification (1981: 27ff.), all Ø-relative clauses turned out to be restrictive (cf. (4.7a) for an ICE-GB example and (4.7b) for one from the ICE-EA; also see section 3.1.2.2), as were most that-introduced ones. The only exceptions were cases like (4.8b) in which context (cf. 4.8a) has already established the antecedent’s reference and the that-clause is only functioning as an ‘aspect clause’, i.e. it is used ‘to indicate that a particular aspect of the antecedent is to be thought of’ (Olofsson 1981: 29). Interestingly, the data from the ICE-EA contained no examples such as (4.8b).

Since it was also possible to check the context of ambiguous examples like (4.8b) in the ICE-GB corpus, the criteria just outlined allowed for a com-paratively unproblematic coding for the factor group restrictiveness.

As mentioned in section 3.5, it was also decided to test the relative clause data for purely structural complexity effects using Lu’s parsing-orientated

‘Mean Chunk Number’ hypothesis (2002). Since in Lu’s approach the Instant Chunk Number (ICN) is divided by the number of words to be inte-grated, this formula yields continuous variables. Since Goldvarb can only process discrete variables, a continuous variable like the MCN thus has to be arbitrarily divided into discrete categories. Yet, since MCNs are only a heur-istic measure of complexity this was not considered to be problematic. In fact, as the statistical analysis showed, it was possible to significantly reduce the categories of this variable (below). On top of that, as Sigley points out, such an arbitrary division of continuous variables is unproblematic for the Goldvarb analysis, ‘provided that decisions are made consistently’ (1997: 20).

In order to guarantee a consistent classification of the various MCNs, stand-ard rounding procedures were employed, which produced the categorical

(4.7) a. I mean wouldn’t she have a grown up son and think god he’s exactly like the bloke [Ø] I fell in love with <ICE-GB:S1A-006 #138:1:B>

b. The machines [Ø] I talked about earlier isn’t a machine of a person who is able to speak or able to communicate or even able to perceive <ICE-EA:S1B021K:B>

(4.8) a. Actually it’s not a small garden …

b. nice nice size garden that she really looks after

<ICE-GB:S1A-025 #137:1:B-#141:1:B>

(4.6) a. It concerns the content of the messages, the medium through which they are passed and the mechanisms at work in the passing of such messages.

<ICE-EA: W2A016K>

b. Beetles were left for two weeks after which they were removed by sieving, leaving larvae to develop. <ICE-EA: W2A024K>

factors as illustrated in Table 4.3. (Example (3.139) whom I think I had some designs or intentions on, for example, which was discussed in section 3.5, had an MCN of 4.7. According to the factor divisions given in Table 4.3 this token was coded for the factor ‘c’.)

Even though continuous factors are not a problem for the Goldvarb ana-lysis per se, attention must be drawn to the fact that any result involving the factor group compleXity will nevertheless have to be interpreted care-fully. For example, one flaw of the MCN calculation is that complex material has an increased effect, the later it appears in a sentence. Compare e.g. the knife which[1] John[2] killed[3] the[4] man[4] with[5] and the knife which[1]

the[2] man[2] killed[3] John[4] with[5], in which both relative clauses con-tain the same lexical material, and should be expected to be equally com-plex. However, as the chunk annotation in the square brackets shows, the later the NP the man appears in the sentence, the ‘heavier’ it becomes: thus, the former sentence has an MCN of (1+2+3+4+4+5)/6 = 3.17, and the lat-ter (1+2+2+3+4+5)/6 = 2.83. Nevertheless, despite these shortcomings, the MCN approach was still considered superior to other measures of complex-ity since it explicitly predicts that stranding is structurally more complex than pied-piping (see section 3.5).

As the initial run of the Goldvarb program (‘no recode’) showed, the data extracted from the ICE-GB corpus contained 1,768 relevant tokens, 985 of which were stranded and 783 of which were pied-piped. In contrast to this, the Kenyan data from the ICE-EA had 1,247 tokens, including 808 stranded and 439 pied-piped ones.³ In addition to this, the Kenyan part of the ICE-EA included 14 doubled-preposition, 22 missing, 18 unexpected (13 of which were stranded, 5 of which were pied-piped) and 7 resumptive tokens. Yet, before comparing mere frequencies it must be kept in mind that the two corpora from which the tokens were extracted differ in size: the ICE-GB corpus consists of 1,060,000 words, while the Kenyan subcorpus of the ICE-EA only has 791,695 words (see section 2.2.1). This is not a prob-lem for the statistical analyses of the data (since Goldvarb and HCFA both successfully correct for such distributional dependence effects; see Sigley 1997: 248–50 and Gries 2008: 247, respectively). For the sake of illustra-tion, however, I will occasionally provide normalized figures of the ICE-EA results to make them comparable to those from the ICE-GB. These nor-malized figures were obtained by multiplying the ICE-EA figures by a fac-tor of 1.34 (= words in ICE-GB/words in ICE-EA = 1,060,000/791,695).⁴

3 These figures do not include subject-contained PPs that precede the subject, 17 instances of which can be found in the ICE-GB (Certain books or scores … of which details are given in the leaflet … <ICE-GB:W2D-006 #100:1>) and 4 in the ICE-EA (the Mijikenda of which about 70% belong to the Giriyama tribal group <ICE-EA:W2A027K>). As pointed out in section 3.4, stranding is not an option in these cases, but a larger corpus is clearly needed to investigate the factors which lead to preposing such PPs to a pre-subject position.

4 The normalization calculations were carried out using Excel. The normalized results are always given as full numbers.

Applying this normalization procedure, a one-million word ICE-EA corpus was predicted to contain 1,670 Kenyan tokens, 1,082 of which should be stranded and 588 of which should be pied-piped. Note, however, that in the following, it can be assumed that all figures given are actual frequencies unless explicitly stated otherwise and that only raw frequencies were used for all statistical tests.

Since all the factor groups have been introduced, I will now present the results of the corpus studies. For this I will begin with the categorical clause contexts (4.2), before moving on to the data displaying variation with respect to preposition placement (4.3). Finally, due to their potentially exhibiting additional constraints different to other clause types, I will focus on relative clauses in both varieties (4.4).

4.2 Categorical clause contexts

In document 052176047 x Preposition (Page 135-142)