Data collection - Considerations behind Corpus Design

4.3 Considerations behind Corpus Design

4.3.1 Data collection

The question of how to collect text messages parallels that faced by compilers of written or spoken corpora, who must similarly identify the nature of data required, locate it, and access it (Hunston 2002: 26-27),2 and their methods similarly differ according to the nature of the data, research questions and practicalities. The decision made in this study to recruit family and friends emerges from evaluation of drawbacks and benefits of other more quantitative and qualitative approaches used in previous texting studies, in the light of current aims and the need to be pragmatic to collect sufficient data.

Data collection methods used in previous studies of text messaging are tabulated below. Broadly categorised, as well as the recruitment of family and friends (Hard af Segersteg 2002), they include qualitative approaches which probe the texting practices of a few individuals (Grinter and Eldridge 2002; Kasesniemi and Rautianen 2002) and quantitative, often anonymous and web-based, methods (Fairon and Paumier 2005; How and Kan 2005; University of Leicester 2006).

At the time of completing the personal profile.

Table 4.3 Data collection procedures used across the text messaging literature

Study Corpus specifications Data collection methods Research focus

Grinter and Eldridge 2001; 2003

477 messages

5 participants aged 15-16.

Record ‗logs‘, in which participants

recorded messages sent and received during 7 consecutive days, followed up by video- taped group discussions and questionnaires.

Communicative functions, analysis of conversation ‗threads‘ and abbreviations

Hard af Segersteg 2002

1152 messages 17,024 words

Various: 112 from an anonymous webpage, 252 messages forwarded from volunteers (two males and two females aged 12-25) and 788 from family and friends

Abbreviation and structural ellipsis, to determine how written communication adapts to technology. Kasesniemi and Rautianen 2002 Nearly 8000 messages from nearly 1000 teenagers, accompanied by interview, field notes, and observation journals (completed by participants)

Teenagers recruited through various channels, including television, the Internet, teachers at schools and the ‗snowball‘ technique. Messages submitted by teenagers, with ‗cover letters‘, along with the

observation journals.

Social-scientific study into the communicative practices of mobile technology.

Oksman and Turtianen 2004

Exploration of texted interaction through a symbolic interactional approach to

determine whether new forms of social interaction are being produced.

Thurlow 2003 544 messages

135 British university students, aged around 19

5 text messages were transcribed and submitted by students at end of university lecture, during which they were recruited.

Linguistic study, of abbreviations, communicative functions and message length

How 2004; How and Kan 2005

10,000 messages 3,348 messages from a ‗website collection programme‘ with the 146 undergraduate participants financially rewarded for their contributions

6,167 messages from a ‗small pool‘ of 20 participants aged 18-22; and 602 from a Yahoo SMS chat website.

Studies aimed at improving predicted text entry.

Fairon and Paumier 2006 30,000 messages (from an initial 75,000) 3,200 participants aged 12 to 65.

Project was broadcast nationally (October and December 2004) and

participants requested to send texts to a free mobile number. (sociolinguistic information including ‗ability to decrypt SMS‘ and ‗writing habits‘

Compiling and preprocessing the corpus to serve as a reference corpus, and translating the language into ‗standard‘ French for future study.

University of Leicester 2006 Main researcher: Dr Tim Grant, now at Centre for Forensic

Linguistics, Aston

[final corpus specifications pending, Grant 2009 pers comm.]

Participants in the study were asked to submit 10 messages to a website and the researchers hoped to recruit at least one hundred texters

Sampling technique snowballing used, whereby participants were encouraged to urge friends to contribute.

Forensic linguistic research which aims to analyse ‗linguistic consistency and variation in individuals‘ texting style; and also ‗the influence of peer groups upon writing style and texting language‘.

Ling and Baron 2007

191 text messages (1473 words) collected.

Paper diaries distributed to university students at an American university.

Linguistic comparison with IM communication

Advantages emerge, from the above studies, of recruiting family and friends over quantitative methods often involving anonymous participants. These include the greater likelihood of ensuring authenticity (Hard af Segersteg 2002: 209-210; How and Kan 2005; Fairon and Paumier 2006; University of Leicester 2006),1 greater familiarity with participants‘ backgrounds, ability to acquire personal information (Fairon and Paumier 2006)2 and the fact that recruiting friends and family achieves depth (allowing greater understanding of individuals‘ behaviour) as well as breadth. Advantages over more in- depth studies are time and cost (Kasesniemi and Rautianen 2002),3 as well as a greater focus on breadth due to the enhanced possibility of recruiting willing participants (Hard af Segersteg 2002).

Choice of method, however, also depends upon the current research focus. The advantage of the snowballing technique,4 for example, is that it enables researchers to explore not only how individuals vary in linguistic practices, but how networks of texters differ: ultimately the goal of the University of Leicester (2006). Others that aim to describe linguistic features typical of texting require breadth and variety of textual data rather than (or as well as) in- depth knowledge of participants‘ backgrounds, and so adopt quantitative procedures for at least part of their collection process (Fairon and Paumier 2006; Hard af Segersteg 2002). Similarly, How and Kan‘s (2005) purpose is to increase typing speeds through rearrangements of the phone keyboard layout, for which they need textual rather than sociological data. In contrast, social-scientific researchers such as Kasesniemi and Rautiainen (2002: 171) who aim to document ‗text messaging as it relates to the life of teenagers‘ require text messages to support insights gleaned through qualitative fieldwork, rather than a wide range of participants. Recruiting family and friends enables a substantial number of participants and text messages, whilst facilitating the collection of some

1_{The 75,000 messages received by Fairon and Paumier (2006) were filtered down to 30,000 in part by} removing messages written for their team‘s attention. Hard af Segersteg (2002: 209) notes that some messages ‗seemed to actually be filled in by someone … making things up for the fun of it‘ and that these were

‗ignored‘. How and Kan (2005) also report removing ‗non-genuine‘ messages, although they do not explain how they were identified.

2_{Of Fairon and Paumier‘s (2006) 3200 participants, only 2500 returned sociolinguistic data, while other} studies did not request it (How and Kan 2005; Hard af Segersteg 2002).

Even in a relatively small-scale study such as Hard af Segersteg‘s (2002), the four participants were paid for messages submitted, by having their pay-as-you-go cards topped up (p210).

4_{The snowballing technique adopted by the University of Leicester involves initial participants being urged to} invite friends to join the study, who then invite their friends and so on. They were also asked to devise ID codes for themselves and the recipients of the submitted texts, presumably to allow them to identify whether their own recipients later join the study as a participant in their own right.

bibliographic information, and is appropriate for a study which prioritises textual data rather than sociological information.

The decision in the current study to supplement messages from friends and family with 441 messages from an AOL website reflects those made by other researchers intent on ensuring breadth and depth. Also evident in these decisions, as with all data collection decisions, is the need to be pragmatic to obtain sufficient data or, in Hunston‘s (2002: 26) words, ‗make use of as much data as is available‘, not only because of time or cost and practical difficulties involved in transferring text messages into computer-readable formats, but people‘s reluctance to part with messages (Hard af Segersteg 2002: 207) due to their well- documented private nature (Kasesniemi and Rautianen 2002: 181-2; Harkin 2003; Taylor and Harper 2003). Attempts to overcome this include obtaining data from various available sources, including online text message services (Hard af Segersteg 2002; How and Kan 2005). The drawbacks to using online text messages, such as those from the AOL website, are that message content and language may reflect the fact that they were sent through a third-party service, that it is not possible to know who sent the messages, nor to ensure that messages constitute genuine communication, and impossible to gain consent from senders. At the same time, in this case, there is no reason to believe messages were made up, whilst the latter objection (potentially more serious) is offset by the fact that the AOL service was never particularly private.1 Accordingly, for pragmatic reasons and to increase breadth of participants, the online text messages were included.

In document A corpus linguistics study of SMS text messaging (Page 80-84)