Preface... ix Period 1: Impressionistic observation... ix Period 2: Diaries and biographies... x Period 3: Transcripts... x Period 4: Computers...

(1)

i

Preface . . . ix

Period 1: Impressionistic observation . . . ix

Period 2: Diaries and biographies . . . x

Period 3: Transcripts . . . x

Period 4: Computers . . . xi

Period 5: Connectivity and Exploratory Reality . . . xii

Three Goals and Three Tools . . . xii

Some Words of Appreciation . . . xiii

How to Use this Book . . . xiv

Changes and the Future . . . xv

1: Principles of Transcription . . . 1

1.1 The Promise of Computerized Transcription . . . 1

1.2 Some Words of Caution . . . 2

1.2.1 The Dominance of the Written Word . . . 2

1.2.2 The Misuse of Standard Punctuation . . . 3

1.2.3 The Advantages of Working with a Videotape . . . 4

1.3 Transcription and Coding. . . 4

2: The CHAT Transcription System . . . 5

2.1 The Goals of a Transcription System . . . 5

2.2 Learning to Use CHAT . . . 6

2.2.1 minCHAT and minDOS . . . 6

2.2.2 Analyzing One Small File . . . 6

2.2.3 midCHAT . . . 7

2.2.4 Problems with Forced Decisions . . . 8

2.3 minCHAT. . . 8

2.3.1 The Form of Files . . . 9

2.3.2 The Form of Utterances . . . 9

2.3.3 The Documentation File . . . 10

2.3.4 Checking Syntactic Accuracy . . . 12

2.3.5 ASCII and Special Characters . . . 12

3: File Headers . . . 13

3.1 Obligatory Headers . . . 13

3.2 Constant Headers . . . 15

3.3 Changeable Headers . . . 18

4: Transcribing Words . . . 23

4.1 The Form of the Main Line . . . 24

4.2 Special Learner Form Markers. . . 24

4.3 Unidentifiable Material . . . 27

4.4 Incomplete and Omitted Words . . . 30

4.5 Standardized Spellings . . . 32

4.5.1 Letters . . . 33

4.5.2 Acronyms. . . 33

4.5.3 Numbers and Titles . . . 34

4.5.4 Kinship Forms . . . 34

(2)

Contents ii 4.5.6 Assimilations . . . 35 4.5.7 Exclamations . . . 36 4.5.8 Interactional Markers. . . 37 4.5.9 Spelling Variants . . . 38 4.5.10 Colloquial Forms . . . 38 4.5.11 Baby Talk. . . 39 4.5.12 Dialectal Variants . . . 40

4.5.13 Disambiguating Homophones in Japanese . . . 41

4.5.14 Punctuation in French and Italian . . . 41

4.5.15 Shortenings in Dutch . . . 42

5: Transcribing Morphemes . . . 43

5.1 Codes for Morphemicization . . . 43

5.2 Standard Forms for Affixes and Clitics . . . 45

5.3 Placing Morphemicizations in Brackets. . . 46

6: Utterances and Tone Units . . . 49

6.1 One Utterance or Many? . . . 49

6.2 Discourse Repetition . . . 51

6.3 Basic Utterance Terminators . . . 52

6.4 Tone Unit Marking. . . 53

6.4.1 Terminated Tone Units . . . 53

6.4.2 Non-final Tone Markers . . . 54

6.5 Prosody within Words . . . 55

6.6 The Comma . . . 56

6.7 Pauses . . . 57

6.8 Special Utterance Terminators . . . 58

6.9 Utterance Linkers . . . 60

7: Scoped Symbols . . . 63

7.1 7.1.Postcodes . . . 71

8: Dependent Tiers . . . 73

8.1 Standard Dependent Tiers . . . 73

8.2 Creating Additional Dependent Tiers . . . 79

8.3 Synchrony Relations . . . 80

9: Adapting CHAT for special topics . . . 83

9.1 Code-switching and Voice-switching . . . 83

9.2 Elicited Narratives and Picture Descriptions . . . 84

9.3 Signed Language . . . 85

9.4 Written Language . . . 90

9.5 Children with Disfluencies. . . 91

10: UNIBETs. . . 93

10.1 A UNIBET for English . . . 93

10.2 Sample Transcriptions Using UNIBET.. . . 98

10.3 UNIBET and CLAN. . . 98

10.4 UNIBETs for Other Languages . . . 99

10.4.1 Dutch . . . 99

(3)

iii Contents 10.4.3 German . . . 102 10.4.4 Italian . . . 104 10.4.5 Japanese . . . 105 10.4.6 Portuguese . . . 106 10.4.7 Spanish. . . 107 10.5 Romanization of Cyrillic . . . 108

11: PHONASCII. . . 111

11.1 Segment Strings . . . 111 11.2 Suprasegmentals. . . 112 11.3 Consonants . . . 113 11.3.1 Stops . . . 114 11.3.2 Fricatives . . . 115

11.3.3 Other Consonant Types . . . 115

11.4 Vowels . . . 117

11.5 Diacritics . . . 119

11.6 Cover Symbols . . . 120

11.7 Prosodies and Suprasegmentals . . . 120

11.8 Sample Transcriptions . . . 121

12: Error Coding . . . 125

12.1 Coding Format . . . 125

12.2 Specific Error Codes . . . 128

12.3 Hesitation Codes . . . 131

12.4 Codes for CED . . . 132

12.5 Examples . . . 132

13: Speech Act Codes . . . 135

13.1 Interchange Type Categories . . . 136

13.2 Categories of Illocutionary Force. . . 137

13.2.1 Directives . . . 137 13.2.2 Speech Elicitations. . . 137 13.2.3 Commitments. . . 138 13.2.4 Declarations . . . 138 13.2.5 Markings . . . 138 13.2.6 Statements . . . 138 13.2.7 Questions . . . 138 13.2.8 Performances . . . 139 13.2.9 Evaluations . . . 139

13.2.10 Demands for clarification . . . 139

13.2.11 Text editing . . . 139

13.2.12 Vocalizations . . . 139

14: Morphosyntactic Coding . . . 141

14.1 Morphological Coding . . . 141

14.2 Part of Speech Codes . . . 142

14.3 Stems . . . 143

14.4 Affixes and Clitics . . . 144

(4)

Contents iv

14.6 Sample Morphological Tagging for English . . . 146

14.7 Error Coding on the %mor Tier . . . 150

14.8 Coding Syntactic Structure. . . 150

14.9 Codes for Grammatical Morphemes . . . 151

14.10 Parts of Speech and Markedness Conventions. . . 155

14.10.1 Specialized Codes for Hungarian . . . 155

14.10.2 Specialized Codes for German . . . 157

15: Examples of Transcribed Data . . . 159

16: Recording, Digitization, and Transcription Techniques 163

16.1 Techniques for Recording . . . 163

16.2 Recording Equipment. . . 164

16.3 Transcribing Equipment . . . 164

16.4 Audio Digitization . . . 165

16.5 Video Digitization . . . 166

17: Chat Symbol Summary . . . 169

18: Introduction to CLAN . . . 175

18.1 Learning CLAN . . . 175 18.2 Installing CLAN . . . 175 18.3 Starting CLAN . . . 175 18.4 Setting Directories . . . 176 18.5 CLAN Commands . . . 176 18.6 Redirection . . . 177 18.7 Shell Commands . . . 177 18.8 Online Help . . . 179 18.9 Testing CLAN . . . 179 18.10 Bug Reports . . . 180

18.11 Program Modification Requests. . . 180

19: CLAN tutorial . . . 183

20: The Editor. . . 197

20.1 Editor Mode. . . 197

20.1.1 File, Edit, and Font Menus . . . 198

20.1.2 Searching . . . 198

20.1.3 Keyboard Commands . . . 198

20.1.4 Tiers Menu . . . 198

20.1.5 Tier Exclusion . . . 198

20.1.6 Running check inside the Editor. . . 199

20.2 Non-ASCII Characters . . . 199

20.2.1 Roman-based Character Sets . . . 199

20.2.2 Non-Roman Scripts . . . 199

20.3 Editor Preferences and Options . . . 200

20.4 Mode Toggling . . . 200

20.5 Sonic CHAT . . . 201

20.6 Sonic Transcriber Mode . . . 201

20.7 Continuous Playback . . . 202

(5)

v Contents

20.9 Disambiguator Mode . . . 203

20.10 Coder Mode . . . 203

20.10.1 Entering Codes . . . 204

20.10.2 Setting up your codes file . . . 205

21:

CHAINS

– Sequences of Interactional Codes . . . 209

21.1 Sample Runs. . . 209

21.2 CHAIN Options . . . 213

22:

CHECK

– Verifying Data Accuracy. . . 215

22.1 How CHECK Works . . . 215

22.2 The Construction of “depfile.cut” . . . 215

22.3 Running CHECK. . . 217

22.4 Some Hints for Using CHECK . . . 217

22.5 CHECK Options . . . 218

23:

CHIP

– Analysis of Interaction – Jeff Sokolov. . . 219

23.1 The Tier Creation System . . . 219

23.2 The Coding System . . . 220

23.3 Word Class Analysis . . . 221

23.4 Summary Statistics . . . 222

23.5 CHIP Options . . . 225

24:

CHSTRING

– Altering Strings in Files . . . 227

25:

COMBO

– Boolean Searching . . . 231

25.1 Composing Search Strings . . . 231

25.2 Examples of Search Strings . . . 232

25.3 Referring to Files in Search Strings . . . 233

25.4 Cluster pairs in COMBO . . . 234

25.5 Searching for Clausemates . . . 234

25.6 Tracking Final Words. . . 234

25.7 Tracking Initial Words . . . 235

25.8 Adding Excluded Characters . . . 235

25.9 Limiting with COMBO . . . 235

25.10 COMBO Options . . . 236

26:

COOCCUR

– Cooccurence Analysis . . . 239

27:

DIST

– Distances Between Codes . . . 241

28:

DSS

– Developmental Sentence Score . . . 243

28.1 CHAT File Format Requirements . . . 243

28.2 Selection of a 50-sentence Corpus . . . 243

28.3 Automatic Calculation of DSS . . . 244

28.4 Interactive Calculation . . . 245

28.5 DSS Output . . . 246

28.6 DSS Summary . . . 247

28.7 DSS Options . . . 249

29:

FREQ

– Frequency Counts . . . 251

29.1 What FREQ Ignores . . . 251

(6)

Contents vi

29.3 Using Wild Cards with FREQ. . . 252

29.4 Directing the Output of FREQ. . . 254

29.5 Limiting in FREQ . . . 255

29.6 Studying Unique Words and Shared Words . . . 256

29.7 FREQ Options . . . 257

29.8 FREQMERG - Merging FREQ Output . . . 258

29.9 FREQPOS – Positional Frequency Analysis . . . 258

30:

GEM

– Tagging Interesting Passages . . . 261

30.2 Limiting with GEM. . . 262

30.3 GEM options . . . 262

30.4 GEMFREQ – Frequency Counts by Activity Types . . . 263

30.5 GEMLIST – Profiling “Gems” within Files. . . 264

31:

KEYMAP

– Contingency Analysis . . . 267

31.2 KEYMAP options . . . 267

32:

KWAL

– Key Word and Line. . . 269

32.1 Limiting in KWAL. . . 269

32.2 KWAL Options . . . 270

33:

MAXWD

– Tracking String Length . . . 273

34:

MLT

– Mean Length of Turn . . . 275

34.1 MLT Defaults . . . 275

34.3 MLT options . . . 276

35:

MLU

– Mean Length of Utterance . . . 279

35.1 MLU Defaults . . . 279

35.3 Including and Excluding Utterances in MLU and MLT. . . 280

35.4 MLU options . . . 281

36:

MODREP

– Matching Words Across Tiers . . . 285

36.1 Exclusions and Inclusions . . . 286

36.2 Using a %mod Line . . . 286

36.3 MODREP and COMBO -- Cross-tier COMBO . . . 287

36.4 MODREP Options . . . 287

37:

MOR

– Morphological Analysis . . . 289

37.1 MOR Options . . . 291

37.2 MOR Lexicons . . . 292

37.3 File Preparation . . . 293

37.4 Lexicon Building . . . 294

37.5 Creating Rule Files. . . 295

37.6 Arules . . . 297

37.7 Crules . . . 299

37.8 Problems with MOR . . . 301

(7)

39:

RELY

– Measuring Code Reliability. . . 305

40:

SALTIN

– Converting SALT Files . . . 307

41:

STATFREQ

– Outputting to Statistical Analyses . . . 309

42:

TEXTIN

– Converting Unstructured Text to CHAT . . . 311

43:

TIMEDUR

– Quantifying pauses and overlaps . . . 313

44:

WDLEN

– Graphs of Word Length . . . 315

45:

CLAN

Options . . . 317

45.1 An Alphabetical Listing of Options . . . 317

45.2 Metacharacters for Searching. . . 324

46: Utilities . . . 327

46.1 COLUMNS – Display of CHAT Files in Columns. . . 327

46.2 DATES – Computing Ages and Dates . . . 329

46.3 FLO – Creating a simple output . . . 329

46.4 LINES – Adding Line Numbers . . . 329

46.5 MAKEDATA – Creating files for other platforms . . . 330

47: Word Lists . . . 333

48: CHILDES/BIB and LEX. . . 345

48.1 CHILDES/BIB . . . 345

48.2 LEX - A Lexical Development Norms Database. . . 345

48.2.1 LEX Installation. . . 347

48.2.2 LEX Operation. . . 347

49: The Database . . . 349

49.1 Documentation and Quality Control . . . 349

49.2 Retrieving Materials through the InterNet . . . 350

49.3 Obtaining Materials on CD-ROM . . . 351

49.4 Contributing Data to CHILDES . . . 351

49.5 The Documentation File. . . 351

49.6 Acknowledgments and Contributors . . . 351

50: Research Based on CHILDES . . . 365

50.1 Grammatical Development . . . 365

50.2 Input Studies. . . 369

50.3 Computational Modeling and Connectionism . . . 370

50.4 Lexical Learning . . . 371

50.5 Narrative Structure . . . 374

50.6 Language and Literacy . . . 375

50.7 Language Impairments . . . 377

50.8 Phonological Development . . . 379

50.9 Articles about CHILDES . . . 380

51: Future Directions . . . 383

51.1 Database Development. . . 383

51.2 User-Friendliness . . . 384

51.3 Exploratory Reality . . . 384

(8)

51.5 Discourse Analysis . . . 385 51.6 Lexical Analysis . . . 386 51.7 Morphosyntactic Analysis . . . 387 51.8 A Final Word . . . 387

52: References. . . 389

53: Index . . . 393

(9)

ix

Language acquisition research thrives on data collected from spontaneous interac-tions in naturally occurring situainterac-tions. It is easy to turn on a taperecorder or videotape, and, before you know it, you will have accumulated a library of dozens or even hundreds of hours of naturalistic interactions. But simply collecting data is only the beginning of a much larger task, because the process of transcribing and analyzing naturalistic samples is extremely time-consuming and often unreliable. In this book, we will examine a set of computational tools designed to facilitate the sharing of transcript data, increase the reli-ability of transcriptions, and automate the process of data analysis. These new computa-tional tools have brought about revolutionary changes in the way that research is conducted in the child language field. In addition, they have equally revolutionary potential for the study of second language learning, adult conversational interactions, sociological content analyses, and language recovery in aphasia. Although the tools are of wide applicability, this book concentrates on their use in the child language field, hoping that researchers from other areas can make the necessary analogies to their own topics.

Before turning to a detailed examination of the current system, it may be helpful to take a brief historical tour over some of the major highlights of earlier approaches to the collection of data on language acquisition. These earlier approaches can be grouped into five major historical periods.

Period 1: Impressionistic observation

The first attempt to understand the process of language development appears in a remarkable passage from the Confessions of St. Augustine. In this passage, Augustine ac-tually claims that he remembered how he had learned language:

This I remember; and have since observed how I learned to speak. It was not that my elders taught me words (as, soon after, other learning) in any set method; but I, longing by cries and broken accents and various motions of my limbs to express my thoughts, that so I might have my will, and yet unable to express all I willed or to whom I willed, did myself, by the understanding which Thou, my God, gavest me, practise the sounds in my memory. When they named anything, and as they spoke turned towards it, I saw and remembered that they called what they would point out by the name they uttered. And that they meant this thing, and no other, was plain from the motion of their body, the natural language, as it were, of all nations, expressed by the countenance, glances of the eye, gestures of the limbs, and tones of the voice, indicating the affections of the mind as it pursues, possesses, rejects, or shuns. And thus by constantly hearing words, as they occurred in various sentences, I collected gradually for what they stood; and, having broken in my mouth to these signs, I thereby gave utterance to my will. Thus I exchanged with those about me these current signs of our wills, and so launched deeper into the stormy intercourse of human life, yet depending on parental authority and the beck of elders.

(10)

mark for child language studies through the Middle Ages and even the Enlightenment. However, Augustine’s recollection technique is no longer of much interest to us, since few of us believe in the accuracy of recollections from infancy, even if they come from Saints.

Period 2: Diaries and biographies

The second major technique for the study of language production was pioneered by Charles Darwin. Using notecards and field books to track the distribution of hundreds of species and subspecies in places like the Galapagos and Indonesia, Darwin was able to col-lect an impressive body of naturalistic data in support of his views on natural secol-lection and evolution. In his study of gestural development in his son, Darwin (1877) showed how these same tools for naturalistic observation could be adopted to the study of human devel-opment. By taking detailed daily notes, Darwin showed how researchers could build dia-ries that could then be converted into biographies documenting virtually any aspect of human development. Following on Darwin’s lead, scholars such as Ament, Preyer, Gvozdev, Szuman, Stern, Ponyori, Kenyeres, and Leopold created monumental biogra-phies detailing the language development of their own children.

Darwin’s biographical technique also had its effects on the study of adult aphasia. Following this tradition, studies of the language of particular patients have been presented by Low (1931), Pick (1913; 1971), Wernicke (1874), and many others.

Period 3: Transcripts

The limits of the diary technique were always quite apparent. Even the most highly trained observer could not keep pace with the rapid flow of normal speech production. Anyone who has attempted to follow a child about with a pen and a notebook soon realizes how much detail is missed and how the notetaking process interferes with the ongoing in-teractions.

The introduction of the taperecorder in the late 1950’s provided a way around these limitations and ushered in the third period of observational studies. The effect of the ta-perecorder on the field of language acquisition was very much like its effect on ethnomu-sicology where researchers like Alan Lomax were suddenly able to produce high quality field recordings using this new technology. This period was characterized by projects in which groups of investigators collected large datasets of taperecordings from several sub-jects across a period of two or three years. Much of the excitement in the 1960’s regarding new directions in child language research was fueled directly by the great increase in raw data that was possible through use of taperecordings and typed transcripts.

This increase in the amount of raw data had an additional, seldom discussed conse-quence. In the period of the baby biography, the final published accounts closely resembled the original data base of note cards. In this sense, there was no major gap between the ob-servational database and the published database. In the period of typed transcripts, a wider gap emerged. The size of the transcripts produced in the 60’s and 70’s made it impossible to publish the full unanalyzed corpora. Instead, researchers were forced to publish only

(11)

high-level analyses based on data that was not available to others. This led to a situation in which the raw empirical database for the field was kept only in private stocks, unavail-able for general public examination. Comments and tallies were written into the margins of ditto master copies and new, even less legible copies were then made by thermal produc-tion of new ditto masters. Each investigator devised a project-specific system of transcrip-tion and project-specific codes. As we began to compare hand-written and typewritten transcripts, problems in transcription methodology, coding schemes, and cross-investigator reliability became more apparent.

Recognizing this problem, Roger Brown took the lead in attempting to share his transcripts from Adam, Eve, and Sarah (Brown, 1973) with other researchers. These tran-scripts were typed onto stencils and mimeographed in multiple copies. The extra copies were lent to and analyzed by a wide variety of researchers. In this model, researchers took their copy of the transcript home, developed their own coding scheme, applied it (usually by making pencil markings directly on the transcript), wrote a paper about the results and, if very polite, sent a copy to Roger. Some of these reports (Moerk, 1983) even attempted to disprove the conclusions drawn from those data by Brown himself! The original database remained untouched. The nature of each individual’s coding scheme and the relationship among any set of different coding schemes could never be fully plumbed.

Period 4: Computers

Just as these data analysis problems were coming to light, a major technological op-portunity was emerging in the shape of the powerful, affordable microcomputer. Micro-computer word-processing systems and database programs allowed researchers to enter transcript data into computer files which could then be easily duplicated, edited, and ana-lyzed by standard data-processing techniques. In 1981, when the CHILDES Project was first conceived, researchers basically thought of computer systems as large notepads. Al-though researchers were aware of the ways in which databases could be searched and tab-ulated, the full analytic and comparative power of the computer systems themselves was not yet fully understood.

Rather than serving only as an “archive” or historical record, a focus on a shared database can lead to advances in methodology and theory. However, to achieve these ad-ditonal advances, researchers first needed to move beyond the idea of a simple data repos-itory. At first, the possibility of utilizing shared transcription formats, shared codes, and shared analysis programs shone only as a faint glimmer on the horizon, against the fog and gloom of handwritten tallies, fuzzy dittoes, and idiosyncratic coding schemes. Slowly, against this backdrop, the idea of a computerized data exchange system began to emerge. It was against this conceptual background that the Child Language Data Exchange System (CHILDES) system was conceived. The origin of the system can be traced back to the sum-mer of 1981 when Dan Slobin, Willem Levelt, Susan Ervin-Tripp, and Brian MacWhinney discussed the possibility of creating an archive for typed, handwritten, and computerized transcripts to be located at the Max-Planck Institut für Psycholinguistik in Nijmegen. In 1983, the MacArthur Foundation funded meetings of developmental researchers in which Elizabeth Bates, Brian MacWhinney, Catherine Snow, and other child language

(12)

research-ers discussed the possibility of soliciting MacArthur funds to support a data exchange sys-tem. In January of 1984, the MacArthur Foundation awarded a two-year grant to Carnegie Mellon University for the establishment of the Child Language Data Exchange System with Brian MacWhinney and Catherine Snow as Principal Investigators. These funds pro-vided for the entry of data into the system and for the convening of a meeting of an Advi-sory Board for the System. Twenty child language researchers met for three days in Concord, Massachusetts and agreed on a basic framework for the CHILDES system, which Catherine Snow and Brian MacWhinney would then proceed to implement.

Period 5: Connectivity and Exploratory Reality

Since 1984, when the CHILDES Project began in earnest, the world of computers has gone through a series of remarkable revolutions, each introducing new opportunities and challenges. The processing power of the home computer now dwarfs the power of the mainframe of the ‘80s, new machines are now shipped with built-in audiovisual capabili-ties, and devices such as CD-ROMs, DAT tapes, and optical disks offer enormous storage capacity at reasonable prices. This new hardware has now opened up the possibility for multimedia access to transcripts of aphasic language production. In effect, a transcript is now the starting point for a new Exploratory Reality in which the whole interaction is ac-cessible through the transcript in terms of both full audio and video images. The current shape of the CHILDES system reflects many of these new realities. In the pages that fol-low, you will learn about how we are using this new technology to provide rapid access to the database and to permit the linkage of transcripts to digitized audio and video records, even over the Internet.

Three Goals and Three Tools

The reasons for developing a computerized exchange system for language data are immediately obvious to anyone who has produced or analyzed transcripts. With such a system, we can address these three basic goals:

1. to provide more data for more children from more ages, speaking more languages; 2. to obtain better data in a consistent and fully-documented transcription system; and 3. to automate the process of data analysis.

The CHILDES system has addressed each of these goals by developing three separate, but integrated, tools. The first tool is the CHAT transcription and coding format. The second tool is the CLAN package of analysis programs, and the third tool is the CHILDES database itself. These three tools are like the legs of a three-legged stool. The transcripts in the database have all been put into the CHAT transcription system. The CLAN programs are designed to make full use of the CHAT format to facilitate a wide variety of searches and analyses. Many research groups are now using both CHAT and CLAN to enter new data sets. Eventually, these new data sets will be available to other researchers as a part of the

(13)

growing CHILDES database. In this way, CHAT, CLAN, and the database function as a coarticulated set of complementary tools.

Some Words of Appreciation

The construction of the database has depended upon the generosity of the dozens of scholars listed in chapter 27. The CLAN programs are the brain child of Leonid Spektor. Spektor began his work by relying on extensions to the public domain HUM concordance package generously provided to us by Bill Tuthill at the University of California at Berke-ley. The HUM package served as a solid base during the initial period of development, al-though no HUM code is contained in the current version. In addition, the SALT transcription system of Miller and Chapman (Miller & Chapman, 1983) provided solid initial guidelines regarding basic practices in transcription and analysis. We also derived ideas for the

MODREP and PHONFREQ programs from aspects of the PAL analysis system developed by Clifton Pye.

Darius Clynes ported CLAN to the Macintosh and added a variety of features specif-ic to the Macintosh, including a facility for building CLAN commands through dialog boxes. Jeffrey Sokolov wrote the CHIP program and the CLAN tutorial given in chapter 24 was con-structed by Pam Rollins, Barbara Pan, and Catherine Snow. Mitzi Morris designed the

MOR analyzer and the morphological rules upon which it depends.

Jane Desimone, Mary MacWhinney, Jane Morrison, Kim Roth, and Gergely Sikuta worked many long hours bringing the CHILDES database into conformity with the CHAT coding system. Helmut Feldweg provided an enormous service by supervising a parallel effort with the German and Dutch data sets. Mike Blackwell, Julia Evans, Kris Loh, Mary MacWhinney, Lucy Hewson, and Gergely Sikuta helped with the seemingly never-ending task of checking and formatting this book. Barbara Pan, Jeff Sokolov, and Pam Rollins also provided a reading of the final draft. Steven Gillis, Kim Plunkett, and Sven Strömqvist have helped propagate the CHILDES system at universities in Northern and Central Eu-rope. Gillis has also built a MOR system for Dutch and established a CHILDES file server at the University of Antwerp. Yuriko Oshima-Takane has established a vital group of child language researchers using CHILDES to study the acquisition of Japanese. Julia Evans has been instrumental in providing recommendations for improving the design of both the au-dio and visual capabilities of the editor.

Catherine Snow played a pivotal role throughout the formation of the CHILDES system in shaping policy and direction, helping in the building of the database, organizing workshops, and determining the shape of CHAT and CLAN. We also received a great deal of extremely helpful input from a variety of other sources regarding the CHAT codes de-scribed in Part I. Some of the most detailed comments have come from George Allen, Eliz-abeth Bates, Nan Bernstein-Ratner, Giuseppe Cappelli, Paola Cipriani, Annick De Houwer, Jane Desimone, Jane Edwards, Julia Evans, Judi Fenson, Paul Fletcher, Steven Gillis, Kristen Keefe, Mary MacWhinney, Jon Miller, Barbara Pan, Lucia Pfanner, Kim Plunkett, Catherine Snow, Jeff Sokolov, Leonid Spektor, Joseph Stemberger, Frank Wijnen, and Antonio Zampolli. Comments developed in Edwards (1992) were useful in

(14)

shaping parts of chapters 1, 4, 5, and 6. George Allen was the principal author of the PHO-NASCII system (Allen, 1988).

The CHILDES system has an ongoing commitment to the further refinement of the codes, programs, and database. Detailed comments and critiques are solicited from all in-terested parties. This work is currently supported by an ongoing grant from the National Institutes of Health (NICHHD).

How to Use this Book

This book is intended primarily as a manual for users of the three CHILDES tools. Chapters 1 through 17 provide a manual for CHAT; Chapters 18 through 46 provide a man-ual for CLAN; and chapters 46 through 51 discuss additional CHILDES facilities. Different users will wish to approach these tools in different orders. Users interested primarily in producing new child language transcript data will want to focus on learning to use the CHAT and CLAN tools. Working alongside a computer, they will first learn to enter tran-scripts in CHAT using the editor and then how to run CLAN programs on the data they have entered. These users will need to have copies of the CLAN programs.

A second group of users will be most interested in analyzing data already available in the CHILDES database. These users will want to focus first on the guide to the database in Chapters 25-32. Using this guide and following the instructions in Chapter 25, they can get copies of the data either on CD-ROM or using anonymous FTP. They will then want to scan the description of CHAT in chapter 2 in order to understand the basic coding con-ventions used in the database. Then, they will want to look over the description of the CLAN

programs with an eye toward determining how the programs can help them analyze and quantify patterns in the database.

A third group of users may be most interested in the CHILDES tools as ways of teaching language analysis to students. These users will want to construct additional ma-terials that will guide students through the learning of CHAT and CLAN and will encourage them to explore the current database to test out particular hypotheses.

The CHILDES system was not intended to address all issues in the study of lan-guage learning or even to be used by all students of spontaneous interactions. The CHAT system is comprehensive, but it is not ideal for all purposes. The CLAN programs are pow-erful, but they cannot solve all analytic problems. It is not the goal of CHILDES to be ev-erything to everybody or to force all research into some uniform mold. Forced uniformity, even on the level of transcription standards, would be a great disservice to scientific progress. It is important for researchers to pursue a variety of approaches to the study of language learning. Indeed, we estimate that the three CHILDES tools will never be used by at least half of the researchers in the field of child language. There are three common reasons why individual researchers may not find CHILDES useful:

1. some researchers may have already committed themselves to use of another tran-scription system;

(15)

2. some researchers may have collected so much data that they can work for many years without needing to collect more data and without comparing their own data to other researchers’ data; and

3. some researchers may not be interested in studying spontaneous speech.

Of these three reasons for not needing to use the three CHILDES tools, the third is the most frequent. For example, researchers studying comprehension would only be interested in CHILDES data when they wish to compare findings arising from studies of comprehension with patterns occurring in spontaneous production.

Changes and the Future

The CHILDES tools have been extensively tested for ease of application, accuracy, and re-liability. However, change is fundamental to the research enterprise. Researchers are con-stantly pursuing better ways of coding and analyzing data. It is important that the CHILDES tools keep progress with these changing requirements. For this reason, there will be revisions to CHAT, CLAN, and the database as long as the CHILDES project is ac-tive. Some of the important directions for the future are discussed in Chapter 33.

(16)

(17)

1

The CHAT system is a standardized format for computerized transcripts of face-to-face conversational interactions. Face-face-to-face interactions may be between children and their parents, between a doctor and a patient, or between a teacher and second language learners. Despite the differences between interactions of these different types, there are enough features common to the various forms of face-to-face interaction to make the idea of a general transcription system reasonable. The system being proposed here is designed for use with both normal and disordered populations. It can be used with learners of all types, including children, second language learners, and adults recovering from aphasic disorders. The system provides options for basic discourse transcription, as well as detailed phonological and morphological analysis. The system bears the acronym “CHAT” which stands for Codes for the Human Analysis of Transcripts. CHAT is the standard transcrip-tion system for the CHILDES (Child Language Data Exchange System) Project. With the exception of a few corpora of historical interest, all of the transcripts in the CHILDES da-tabase are in CHAT format. In addition, approximately 60 groups of researchers around the world are currently actively involved in new data collection and transcription using the CHAT system. Eventually the data collected in these projects will be contributed to the database. The CHAT system is specifically designed to facilitate the subsequent automatic analysis of transcripts by the CLAN programs that will be discussed in chapters 18 to 24.

1.1 The Promise of Computerized Transcription

Public inspection of experimental data is a crucial prerequisite for serious scientific progress. Imagine how genetics would function if every experimenter had their own indi-vidual strain of peas or drosophila and refused to allow them to be tested by other experi-menters. What would happen in geology, if every scientist kept their own set of rock specimens and refused to compare them with those of other researchers? In some fields the basic phenomena in question are so clearly open to public inspection that this is not a problem. The basic facts of planetary motion are open for all to see, as are the basic facts underlying Newtonian mechanics.

Unfortunately, in language studies, a free and open sharing and exchange of data has not always been the norm. In earlier decades, researchers often jealously guarded their field notes from a particular language community of subject type, refusing to share them openly with the broader community. Various justifications were given for this practice. It was sometimes claimed that other researchers would not fully appreciate the nature of the data or that they might misrepresent crucial patterns. Sometimes, it was claimed that only someone who had actually participated in the community or the interaction could under-stand the nature of the language and the interactions. In some cases, these limitations were real and important. However, such restrictions on the sharing of data inevitably impede the progress of the scientific study of language learning.

Within the field of language acquisition studies it is now understood that the advan-tages of sharing data outweigh the potential dangers. The question is no longer whether

(18)

data should be shared, but rather how they can be shared in a reliable and responsible fash-ion. The computerization of transcripts opens up the possibility for many types of data sharing and analysis that otherwise would have been impossible. However, the full exploi-tation of this opportunity requires the development of a standardized system for data tran-scription and analysis.

1.2 Some Words of Caution

Before we examine the CHAT system, we need to consider certain dangers in-volved in computerized transcriptions. These dangers arise from the need to compress a complex set of spoken and nonspoken messages into the extremely narrow channel re-quired for the computer. In most cases, these dangers also exist when one creates a type-written or handtype-written transcript. Let us look at some of the dangers surrounding this enterprise.

1.2.1 The Dominance of the Written Word

Perhaps the greatest danger facing the transcriber is the tendency to treat spoken language as if it were written language. The decision to write out stretches of vocal mate-rial using the forms of written language involves a major theoretical commitment. As Ochs (1979) showed so clearly, these decisions inevitably turn transcription into a theoretical en-terprise. The most difficult bias to overcome is the tendency to map every form spoken by a learner – be it a child, an aphasic, or a second language learner – onto a set of standard lexical items in the adult language. Transcribers tend to assimilate nonstandard learner strings to standard forms of the adult language. For example, when a child says “put on my jamas,” the transcriber may instead enter “put on my pajamas,” reasoning unconsciously that “jamas” is simply a childish form of “pajamas.” This type of regularization of the child form to the adult lexical norm can lead to misunderstanding of the shape of the child’s lex-icon. For example, it could be the case that the child uses “jamas” and ”pajamas” to refer to two very different things (Clark, 1987; MacWhinney, 1989).

There are two types of errors possible here. One involves mapping a learner’s spo-ken form onto an adult form when, in fact, there was no real correspondence. This is the problem of overregularization. The second type of error involves failing to map a learner’s spoken form onto an adult form when, in fact, there is a correspondence. This is the prob-lem of underregularization. The goal of transcribers should be to avoid both the Scylla of overregularization and the Charybdis of underregularization. Steering a course between these two dangers is no easy matter. A transcription system can provide devices to aid in this process, but it cannot guarantee safe passage.

Transcribers also often tend to assimilate the shape of sounds spoken by the learner to the shapes that are dictated by morphosyntactic patterns. For example, Fletcher (1985) noted that both children and adults generally produce “have” as “uv” before main verbs. As a result, forms like “might have gone” assimilate to “mightuv gone.” Fletcher believed that younger children have not yet learned to associate the full auxiliary “have” with the contracted form. If we write the children’s forms as “might have,” we then end up

(19)

mischar-acterizing the structure of their lexicon. To take another example, we can note that, in French, the various endings of the verb in the present tense are distinguished in spelling, whereas they are homophonous in speech. If a child says /mAnZ/ “eat,” are we to tran-scribe it as first person singular mange, as second person singular manges, or as the imper-ative mange? If the child says /mAnZe/, should we transcribe it as the infinitive manger, the participle mangé, or the second person formal mangez ?

CHAT deals with these problems by providing a uniform way of transcribing dis-course phonemically called UNIBET. Using UNIBET, we can code ”mightuv” as / maItUv/ and mangez/manger/mangé as /mAnZe/. It is a pity that phonological transcrip-tions are not more widely used, because they offer a level of accuracy that is difficult to obtain in other ways. However, for those who wish to avoid the work involved in phonemic transcription, CHAT also allows for the specification of nonstandard lexical forms, so that the form “mightav” would be universally recognized as the spelling of the contracted form of “might have.” For the French example, CHAT allows for a general neutral suffix written as -e. Using this, we would write mang-e, rather than mang-ez, mang-é, or mang-er.

As a supplement to the use of UNIBET codes, the editor supports transcription in standard IPA characters. The editor also allows the user to link a full digitized audio record of the interaction directly to the transcript. This is the system called “sonic CHAT”. With these sonic CHAT links, it is possible to double click on a sentence and hear its sound im-mediately. Having the actual sound produced by the child directly available in the tran-script takes some of the burden off of the trantran-scription system. However, whenever computerized analyses are based not on the original audio signal, but on transcribed ortho-graphic forms, one must continue to understand the limits of transcriptions conventions.

1.2.2 The Misuse of Standard Punctuation

Transcribers have a tendency to write out spoken language with the punctuation conventions of written language. Written language is organized into clauses and sentences delimited by commas, periods, and other marks of punctuation. Spoken language, on the other hand, is organized into tone units clustered about a tonal nucleus and delineated by pauses and tonal contours (Crystal, 1969; Crystal, 1975; Halliday, 1966; Halliday, 1967; Halliday, 1968). Work on the discourse basis of sentence production (Chafe, 1980; Jeffer-son, 1984; MacWhinney, 1985) has demonstrated a close link between tone units and ide-ational units. Retracings, pauses, stress, and all forms of intonide-ational contours are crucial markers of aspects of the utterance planning process. Moreover, these features also convey important sociolinguistic information. Within special markings or conventions, there is no way to directly indicate these important aspects of interactions.

One way of dealing with punctuation bias is to supplement UNIBET phonological coding with prosodic markings that indicate tonal stress and rises and falls in intonation. For those who do not wish to construct a complete phonological transcription, CHAT makes available a set of prosodic markers that can be combined either with standard words or with a phonological transcription to code the details of tone units and contours. In addi-tion, CHAT provides a set of conventions for marking retracings, pauses, and errors.

(20)

Again, having the actual audio record available through sonic CHAT helps keep a link be-tween the transcript and the original data.

1.2.3 The Advantages of Working with a Videotape

Whatever form a transcript may take, it will never contain a fully accurate record of what went on in an interaction. A transcript of an interaction can never fully replace an audiotape, because an audiotape of the interaction will always be more accurate in terms of preserving the actual details of what transpired. By the same token, an audio recording can never preserve as much detail as a video recording with a high-quality audio track. Audio recordings record none of the nonverbal interactions that often form the backbone of a con-versational interaction. Hence, they systematically exclude a source of information that is crucial for a full interpretation of the interaction. Although there are biases involved even in a videotape, it is still the most accurate record of an interaction that we have available. For those who are trying to use transcription to capture the full detailed character of an in-teraction, it is imperative that transcription be done from a videotape and the videotape be repeatedly consulted during all phases of analysis.

The editor is currently being extended to facilitate its use with videotapes. Our plan is to make available a floating window in the shape of a VCR controller that can be used to rewind the videotape and to enter time stamps from the videotape into the CHAT file. The alternative way of analyzing video is to record from tape onto QuickTime movies and to link these digitized movies to your transcript.

1.3 Transcription and Coding

It is important to recognize the difference between transcription and coding. Tran-scription focuses on the production of a written record that can lead us to understand, albeit only vaguely, the flow of the original interaction. Transcription must be done directly off an audiotape or, preferably, a videotape. Coding, on the other hand, is the process of rec-ognizing, analyzing and taking note of phenomena in transcribed speech. Coding can often be done by referring only to a written transcript. For example, the coding of parts of speech can be done directly from a transcript without listening to the audiotape. For other types of coding, such as speech act coding, it is imperative that coding be done while watching the original videotape.,

The CHAT system includes conventions for both transcription and coding. When first learning the system, it is best to focus on learning how to transcribe. The CHAT system offers the transcriber a large array of coding options. Although few transcribers will need to use all of the options, everyone needs to understand how basic transcription is done on the “main line.” Additional coding is done principally on the secondary or “dependent” tiers. As transcribers work more with their data, they will include further options from the secondary or “dependent” tiers. However, the beginning user should focus first on learning to correctly use the conventions for the main line. The manual includes several sample transcripts to help the beginner in learning the transcription system.

(21)

5 2.1 The Goals of a Transcription System

Like other forms of communication, transcription systems are subjected to a variety of communicative pressures. The view of language structure developed by Slobin (1977) sees structure as emerging from the pressure of three conflicting charges or goals. On the one hand, language is designed to be clear. On the other hand, it is designed to be proces-sible by the listener and quick and easy for the speaker. Unfortunately, ease of production often comes in conflict with clarity of marking. The competition between these three mo-tives leads to a variety of imperfect solutions that satisfy each goal only partially. Such im-perfect and unstable solutions characterize the grammar and phonology of human language (Bates & MacWhinney, 1982). Only rarely does a solution succeed in fully achieving all three goals.

Slobin's view of the pressures shaping human language can be extended to analyze the pressures shaping a transcription system. In many regards, a transcription system is much like any human language. It needs to be clear in its markings of categories, while still preserving readability and ease of transcription. However, unlike a human language, a transcription system needs to address two types audiences. One audience is the human au-dience of transcribers, analysts, and readers. The other auau-dience is the digital computer and its programs. In order to successfully deal with these two audiences, a system for comput-erized transcription needs to achieve the following goals:

1. Clarity: Every symbol used in the coding system should have some clear and de-finable real-world referent. The relation between the referent and the symbol should be consistent and reliable. Symbols that mark particular words should al-ways be spelled in a consistent manner. Symbols that mark particular conversa-tional patterns should refer to actual patterns consistently observable in the data. In practice, codes will always have to steer between the Scylla of overregulariza-tion and the Charybdis of underregularizaoverregulariza-tion discussed earlier. Distincoverregulariza-tions must avoid being either too fine or too coarse. Another way of looking at clarity is through the notion of systematicity. Systematicity is a simple extension of clarity across transcripts or corpora. Codes, words, and symbols must be used in a con-sistent manner across transcripts. Ideally, each code should always have a unique meaning independent of the presence of other codes or the particular transcript in which it is located. If interactions are necessary, as in hierarchical coding systems, these interactions need to be systematically described.

2. Readability: Just as human language needs to be easy to process, so transcripts need to be easy to read. This goal often runs directly counter to the first goal. In the CHILDES system, we have attempted to provide a variety of CHAT options that will allow a user to maximize the readability of a transcript. We have also pro-vided CLAN tools that will allow a reader to suppress the less readable aspects in transcript when the goal of readability is more important than the goal of clarity of

(22)

marking.

3. Ease of data entry: As distinctions proliferate within a transcription system, data entry becomes increasingly difficult and error-prone. The CLAN programs provide three tools for dealing with this problem. One is a program called CHECK that ver-ifies the syntactic accuracy of a transcript. The second is an editor mode that pro-vides computer assistance for applying a coding scheme to a transcript. The third is a program called MOR which provides automatic morphological analysis of the words in a transcript.

2.2 Learning to Use CHAT

CHAT is designed to provide options for users on two levels – basic and advanced. The basic level of CHAT is called minCHAT. Everyone should start out first learning min-CHAT.

2.2.1 minCHAT and minDOS

At the basic level, CHAT requires a minimum of coding decisions. This minimalist version of CHAT, called minCHAT, is discussed in section 2.3. MinCHAT looks much like other intuitive transcription systems that are in general use in the fields of child lan-guage and discourse analysis. It makes sense for the new user to focus on the use of min-CHAT and to ignore the rest of this manual at first. However, eventually, many users will find that there is something that they want to be able to code that goes beyond minCHAT. At that point, the next chapters to read are chapters 4, 7, and 8, which explain the remaining details of the basic coding of words on the main line.

The beginning user also has to become familiar with the basic use of a microcom-puter. For users working with a Macintosh computer, this means learning how to navigate around on the desktop and how to use the mouse to open up menus and select items. For users working with machines running MS-DOS, this means learning how to use these DOS commands: type, dir, cd, path, mkdir, rmdir, delete, copy, and rename. We refer to this restricted set of DOS commands as minDOS. Acquainting yourself with minDOS re-quires patience and a careful reading of your MS-DOS manuals. While learning minDOS, you also need to learn the editor.

2.2.2 Analyzing One Small File

For researchers who are just now beginning to use CHAT and CLAN, there is per-haps one single suggestion that can save literally hundreds of hours of possibly wasted time. The suggestion is to transcribe and analyze one single small file completely and per-fectly before launching a major program of transcription and analysis. The idea is that you should learn just enough about minCHAT, minDOS, and minCLAN to see your path through these four crucial steps:

(23)

2. successful running of the CHECK program inside the editor to guarantee accuracy in your CHAT file,

3. development of a series of codes that will interface with the particular CLAN pro-grams most appropriate for your analysis, and

4. running of the relevant CLAN programs, so that you can be sure that the results you will get will properly test the hypotheses you wish to develop.

If you go through these steps first, you can guarantee in advance the successful outcome of your project. You can avoid ending up in a situation in which you have transcribed hun-dreds of hours of data in a way that simply does not match correctly with the input require-ments for the CLAN programs.

2.2.3 midCHAT

After having learned minCHAT, the learner is ready to move on to midCHAT. Before do-ing that, it is probably a good idea to learn the basics of CLAN. To do this, first consult chap-ter 18 which introduces the CLAN system. To begin, you will want to learn CLAN only up to the level of minCLAN, which corresponds to the minCHAT level. However, once you have learned midCHAT, it also makes sense to learn the rest of the CLAN system. Learning midCHAT involves mastering additional material in chapters 4, 5, 7, and 8. This material includes the following aspects of transcription:

1. the use of canonical spellings (chapter 4), 2. using explanations on the main line (chapter 8),

3. marking omitted words and morphemes (chapters 4 and 5), 4. indicating suffixes and prefixes on the main line (chapter 5),

5. marking utterance incompletion (chapter 7) and overlap (chapter 8), and 6. marking retracings and errors (chapter 8).

Having mastered the CHAT manual through chapter 8, the learner has picked up all of the basic CHAT conventions. In many cases, the transcriber may not need to learn any more about CHAT. However, there are five topic areas for which researchers will have to look at other chapters.

1. Phonological Transcription. If your research deals with speech that diverges strongly from the standard in phonological terms, you will want to make use of some form of phonological transcription. This is often important when you are dealing with very young children, language-impaired subjects, dialect speakers, and second language learners. If you need to do phonological transcription, first

(24)

read chapter 10, which describes a system for phonemic transcription, including codes for stress and tone contours. If this level of detail is insufficient, the full ex-tended IPA phonetic system given in chapter 11 may be needed.

2. Speech Acts. If your research focuses on speech acts, you will wish to work out a system of the sort outlined in chapter 13. That chapter provides only a sketch of a fuller set of codes that will be provided in future versions of CHAT.

3. Error Analysis. If your research deals with the analysis of phonological, morpho-logical, syntactic, or semantic errors, you will want to read chapter 12, which pre-sents a system for the detailed coding of errors. That chapter has a number of examples given at the end and there are further examples of error coding that can be found in chapter 15. Even if you do not use this full system, you may wish to mark errors using the asterisk symbol on the main line.

4. Timing Analyses. If you wish to construct detailed analyses of pause times and the times for spoken material, you will want to look at the use of the %tim coding line discussed in chapter 9.

5. Morphological Analysis. If you wish to analyze the child’s learning of morpho-logical markings, you will first want to look at the main line coding of morphemes discussed in chapter 5. However, for those who want to go beyond the simple tab-ulation of types of markings, it is better to use the complete system for morpholog-ical and syntactic coding presented in chapter 14.

Finally, there will be researchers who find that none of the CHAT conventions properly ex-press the categories that they wish to code. In such cases, researchers can create their own CHAT codes, following the basic principles discussed in chapter 9.

2.2.4 Problems with Forced Decisions

Transcription and coding systems often force the user to make difficult distinctions. For example, a system might make a distinction between grammatical ellipsis and ungram-matical omission. However, it may often be the case that the user cannot decide in a given case whether an omission is grammatical or not. In that case, it may be helpful to have some way of blurring the distinction in the particular case. CHAT has certain symbols that can be used when a categorization cannot be made. It is important to remember that many of the CHAT symbols are entirely optional. Whenever you feel that you are being forced to make a distinction, check with the manual to see whether the particular coding choice is actually required. If it is not required, then simply omit the code altogether.

2.3 minCHAT

This section describes the minimum set of standards for a CHAT file. Files that follow these standards can use most aspects of the CLAN programs effectively. The basic requirements for minCHAT involve the form of the file, the form of utterances, the writing of

(25)

documen-tation, and the use of ASCII symbols. 2.3.1 The Form of Files

There are several minimum standards for the form of a minCHAT file. These stan-dards must be followed for the CLAN programs to run successfully on CHAT files:

1. When doing normal coding in English, every character in the file must be in the basic ASCII character set (see the following section).

2. Every line must end with a carriage return.

3. The first line in the file must be an @Begin header line. 4. The last line in the file must be an @End header line.

5. There must be an @Participants header line listing three-letter codes for each par-ticipant, the participant’s name, and the participant’s role.

6. Lines beginning with * indicate what was actually said. These are called “main lines.” Each main line should code one and only one utterance. When a speaker produces several utterances in a row, code each with a new main line.

7. After the asterisk on the main line comes a three-letter code in upper case letters for the participant who was the speaker of the utterance being coded. After the three-letter code comes a colon and then a tab.

8. What was actually said is entered starting in the ninth column.

9. Lines beginning with the % symbol can contain anything. Typically, these lines include codes and commentary on what was said. They are called “dependent tier” lines.

10. Dependent tier lines begin with the % symbol. Then comes a three-letter code in lower case letters for the dependent tier type, such as “pho” for phonology, a colon, and then a tab. The text of the dependent tier begins in the ninth column.

11. Continuations of main lines and dependent tier lines begin with a tab. 2.3.2 The Form of Utterances

In addition to these minimum requirements for the form of the file, there are certain minimum ways in which utterances and words should be written on the main line:

1. Utterances should end with an utterance terminator. The basic utterance termina-tors are the period, the exclamation mark, and the question mark.

(26)

2. Commas should be used sparingly.

3. Use upper case letters only for proper nouns and the word “I.” Do not use upper case letters for the first words of sentences. This will facilitate the identification of proper nouns. However, for languages like German that use capitalization to mark part of speech, this restriction can be modified so that only nouns are capital-ized.

4. Unintelligible words with an unclear phonetic shape should be transcribed as xxx. 5. If you wish to note the phonological form of an incomplete or unintelligible

pho-nological string, write it out with an ampersand, as in &guga.

6. Incomplete words can be written with the omitted material in parentheses, as in (be)cause and (a)bout.

Here is a sample that illustrates these principles. This file is syntactically correct and uses the minimum number of CHAT conventions while still maintaining compatibility with the

CLAN analysis programs. @Begin

@Participants: ROS Ross Child, BRI Brian Father *ROS: why isn't Mommy coming?

%com: Mother usually picks Ross up around 4 PM. *BRI: don't worry.

*BRI: she'll be here soon. *ROS: good.

@End

For further examples of minCHAT coding see chapter 15. 2.3.3 The Documentation File

CHAT files typically record a conversational sample collected from a particular set of speakers on a particular day. Sometimes researchers study a small set of children repeat-edly over a long period of time. This is a longitudinal study. For such studies, it is best to break up CHAT files into one collection for each child. Such a collection of files consti-tutes a corpus. A corpus can also be composed of a group of files from different groups of speakers when the focus is on a cross-sectional sampling of larger numbers of language learners from various age groups. In either case, each corpus should be accompanied by a documentation file. By convention, the name for this file should be “00readme.cdc”. The name of this file begins with two zeroes in order to assure that it appears first in directory listings. This “readme” file should contain a basic set of facts that are indispensable for the proper interpretation of the data by other researchers. The minimum set of facts that should be in each “00readme.cdc” file are:

1. Acknowledgments. There should be a statement that asks the user to cite some particular reference when using the corpus. For example, researchers using the

(27)

Adam, Eve, and Sarah corpora from Roger Brown and his colleagues are asked to cite Brown (1973). In addition, all users can cite this current manual as the source for the CHILDES system in general.

2. Restrictions. If the data is being contributed to the CHILDES system, contributors can set particular restrictions on the use of their data. For example, researchers may ask that they be sent copies of articles that make use of their data. Many re-searchers have chosen to set no limitations at all on the use of their data.

3. Warnings. This documentation file should also warn other researchers about lim-itations on the use of the data. For example, if an investigator paid no attention to correct transcription of speech errors, this should be noted.

4. Pseudonyms. The 00readme.cdc file should also include information on whether informants gave informed consent for the use of their data and whether pseud-onyms have been used to preserve informant anonymity. In general, real names should be replaced by pseudonyms. This replacement may not be desirable when the subject of the transcriptions is the researcher’s own child.

5. History. There should be detailed information on the history of the project. How was funding obtained? What were the goals of the project? How was data collect-ed? What was the sampling procedure? How was transcription done? What was ig-nored in transcription? Were transcribers trained? Was reliability checked? Was coding done? What codes were used? Was the material computerized? How? 6. Codes. If project-specific codes are being used, these should be described. 7. Biographical data. Where possible, extensive demographic, dialectological, and

psychometric data should be provided for each informant. There should be infor-mation on topics such as age, gender, siblings, schooling, social class, occupation, previous residences, religion, interests, friends, and so forth. Information on where the parents grew up and the various residences of the family is particularly important in attempting to understand sociolinguistic issues regarding language change, regionalism, and dialect. Without detailed information about specific di-alect features, it is difficult to know whether these particular markers are being used throughout the language or just in certain regions.

8. Table of contents. There should be a brief index to the contents of the corpora. This could be in the form of a list of files with their dates and the age of the target children involved. If MLU data are available for the children, these should be in-cluded. Such data are often extremely helpful to other researchers in making an initial judgment regarding the utility of a data set for their particular research ob-jectives.

9. Situational descriptions. General situational descriptions such as the shape of the child’s home and bedroom can be included in the readme file. More specific

(28)

situ-ational information should be included in each separate file, as discussed in chapter 3.

2.3.4 Checking Syntactic Accuracy

Each CLAN program runs a very superficial check to see if a file conforms to min-CHAT. This check looks only to see that each line begins with either @, *, %, a tab or a space. This is the minimum that the CLAN programs must have to function. However, the correct functioning of many of the functions of CLAN depends on adherence to further stan-dards for minCHAT. In order to make sure that a file matches these minimum requirements for correct analysis through the CLAN programs, researchers should run each file through the CLAN program called CHECK. The program can be run directly inside the editor, so that you can verify the accuracy of your transcription as you are producing it. The CHECK pro-gram will detect errors such as failure to start lines with the correct symbols, use of incor-rect speaker codes, or missing @Begin and @End symbols. CHECK can also be used to find errors in CHAT coding beyond those discussed in this chapter. Using the CHECK

program is like brushing your teeth. It may be hard to remember to use the program, but the more you use it the easier it becomes and the better the final results.

2.3.5 ASCII and Special Characters

By default, CHAT files are in ASCII code. Basic ASCII is composed of these 96 printing symbols:

a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 1 2 3 4 5 6 7 8 9 0 - = [ ] ' ` ; / \ . ,

! @ # $ % ^ & * ( ) _ + { } " ~ : ? | > <

This core set of characters is constant across computers, but the next 128 characters used on many computers are not standardized. These additional 128 characters are called “ex-tended ASCII”. The exact assignment of special characters such as Spanish “ñ” to a par-ticular extended ASCII value varies from font to font on different systems. By default, CHAT files use the Monaco font values on Macintosh and the Courier font values on Win-dows. Languages with characters outside the extended set can be represented using special fonts.

(29)

13

The three major components of a CHAT transcript are the file headers, the main tier, and the dependent tiers. In this chapter we discuss creating the first major component – the file headers. A computerized transcript in CHAT format begins with a series of “header” lines, which tells us about things such as the date of the recording, the names of the partic-ipants, the ages of the particpartic-ipants, the setting of the interaction, and so forth. Most of these header lines occur only at the very beginning of the file. These are what we call “constant headers,” because they refer to information that is constant throughout the file. Other head-ers can occur along within the main body of the file. These “changeable headhead-ers” refer to information that varies during the course of the interaction.

A header is a line of text that gives information about the participants and the set-ting. All headers begin with the “@” sign. Some headers require nothing more than the @ sign and the header name. These are “bare” headers such as @Begin or @New Episode. However, most headers require that there be some additional material. This additional ma-terial is called an “entry.” Headers that take entries must have a colon, which is then fol-lowed by one or two tabs and the required entry. By default, tabs are usually understood to be placed at eight-character intervals. The only purpose for the tabs is to improve the readability of the file header information. The material up to the colon is called the “header name.” In the example following, “@Age of CHI:” and “@Date:” are both header names.

@Age of CHI: 2;6.14 @Date: 25-JAN-1983

The text that follows the header name is called the “header entry.” In the example cited earlier, “2;6.14” and “25-JAN-1983” are the header entries. The header name and the header entry together are called the “header line.” The header line should never have a punctuation mark at the end. In CHAT, only utterances actually spoken by the subjects re-ceive final punctuation.

This chapter presents a set of headers that researchers have considered important. You may find this list incomplete. It that case, CHAT allows you to add to it. You may also find many of the headers unnecessary. Except for the @Begin, @Participants, and @End headers, none of the headers are required and you should feel free to use only those headers that you feel are needed for the accurate documentation of your corpus.

3.1 Obligatory Headers

CHAT uses three types of headers – obligatory, constant, and changeable. There are only three obligatory headers – @Begin, @Participants, and @End. Without these obligatory headers, the CLAN programs will not run correctly.

(30)

@Begin

This header is placed at the beginning of the file. It is needed to guarantee that no material has been lost at the beginning of the file. This is a “bare” header that takes no entry and uses no colon.

@Participants:

This header must be included as the second line in the file. It lists all of the actors within the file. The entry for this header is XXX Name Role, XXX Name Role, ..., XXX Name Role. XXX stands for the three-letter speaker ID. Here is an example of a completed @Participants header line:

@Participants: SAR Sue_Day Target_Child, CAR Carol Mother

Participants are identified by three elements: their speaker ID, their name and their role: 1. Speaker ID. The speaker ID is usually composed of three letters. The code may

be based either on the participant's name, as in *ROS or *BIL, or on her role, as in *CHI or *MOT. In this type of identifying system, several different children could be indicated as *CH1, *CH2, *CH3, and so on. Speaker IDs must be unique be-cause they will be used to identify speakers both in the main body of the transcript and in other headers. In many transcripts, three letters are enough to distinguish all speakers. However, even with three letters, some ambiguities can arise. For ex-ample, suppose that the child being studied is named Mark (MAR) and his mother is named Mary (MAR). They would both have the same speaker ID and you would not be able to tell who was talking. So you must change one speaker ID. You would probably want to change it to something that would be easy to read and un-derstand as you go through the file. A good choice is that speaker’s role. In this example, Mary’s speaker ID would be changed to MOT (Mother). You could change Mark’s speaker ID to CHI, but that would be misleading if there are other children in the transcript. So a better solution would be to use MAR and MOT as shown in the following example:

@Participants: MAR Mark Target_Child, MOT Mary Mother

Combinations of speaker and addressee can be indicated by combining three-letter codes, as in *CHI-MOT o