6 Exercises
6.2 MLU50 Analysis
The first CLAN analysis we will perform calculates MLU for each child on a sample of 50 utterances. By default, the MLU program excludes the strings xxx, yyy, www, as well as any string immediately preceded by one of the following symbols: 0, &, +, -, #, $, or : (see the CHAT manual for a description of transcription conventions). The MLU program also excludes from all counts material in angle brackets followed by [/], [//], or [% bch] (see the CLAN manual for list of symbols CLAN considers to be word, morpheme, or utterance delimiters). Remember that to perform any CLAN analysis, you need to be in the directory where your data is when you issue the appropriate CLAN command. In this case, we want to be in /childes/clan/lib/ne20. The command string we used to compute MLU for all five children is:
mlu +t*CHI +z50u +f *.cha
+t*CHI Analyze the child speaker tier only
+z50u Analyze the first 50 utterances only
+f Save the results in a file
*.cha Analyze all files ending with the extension .cha The only constraint on the order of elements in a CLAN command is that the name of the program (here, MLU) must come first. Many users find it good practice to put the name of the file on which the analysis is to be performed last, so that they can tell at a glance both what program was used and what file(s) were analyzed. Other elements may come in any order.
The option +t*CHI tells CLAN that we want only CHI speaker tiers considered in the analysis. Were we to omit this string, a composite MLU would be computed for all speakers in the file.
The option + z50u tells CLAN to compute MLU on only the first 50 utterances. We could, of course, have specified the child’s first 100 utterances (+z100u) or utterances from the 51st through the 100th (+z51u-100u). With no +z option specified, MLU is
computed on the entire file.
The option +f tells CLAN that we want the output recorded in output files, rather than simply displayed onscreen. CLAN will create a separate output file for each file on which it computes MLU. If we wish, we may specify a three-letter file extension for the output files immediately following the +f option in the command line. If a specific file extension is not specified, CLAN will assign one automatically. In the case of MLU, the default ex- tension is .mlu.cex. The .cex at the end is mostly important for Windows, since it allows the Windows operating system to know that this is a CLAN output file.
Finally, the string *.cha tells CLAN to perform the analysis specified on each file ending in the extension .cha found in the current directory. To perform the analysis on a single file, we would specify the entire file name (e.g., 68.cha). It was possible to use the wildcard * in this and following analyses, rather than specifying each file separately, because:
1. All the files to be analyzed ended with the same file extensions and were in the same directory; and
2. in each file, the target child was identified by the same speaker code (i.e., CHI), thus allowing us to specify the child’s tier by means of +t*CHI.
Utilization of wildcards whenever possible is more efficient than repeatedly typing in similar commands. It also cuts down on typing errors.
By default, CLAN computes MLU in morphemes, rather than words, if the transcript is morphemicized on the main line. The user may override this default and have CLAN ignore morphemicization symbols by using the option, followed by those symbols to be ignored. For example, -c# would instruct CLAN to ignore the prefix symbol in words such as un#tie; -c#-would result in both the # and - symbols in un#tie-ed being disregarded. Thus, researchers can choose not to count morphemes they believe the child is not yet using productively. To have all morphemicization symbols ignored, one would use -c#&- .
For illustrative purposes, let us suppose that we ran the above analysis on only a single child (68.cha), rather than for all five children at once (by specifying *.cha). We would use the following command:
mlu +t*CHI +z50u 68.cha
The output for this command would be as follows: > mlu +t*CHI +z50u 68.cha
mlu +t*CHI +z50U 68.cha Wed Oct 20 11:46:51 1999
mlu (18-OCT-99) is conducting analyses on: ONLY speaker main tiers matching: *CHI; **************************************** From file <68.cha>
MLU for Speaker: *CHI:
MLU (xxx and yyy are EXCLUDED from the utterance and morpheme counts):
Number of: utterances = 50, morphemes = 133 Ratio of morphemes over utterances = 2.660 Standard deviation = 1.570
MLU reports the number of utterances (in this case, the 50 utterances we specified), the number of morphemes that occurred in those 50 utterances, the ratio of morphemes over utterances (MLU in morphemes), and the standard deviation of utterance length in morphemes. The standard deviation statistic gives some indication of how variable the child’s utterance length is. This child’s average utterance is 2.660 morphemes long, with a standard deviation of 1.570 morphemes.
Check line 1 of the output for typing errors in entering the command string. Check lines 3 and possibly 4 of the output to be sure the proper speaker tier and input file(s) were specified. Also check to be sure that the number of utterances or words reported is what was specified in the command line. If CLAN finds that the transcript contains fewer utterances or words than the number specified with the +z option, it will still run the analysis but will report the actual number of utterances or words analyzed.