Course Info
Syllabus
Blog
Textbook Errata
Bibliography
Internet Resources
Labs
Home
GRS LX 700 Lang Acq and Ling Thy
CHILDES lab

This lab is due September 25 30. Deadline extended due to a typo that could have affected your counts.

To do this lab you will need access to the CHILDES data and the CLAN analysis program. There are two ways that you can use the analysis programs. One is to download the program (available to Mac, Windows, and some flavor of Unix) and relevant databases to your own computer, and the other is to use the web interface (WebCLAN). See picking a version of CLAN to use for comments on the options. In the text below, I will mark things that apply only to CLAN installed on your own machine with [C] and things that apply only to WebCLAN with [W].

On this page, you will find information on:

(The task you'll be doing in this lab assignment was originally formulated by Martha McGinnis, U. Calgary)


Picking a version of CLAN to use

You need to choose between a) downloading and installing CLAN on your computer, along with the corpus we'll be working with, and b) using WebCLAN online.

The CLAN program (and WebCLAN) is really a collection of several different commands that you can execute to analyze your data. If you install CLAN on your computer, then you type these commands entirely within its command box. If you are using WebCLAN then you pick the command you want to run from a popup menu, and just type the arguments in the blank next to the popup menu, then press "Run".

There are advantages to both. The availability of WebCLAN is a giant leap forward in terms of making labs like this for a course easy to manage, because there are always various types of headaches that go with trying to download and install things. A big advantage of WebCLAN is that all you need is an internet connection and a browser. For the purposes of doing this lab, it can do everything you need it to do. However, it does rely on your being connected to the internet (and also, it relies on CMU's web site being up). It also does not automatically save your output anywhere, so you have to save it yourself.

On the other hand, installing CLAN yourself makes sense if you are going to be doing more than just this lab with it. For example, if you are going to do research with CHILDES corpora for your final project. This is primarily because the version of CLAN that you install can do slightly more. The biggest thing that it allows you to do is to provide your search criteria in an external file (rather than typing them in on the command line). This can be very helpful when trying to do complex searches. The ability of CLAN to recall previously used commands easily is also quite helpful (although this can be accomplished for WebCLAN in a limited way by using the back button in your browser).

What I would probably recommend is: use WebCLAN for this lab, but once you have an idea of how it works, try to install CLAN on your own computer if you want to continue working with CHILDES.

If you want to install CLAN on your computer, go to the CHILDES page, click on "CLAN programs", and choose the version for your computer. I will assume you basically know how to install things, although if you are trying to do this and having problems, I might be able to help in a limited way.


Locating Nina's transcripts with WebCLAN: [W]

If you are using WebCLAN, you find Nina's transcripts as follows.

  • Go to the WebCLAN page.
  • Click on the "Eng-USA" button.
  • Click on "Suppes" button.
  • You should now see several files like "nina01.cha"; these are the transcripts.

Downloading Nina's transcripts (this is the data we will be working with): [C]

If you opt to install CLAN, you will also need to download the data that we will be working with. Here is where you find it:

  • Go to the CHILDES page.
  • Click on "CHILDES database"
  • Click on "English-USA" (not the "tagged" one, the regular one)
  • Download "suppes.zip" and un-zip it so that you have a "suppes" folder.
  • The "suppes" folder should contain 52 files, starting with "nina01.cha".

Starting CLAN on your own machine: [C]

If you have installed CLAN on your own machine, start the program (double-click on the CLAN icon). To use it, you will need to tell it a couple of things about where to find certain files. This is accomplished by using the "Working", "Output", "Lib", and "Mor lib" buttons on the main CLAN window. When you press each of these buttons, you will be asked to find a folder on your computer.

First, you should be able to leave "Lib" and "Mor lib" alone, but they should point to the folder in which the CLAN application resides.

The "Working" folder is where CLAN looks for the input data, you want to point this to the "suppes" folder you downloaded earlier, which contains the Nina transcripts.

The "Output" folder is where CLAN will store any output files you ask for. You probably want to point this to a new folder you create for this purpose.


The structure of a (Web)CLAN command:

  • A CLAN command comes in three parts. Here is an example of one such command.
    mlu +t*CHI nina* > mlu-nina.txt
  • This command computes the MLU from the utterances in a file. The first word on the line is the name of the command. In CLAN, you type this command name followed by a space, and in WebCLAN, you choose "mlu" from a popup list.
  • The part after the command name and before the ">" are the parameters. There are two parameters given here.
  • The first parameter is "+t*CHI", which indicates that you are only interested in the "*CHI" tier. That is, you only want the MLU to be computed for the child utterances. To know that the child utterances are on the "*CHI" tier (meaning that they are on lines that begin with "*CHI:" in the transcript), you may have to look at the file. *CHI is common, but sometimes it is some abbreviation for the child's name instead. You can also examine other tiers (like *MOT for "mother", if that's what it's called in the transcript) in the same way. "+t" stands for "examine tier." NOTE: There are no spaces here. You cannot put a space anywhere in "+t*CHI" or it will not work. Parameters are identifed by looking for symbols between spaces, and this is a single parameter.
  • The second parameter is "nina*". This parameter indicates which transcript files you want to process. The "*" is a "wildcard" here, and so "nina*" means that you want to look at any file whose name begins with "nina" (so, all of them). You could also say "nina01.cha" to look at just the first transcript, etc. This parameter that indicates the input files is generally the last parameter included (so if you have other parameters, they should come before the parameter that identifies the input files).
  • After the parameters, a CLAN command will often have ">" followed by a filename. This filename is the name of the file that the output of the command will be stored in. [W] There is no equivalent for WebCLAN, a WebCLAN command never includes the ">" or anything that follows it. Instead, the WebCLAN command will send its output to your web browser, and you save the result manually by choosing Save in your browser and entering the filename to save it as.
  • Note: [W] The commands given in the lab will be in the same format as the one above, what you would enter into CLAN. If you are using WebCLAN, make the appropriate adjustments for WebCLAN as needed (choose the command from the popup menu, and enter just the parameters, not the ">" or anything following the ">").

CLAN:

clan-mlu

WebCLAN:

webclan-mlu


Your assignment:

The lab assignment comes in six parts. I will mark things to hand in with a [H] symbol.

 

Part 1: Use CLAN to determine MLU.

Use the mlu command to determine the MLU for Nina's transcripts.

You can do this with the following CLAN command, discussed in the instructions above. If you use CLAN, the results will be in a file called "mlu-nina.txt" in the Output directory. If you use WebCLAN, it will be displayed in your browser, and you should save the file yourself as "mlu-nina.txt".

mlu +t*CHI nina* > mlu-nina.txt

 

Part 2: Record Nina's age and MLU for each file 01-19.

The file "mlu-nina.txt" should now contain the MLU for Nina's utterances in each transcript ("ratio of morphemes over utterances").

Open each transcript file from 01 through 19. (Note: there is no file nina08.cha.) Observe that at the top of each file, Nina's age in that file is recorded.

[H] To hand in: A list containing, for each file from nina01.cha to nina19.cha, Nina's age in the transcript, and the MLU you computed for the transcript using the mlu command from part 1.

 

Part 3: Use CLAN to determine word frequencies

For two representative samples, we will use CLAN to determine the frequency with which each word in the transcript appears. To do this, we use the freq command. It works very much like the mlu command described above. We will run freq on nina10.cha and nina19.cha, and you can use the following commands to do this.

freq +t*CHI nina10.cha > freq-nina10.txt
freq +t*CHI nina19.cha > freq-nina19.txt

After having done this, you will have two lists of words and numbers (one from file 10, one from file 19). We will look at each, and pick a regular verb that occurs the most often from each file.

I found that eat and see seemed to be equally popular verbs in the nina10.cha file. Somewhat arbitrarily, we'll look at eat (see is complicated by the fact that it often occurs as "See?", which properly lacks a subject). I discounted have because it can be an auxiliary (and auxiliaries behave differently), also an unnecessary complication for what we are trying to do.

In the nina19.cha file, I picked get as the verb to look at. It's a common verb, not as popular as irregular go, but go is involved in some auxiliary uses like have. Want would be an reasonable verb to pick as well, but it isn't even as interesting to look at as get.

 

Part 4: Use CLAN to look at subject drop in a small sample of two files

Part 4a. Search the transcripts for the examples.

Having picked a common verb from each file, what we're going to do is look at each time the verb is used in the transcript and count how often it appears with a subject.

To do this , use the following CLAN commands.

combo +t*CHI -w2 +s"eat*" nina10.cha > selected-nina10.txt
combo +t*CHI -w2 +s"get*" nina19.cha > selected-nina19.txt

Make sure you know why it does what it does; read the combo notes at the end of this web page.

This will give you two files (selected-nina10.txt and selected-nina-19.txt), which contain the child utterances containing the verbs you've picked and the two lines preceding each.

Part 4b. Count up the totals.

Now, go through each example and decide which of the following categories it falls under. Be sure to read the "exclusion" criteria carefully. You may find it helpful to print this out and do it with a pencil.

  • X. Excluded. The utterance is (a) a repetition of an immediately preceding utterance (either by the child or the adults), (b) incomprehensible, (c) part of a rote-learned expression (e.g., "...how I wonder what you are"), (d) an imperative. We do not want to count these because either they are not certain to reflect the child's productive grammar, or because no subject is required in the adult language.
  • O. Overt subject. The verb has an overt (pronounced) subject.
  • N. Null subject. The verb should have had a subject, but the subject is missing.
  • F. Fragment. These look a lot like null subjects, but if a child answers a question like "what are you doing?" with "Eating sandwiches", it isn't accurate to call that a null subject utterance. However, in response to "What were the monkeys eating?", "Eating a balloon" should count as a null subject (not as a fragment), since this is not a well-formed fragment in adult speech. We will exclude these from the analysis, but you might as well differentiate them from the X category.

Part 4c. Describe what you found

[H] Create a 2 x 3 table of results ( 2 rows and 3 columns) like the one below. Fill in the overt and null subject numbers for each file. In the third column, compute the percentage of included utterances for each file that have overt subjects (divide the number of overt subjects by the sum of both overt and missing subjects, and then multiply by 100).

  null subjects
overt subject Percentage
with overt subject
nina10.cha; eat N for nina10.cha O for nina10.cha 100 * O/(N+O)
nina19.cha; get N for nina19.cha O for nina19.cha 100 * O/(N+O)

[H] Write a sentence that describes the results (i.e. does the percentage of dropped subjects decrease as Nina gets older?).

 

Part 5: Use CLAN to study Nina's use of subject drop in wh-questions

Search Nina's transcripts 01 through 19 for occurrences of the following wh-words: who, what, where, when, how, why, whose, which. You should create two output files, one for transcripts 01 through 09, and one for transcripts 10 through 19.

I'll leave you on your own to figure out how to do this. Consult the combo notes on the bottom of this page for some tips. It will probably be easier to count things if you print out your results, but you might consider printing 2-up (or even 4-up) to save paper.

Also: be sure your search will find not only what but what's and what'll and so forth. If you don't, you'll find that you have almost no instances in the first output file. There is an easy way to do this, by adding the "*" (wildcard) character.

If Nina is repeating something she or someone else just said, don't count that utterance.

Go through your two output files in detail. For each output file, tally up and record how many utterances fall into each of the following four classes:

  • A. impossible to tell whether wh-word is the subject or not (e.g. one-word utterance)
  • B. wh-word is the subject
  • C. wh-word is not the subject, and the subject is dropped
  • D. wh-word is not the subject, and the subject is overt

A note on how to count: You will find that there are a lot of utterances like What's that? or Where's another boy with a valentine? . Although one could potentially look at these a couple of different ways, I would count these in class C D, that is the class where the wh-word is not the subject, deriving them from that is what and another boy with a valentine is where. I had originally suggested classifying these as class "C", but that's nonsensical, it was a typo. "D" is what I meant. The wh-word is not the subject, and the subject is overt.

[H] Create a 2 x 3 table of results (2 rows and 3 columns) like the one below. Let the first row represent Nina's early transcripts (01-09) and the second row represent her later transcripts (10-19). This works just like the table from before. Let the first column represent the number of utterances in class C for each set of transcripts, and the second column represent the number of utterances in class D for each set of transcripts.

  non-subject
wh-word,
null subject
non-subject
wh-word,
overt subject
Percentage of
wh-questions
with overt subject
Early transcripts (01-09)      
Later transcripts (10-19)      

For the third column of your table, calculate the percentage of these (non-subject wh-word) utterances that have an overt subject, by adding the class C and class D amounts for each set of transcripts together, then dividing the class D amount by the result and multiplying by 100 (that is, 100 * D / (C+D)). Put the resulting percentage of overt subjects for each set of transcripts in the third column.

[H] Describe what you see (e.g., does the percentage of overt subjects increase as Nina gets older?).

 

Part 6: Discuss the comparison with Valian's (1991) results.

Consider the tables below, from O'Grady (1997), based on data from Valian (1991). They show overall percentages of dropped subjects in general, not just in (non-subject) wh-questions.

[H] Describe how your results on subject omission for eat and get (from part 4) compare with what Valian found. Mention things like whether you found more or less omission than Valian found, and pay particular attention to the groups of children whose age and/or MLU match the transcript you are looking at.

[H] Describe how your results on subject omission in wh-questions (from part 5) compare to the overall rate of subject omission. Mention things like whether subjects are dropped more often or less often in wh-questions.

[H] Consider your results in light of the hypothesis that "topic drop" accounts for some of the cases of subject omission in Child English (cf. comments about Bromberg & Wexler 1995 from the class handouts). Do your results support this hypothesis? Briefly explain why or why not.

Table 1. English-speaking children in Valian's study
(based on Valian 1991:38)
Group
No. of children
Age range
MLU
I
5
1;10 - 2;2
1.53 - 1.99
II
5
2;3 - 2;8
2.24 - 2.76
III
8
2;3 - 2;6
3.07 - 3.72
IV
3
2;6 - 2;8
4.12 - 4.38

 

Table 2. Proportion of utterances containing a subject
(based on Valian 1991:44-45)
Group
Mean
Range
I
69%
55 - 82%
II
89%
84 - 94%
III
93%
87 - 99%
IV
95%
92 - 97%

 

O'Grady, William (1997). Syntactic Development. Chicago: University of Chicago Press.

Valian, Virginia (1991). Syntactic subjects in the early speech of American and Italian children. Cognition 35:105-22.


Comments on combo:

CLAN includes a relatively powerful searching tool called combo. I will outline a couple of points here, although you should probably refer to the CLAN manual for more information. [W] NOTE: Although in CLAN, you should surround your search string with quotation marks (as shown below), in WebCLAN you should not put quotation marks around the search string. It's weird, but that's what I found. So, where it says +s"what^my" below in the CLAN command, you should just type +swhat^my if you are using WebCLAN.

An example of the combo command is given below:

combo +t*CHI +w2 -w2 +s"what^my" nina* > whatmy.txt

This command says:

  • combo: the command
  • +t*CHI: restrict attention to the lines uttered by the child
  • +w2: show me the line you find and 2 lines after it.
  • -w2: show me the line you find and two lines before it.
  • +s"what^my": search for "what" followed directly by "my". ([W] See note above: this should be +swhat^my for WebCLAN.)
  • nina*: search all of the files in the Working directory that begin with "nina".
  • > whatmy.txt: Save the results in a file called "whatmy.txt" in the Output directory.

This will look for "what" immediately followed by "my" in any of the nina files, returning something like this:

*** File "Moxie:CLAN:suppes:nina19.cha": line 254.
  *CHI: I want to play with you here .
  *CHI: look what my got .
  *CHI: look (1)what (1)my got .
  *MOT: I see what you got .
  *MOT: what did you get ?

You can see that we used the "^" character in the search string. This character means "immediately followed by", so what we searched for was "what" immediately followed by "my". In these search strings there are several other special characters that you can use.

  • x^y
    • Finds x immediately followed by y. x and y are full words
  • *
    • Finds anything
  • _
    • Finds any one character
  • x+y
    • Finds x or y
  • !x
    • Finds anything except x

You can combine these in various ways to get useful effects. A couple of common things you might use are:

  • x^*^y
    • Finds x eventually followed by y (unlike with x^y, y does not need to immediately follow x). Literally this means, search for x, immediately followed by anything, immediately followed by y.
  • *ing
    • Finds anything that ends in ing. For example, verbs like swimming. Of course it will also get some irrelevant things like thing, boring, etc.

Some example combo commands are:

  • combo +t*CHI +w2 -w2 +s"the^*^!grey^*^(dog+cat)" nina*
    • This will search for "the" followed eventually ("^*^" means "followed by anything followed by...") by something other than "grey" ("!grey" means "not grey"), followed eventually by either "dog" or "cat" ("dog+cat" means either "dog" or "cat"). It will not find "the grey cat" but it will find "the black cat", "the big red dog", etc.
  • combo +t*CHI +w2 -w2 +s"my^*^*ing" nina*
    • This will search for all instances of "my" followed eventually by something that ends in "ing".

[C] Instead of typing in the thing you are searching for each time, you can also use a "search" file (but this function is not available in WebCLAN). The "search" file is a text file that contains the things you want to search for. An example search file might look like this (searching for first person pronouns).

I
I'*
me
me'*
my
my'*

If you save this file as "search-1pron.txt" in your Working directory, then you could do the search with the following combo command, where the @ tells combo to look in your file for the list of things to search for.

combo +t*CHI +w2 -w2 +s@search-1pron.txt nina* > pron1-nina.txt

[W] Because you cannot use a search file with WebCLAN, you have to enter everything as part of the parameters. In order to do the search described above, you would need to provide this as the argument for combo instead:

+t*CHI +w2 -w2 +sI+I'*+me+me'*+my+my'* nina*