|
|

Lab 2: CHILDES and Excel
Before you begin
Install CHILDES. If you are working in the CAS 330 lab, CHILDES should be installed on the machines already. If you are working at home, you will need to install it. See the Installation notes page for tips.
If you are working in CAS 330, you may also want to see the notes about saving your work in the CAS 330 lab as well.
On this page, you will find information on:
In-class exercises
The tasks: Download the Bloom 1970 corpus, for the transcripts
of Peter, and look at the case of subject pronouns in file 7.
Background: There's a short
PowerPoint presentation about Excel here and what we're about to do
here that might help you see how everything fits together (or the PDF
handout version)
To get Peter's transcripts:
- Go to the CHILDES page.
- Click on "Database"
- Click on "English-USA" (under "zipped transcripts")
- Download "Bloom70.zip" and un-zip it so that you have a "Bloom70" folder.
- The "Bloom70" folder should contain 2 folders, "eric" and "peter". It's "peter" we will be working with, specifically peter07.cha.
- If you are having trouble getting it, you can download the local copy of the peter07.cha file.
- Bloom, L., Hood, L., & Lightbown, P. (1974). Imitation in language development: If, when and why. Cognitive Psychology 6:380-420.
- Bloom, L., Lightbown, P., & Hood, L. (1975). Structure and variation in child language. Monographs of the Society for Research in Child Development 40 (Serial no. 160).
Step one: Start CLAN and set the Working directory
- You should know how to do this from the previous lab. But it is the first step. Make sure that the peter07.cha file is in your Working directory.
Step two: Create the search file
- We're going to be looking at the case of the (first person) subject
pronouns, and we need to create a file with a list of the things we
want to look for in the transcripts.
- Open TextEdit (Mac) or Notepad (Windows) and create a new file (New under the File menu) if one is not already open. If you are using TextEdit on a Mac, choose Make plain text under the Format menu.
- CLAN has a built-in text editor, but it is more trouble than it is worth, if you can even get it to work at all for this purpose. I found that on the current Mac version, anyway, the Unicode (UTF8) option cannot be disabled, which means every file you edit has "@UTF8" at the top, which then confuses the CLAN utilities. Maybe it will work better in Windows, but there are other, easier, options anyway. To wit:
- Enter the pronouns we want to look for in the editor window, each on its own line (local
copy):
- Then choose Save from the File menu to save
the file. Call the file prons1.txt and save it into your Working directory.
Step three: Execute the COMBO command to find the pronouns
- Type this into the CLAN command window
combo +t*CHI +w2 -w2 +s@prons1.txt peter07.cha > prons1found.txt
- This means:
- combo: search the transcript (using the combo command)
- +t*CHI: restrict the search to just the child utterances.
- +w2: show us two lines after each "hit".
- -w2: show us two lines before each "hit".
- +s@prons1.txt: Look in the prons1.txt file (in the Working directory) for a list of things that we want to search for.
- peter07.cha: Search the file peter07.cha (in the Working directory)
- > prons1found.txt: saves the result in a file called prons1found.txt.
- After you've done this, you should have a file called prons1found.txt in your Working directory.
- It will look like the local copy of prons1found.txt.
Step four: Get the data into Excel
- Start Excel.
- Choose Open... from the File menu and select prons1found.txt (you may
need to change the "Files of type:" option to "text files").
- Do not drag prons1found.txt onto Excel to open the file. This (at least on the Mac) will skip the "import" step described next and cause you trouble later on (because tab will have been treated as a delimiter, see the next purple comment).
- Click Next on the import box ("delimited" should be chosen),
then uncheck all of the "delimiters" and click Next, then
click Finish. The file should appear as an Excel spreadsheet.
- It's important that you follow these instructions pretty closely. Be
sure that none of the "delimiters" are checked. This
will probably mean that you need to uncheck "tab".
If you don't, you'll run into trouble, but it won't be obvious until
a few steps down the road.
- Choose Save As... from the File menu, change the Format to "Microsoft
Excel workbook" and save the file to your working directory as prons1found.xls.
- If you're working on the take-home exercise and working with nina13.cha,
and you are having trouble importing the file, you can download the prons3found.xls spreadsheet that I ended
up with after this step.
Step five: Create "sort" fields to allow us to distill
out the child utterances alone (without context).
- Change the name of the worksheet to raw by double-clicking
on the prons1found.txt tab at the bottom of the worksheet and typing
raw in its place.
- Next, we create some cells for sorting on.
- Select column A and Insert two columns (choose Column from the Insert menu, twice).
- At the top of the new column A (in cell A1), type "starcount",
and at the top of the new column B (in cell B1), type "utt".
These are our labels.
- Select the cell in column A that's in the same row as the first row
of dashes before the transcript data (should be row 7). Enter the following
into that cell (when you hit Return, it should show "0"):
=((LEFT(C7,1)="*")+A6)*(1-(LEFT(C7,3)="---")) |
What it does:
It looks at the transcript line, which is in the current row (7)
in column C, and checks to see if it starts with "---",
and if so, the formula will evaluate to zero. If not, the formula
will copy the number above it (A6) and if the transcript line
starts with a "*", it will add one. The effect of this
is that it will keep a running tally of the number of transcript
lines that start with a "*" since that last line of
dashes. Using this, we will always be able to identify the child's
utterance as being the fourth such row.
How it does it: The formula LEFT(C7,1) yields the first character (1 character long, starting from the left) in cell C7. The formula (LEFT(C7,1)="*") yields 1 if the first character in cell C7 is an asterisk, and 0 otherwise. A6 is the cell just above our formula; at the outset, there is nothing there, so it will be treated as zero. So, ((LEFT(C7,1)="*")+A6) means "take the number just above, and add one to it if C7 starts with an asterisk." The second half of the formula looks to see if the leftmost 3 characters of C7 is "---" (because a line of hyphens separates each result CLAN gave us). Again, we get 1 from (LEFT(C7,3)="---") if C7 starts with three hyphens, and 0 if it doesn't. Subtracting it from 1 inverts the result (1-1 is 0, 1-0 is 1). By multiplying them together, we reset the number to zero each time a line starts with "---" (because (1-(LEFT(C7,3)="---") will be zero), and we add one to the number above each time a line starts with "*". If the line starts with something other than "*" or "---", then the number above will simply be copied. |
- Then select the cell in column B that's in the same row as the row
of dashes and enter the following into that cell:
| =B6+(A7=4) |
What it does:
It either copies the number above it (B6) or adds one to that
number (if the starcount is at 4). The effect of this is that
we count the number of found utterances.
How it does it: Just like before, we use the fact that (A7=4) will yield 1 if the condition is true, and 0 if the condition is false. So, in the fourth line, that is, when the count in the A column has reached 4, (A7=4) will be 1 and so it will add one to the number above it. When the count in the A column is anything other than 4, the number above is simply copied (because we add zero to it, since (A7=4) yields zero). NOTE: The reason we are looking at line 4 is that when we ran the combo command in CLAN we chose to see the two lines preceding the "hit" (-w2), and each hit starts with a line like *** File "peter07.cha": line 5011. So, if you choose another "window size", you need to replace the number 4 in this formula with the number two greater than your -w option. |
- Now, select the two cells together (A7 and B7), choose Copy from the
Edit menu, select all of the cells in columns A and B between row 8
and the last row for which there is transcript data (down around 690
in this file), and then press Enter. This will copy those formulas into
all of the cells you had selected. There should now be numbers in columns
A and B.
- Insert two more columns before the transcript utterances (click on column C, choose Insert > Column twice). We will use these columns later for the purpose of marking utterances for exclusion. In row 1, put the labels hilite for column C and exclude for column D.
- There is a local copy of the spreadsheet as it looks after this step here: prons1found-step5.xls.
Step six: Create a worksheet with just the context-free child
utterances
- Create a new worksheet by choosing Worksheet from the Insert menu.
Name it data.
- Put the label rawline in A1.
- Select A2 and enter the following into that cell. It will probably say 11 once you've entered it.
=MATCH(ROW(A1),raw!$B$1:$B$800,0) |
What it does:
It looks in the "raw" worksheet and finds the number
of the first row in which the current number appears in the "utt" column. This will be the child's utterance.
How it does it: The MATCH function will search a range of cells. You provide it with the value to search for, the range of cells to search, and a "type" of search to do. We have set the type of search to "0" which means "exact match". We are searching the range raw!$B$1:$B$800, that is, cells from the raw worksheet (raw!...) from B1 to B800 (...$B$1:$B$800). The reason the $ is there before each column and row reference is just for insurance, really. Normally, if you copy a formula from one column to another or one row to another, row references will be translated. This means if you copy a formula that refers to B1 one column to the right, you will find that the new formula refers to C1. If you then copy the formula down one row, it will refer to C2. But a formula referring to $B$1 will always refer to B1 no matter where you copy it; $ means "don't translate". The value we are searching for is ROW(A1). This seems kind of dumb, since the ROW function just returns the row of a given cell, and ROW(A1) is very obviously 1. However, when we copy that down to lower rows, the A1 will be translated to A2, A3, A4, etc., because there is no $ there. The result is that the formula will be looking for 1 in the first row, 2 in the second row, 3 in the third row, and so forth. The ultimate result of the MATCH function is a number that tells us how far down the search range our number was. So, if we are looking for 4 in raw!$B$1:$B$800 and it is in B33, the formula will return 33, since our search range starts at B1. I picked 800 because it's more than the number of rows in the raw worksheet. |
- Select B2 and enter the following into that cell:
| =INDIRECT("raw!R"&A2&"C5",FALSE) |
What it does: This formula displays the child utterance found by the MATCH function in the previous formula, so we can see it.
How it does it: This formula gets the value of a cell, but one which we refer to indirectly. We construct
the "address" of the cell by putting together a string like "raw!R1C1"
(which would refer to the "raw" workbook, row 1, column
1, a.k.a. cell A1). The reason it is indirect is that we will
use the row number we just found in the cell in the A column,
so that we can put all of the child examples together without their context. The child examples in the raw worksheet are in column 5. The
FALSE at the end of the INDIRECT command tells Excel that the reference
will look like "R1C1" instead of like "A1". |
- Note:
if you do this and find that all you see in each cell in column B is
just "CHI:" then you forgot to uncheck "tab"
in the delimiters box back in Step 4. You can salvage the situation
by doing step 4 again (open the CHILDES output file, run through the
import procedure again but this time unchecking "tab"), selecting
column A (click on the "A" above the column), selecting Edit>Copy
from the menu bar, going back to the spreadsheet you were working on,
clicking on the "raw" tab at the bottom to bring back the
"raw" worksheet, and then selecting column E (where the utterances
are), and choosing Edit>Paste from the menu bar. This should copy
the utterances into your "raw" spreadhseet the way they were
supposed to be. You may need to force Excel to recalculate at this point,
which I think can be accomplished by either pressing F9 or Command-=.
- Select cells A2 and B2, copy them, and then paste them into the cells
below. You want to fill up column A and B down to around row 100 or
so.
- You'll see that at line 97, both columns start to read "#N/A".
That's because there are only 95 child utterances found, so it can't
find the row of the 96th utterance. You can delete the rows from 97
down to where you stopped pasting in the previous substep.
- There is a local copy of the spreadsheet as it looks after this step here: prons1found-step6.xls.
Step seven: Tag each line for subject/irrelevant
- Now, we want to weed out the non-subjects, and while we're at it we
may as well classify the subjects as I, me, or my. Select column B and
insert four columns, and then label them I, me, my, and other.
- If Excel asks you if you are managing a list and offers to take you to the "list manager", politely decline.
- Select columns A-E (all at once, click on the A at the top and drag rightward) and make them smaller by using Format > Column > Autofit selection.
- You may also want to split the screen horizontally so that you can
always see the top row; you do this by dragging the small handle on
the right edge of the screen at the top, down just a little. This control
is very hard to find, actually, but it is just above the top
of the vertical scrollbar on the right side of your worksheet.
- Go through the examples and decide if the pronoun is a subject pronoun
or not. If it is, put a 1 in the I, me, or my column, as appropriate. If it is a possessive
or object pronoun, or if it is completely unclear, mark 1 in the other column. If a line contains two pronouns, they should each be represented
in the number (i.e. enter "2" under my if there
are two "my" subjects), but you should not count immediate
repetitions (on the same line); those should count as just one. This will provide us with
a rough cut of the data, so we can concentrate just on the subjects.
Step eight: Colorize the worksheet to help restrict our attention
to the relevant data.
- This is perhaps not crucial, but it's somewhat nifty. We'll make it so that all rows that have a subject relevant to the analysis are highlighted. What makes line "relevant" is if there is at least one subject that is either "me", "my", or "I".
- To do this, put the cursor on the row number for row 2, and
click and drag to select all of the rows with data in them (rows 2 through 96). Then choose Conditional
Formatting from the Format menu.
- Note: It is actually important that you start at row 2 and drag down. Don't start at the bottom and drag up or this won't work. The 2s in the formula we are about to enter are evaluated relative to where the cursor is at the time you enter the formula.
- In the window that comes up, choose Formula Is (rather
than Cell Value Is) and enter the following:
| =(SUM($B2:$D2)>0) |
| What it does:
apply the formatting you are going to choose anytime this formula
evaluates to true -- that is, whenever the cell in column E of
the current row is 1. Notice that the cell references are written like $B2. The cell references are evaluated to where the cursor is (even though many cells are selected, the cursor still is somewhere), so assuming the cursor is in row 2, $B2 will refer to column B in the current row (in every cell that this formatting condition is evaluated in). The reason we need to "lock" this to column B using $ is that otherwise the next cell over would check columns C-E to see if it should be highlighted, and the one after that would check D-F, and so forth. We want every cell, no matter what column it is in, to be highlighted if and only if there is a number in columns B-D of the current row. |
- Click Format..., move to the Patterns pane, and choose a bright background
color (perhaps yellow) to put behind all of the utterances we want to highlight. When you've picked
it, click OK. You'll see an example of what it will look like.
- Then click OK, and the relevant child utterances in your worksheet
should now be highlighted.
Step nine: Tag the verb forms
- Now, we'll look at the verb forms for all of the utterances that have relevant subjects. Insert three more columns between
"irr" and the transcript data (click on the column that has
the transcript data and then choose Column from the Insert menu, three
times), and label them fin, nonfin and ambig.
You may also want to Format > Column > Autofit selection again
to make them the proper width.
- Now go through the relevant (highlighted) utterances and mark them
as either having an unambiguously finite verb, an unambiguously nonfinite
verb, or as having a verb which is ambiguous with respect to finiteness.
You will find that they are nearly all ambiguous.
Step ten: Isolate the
child utterances with unambiguous verb forms and relevant subjects.
- In order to isolate the
data points we care about, we'll make a column that will be 1 whenever there is both an unambiguous verb form (there is a number in column F or column G) and a relevant subject (there is a number in columns B through D), and whenever it hasn't been marked for exclusion on the raw worksheet.
- Insert a two new columns before the utterances and label the first one exclude and the second one include.
- In the exclude column, we want to simply copy the value of the exclude column from the raw worksheet, so go to the cell in row 2 and enter the following formula (presuming that your exclude column on the raw worksheet is column D):
=INDIRECT("raw!R"&A2&"C4", FALSE)
|
- In the include column, go to the cell
in row 2 and enter the following formula (presuming your exclude column in the data worksheet is column I), to implement the behavior described at the beginning of this step:
| =(SUM($B2:$D2)>0)*(SUM($F2:$G2)>0)*NOT($I2) |
- Copy the cell down through the column as before.
- There is a local copy of the spreadsheet as it looks after this step (sans numbers) here: prons1found-step10.xls.
Step eleven: Create a list of relevant examples with context
restored
- Now, we're going to use conditional formatting again to highlight the lines in the raw worksheet that are relevant for the data analysis (they have a relevant subject, they have an unambiguous verb form). We will then look at each of those utterances in context so that we can exclude repetitions or other utterances that from the context should be excluded.
- Switch to the raw worksheet.
- In the hilite column, we want to simply copy the value of the include column from the data worksheet. So, in the hilite column, go to the cell in row 7 (the first line of dashes)
and enter the following formula (presuming your include column in the raw worksheet is column J, the 10th column):
| =IF((B6=B7),0,INDIRECT("data!R"&(1+B7)&"C10", FALSE)) |
| What this does: The IF function takes a condition, a value to return if the condition is true, and a value to return if the condition is false. The condition we are checking is whether the utterance count is the same as in the line above (recall that it switches on each "hit"), and the utterance count is in column B. So (B6=B7) will be true if the utterance count has not changed. If it has not changed, we want the cell to be zero. If it has changed, then we want to copy the value of the include column from the data worksheet, so we use the INDIRECT function, reading from the row corresponding to this utterance (which happens to be one higher than the current "utterance number") and column 10. |
- Copy the cell down through the column as before. You can do this by selecting the cell you just entered the formula into, Edit > Copy, selecting the cell just below it, then scrolling down with the scroll bar to the end of the data, holding down Shift while selecting the cell in the hilite column at the bottom of the data, and then pressing Enter.
- Now, we'll apply conditional formatting. Select row 7, scroll to the bottom of the data using the scroll bar, and then hold down Shift while selecting the last row of data.
- Format > Conditional Formatting.... Choose Formula Is (instead of Cell Value Is) and enter the following:
| =$C7 |
| What this does: This just checks to see if there is anything other than zero in the hilite column, and, if so, highlights it. |
- Click Format..., move to the Patterns pane, and choose a bright background color (like yellow). Click OK to accept the highlighting choice, and then click OK to apply the formatting.
Step twelve: Assess the highlighted examples and exclude those that should be excluded
- The reason that we did that last part is that it is important to be
sure that the examples that are left are appropriate for counting. In particular
you want to look to be sure that you don't have immediate repetitions
of either the adults or the child himself, since those should probably
not be counted as separate utterances. This is the function of the exclude column on the raw worksheet.
- Go through the highlighted examples on the raw worksheet, and if you find an example which should be
omitted, put a 1 in the exclude column. The highlighting should disappear. We will make reference to that later
when we tabulate the results.
- As it turned out, when I went through
this myself I didn't find any to omit, but it's worth having this capability.
Step thirteen: Tabulate the results.
- Let's make a table to see what we got.
- Insert a new worksheet, call it analysis.
- Make a small 3x2 table with columns nom, acc,
gen, and rows finite and nonfinite.
- In the nom/finite cell, enter the following. As far as Excel is
concerned, this is a "multiple-value" formula, so you will
need to enter this by pressing Command-Enter (Mac)
Ctrl-Shift-Enter (Windows) after typing the formula. (It's not obvious that
it would be a "multiple-value" formula, since it returns only a single value, but nevertheless,
this is the way it is).
| =SUM(IF((data!$F$2:$F$400 > 0)*(data!$J$2:$J$400
> 0),data!$B$2:$B$400)) |
| What it does:
This one is somewhat hard to decipher, but here's what you need
to know: the fin column in the data worksheet is referred to by data!$F$2:$F$400,
the include column is represented by data!$J$2:$J$400,
and the I (nominative) column is represented by data!$B$2:$B$400.
What it will do is, if both the fin column and the include column have numbers in them, then it will
add the number from the I (nominative) column to the total. The number 400 was chosen only because it is greater than the number of rows with relevant data in the data worksheet. |
- Great, that's one cell (nominative, finite), but we need to finish the table. To do this, copy this cell into the other
two cells in the finite row. Then change $B (which stood for nominative subjects) in the formula to $C (for the acc column), or $D (for
the gen column). Then, copy the three cells in the finite row to the
three cells in the nonfinite row, and change $F (which stood for finite) in the formula to $G (for nonfinite). Don't
forget to use Command-Enter or Ctrl-Shift-Enter
at the end of each.
Step fourteen: Ponder.
- For what it's worth, the 9 utterances that I decided I should be counting
in the example spreadsheet (with all of these steps completed, downloadable
here if you want to look at it), were those listed below. You may have made decisions different from those I made and have come
up with slightly different numbers, though.
- Finite, nominative:
- this is # (1)I'll show you # (2)I'll show you
.
- <(1)I'll &sh xxx> [//] (2)I'll show
it to you .
- (1)I'm do it .
- (1)I'm a play the [*] toys .
- ok # (1)I did it .
- (1)I got it .
- (1)I'm gonna get a button .
- Nonfinite, genitive:
- (1)my writing .
- (1)my writing . (this second one's questionable,
maybe should have been discarded as a repetition)
Take-home exercises
Your task: Repeat the in-class exercise except now,
instead of looking for first-person pronouns, check Nina's use of third
person pronouns in the transcript nina13.
The data for Nina is in Suppes.zip, available from CHILDES. You want
to look at nina13.cha.
Your search file should look something like this:
he
he'*
his
him
him'*
she
she'*
her
her'* |
If it's helpful, you can download the ninapronouns.txt
file and the nina13.cha file here. (To
save the files on your local computer, try right-clicking or Command-clicking
on the link and choosing something like "Save file as..." or
"Download file..."). You may also want/need to skip the CLAN
steps altogether because of the way the computers are set up in the CAS
330 lab. If so, you can download prons3found.txt
from here and start from basically step 4 in the above instructions.
What I want you turn in is:
- The spreadsheet you created
- The counts you got for
- nominative pronouns (he, she) as subjects of unambiguously finite
verbs
- nominative pronouns (he, she) as subjects of unambiguously non-finite
verbs
- accusative pronouns (him, her) as subjects of unambiguously finite
verbs
- accusative pronouns (him, her) as subjects of unambiguously non-finite
verbs
- genitive pronouns (his, her) as subjects of unambiguously finite
verbs
- genitive pronouns (his, her) as subjects of unambiguously non-finite
verbs
- A short (couple of paragraphs at most, one page) commentary on how
your results fit in with the system in Schütze & Wexler (1996).
Of course, the data you analyzed was a small subset of the data that
went into the Schütze & Wexler paper, so you can expect that
it will fit in pretty well.
|