Course Info
Syllabus
Blog
Bibliography
Internet Resources
Labs
Home

CAS LX 522 Syntax I
Lab 2: CHILDES and Excel


Before you begin

Install CHILDES. If you are working in the CAS 330 lab, CHILDES should be installed on the machines already. If you are working at home, you will need to install it. See the Installation notes page for tips.

If you are working in CAS 330, you may also want to see the notes about saving your work in the CAS 330 lab as well.

On this page, you will find information on:


In-class exercises

The tasks: Download the Bloom 1970 corpus, for the transcripts of Peter, and look at the case of subject pronouns in file 7.

Background: There's a short PowerPoint presentation about Excel here and what we're about to do here that might help you see how everything fits together (or the PDF handout version)

To get Peter's transcripts:

  • Go to the CHILDES page.
  • Click on "Database"
  • Click on "English-USA" (under "zipped transcripts")
  • Download "Bloom70.zip" and un-zip it so that you have a "Bloom70" folder.
  • The "Bloom70" folder should contain 2 folders, "eric" and "peter". It's "peter" we will be working with, specifically peter07.cha.
  • If you are having trouble getting it, you can download the local copy of the peter07.cha file.
    • Bloom, L., Hood, L., & Lightbown, P. (1974). Imitation in language development: If, when and why. Cognitive Psychology 6:380-420.
    • Bloom, L., Lightbown, P., & Hood, L. (1975). Structure and variation in child language. Monographs of the Society for Research in Child Development 40 (Serial no. 160).

Step one: Start CLAN and set the Working directory

  • You should know how to do this from the previous lab. But it is the first step. Make sure that the peter07.cha file is in your Working directory.

Step two: Create the search file

  • We're going to be looking at the case of the (first person) subject pronouns, and we need to create a file with a list of the things we want to look for in the transcripts.
  • Open TextEdit (Mac) or Notepad (Windows) and create a new file (New under the File menu) if one is not already open. If you are using TextEdit on a Mac, choose Make plain text under the Format menu.
    • CLAN has a built-in text editor, but it is more trouble than it is worth, if you can even get it to work at all for this purpose. I found that on the current Mac version, anyway, the Unicode (UTF8) option cannot be disabled, which means every file you edit has "@UTF8" at the top, which then confuses the CLAN utilities. Maybe it will work better in Windows, but there are other, easier, options anyway. To wit:
  • Enter the pronouns we want to look for in the editor window, each on its own line (local copy):
    I
    I'*
    me
    me'*
    my
    my'*
  • Then choose Save from the File menu to save the file. Call the file prons1.txt and save it into your Working directory.

Step three: Execute the COMBO command to find the pronouns

  • Type this into the CLAN command window
    combo +t*CHI +w2 -w2 +s@prons1.txt peter07.cha > prons1found.txt
  • This means:
    • combo: search the transcript (using the combo command)
    • +t*CHI: restrict the search to just the child utterances.
    • +w2: show us two lines after each "hit".
    • -w2: show us two lines before each "hit".
    • +s@prons1.txt: Look in the prons1.txt file (in the Working directory) for a list of things that we want to search for.
    • peter07.cha: Search the file peter07.cha (in the Working directory)
    • > prons1found.txt: saves the result in a file called prons1found.txt.
  • After you've done this, you should have a file called prons1found.txt in your Working directory.
  • It will look like the local copy of prons1found.txt.

Step four: Get the data into Excel

  • Start Excel.
  • Choose Open... from the File menu and select prons1found.txt (you may need to change the "Files of type:" option to "text files").
    • Do not drag prons1found.txt onto Excel to open the file. This (at least on the Mac) will skip the "import" step described next and cause you trouble later on (because tab will have been treated as a delimiter, see the next purple comment).
  • Click Next on the import box ("delimited" should be chosen), then uncheck all of the "delimiters" and click Next, then click Finish. The file should appear as an Excel spreadsheet.
    • It's important that you follow these instructions pretty closely. Be sure that none of the "delimiters" are checked. This will probably mean that you need to uncheck "tab". If you don't, you'll run into trouble, but it won't be obvious until a few steps down the road.
  • Choose Save As... from the File menu, change the Format to "Microsoft Excel workbook" and save the file to your working directory as prons1found.xls.
    • If you're working on the take-home exercise and working with nina13.cha, and you are having trouble importing the file, you can download the prons3found.xls spreadsheet that I ended up with after this step.

Step five: Create "sort" fields to allow us to distill out the child utterances alone (without context).

  • Change the name of the worksheet to raw by double-clicking on the prons1found.txt tab at the bottom of the worksheet and typing raw in its place.
  • Next, we create some cells for sorting on.
  • Select column A and Insert two columns (choose Column from the Insert menu, twice).
  • At the top of the new column A (in cell A1), type "starcount", and at the top of the new column B (in cell B1), type "utt". These are our labels.
  • Select the cell in column A that's in the same row as the first row of dashes before the transcript data (should be row 7). Enter the following into that cell (when you hit Return, it should show "0"):

    =((LEFT(C7,1)="*")+A6)*(1-(LEFT(C7,3)="---"))

    What it does: It looks at the transcript line, which is in the current row (7) in column C, and checks to see if it starts with "---", and if so, the formula will evaluate to zero. If not, the formula will copy the number above it (A6) and if the transcript line starts with a "*", it will add one. The effect of this is that it will keep a running tally of the number of transcript lines that start with a "*" since that last line of dashes. Using this, we will always be able to identify the child's utterance as being the fourth such row.

    How it does it: The formula LEFT(C7,1) yields the first character (1 character long, starting from the left) in cell C7. The formula (LEFT(C7,1)="*") yields 1 if the first character in cell C7 is an asterisk, and 0 otherwise. A6 is the cell just above our formula; at the outset, there is nothing there, so it will be treated as zero. So, ((LEFT(C7,1)="*")+A6) means "take the number just above, and add one to it if C7 starts with an asterisk." The second half of the formula looks to see if the leftmost 3 characters of C7 is "---" (because a line of hyphens separates each result CLAN gave us). Again, we get 1 from (LEFT(C7,3)="---") if C7 starts with three hyphens, and 0 if it doesn't. Subtracting it from 1 inverts the result (1-1 is 0, 1-0 is 1). By multiplying them together, we reset the number to zero each time a line starts with "---" (because (1-(LEFT(C7,3)="---") will be zero), and we add one to the number above each time a line starts with "*". If the line starts with something other than "*" or "---", then the number above will simply be copied.

  • Then select the cell in column B that's in the same row as the row of dashes and enter the following into that cell:
    =B6+(A7=4)

    What it does: It either copies the number above it (B6) or adds one to that number (if the starcount is at 4). The effect of this is that we count the number of found utterances.

    How it does it: Just like before, we use the fact that (A7=4) will yield 1 if the condition is true, and 0 if the condition is false. So, in the fourth line, that is, when the count in the A column has reached 4, (A7=4) will be 1 and so it will add one to the number above it. When the count in the A column is anything other than 4, the number above is simply copied (because we add zero to it, since (A7=4) yields zero). NOTE: The reason we are looking at line 4 is that when we ran the combo command in CLAN we chose to see the two lines preceding the "hit" (-w2), and each hit starts with a line like *** File "peter07.cha": line 5011. So, if you choose another "window size", you need to replace the number 4 in this formula with the number two greater than your -w option.

  • Now, select the two cells together (A7 and B7), choose Copy from the Edit menu, select all of the cells in columns A and B between row 8 and the last row for which there is transcript data (down around 690 in this file), and then press Enter. This will copy those formulas into all of the cells you had selected. There should now be numbers in columns A and B.
  • Insert two more columns before the transcript utterances (click on column C, choose Insert > Column twice). We will use these columns later for the purpose of marking utterances for exclusion. In row 1, put the labels hilite for column C and exclude for column D.
  • There is a local copy of the spreadsheet as it looks after this step here: prons1found-step5.xls.

Step six: Create a worksheet with just the context-free child utterances

  • Create a new worksheet by choosing Worksheet from the Insert menu. Name it data.
  • Put the label rawline in A1.
  • Select A2 and enter the following into that cell. It will probably say 11 once you've entered it.

    =MATCH(ROW(A1),raw!$B$1:$B$800,0)

    What it does: It looks in the "raw" worksheet and finds the number of the first row in which the current number appears in the "utt" column. This will be the child's utterance.

    How it does it: The MATCH function will search a range of cells. You provide it with the value to search for, the range of cells to search, and a "type" of search to do. We have set the type of search to "0" which means "exact match". We are searching the range raw!$B$1:$B$800, that is, cells from the raw worksheet (raw!...) from B1 to B800 (...$B$1:$B$800). The reason the $ is there before each column and row reference is just for insurance, really. Normally, if you copy a formula from one column to another or one row to another, row references will be translated. This means if you copy a formula that refers to B1 one column to the right, you will find that the new formula refers to C1. If you then copy the formula down one row, it will refer to C2. But a formula referring to $B$1 will always refer to B1 no matter where you copy it; $ means "don't translate". The value we are searching for is ROW(A1). This seems kind of dumb, since the ROW function just returns the row of a given cell, and ROW(A1) is very obviously 1. However, when we copy that down to lower rows, the A1 will be translated to A2, A3, A4, etc., because there is no $ there. The result is that the formula will be looking for 1 in the first row, 2 in the second row, 3 in the third row, and so forth. The ultimate result of the MATCH function is a number that tells us how far down the search range our number was. So, if we are looking for 4 in raw!$B$1:$B$800 and it is in B33, the formula will return 33, since our search range starts at B1. I picked 800 because it's more than the number of rows in the raw worksheet.

  • Select B2 and enter the following into that cell:
    =INDIRECT("raw!R"&A2&"C5",FALSE)

    What it does: This formula displays the child utterance found by the MATCH function in the previous formula, so we can see it.

    How it does it: This formula gets the value of a cell, but one which we refer to indirectly. We construct the "address" of the cell by putting together a string like "raw!R1C1" (which would refer to the "raw" workbook, row 1, column 1, a.k.a. cell A1). The reason it is indirect is that we will use the row number we just found in the cell in the A column, so that we can put all of the child examples together without their context. The child examples in the raw worksheet are in column 5. The FALSE at the end of the INDIRECT command tells Excel that the reference will look like "R1C1" instead of like "A1".

    • Note: if you do this and find that all you see in each cell in column B is just "CHI:" then you forgot to uncheck "tab" in the delimiters box back in Step 4. You can salvage the situation by doing step 4 again (open the CHILDES output file, run through the import procedure again but this time unchecking "tab"), selecting column A (click on the "A" above the column), selecting Edit>Copy from the menu bar, going back to the spreadsheet you were working on, clicking on the "raw" tab at the bottom to bring back the "raw" worksheet, and then selecting column E (where the utterances are), and choosing Edit>Paste from the menu bar. This should copy the utterances into your "raw" spreadhseet the way they were supposed to be. You may need to force Excel to recalculate at this point, which I think can be accomplished by either pressing F9 or Command-=.
  • Select cells A2 and B2, copy them, and then paste them into the cells below. You want to fill up column A and B down to around row 100 or so.
  • You'll see that at line 97, both columns start to read "#N/A". That's because there are only 95 child utterances found, so it can't find the row of the 96th utterance. You can delete the rows from 97 down to where you stopped pasting in the previous substep.
  • There is a local copy of the spreadsheet as it looks after this step here: prons1found-step6.xls.

Step seven: Tag each line for subject/irrelevant

  • Now, we want to weed out the non-subjects, and while we're at it we may as well classify the subjects as I, me, or my. Select column B and insert four columns, and then label them I, me, my, and other.
    • If Excel asks you if you are managing a list and offers to take you to the "list manager", politely decline.
  • Select columns A-E (all at once, click on the A at the top and drag rightward) and make them smaller by using Format > Column > Autofit selection.
  • You may also want to split the screen horizontally so that you can always see the top row; you do this by dragging the small handle on the right edge of the screen at the top, down just a little. This control is very hard to find, actually, but it is just above the top of the vertical scrollbar on the right side of your worksheet.
  • Go through the examples and decide if the pronoun is a subject pronoun or not. If it is, put a 1 in the I, me, or my column, as appropriate. If it is a possessive or object pronoun, or if it is completely unclear, mark 1 in the other column. If a line contains two pronouns, they should each be represented in the number (i.e. enter "2" under my if there are two "my" subjects), but you should not count immediate repetitions (on the same line); those should count as just one. This will provide us with a rough cut of the data, so we can concentrate just on the subjects.

Step eight: Colorize the worksheet to help restrict our attention to the relevant data.

  • This is perhaps not crucial, but it's somewhat nifty. We'll make it so that all rows that have a subject relevant to the analysis are highlighted. What makes line "relevant" is if there is at least one subject that is either "me", "my", or "I".
  • To do this, put the cursor on the row number for row 2, and click and drag to select all of the rows with data in them (rows 2 through 96). Then choose Conditional Formatting from the Format menu.
    • Note: It is actually important that you start at row 2 and drag down. Don't start at the bottom and drag up or this won't work. The 2s in the formula we are about to enter are evaluated relative to where the cursor is at the time you enter the formula.
  • In the window that comes up, choose Formula Is (rather than Cell Value Is) and enter the following:
    =(SUM($B2:$D2)>0)
    What it does: apply the formatting you are going to choose anytime this formula evaluates to true -- that is, whenever the cell in column E of the current row is 1. Notice that the cell references are written like $B2. The cell references are evaluated to where the cursor is (even though many cells are selected, the cursor still is somewhere), so assuming the cursor is in row 2, $B2 will refer to column B in the current row (in every cell that this formatting condition is evaluated in). The reason we need to "lock" this to column B using $ is that otherwise the next cell over would check columns C-E to see if it should be highlighted, and the one after that would check D-F, and so forth. We want every cell, no matter what column it is in, to be highlighted if and only if there is a number in columns B-D of the current row.
  • Click Format..., move to the Patterns pane, and choose a bright background color (perhaps yellow) to put behind all of the utterances we want to highlight. When you've picked it, click OK. You'll see an example of what it will look like.
  • Then click OK, and the relevant child utterances in your worksheet should now be highlighted.

Step nine: Tag the verb forms

  • Now, we'll look at the verb forms for all of the utterances that have relevant subjects. Insert three more columns between "irr" and the transcript data (click on the column that has the transcript data and then choose Column from the Insert menu, three times), and label them fin, nonfin and ambig. You may also want to Format > Column > Autofit selection again to make them the proper width.
  • Now go through the relevant (highlighted) utterances and mark them as either having an unambiguously finite verb, an unambiguously nonfinite verb, or as having a verb which is ambiguous with respect to finiteness. You will find that they are nearly all ambiguous.

Step ten: Isolate the child utterances with unambiguous verb forms and relevant subjects.

  • In order to isolate the data points we care about, we'll make a column that will be 1 whenever there is both an unambiguous verb form (there is a number in column F or column G) and a relevant subject (there is a number in columns B through D), and whenever it hasn't been marked for exclusion on the raw worksheet.
  • Insert a two new columns before the utterances and label the first one exclude and the second one include.
  • In the exclude column, we want to simply copy the value of the exclude column from the raw worksheet, so go to the cell in row 2 and enter the following formula (presuming that your exclude column on the raw worksheet is column D):

    =INDIRECT("raw!R"&A2&"C4", FALSE)

  • In the include column, go to the cell in row 2 and enter the following formula (presuming your exclude column in the data worksheet is column I), to implement the behavior described at the beginning of this step:
    =(SUM($B2:$D2)>0)*(SUM($F2:$G2)>0)*NOT($I2)
  • Copy the cell down through the column as before.
  • There is a local copy of the spreadsheet as it looks after this step (sans numbers) here: prons1found-step10.xls.

Step eleven: Create a list of relevant examples with context restored

  • Now, we're going to use conditional formatting again to highlight the lines in the raw worksheet that are relevant for the data analysis (they have a relevant subject, they have an unambiguous verb form). We will then look at each of those utterances in context so that we can exclude repetitions or other utterances that from the context should be excluded.
  • Switch to the raw worksheet.
  • In the hilite column, we want to simply copy the value of the include column from the data worksheet. So, in the hilite column, go to the cell in row 7 (the first line of dashes) and enter the following formula (presuming your include column in the raw worksheet is column J, the 10th column):
    =IF((B6=B7),0,INDIRECT("data!R"&(1+B7)&"C10", FALSE))
    What this does: The IF function takes a condition, a value to return if the condition is true, and a value to return if the condition is false. The condition we are checking is whether the utterance count is the same as in the line above (recall that it switches on each "hit"), and the utterance count is in column B. So (B6=B7) will be true if the utterance count has not changed. If it has not changed, we want the cell to be zero. If it has changed, then we want to copy the value of the include column from the data worksheet, so we use the INDIRECT function, reading from the row corresponding to this utterance (which happens to be one higher than the current "utterance number") and column 10.
  • Copy the cell down through the column as before. You can do this by selecting the cell you just entered the formula into, Edit > Copy, selecting the cell just below it, then scrolling down with the scroll bar to the end of the data, holding down Shift while selecting the cell in the hilite column at the bottom of the data, and then pressing Enter.
  • Now, we'll apply conditional formatting. Select row 7, scroll to the bottom of the data using the scroll bar, and then hold down Shift while selecting the last row of data.
  • Format > Conditional Formatting.... Choose Formula Is (instead of Cell Value Is) and enter the following:
    =$C7
    What this does: This just checks to see if there is anything other than zero in the hilite column, and, if so, highlights it.
  • Click Format..., move to the Patterns pane, and choose a bright background color (like yellow). Click OK to accept the highlighting choice, and then click OK to apply the formatting.

Step twelve: Assess the highlighted examples and exclude those that should be excluded

  • The reason that we did that last part is that it is important to be sure that the examples that are left are appropriate for counting. In particular you want to look to be sure that you don't have immediate repetitions of either the adults or the child himself, since those should probably not be counted as separate utterances. This is the function of the exclude column on the raw worksheet.
  • Go through the highlighted examples on the raw worksheet, and if you find an example which should be omitted, put a 1 in the exclude column. The highlighting should disappear. We will make reference to that later when we tabulate the results.
    • As it turned out, when I went through this myself I didn't find any to omit, but it's worth having this capability.

Step thirteen: Tabulate the results.

  • Let's make a table to see what we got.
  • Insert a new worksheet, call it analysis.
  • Make a small 3x2 table with columns nom, acc, gen, and rows finite and nonfinite.
  • In the nom/finite cell, enter the following. As far as Excel is concerned, this is a "multiple-value" formula, so you will need to enter this by pressing Command-Enter (Mac) Ctrl-Shift-Enter (Windows) after typing the formula. (It's not obvious that it would be a "multiple-value" formula, since it returns only a single value, but nevertheless, this is the way it is).
    =SUM(IF((data!$F$2:$F$400 > 0)*(data!$J$2:$J$400 > 0),data!$B$2:$B$400))
    What it does: This one is somewhat hard to decipher, but here's what you need to know: the fin column in the data worksheet is referred to by data!$F$2:$F$400, the include column is represented by data!$J$2:$J$400, and the I (nominative) column is represented by data!$B$2:$B$400. What it will do is, if both the fin column and the include column have numbers in them, then it will add the number from the I (nominative) column to the total. The number 400 was chosen only because it is greater than the number of rows with relevant data in the data worksheet.
  • Great, that's one cell (nominative, finite), but we need to finish the table. To do this, copy this cell into the other two cells in the finite row. Then change $B (which stood for nominative subjects) in the formula to $C (for the acc column), or $D (for the gen column). Then, copy the three cells in the finite row to the three cells in the nonfinite row, and change $F (which stood for finite) in the formula to $G (for nonfinite). Don't forget to use Command-Enter or Ctrl-Shift-Enter at the end of each.

Step fourteen: Ponder.

  • For what it's worth, the 9 utterances that I decided I should be counting in the example spreadsheet (with all of these steps completed, downloadable here if you want to look at it), were those listed below. You may have made decisions different from those I made and have come up with slightly different numbers, though.
    • Finite, nominative:
      • this is # (1)I'll show you # (2)I'll show you .
      • <(1)I'll &sh xxx> [//] (2)I'll show it to you .
      • (1)I'm do it .
      • (1)I'm a play the [*] toys .
      • ok # (1)I did it .
      • (1)I got it .
      • (1)I'm gonna get a button .
    • Nonfinite, genitive:
      • (1)my writing .
      • (1)my writing . (this second one's questionable, maybe should have been discarded as a repetition)

Take-home exercises

Your task: Repeat the in-class exercise except now, instead of looking for first-person pronouns, check Nina's use of third person pronouns in the transcript nina13.

The data for Nina is in Suppes.zip, available from CHILDES. You want to look at nina13.cha.

Your search file should look something like this:
he
he'*
his
him
him'*
she
she'*
her
her'*

If it's helpful, you can download the ninapronouns.txt file and the nina13.cha file here. (To save the files on your local computer, try right-clicking or Command-clicking on the link and choosing something like "Save file as..." or "Download file..."). You may also want/need to skip the CLAN steps altogether because of the way the computers are set up in the CAS 330 lab. If so, you can download prons3found.txt from here and start from basically step 4 in the above instructions.

What I want you turn in is:

  • The spreadsheet you created
  • The counts you got for
    • nominative pronouns (he, she) as subjects of unambiguously finite verbs
    • nominative pronouns (he, she) as subjects of unambiguously non-finite verbs
    • accusative pronouns (him, her) as subjects of unambiguously finite verbs
    • accusative pronouns (him, her) as subjects of unambiguously non-finite verbs
    • genitive pronouns (his, her) as subjects of unambiguously finite verbs
    • genitive pronouns (his, her) as subjects of unambiguously non-finite verbs
  • A short (couple of paragraphs at most, one page) commentary on how your results fit in with the system in Schütze & Wexler (1996). Of course, the data you analyzed was a small subset of the data that went into the Schütze & Wexler paper, so you can expect that it will fit in pretty well.