Course Info
Syllabus
Handouts
Questions and Answers
Textbook Errata
Bibliography
Internet Resources
Labs
Home

CAS LX 522 Syntax I
Lab 4: Praat and the GLA lab


This page has three parts:


Introductory notes.

We'll be exploring learnability in OT using the ranking software included in Praat.

There is a "competitor" called OTSoft, which is a program that can also do ranking/learning in Optimality Theory. Currently, it is PC-only, although I have contacted Bruce Hayes about the possibility of porting it to REALbasic (which would allow it to be run on either Mac or Windows machines). Time has not allowed for me to finish working with this porting project (I originally started it 2 years ago!), and Praat is for the moment more refined. Feel free to explore OTSoft anyway, of course.


In-class exercises

The task: Run Praat on (basically) Anttila's data from Finnish plurals.

First step: Get Praat

  • It should be installed on the computers in the lab. If it isn't there (or if you aren't there and you want to do the take-home part of the lab), go to the Praat page and download it.
    • The Windows download results in a "self-extracting archive" which you can double-click to produce the Praat application. The Mac download results in a "Stuffit" archive, which you should be able to double-click and decompress. There is currently a note on the Mac archive that it won't decompress with Stuffit Expander 8, but I had no trouble with Stuffit Expander 8.0.2.
    • Also note: The Praat program is very frequently updated. If you download the program yourself, you will almost certainly have a newer version than the one installed on the lab machines. As of the date of class, it was updated 13 days ago. This shouldn't matter (much, I hope).

Second step: Get the data on Finnish plurals.

These files come from Paul Boersma's GLA page. What we want is Anttila's data, Anttila_data.txt (or the local copy) Right-click (of Ctrl-click) on that link, choose Save Target As... or Download Link to Disk. You might want to make a folder on the desktop (or on the file server if you've connected your account) to hold your files. Wherever you put it, just make sure you know where it is.

Third step: Start Praat.

  • Double-click on the funny mouth+ear icon

Fourth step: Create the NoCoda grammar

  • In the Objects window, choose New>Optimality Theory>Create NoCoda Grammar.

Fifth step: Load Anttila data.

  • From the Read menu, choose Read from file... and find the Anttila_data.txt file you downloaded in step 2.
  • Choose OTGrammar Anttila and click Edit to see an overview of the grammar. Check to be sure that the ranking values are all at 100 (the initial state).

Sixth step: Create the "primary linguistic data".

  • In the Objects window, choose PairDistribution Anttila. This is a survey of how frequent each pair of input and output is in the corpus that Anttila studied. Praat can take this and create a set of input and output strings that match the statistical properties of Anttila's original corpus.
  • Now, let's learn. Click on OTGrammar Anttila, then control-click (Windows) or command-click (Mac) on PairDistribution Anttila so that both are selected.
  • Click Learn... We'll use the defaults, 10000 iterations per plasticity, starting at a plasticity of 1 and going down 0.1 each round for four rounds (so, a total of 40000 data points considered).

And we'll do some other stuff...


Take-home exercises

Your task: Try the GLA on the French NRF data we talked about in class last week. The paper that discusses the theoretical background (in terms of Reynolds-style floating constraints) is in PDF form here as Legendre-etal-NRFs.pdf. It might be useful to at least skim this paper. I will mark things in red that you should write down/print out for handing in with the lab writeup. There's a summary at the end of this web page.

Step 1: Consider the problem; work out the input distribution that the "kid" will "hear."

  • According to Table 4 in the Legendre et al. paper, adults used on average 31% non-present forms and 37% non-3sg forms.
  • Assume that tense and agreement are fully independent (that is, the likelihood of seeing a non-present form does not depend on whether the form is also non-3sg, and vice-versa).
  • Given this, determine what the percentage of each of the following should be in adult speech (you don't need to hand this in, but you'll copy these numbers into a table later).
    • non-present 3sg forms?
      • The likelihood of non-present 3sg forms is the likelihood of a non-present form (31%) times the likelihood of a 3sg form (63%). And so forth.
    • non-present non-3sg forms?
    • present 3sg forms?
    • present non-3sg forms?
    • NRFs? (non-finite root form; a root infinitive)
  • The assumption behind our predictions is that kids intend to use non-present 3sg forms, for example, as often as adults do. Yet sometimes, functional projections will be missing, and the resulting form that the child produces will be a default form (present tense if TP is missing, 3sg if AgrP is missing).
  • So, for every input, which will have tense (either non-present, or present) and agreement (either non-3sg, or 3sg), there are four possible output forms:
    • TP and AgrP (the tense form in the input, the agr form in the input)
    • TP only (the tense form in the input, and default 3sg)
    • AgrP only (default present tense, and the agr form in the input)
    • neither TP nor AgrP (NRF form)

Step 2a: Creating the french.praat.txt file: Get the skeleton file and look at how it works.

  • Download this skeleton file skeleton.praat.txt (right-click, or control-click and save to disk).
  • Make a copy and call the copy french.praat.txt.
  • Open the file in a text editor (WordPad, SimpleText, BBEdit).
  • Look at how the constraints work. There are four constraints:
    • ParseA
    • ParseT
    • *F
    • *F2
  • Each is ranked somewhere -- higher numbers correspond to higher-ranked (more important) constraints.
  • How are they ranked initially?
  • Why are they ranked that way?

Step 2b: Creating the french.praat.txt file: Entering tableaux

  • The skeleton file has one input. We need more. The one I've given you a model of is "non3sg-nonpres". What the other ones are should be pretty obvious (given that one is "non3sg-nonpres").
    • What these represent are the different possible things the child could be trying to say. So, the one I gave you represents the situation when the child is trying to say something in the non-present with a non-3sg subject (e.g., 1sg past).
  • For each input (things the child could be trying to say), there are four possible outcomes (given at the end of step 1, above).
  • Looking at the input I provided, notice that after the label of the input (e.g., "non3sg-nonpres"), the number of outcomes is given (i.e. 4). This is the number of different ways a kid might actually end up producing an intended 'non3sg-nonpres'.
  • Underneath the number, there is one line for each candidate (each possible output). The first thing in the line is how it comes out (e.g., "non3sg-pres") followed by as many numbers as there are constraints. Each number after the name represents how many violations this candidate incurs according to the contsraints (in the order they were listed earlier in the file). In English, what you should get from the example I provided is:
    • Non3sg-nonpres satisfies both ParseT and ParseA, violates *F twice, and *F2 once.
    • Non3sg-pres satisfies ParseA but violates ParseT, violates *F, but satisfies *F2. (Note: pres here means the default; TP is missing for this candidate).
  • To do: Create the missing inputs. They should be modeled on input 1 but for the other possible things (other than non3sg-nonpres) that a kid might try to say.
    • To get you started: Make a copy of the existing input (from the line that says "non3sg-nonpres" 4 down to the fourth candidate below it) below the existing input. Let's start with "non3sg-pres" (where the kid is trying to say something in the present tense with a non3sg subject), so change the "non3sg-nonpres" in the copy you just made to "non3sg-pres". Now, notice that because the kid is trying to use present tense, whenever TP appears in the output candidate, the form that surfaces will be in present tense. The first and third candidates are ones in which TP surfaces (note they have zero violations for ParseT, the second number after the candidate name), so those candidates should faithfully reflect the tense of the input. Practically speaking, this means you need to change the output form for those two candidates to read "pres" and not "nonpres". You should end up with the first two lines reading "non3sg-pres" (the first is faithfully representing tense, the second is a default, but they both sound the same), the third line reading "3sg-pres", and the fourth line reading "nrf". You can follow this same procedure and logic for the other inputs. Below is the spelled-out version of the lines for this second input just described:
      "non3sg-pres" 4
         "non3sg-pres" 0 0 2 1
         "non3sg-pres" 0 1 1 0
         "3sg-pres" 1 0 1 0
         "nrf" 1 1 0 0
      
  • When you're done, you also need to make sure that the 1 in the line above the inputs (that used to read "1 inputs" in the skeleton file), is changed to reflect the total number of inputs you have when you are finished.

Step 2c: Creating the french.praat.txt file: Corpus distribution

  • Now, we need to provide the model of the adult speech.
  • I've provided a "pair distribution" for you, but without the right weights (right now they all read "100").
  • These indicate how often the input (first string) is pronounced like the output (second string) in adult speech.
  • We'll assume no variation in the adult language, so the only way an input comes out is faithfully. Hence, we have only one option per input form.
  • However, there is a difference in how often people say 3sg-nonpres and non3sg-pres (we figured that out above, in step 1). Change the "100"s in the skeleton file to reflect the proportions we expect to see for each one. So, if you found that 19.5% of the time adults will use 3sg-nonpres forms, change the "100" by that example to "195". And so forth for the other three. After changing the first one as just described, that part of the file should look like this:
    "PairDistribution" "French"
    4
    
    "3sg-pres" "3sg-pres" 100
    "3sg-nonpres" "3sg-nonpres" 195
    "non3sg-pres" "non3sg-pres" 100
    "non3sg-nonpres" "non3sg-nonpres" 100
    
  • After changing the other three numbers, you should have a file that you can load into Praat.

Step 3: Load it and run the learning algorithm.

  • Go into Praat and use "Read">"Read from file..." in the Object window and load your file. It should load without errors (if not, go back and check to make sure you didn't make any typos).
    • If you have a lot of trouble getting your file to load, email it to me and maybe I can help troubleshoot.
  • First, make sure that both the OTGrammar French and PairDistribution French are selected.
    • They should be after you load your file, but if they aren't, you should be able to click on one and then hold down shift and click on the other one to get both selected.
  • You should see two buttons, "Learn..." and "Get Fraction correct..."
  • Click "Get Fraction correct..." and let it run for 100000 iterations at 2.0 noise (the defaults) and make a note of what percentage of the corpus it could predict. The number should be quite low.
  • Now, click "Learn..." and run it with the default values (4 plasticities, 100000 iterations per plasticity, 2.0 noise, 0.1 decrement per plasticity, 1 chew, respect local rankings, strategy "symmetric all").
  • Get Fraction correct... again and note the improvement. It should be dramatic.

Step 4a: Simulate stage 4b: figure out the target percentages

  • Now, we're going to simulate child stage 4b, and run the learning algorithm on it to see what ranking it comes up with.
  • To accomplish what we want to do, we need to edit the PairDistribution at the bottom of the file so that it reflects what the kids at stage 4b are doing (rather than what the adults produce).
  • Let's try to figure out the numbers to put in there.
  • According to Legendre et al. paper, adults use non-3sg 37% of the time (thus 3sg 63% of the time), and non-present 31% of the time (thus present 69% of the time). We will assume that what kids mean to say have these same proportions, but sometimes they end up producing defaults (3sg, present, or NRFs). At stage 4b, kids are producing 15% non-3sg and 15% non-present, so not quite the adult level. Plus, they are producing 18% NRFs.
  • So, for agreement the target is 37% non-3sg. 18% of those come out as NRFs, 41% (15/37) come out as non-3sg, and the remaining 41% come out as default 3sg. For tense the target is 31% non-present. 18% of those come out as NRFs, 48% (15/31) come out as non-present, and the remaining 34% come out as default present. This reduces to the following likelihoods for finding each of TP and AgrP:
    TP and AgrP 7%
    TP only 41%
    AgrP only 34%
    neither TP nor AgrP 18%
  • And, now, let's figure out how often kids (and adults) are aiming for each of the four possible combinations of non-3sg/3sg and nonpresent/present. Draw a table like this:
    non3sg & nonpres 37% x 31% 11%
    non3sg & pres 37% x 69% 26%
    3sg & nonpres    
    3sg & pres    
  • Fill in the missing cells in the table. These numbers may seem familiar, you worked them out earlier.
  • Now, we can combine the two to figure out the percentages for each of the inputs (from the second table) of each form (from the first table). Draw four tables, I've drawn two of them below. The asterisks will be explained shortly. In the first table below, "non3sg & nonpres" is the input. This is what the kid is trying to say. The kid tries to say "non3sg & nonpres" 11% of the time accrording to the second table above. There are four possible outcomes when the kid tries to say something (the structure has TP only, AgrP only, both, or neither). From the first table above, the likelihood of having TP and AgrP together is 7%. So the overall likelihood of this input surfacing with this candidate chosen in 7% x 11%, or 0.7%.
    non3sg & nonpres 11% TP&AgrP 7% 0.7%
      TP only 41% 4.5%
    AgrP only 34% 3.7%
    neither (NRF) 18% 2.0%

    non3sg & pres 26% TP&AgrP * 7% 1.8%
      TP only 41% 10.7%
    AgrP only * 34% 8.8%
    neither (NRF) 18% 4.7%
  • Draw the other two tables.
  • Now, consider what will happen if the TP is missing in the second table. That will result in default present tense. But having TP also results in present tense. So, the output of both the first and third possibilities (TP&AgrP and AgrP only) for non3sg & present are the same. I indicated this by marking them with asterisks. We'll need to know how often an input of non3sg & present comes out as non3sg present, so we'll need to add the percentages for the two lines when we tell Praat how many to expect.
    • So we expect 10.6% (1.8% + 8.8%) of the kid's verbs to be attempts at non3sg-present which come out as non3g-present. (And we expect 10.7% of the kid's verbs to be attempts at non3sg-present which come out as 3sg-present, and 4.7% of the kid's verbs to be attempts at non3sg-present which come out as NRFs).
  • Put the asterisks in the other two tables and add up the percentages (from the last column) for all asterisked rows in each table, we'll need this later.

Step 4b: Simulate stage 4b: create french4b.praat.txt.

  • Make a copy of your french.praat.txt file and name the copy something like french4b.praat.txt.
  • We're going to alter the PairDistribution by adding the options that kids seem to have (but adults don't). To do this, we need to add some more lines.
    • In particular, any time the "intent" (the underlying form) has either non-3sg or non-present, we need to copy that line so there are two possibilities (either they get it or the get a default).
    • Change the PairDistribution lines at the bottom file so they look like this (although we'll be changing the numbers for each):
      • "3sg-pres" "3sg-pres" 100
        "3sg-pres" "nrf" 100
        "3sg-nonpres" "3sg-nonpres" 100
        "3sg-nonpres" "3sg-pres" 100
        "3sg-nonpres" "nrf" 100
        "non3sg-pres" "non3sg-pres" 100
        "non3sg-pres" "3sg-pres" 100
        "non3sg-pres" "nrf" 100
        "non3sg-nonpres" "non3sg-nonpres" 100
        "non3sg-nonpres" "3sg-nonpres" 100
        "non3sg-nonpres" "non3sg-pres" 100
        "non3sg-nonpres" "nrf" 100
    • Now, we need to fill in the numbers from our last four tables.
    • This should be pretty self-explanatory. Just change the 100s to numbers corresponding to your percentages from the tables. From the numbers I provided, you would put in the numbers below. Don't forget to take the "asterisked" lines from the tables above into account -- so the first number below is 106 because the sum of "TP & AgrP" and "AgrP alone" in the table corresponding to "non3sg-present" added up to 10.6%:
      • "non3sg-pres" "non3sg-pres" 106
        "non3sg-pres" "3sg-pres" 107
        "non3sg-pres" "nrf" 47
        "non3sg-nonpres" "non3sg-nonpres" 7
        "non3sg-nonpres" "3sg-nonpres" 45
        "non3sg-nonpres" "non3sg-pres" 37
        "non3sg-nonpres" "nrf" 20
    • Be sure to change the number above these lines (it used to be 4) to 12, since we now have 12 pairs.
      • Double-check: If you add up all of the numbers at the end of the lines, it should total very nearly 1000, representing 100%. At the precision I was working with, I came up with 998.
  • Load your French4b.praat.txt file into Praat (the same way you loaded the previous one) and do the same thing we did before.
    • First, check the percent correct at the outset.
    • Then, learn... with the default settings.
    • Then, check the percent correct after learning.
    • Select just the OTGrammar object for French 4b, which should give you, among other things, a button called "Edit". Click Edit and take a look at the rankings. Write down the ranking values Praat came up with for the four constraints.
  • Praat doesn't do a very good job at predicting this stage does it? Look at what it figured out. There's a statistical term for its level of performance.

Step 5: What went wrong?

  • Consider how Legendre et al. predict the percentages for stage 4b. There, we use a ranking scheme like this:
    Fixed:   *F2 >> *F
    Floating: ParseT
      ParseA
  • This means that ParseT can sometimes outrank *F2, sometimes be ranked between *F2 and *F, and sometimes be ranked below *F. The same for ParseA. You can consult the paper for the details.
  • What differs between Boersma's system and Legendre et al.'s is that Boersma's constraints are evaluated with noise at particular places on a number line. Boersma's constraints are drawn as normal distributions (bell curves) on a horizontal line. The more overlap between the distributions, the more likely two constraints might switch rankings on any given evaluation.
  • There is only one level of noise (the "2.0" level you enter in the "learn" box, for example). This noise governs the evaluation of all constraints.
  • That's the key to the reason why Boersma's system can't learn this grammar while Legendre et al.'s grammar matches the corpus very closely.
  • The hard question: Explain why Boersma's system can't find a solution.
  • Followup: What might you have to do to make it possible for Boersma's system to find a solution? What might you try?
    • I can think of two answers. One is obvious given what I said above, but another thing to consider is the assumptions about the grammar itself. How much of this depends on our four constraints?

Step 6: Turn it in.

  • What I want you to turn in is (a printout or a file):
    • Your files for the adults and for stage 4b (french.praat.txt, french4b.praat.txt) so I can see the numbers you used in the PairDistribution.
    • The percent correct Praat gave you before and after running the adult and the stage 4b grammars.
    • The rankings Praat gave you after learning for both the adult and stage 4b grammars (one number for each of the four constraints, for each grammar); this number comes from the Edit window you get clicking Edit when the grammar is selected.
    • The five tables you filled in from Step 4a.
    • You answer to the "hard question" in Step 5.