|
|

Lab 4: Praat and the GLA lab
This page has three parts:
Introductory notes.
We'll be exploring learnability in OT using the ranking software included
in Praat.
There is a "competitor" called OTSoft,
which is a program that can also do ranking/learning in Optimality Theory.
Currently, it is PC-only, although I have contacted Bruce Hayes about
the possibility of porting it to REALbasic
(which would allow it to be run on either Mac or Windows machines). Time
has not allowed for me to finish working with this porting project (I originally started it 2 years ago!), and
Praat is for the moment more refined. Feel free to explore OTSoft anyway,
of course.
In-class exercises
The task: Run Praat on (basically) Anttila's data from Finnish plurals.
First step: Get Praat
- It should be installed on the computers in the lab. If it isn't there (or if you aren't there and you want to do the take-home part of the lab), go to the Praat page and download it.
- The Windows download results in a "self-extracting
archive" which you can double-click to produce the Praat application.
The Mac download results in a "Stuffit" archive, which you
should be able to double-click and decompress. There is currently a note on the Mac archive that it won't decompress with Stuffit Expander 8, but I had no trouble with Stuffit Expander 8.0.2.
- Also note: The Praat program is very frequently updated. If you download the program yourself, you will almost certainly have a newer version than the one installed on the lab machines. As of the date of class, it was updated 13 days ago. This shouldn't matter (much, I hope).
Second step: Get the data on Finnish plurals.
These files come from Paul
Boersma's GLA page. What we want is Anttila's data, Anttila_data.txt (or the local copy)
Right-click (of Ctrl-click) on that link, choose Save Target As... or
Download Link to Disk. You might want to make a folder on the desktop (or on the file server if you've connected your account) to hold your files. Wherever you put it, just make sure you know where it is.
Third step: Start Praat.
- Double-click on the funny mouth+ear icon
Fourth step: Create the NoCoda grammar
- In the Objects window, choose New>Optimality Theory>Create NoCoda
Grammar.
Fifth step: Load Anttila data.
- From the Read menu, choose Read from file... and find the Anttila_data.txt
file you downloaded in step 2.
- Choose OTGrammar Anttila and click Edit to see an overview of the
grammar. Check to be sure that the ranking values are all at 100 (the
initial state).
Sixth step: Create the "primary linguistic data".
- In the Objects window, choose PairDistribution Anttila. This is a
survey of how frequent each pair of input and output is in the corpus
that Anttila studied. Praat can take this and create a set of input
and output strings that match the statistical properties of Anttila's
original corpus.
- Now, let's learn. Click on OTGrammar Anttila, then control-click (Windows)
or command-click (Mac) on PairDistribution Anttila so that both are
selected.
- Click Learn... We'll use the defaults, 10000 iterations per plasticity,
starting at a plasticity of 1 and going down 0.1 each round for four
rounds (so, a total of 40000 data points considered).
And we'll do some other stuff...
Take-home exercises
Your task: Try the GLA on the French NRF data we talked
about in class last week. The paper that discusses the theoretical background
(in terms of Reynolds-style floating constraints) is in PDF form here
as Legendre-etal-NRFs.pdf.
It might be useful to at least skim this paper. I will mark things in
red that you should write
down/print out for handing in with the lab writeup. There's a summary
at the end of this web page.
Step 1: Consider the problem; work out the input distribution that the "kid" will "hear."
- According to Table 4 in the Legendre et al. paper, adults used on
average 31% non-present forms and 37% non-3sg forms.
- Assume that tense and agreement are fully independent (that is, the
likelihood of seeing a non-present form does not depend on whether the
form is also non-3sg, and vice-versa).
- Given this, determine what the percentage of
each of the following should be in adult speech (you don't need to hand
this in, but you'll copy these numbers into a table later).
- non-present 3sg forms?
- The likelihood of non-present 3sg forms is the likelihood
of a non-present form (31%) times the likelihood of a 3sg form
(63%). And so forth.
- non-present non-3sg forms?
- present 3sg forms?
- present non-3sg forms?
- NRFs? (non-finite root form; a root
infinitive)
- The assumption behind our predictions is that kids intend
to use non-present 3sg forms, for example, as often as adults do. Yet
sometimes, functional projections will be missing, and the resulting
form that the child produces will be a default form (present tense if
TP is missing, 3sg if AgrP is missing).
- So, for every input, which will have tense (either non-present, or
present) and agreement (either non-3sg, or 3sg), there are four possible
output forms:
- TP and AgrP (the tense form in the input, the agr form in the
input)
- TP only (the tense form in the input, and default 3sg)
- AgrP only (default present tense, and the agr form in the input)
- neither TP nor AgrP (NRF form)
Step 2a: Creating the french.praat.txt file: Get the skeleton file and look
at how it works.
- Download this skeleton file skeleton.praat.txt
(right-click, or control-click and save to disk).
- Make a copy and call the copy french.praat.txt.
- Open the file in a text editor (WordPad, SimpleText, BBEdit).
- Look at how the constraints work. There are four constraints:
- Each is ranked somewhere -- higher numbers correspond to higher-ranked
(more important) constraints.
- How are they ranked initially?
- Why are they ranked that way?
Step 2b: Creating the french.praat.txt file: Entering tableaux
- The skeleton file has one input. We need more. The one I've given you
a model of is "non3sg-nonpres". What the other ones are should
be pretty obvious (given that one is "non3sg-nonpres").
- What these represent are the different possible things the child
could be trying to say. So, the one I gave you represents the situation
when the child is trying to say something in the non-present with
a non-3sg subject (e.g., 1sg past).
- For each input (things the child could be trying to say), there are
four possible outcomes (given at the end of step 1, above).
- Looking at the input I provided, notice that after the label of the
input (e.g., "non3sg-nonpres"), the number of outcomes is
given (i.e. 4). This is the number of different ways a kid might actually
end up producing an intended 'non3sg-nonpres'.
- Underneath the number, there is one line for each candidate (each possible output). The first
thing in the line is how it comes out (e.g., "non3sg-pres")
followed by as many numbers as there are constraints. Each number after
the name represents how many violations this candidate incurs according
to the contsraints (in the order they were listed earlier in the file).
In English, what you should get from the example I provided is:
- Non3sg-nonpres satisfies both ParseT and ParseA, violates *F twice,
and *F2 once.
- Non3sg-pres satisfies ParseA but violates ParseT, violates *F,
but satisfies *F2. (Note: pres here means the default; TP is missing
for this candidate).
- To do: Create the missing inputs.
They should be modeled on input 1 but for the other possible things
(other than non3sg-nonpres) that a kid might try to say.
- When you're done, you also need to make sure that the 1 in the line
above the inputs (that used to read "1 inputs" in the skeleton
file), is changed to reflect the total number of inputs you have when
you are finished.
Step 2c: Creating the french.praat.txt file: Corpus distribution
Step 3: Load it and run the learning algorithm.
- Go into Praat and use "Read">"Read from file..."
in the Object window and load your file. It should load without errors
(if not, go back and check to make sure you didn't make any typos).
- If you have a lot of trouble getting your file to load, email
it to me and maybe I can help troubleshoot.
- First, make sure that both the OTGrammar French and PairDistribution
French are selected.
- They should be after you load your file, but if they aren't, you
should be able to click on one and then hold down shift and click
on the other one to get both selected.
- You should see two buttons, "Learn..." and "Get Fraction
correct..."
- Click "Get Fraction correct..." and let it run for 100000
iterations at 2.0 noise (the defaults) and make
a note of what percentage of the corpus it could predict. The
number should be quite low.
- Now, click "Learn..." and run it with the default values
(4 plasticities, 100000 iterations per plasticity, 2.0 noise, 0.1 decrement
per plasticity, 1 chew, respect local rankings, strategy "symmetric
all").
- Get Fraction correct... again and note the
improvement. It should be dramatic.
Step 4a: Simulate stage 4b: figure out the target percentages
- Now, we're going to simulate child stage 4b, and run the learning
algorithm on it to see what ranking it comes up with.
- To accomplish what we want to do, we need to edit the PairDistribution
at the bottom of the file so that it reflects what the kids at stage
4b are doing (rather than what the adults produce).
- Let's try to figure out the numbers to put in there.
- According to Legendre et al. paper, adults use non-3sg 37% of the
time (thus 3sg 63% of the time), and non-present 31% of the time (thus
present 69% of the time). We will assume that what kids mean
to say have these same proportions, but sometimes they end up producing
defaults (3sg, present, or NRFs). At stage 4b, kids are producing 15%
non-3sg and 15% non-present, so not quite the adult level. Plus, they
are producing 18% NRFs.
- So, for agreement the target is 37% non-3sg. 18% of those come out
as NRFs, 41% (15/37) come out as non-3sg, and the remaining 41% come
out as default 3sg. For tense the target is 31% non-present. 18% of
those come out as NRFs, 48% (15/31) come out as non-present, and the
remaining 34% come out as default present. This reduces to the following
likelihoods for finding each of TP and AgrP:
| TP and AgrP |
7% |
| TP only |
41% |
| AgrP only |
34% |
| neither TP nor AgrP |
18% |
- And, now, let's figure out how often kids (and adults) are aiming for each of the four possible combinations of non-3sg/3sg and nonpresent/present.
Draw a table like this:
| non3sg & nonpres |
37% x 31% |
11% |
| non3sg & pres |
37% x 69% |
26% |
| 3sg & nonpres |
|
|
| 3sg & pres |
|
|
- Fill in the missing cells in the table. These numbers may seem familiar, you worked them out earlier.
- Now, we can combine the two to figure out the percentages for each
of the inputs (from the second table) of each form (from the first
table). Draw four tables, I've drawn two of them below. The asterisks
will be explained shortly. In the first table below, "non3sg & nonpres" is the input. This is what the kid is trying to say. The kid tries to say "non3sg & nonpres" 11% of the time accrording to the second table above. There are four possible outcomes when the kid tries to say something (the structure has TP only, AgrP only, both, or neither). From the first table above, the likelihood of having TP and AgrP together is 7%. So the overall likelihood of this input surfacing with this candidate chosen in 7% x 11%, or 0.7%.
| non3sg & nonpres |
11% |
TP&AgrP |
7% |
0.7% |
| |
TP only |
41% |
4.5% |
| AgrP only |
34% |
3.7% |
| neither (NRF) |
18% |
2.0% |
| non3sg & pres |
26% |
TP&AgrP * |
7% |
1.8% |
| |
TP only |
41% |
10.7% |
| AgrP only * |
34% |
8.8% |
| neither (NRF) |
18% |
4.7% |
- Draw the other two tables.
- Now, consider what will happen if the TP is missing in the second
table. That will result in default present tense. But having
TP also results in present tense. So, the output of both the first and
third possibilities (TP&AgrP and AgrP only) for non3sg & present
are the same. I indicated this by marking them with asterisks. We'll
need to know how often an input of non3sg & present comes out as
non3sg present, so we'll need to add the percentages for the two lines
when we tell Praat how many to expect.
- So we expect 10.6% (1.8% + 8.8%) of the kid's verbs to be attempts
at non3sg-present which come out as non3g-present. (And we expect
10.7% of the kid's verbs to be attempts at non3sg-present which
come out as 3sg-present, and 4.7% of the kid's verbs to be attempts
at non3sg-present which come out as NRFs).
- Put the asterisks in the other two tables and
add up the percentages (from the last column) for all asterisked rows in each table,
we'll need this later.
Step 4b: Simulate stage 4b: create french4b.praat.txt.
- Make a copy of your french.praat.txt file and
name the copy something like french4b.praat.txt.
- We're going to alter the PairDistribution by adding the options that
kids seem to have (but adults don't). To do this, we need to add some
more lines.
- In particular, any time the "intent" (the underlying form) has
either non-3sg or non-present, we need to copy that line so there
are two possibilities (either they get it or the get a default).
- Change the PairDistribution lines at the
bottom file so they look like this (although we'll be changing
the numbers for each):
- "3sg-pres" "3sg-pres" 100
"3sg-pres" "nrf" 100
"3sg-nonpres" "3sg-nonpres" 100
"3sg-nonpres" "3sg-pres" 100
"3sg-nonpres" "nrf" 100
"non3sg-pres" "non3sg-pres" 100
"non3sg-pres" "3sg-pres" 100
"non3sg-pres" "nrf" 100
"non3sg-nonpres" "non3sg-nonpres" 100
"non3sg-nonpres" "3sg-nonpres" 100
"non3sg-nonpres" "non3sg-pres" 100
"non3sg-nonpres" "nrf" 100
- Now, we need to fill in the numbers from our last four tables.
- This should be pretty self-explanatory. Just change
the 100s to numbers corresponding to your percentages from the tables.
From the numbers I provided, you would put in the numbers below.
Don't forget to take the "asterisked" lines from the tables
above into account -- so the first number below is 106 because the
sum of "TP & AgrP" and "AgrP alone" in the
table corresponding to "non3sg-present" added up to 10.6%:
- "non3sg-pres" "non3sg-pres" 106
"non3sg-pres" "3sg-pres" 107
"non3sg-pres" "nrf" 47
"non3sg-nonpres" "non3sg-nonpres" 7
"non3sg-nonpres" "3sg-nonpres" 45
"non3sg-nonpres" "non3sg-pres" 37
"non3sg-nonpres" "nrf" 20
- Be sure to change the number above
these lines (it used to be 4) to 12, since we now have 12 pairs.
- Double-check: If you add up all of the numbers at
the end of the lines, it should total very nearly 1000, representing
100%. At the precision I was working with, I came up with 998.
- Load your French4b.praat.txt file into Praat (the same way you loaded
the previous one) and do the same thing we did before.
- First, check the percent correct at the outset.
- Then, learn... with the default settings.
- Then, check the percent correct after learning.
- Select just the OTGrammar object for French 4b, which should give
you, among other things, a button called "Edit". Click Edit
and take a look at the rankings. Write down
the ranking values Praat came up with for the four constraints.
- Praat doesn't do a very good job at predicting this stage does it?
Look at what it figured out. There's a statistical term for its level
of performance.
Step 5: What went wrong?
- Consider how Legendre et al. predict the percentages for stage 4b.
There, we use a ranking scheme like this:
| Fixed: |
|
*F2 |
>> |
*F |
| Floating: |
ParseT |
|
| |
ParseA |
|
- This means that ParseT can sometimes outrank *F2, sometimes be ranked
between *F2 and *F, and sometimes be ranked below *F. The same for ParseA.
You can consult the paper for the details.
- What differs between Boersma's system and Legendre et al.'s is that
Boersma's constraints are evaluated with noise at particular places
on a number line. Boersma's constraints are drawn as normal distributions
(bell curves) on a horizontal line. The more overlap between the distributions,
the more likely two constraints might switch rankings on any given evaluation.
- There is only one level of noise (the "2.0" level you enter
in the "learn" box, for example). This noise governs the evaluation
of all constraints.
- That's the key to the reason why Boersma's system can't learn this
grammar while Legendre et al.'s grammar matches the corpus very closely.
- The hard question: Explain why Boersma's system
can't find a solution.
- Followup: What might you have to do to make it possible for Boersma's
system to find a solution? What might you try?
- I can think of two answers. One is obvious given what I said above,
but another thing to consider is the assumptions about the grammar
itself. How much of this depends on our four constraints?
Step 6: Turn it in.
- What I want you to turn in is (a printout or a file):
- Your files for the adults and for stage 4b (french.praat.txt, french4b.praat.txt)
so I can see the numbers you used in the PairDistribution.
- The percent correct Praat gave you before and after running the
adult and the stage 4b grammars.
- The rankings Praat gave you after learning for both the adult and
stage 4b grammars (one number for each of the four constraints, for
each grammar); this number comes from the Edit window you get clicking
Edit when the grammar is selected.
- The five tables you filled in from Step 4a.
- You answer to the "hard question" in Step 5.
|