Course Info
Syllabus
Blog
Bibliography
Internet Resources
Labs
Home

CAS LX 522 Syntax I
Lab 8: Perl and data analysis


Introduction

When we finished up our work with Linger, we ended up with some data files. We'd like to know what they tell us, what we can conclude about how people's reading times were affected by the differences in the experimental items.

One way to figure that out is to run it through Lingalyzer, but what has that actually done?

In the last two weeks of the semester, we're going to try to tackle the problem of analyzing the data without Lingalyzer, to try to get a feel for how the data is actually analyzed, and also to get a feel for how one might go about this type of problem.

Useful links

The data files from our Linger experiment are here: binding3data-take2.tar.gz. You'll want to have those somewhere handy.

The official perl site is http://www.perl.org/ and you should be able to download it from there. If it isn't already installed on your computer. Note that acs.bu.edu has it installed as well, so you should be able to do much of what we're doing now over telnet/ssh.

Presentation about basic perl

Gregory Garretson's presentation about the basics of perl (also available 6-up)

Exercises and homework

Below are given a number of tasks, descriptions of a small perl programs to write, which compute averages. They are arranged in increasing order of difficulty, with the hardest one last. Depending on how much experience you've had with programming, you may find that you reach your limit before you get through them all. So, do as many as you can while not spending an unreasonable amount of time on it. When you are through, email your programs to me, and that is the lab.

Averages and arrays

It is very common to want to find the average of a list of numbers. Perl can help.

Type in this much of the program. Understand what it does so far.

#!/bin/perl -w
@list_of_numbers = (1, 2, 3, 5, 7, 11, 13, 17);

What we want to do is come up with an average of those numbers. How would you compute the average? Do that on paper or with a calculator. See what you get. (Hint: It should be 7.375.) Now, tell perl to do exactly the same thing: Add up all of the numbers, divide the sum by the number of numbers.

Think about what the program should do abstractly first. We want the sum of all of the numbers, and then we want to divide by the number of numbers. An easy way to do this is to use a scalar variable to keep track of the sum so far (we might name it $running_sum). Then, for each number in the list, we add that number to our running sum, and in the end, the running sum is in fact the sum of all of the numbers. We might write that in "pseudo-code" like this:

  • set @list_of_numbers array to a list of numbers.
  • set $running_sum to 0.
  • set $divisor to 0.
  • for each number in @list_of_numbers,
    • add it to $running_sum.
    • add 1 to $divisor.
    • continue through the list until we've worked with every number in @list_of_numbers.
  • display $running_sum / $divisor.

This is not perl, these are just English instructions, designed to make it easier to see the structure of the program we'll write.

Your task: turn this pseudo-code into perl. Try to use foreach and ++ in your program. The first two lines of the program were already given above.

To get started: Open a text document just as we did in class (in TextEdit on the Mac, making sure that it is in "plain text" mode, or in NotePad on Windows, or emacs or something similar on a Linux machine or on acs.bu.edu). Type the first two lines from above.

Notes: To add $number to $running_sum, you can type either of the following (they are equivalent):

$running_sum = $running_sum + $number;
$running_sum += $number;

Similarly, to add one to $divisor, you can type either of the following (they are equivalent):

$divisor = $divisor + 1;
$divisor ++;

Getting data from the keyboard

Once you have your code above working, and telling you that the prime numbers less than 19 average to 7.375 (meaning, the program works and gives you the right result), try to change your program to let you type in the numbers from the keyboard.

Remember to get input from the keyboard, you can use

$entry = <STDIN>
chomp ($entry);

The plan for this program is to build the @list_of_numbers array from what you enter, rather than specifying it directly in the program. We will make the assumption that we want to average 8 numbers.

Specifically, what we'll want to do is:

  • set @list_of_numbers to be an empty array.
  • start a loop that will iterate 8 times
    • get a value from the keyboard.
    • add the number the user typed to @list_of_numbers,
  • once all the numbers are entered, the program is just the same as in task 1.

What you are going to write here basically replaces the line in your previous program that read @list_of_numbers = (1, 2, 3, 5, 7, 11, 13, 17);, but the rest of the program remains as it was in task 1.

To make a loop that runs 8 times, we want to use for. Specifically, you want something like:

for ($i = 1; $i <= 8; $i++) {
code in your loop
}

There are two ways to add the number entered to @list_of_numbers. If you use the for code above, the scalar $i will hold a value that tells us how many times we have started the loop, so you can explicitly say that the $i'th element of @list_of_numbers is the number entered, like this:

$list_of_numbers[$i-1] = $entry;

Remember that when we are specifying elements in an array like @list_of_numbers by using a direct reference to the position in brackets, the counting starts at zero. So the first element in the list is $list_of_numbers[0], the second its $list_of_numbers[1], and so forth. So, in the loop where $i counts from 1 to 8, $list_of_numbers[$i-1] is the element in the array in the $i-1'th place, counting the first one as zero. That is, over the course of all 8 iterations, numbers will be assigned to the 8 entries from $list_of_numbers[0] to $list_of_numbers[7].

A second way to add the number is by using push, which will just add a number to the end of the array, like this:

push @list_of_numbers $entry;

Either way will work, they both come to the same result.

Your task: Put the pieces given above together to make a program that gets 8 numbers and averages them. Try it out on the numbers 1, 2, 3, 5, 7, 11, 13, 17 to make sure that you come up with 7.375.

Loading data from a file

Now, let's take it one step further. Create a new file with the following contents and save it as numbers.txt in the same folder as your perl scripts.

1
2
3
5
7
11
13
17

Don't put any extra lines at the end, just type each number and hit return, until you've entered all 8 numbers, and then save.

We'll change our program so that now, instead of asking the user to type in the 8 numbers, it will read them from the file we just created.

If you look at Gregory's slides, near the end, there's a slide entitled "Opening files" that has the following code on it:

$input_file = "bigoldfile.dat";
open (INPUT, "$input_file");
while ($line = <INPUT>) {
	print "$line";
}
close(INPUT);

This is what we want to base our code on. Fix the file name so that instead of opening bigoldfile.dat, it opens numbers.txt instead. Extra fancy option: Or, ask the user for the file name, so that you type in the name of the file you want to find the average for.

We want to do more than simply print the contents of the file (which is what the code above will do). Instead, we want to add the value retrieved from each line of the file to our @list_of_numbers. For this, the easiest approach is to use the push function, rather than assigning the value directly with something like $list_of_numbers[$i](see the comments in the previous task about push).

Your task: Put the pieces above together in your program so that it loads the numbers from numbers.txt and prints the average.

Hashes and records

Suppose that you want to keep track of your scores on three quizzes and a midterm from a certain class. You could do this with an array, that might look something like this:

@scores = (83, 92, 85, 88);

If we have an array like this, we can easily find the average score, based on the tasks we've already done. However, if we want to know how we did on the midterm, we have to know which of those four scores represents the midterm. Perhaps it is the fourth one. In that case, $scores[3] would be the midterm score, 88. This can get a little bit hard to keep track of, however.

Another option is to keep the scores is a hash, which is like an array but with names for each element. More technically, a key is the name of the element, and the value is its value. So, we could define a hash like this instead:

%scores = ("quiz1" => 83, "quiz2" => 92, "quiz3" => 85, "midterm" => 88);

If we do this, then finding the midterm score is easier, it is $scores{'midterm'}. In general, to find the value in a hash for a given key, you refer to it like $hash{'key'}.

Now, suppose you want to keep track of these scores for four different students, Alfred, Beatrice, Chloe, and Daniel. We want to have a list of scores for each student. That is, we want one hash like %scores above for Alfred, another for Beatrice, and so forth. We could just do this:

%a_scores = ("quiz1" => 83, "quiz2" => 92, "quiz3" => 85, "midterm" => 88);
%b_scores = ("quiz1" => 71, "quiz2" => 95, "quiz3" => 65, "midterm" => 90);
%c_scores = ("quiz1" => 80, "quiz2" => 91, "quiz3" => 88, "midterm" => 75);
%d_scores = ("quiz1" => 70, "quiz2" => 97, "quiz3" => 75, "midterm" => 81);

So, to find Chloe's quiz2 score, we check $c_scores{'quiz2'}.

But we can also make a list containing these lists. So, suppose we have a hash like

%scores = ("Alfred" => {"quiz1" => 83, "midterm" => 88},
                "Beatrice" => {"quiz1" => 71,  "midterm" => 90},
                "Chloe" => {"quiz1" => 80, "midterm" => 75},
                "Daniel" => {"quiz1" => 70, "midterm" => 81}
               );

That's a hash of hashes. It allows us to find Chloe's quiz1 score like this: $scores{'Chloe'}->{'quiz1'}.

Warning: There is a certain amount of "magic" here, unless you are up to the task of grappling with the concept of value vs. reference. Specifically, this affects where you use the -> arrow and when you don't, and when you use curly braces and when you don't. Rules of thumb: If you are treating a hash as a value (for example, as above, where the key is Alfred and the value is a hash), you put the hash in curly braces. If you defined the hash with curly braces, then you need to use the -> arrow to refer to a key within the hash. Technically, this is because we are not actually putting the hash in the list, but rather a pointer to the hash, and the arrow is used to refer to keys in a hash being pointed to. So $scores{'Chloe'} is not actually a hash, but a pointer to a hash, so to get the value of 'quiz1', we need to use the arrow to locate the hash first. You will not be tested on this.

Here's what we want to do. This is the hardest task. First, we want to set up the %scores hash as above, so your first line in your program in this part will be to just type in the code just above here. What we actually want our program to do is to find the average score on quiz1 and the average score on the midterm.

Again, it's useful to think of how you would proceed to figure this out manually: You'd add up each student's score for quiz1, and then divide by the number of students. So, we want a loop that goes through each element in %scores (there is one per student), and keeps track of the running total for quiz1 and the running total for the midterm.

To go through the list of students, we can use foreach again, but this time, because we're going through a hash, we want to do it like this:

foreach $student (keys %scores) {
your code here
}

Then, inside the loop, $student will be the name of the student we are currently considering. To find the current student's score on quiz 1, we do the following:

$quiz1_score = $scores{$student}->{'quiz1'};

Your task: put together the code from the previous tasks with this one to find the average score for quiz1 and for the midterm. To do this, you'll want to keep a running total of each. You could also keep track of the number of students, but a quicker way to find that number (which you can do once the loop is finished) is:

$number_of_students = scalar(keys %scores);