©2009 2010 2011 2012 Jennifer Harlow, Dominic Lee and Raazesh Sainudiin.
Creative Commons Attribution-Noncommercial-Share Alike 3.0
A statistic is any measureable function of the data: $T(x): \mathbb{X} \rightarrow \mathbb{T}$.
Thus, a statistic $T$ is also an RV that takes values in the space $\mathbb{T}$.
When $x \in \mathbb{X}$ is the observed data, $T(x)=t$ is the observed statistic of the observed data $x$.
Let's go back to our New Zealand lotto data.
We showed that for New Zealand lotto (40 balls in the machine, numbered $1, 2, \ldots, 40$), the number on the first ball out of the machine can be modelled as a $de Moivre (\frac{1}{40}, \frac{1}{40}, \ldots, \frac{1}{40})$ RV.
We have the New Zealand Lotto results for 1114 draws, from 1 August 1987 to 10 November 2008 (retrieved from the NZ lotto web site: http://lotto.nzpages.co.nz/previousresults.html ).
We can think of this data as $x$, the realisation of a random vector $X = (X_1, X_2,\ldots, X_{1114})$ where $X_1, X_2,\ldots, X_{1114} \overset{IID}{\thicksim} de Moivre(\frac{1}{40}, \frac{1}{40}, \ldots, \frac{1}{40})$
The data space is every possible sequence of ball numbers that we could have got in these 1114 draws. $\mathbb{X} = \{1, 2, \ldots, 40\}^{1114}$. There are $40^{1114}$ possible sequences and our data is just one of these $40^{1114}$ possible points in the data space.
We will use our hidden function that enables us to get the ball one data in a list. Evaluate the cell below to get the data and confirm that we have data for 1114 draws.
|
|
From a given sequence of RVs $X_1, X_2, \ldots, X_n$ (or a random vector $X = (X_1, X_2, \ldots, X_n)$ we can obtain another RV called the sample mean (technically, the $n$-sample mean):
$$T_n((X_1, X_2, \ldots, X_n)) = \bar{X_n}((X_1, X_2, \ldots, X_n)) :=\frac{1}{n}\displaystyle\sum_{i=1}^nX_i$$
We write $\bar{X_n}((X_1, X_2, \ldots, X_n))$ as $\bar{X}$,
and its realisation $\bar{X_n}((X_1, X_2, \ldots, X_n))$ as $\bar{x_n}$.
By the properties of expectations that we have seen before,
$$E(\bar{X_n}) = E \left(\frac{1}{n}\sum_{i=1}^nX_i \right) = \frac{1}{n}E\left(\sum_{i=1}^nX_i\right) = \frac{1}{n}\sum_{i=1}^nE(X_i)$$
And because every $X_i$ is identically distributed with the same expectation, say $E(X_1)$, then
$$E(\bar{X_n}) = \frac{1}{n}\sum_{i=1}^nE(X_i)= \frac{1}{n} \times n \times E(X_1) = E(X_1)$$
We have already met the Sage pylab module which provides arrays and some useful statistical capabilities for our Lotto data. To be able to use these capabilities, it is easiest to convert our Lotto Data into a pylab.array.
|
Now we can find the sample mean of our Lotto data using the mean method of pylab arrays.
|
|
If we go back to our definition of the sample mean $\bar{X_n}$
$$T_n((X_1, X_2, \ldots, X_n)) = \bar{X_n}((X_1, X_2, \ldots, X_n)) :=\frac{1}{n}\displaystyle\sum_{i=1}^nX_i$$
we can see that we calculate it by summing all the data values $X_i$ and then dividing the sum by the number of data values $n$. Instead of using the pylab magic to give us the mean, we could explicitly calculate it for ourselves using the Sage functions sum() and len().
We can use sum() to add up (find the summation or sum of) the values in the list (this will give us $\displaystyle\sum_{i=1}^nX_i$), and we can use len() to find the length of the list which is the number of data values $n$.
|
|
What has happened here?
If you ask Sage to divide one Sage.rings.integer by another, you will get the answer as a Sage.rings.rational.
|
To convert from a Sage.rings.rational to a floating point real number, we can use the conversion function RR().
|
|
|
From a given sequence of RVs $X_1, X_2, \ldots, X_n$ (or a random vector $X = (X_1, X_2, \ldots, X_n)$ we can obtain another RV called the sample variance.
$$T_n((X_1, X_2, \ldots, X_n)) =S_n^2((X_1, X_2, \ldots, X_n)) :=\frac{1}{n}\displaystyle\sum_{i=1}^n(X_i - \bar{X_n})^2$$
Sometimes, we divide by $n-1$ instead of $n$.
The sample variance is a measure of spread from the sample.
We write $S_n^2((X_1, X_2, \ldots, X_n))$ as $S_n^2$,
and its realisation $S_n^2((x_1, x_2, \ldots, x_n))$ as $s_n^2$
Suppose that every $X_i$ is identically distributed with the same variance, say $V(X_1)$. Then, using the definition of variance and the properties of expectations summarized by the following equalities:
try to show that
$$E(S_n^2) = \frac{n-1}{n} V(X_1)$$
Here we use the var method of the pylab array to find the variance of the Lotto data.
|
|
|
Similarly, we can look at a sample standard deviation = $\sqrt{\text{sample variance}}$
|
|
|
Double check that the sample standard deviation is indeed the square root of the sample variance
|
|
Suppose that $X_1, \,X_2, \,\ldots \,X_n$ is a sequence of RVs.
Then, the $n$-sample order statistic $X_{([n])}$ is:
$$X_{([n])}((X_1, X_2, \ldots, X_n)) := (X_{(1)}, X_{(2)}, \ldots, X_{(n)})$$
such that
$$X_{(1)} \le X_{(2)} \le \, \ldots \, \le X_{(n)}$$
For brevity, we write $X_{([n])}((X_1, X_2, \ldots, X_n))$ as $X_{[n]}$
and its realisation $X_{([n])}((x_1, x_2, \ldots, x_n))$ as $x_{([n])} = (x_{(1)}, x_{(2)}, \ldots, x_{(n)})$.
Thus, we simply sort the data to get the order statistic.
For example, if we have the outcome of 3 $Bernoulli$ trials $x = (0, 1, 0)$,
then $x_{([3])} = (0, 0, 1)$.
For the Lotto data, we could easily sort our data using the sort method of a list, but watch out! This will sort our list in-place, which means that listBallOne will now be in the new, sorted, order. Very often, we want to keep our data in the original order. What if we wanted to know the result of the first draw, or the last draw? If we only have the sorted list we have lost this original ordering.
We have can do two things. First, we can use the sorted() function. This will return a new, sorted, copy of the list.
|
We suggest that you use sorted() to make a sorted copy of a list. It is quicker than the copy-and-sort two-step process and you are less likely to make a mistake and sort the original list instead of the copy.
|
The $k$th order statistic of a sample is the $k$th smallest value in the sample. Order statistics that may be of particular interest include the smallest (minimum) and largest (maximum) values in the sample.
The pylab.array version of our data provides us with an easy way to get at the sample minimum and sample maximum statistics.
|
|
With lots of discrete data like this, we may be interested in the frequency of each value, or the number of times we get a particular value. This can be formalized as the following process of adding indicator functions of the data points $(X_1,X_2,\ldots,X_n)$: \[ \sum_{i=1}^n \mathbf{1}_{\{X_i\}}(x) \]
We have seen how we can use the Sage dictionary to give us a mapping from ball numbers in the list to the count of the number of times each ball number comes up. In the last worksheet we wrote a function to do this. Let us fully understand this function now.
|
|
Let us flex our Sage muscles by implementing another way to do the frequencies from a data list. We have learned about sorting a list and counting how many times a particular value occurs in a list (remember count: if we use myList.count(0) we will get the number of times 0 occurs in myList). We can now write a new implementation of the function to make a dictionary associating ball numbers with their frequencies. This function, unlike the one above, is not quite returning the $\sum_{i=1}^n \mathbf{1}_{\{X_i\}}(x)$.
|
What do you need to note here?
|
{1: 29, 2: 28, 3: 31, 4: 24, 5: 27, 6: 30, 7: 31, 8: 32, 9: 22, 10: 27, 11: 25, 12: 28, 13: 29, 14: 22, 15: 22, 16: 27, 17: 20, 18: 28, 19: 28, 20: 32, 21: 30, 22: 30, 23: 27, 24: 36, 25: 37, 26: 22, 27: 26, 28: 24, 29: 24, 30: 33, 31: 28, 32: 27, 33: 28, 34: 27, 35: 34, 36: 24, 37: 26, 38: 30, 39: 22, 40: 37} {1: 29, 2: 28, 3: 31, 4: 24, 5: 27, 6: 30, 7: 31, 8: 32, 9: 22, 10: 27, 11: 25, 12: 28, 13: 29, 14: 22, 15: 22, 16: 27, 17: 20, 18: 28, 19: 28, 20: 32, 21: 30, 22: 30, 23: 27, 24: 36, 25: 37, 26: 22, 27: 26, 28: 24, 29: 24, 30: 33, 31: 28, 32: 27, 33: 28, 34: 27, 35: 34, 36: 24, 37: 26, 38: 30, 39: 22, 40: 37} |
{1: 29, 2: 28, 3: 31, 4: 24, 5: 27, 6: 30, 7: 31, 8: 32, 9: 22, 10: 27, 11: 25, 12: 28, 13: 29, 14: 22, 15: 22, 16: 27, 17: 20, 18: 28, 19: 28, 20: 32, 21: 30, 22: 30, 23: 27, 24: 36, 25: 37, 26: 22, 27: 26, 28: 24, 29: 24, 30: 33, 31: 28, 32: 27, 33: 28, 34: 27, 35: 34, 36: 24, 37: 26, 38: 30, 39: 22, 40: 37} {1: 29, 2: 28, 3: 31, 4: 24, 5: 27, 6: 30, 7: 31, 8: 32, 9: 22, 10: 27, 11: 25, 12: 28, 13: 29, 14: 22, 15: 22, 16: 27, 17: 20, 18: 28, 19: 28, 20: 32, 21: 30, 22: 30, 23: 27, 24: 36, 25: 37, 26: 22, 27: 26, 28: 24, 29: 24, 30: 33, 31: 28, 32: 27, 33: 28, 34: 27, 35: 34, 36: 24, 37: 26, 38: 30, 39: 22, 40: 37} |
|
The empirical mass function (EMF) of a sequence of observed data $x_1, x_2, \ldots x_n$ is the sum of the following indicator functions:
$$EMF((x_1, x_2, \ldots, x_n)) = \frac{1}{n}\displaystyle\sum_{i=1}^{n}\mathbf{1}_{\{x_i\}}(x)$$
What is this doing? This is adding normalized indicator functions of the data points $(X_1,X_2,\ldots,X_n)$: \[ \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{\{X_i\}}(x) \]
The normalization part is the division by $n$, the total number of data points.
For example if we have $x$ as the outcome of three $Bernoulli$ trials (like tossing a fair coin three times to see if we get heads), say $x = (1, 0, 1)$, then
$$EMF((x_1, x_2, \ldots, x_n)) = \frac{1}{3}(\mathbf{1}_{\{1\}}(x) + \mathbf{1}_{\{0\}}(x) + \mathbf{1}_{\{1\}}(x))$$
We can plot this by showing the height of a point as the relative frequency of occurences of that value. The relative frequency for a number is the count (frequency) for that number divided by the sample size (the sample size is the total number of trials). The relative frequency of some data value $x_i$ appears in our $EMF$ formula as $ \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{\{X_i\}}(x) $
Using our prepackaged and hidden makeEMF() function we can do this plot for this $Bernoulli$ trial data.
[(0, 0.333333333333333), (1, 0.666666666666667)] [(0, 0.333333333333333), (1, 0.666666666666667)] |
|
|
The dotted blue lines just help us to see what $x$ value is associated with each point in the plot.
In the last worksheet, we showed an empirical mass function for the Lotto data.
|
|
Another extremely important statistics of the observed data is called the empirical distribution function (EDF). The EDF is the empirical or data-based distribution function (DF) just like the empirical mass function (EMF) is the empirical or data-based probability mass function (PMF). This can be formalized as the following process of adding indicator functions of the half-lines beginning at the data points $[X_1,+\infty),[X_2,+\infty),\ldots,[X_n,+\infty)$: \[ \widehat{F}_n (x) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{[X_i,+\infty)}(x) \]
This formula is saying that we are looking at the cumulative relative frequency for each of the values: not just how many data points had that value (normalized by $n$) but how many data points had that value or a smaller one (normalized by $n$).
Let's have a look at this for the simple of $Bernoulli$ example. We have bundled up the techniques we use in the hidden function called makeEDF so that you can just concentrate on the data for now.
[(0, 0.33333333333333331), (1, 1.0)] [(0, 0.33333333333333331), (1, 1.0)] |
|
The two different values a $Bernoulli$ RV can take (in order from smallest to largest) are 0, 1. What is the cumulative relative frequency of the number of times we got a 0? It is $\frac{1}{3}$: we got 0 once and we had three datapoints in total, and there are no values that the RV can take smaller than 0. What is the cumulative relative frequency of the number of times we got a 1? We got a 1 twice so its relative frequency is $\frac{2}{3}$, which is what we showed with the EMF. However, for the cumulative relative frequency we add need to add to this the relative frequency of all values less than 1 (only zero in this case) so the cumulative relative frequency of 1 is $\frac{1}{3} + \frac{2}{3} = 1$.
We can plot this.
|
|
We showed last week how we can do the same for the Lotto data. This EDF plot has many more points on it because the first ball can have 40 possible numbers on it (1, 2, 3, $\ldots$, 40)
|
|
You try
The list comprehension is a programming concept that allows us to represent lists ( or sets, or tuples or other sequential containers) that are obtained by transforming each element of some given list and selecting a subset of them.
We have already talked about for loops. Sometimes when you are working with a list and a for loop seems to be the obvious way to create another list from the first list, a list comprehension is the easiest way to do it.
You can think of a list comprehension as a short specialised syntax which does a for loop on a list and creates another list.
We will illustrate this by staring with a simple list.
|
Suppose that we want a list that contains all the elements in firstList multiplied by 2. We can use a list comprehension to do this.
Remember that a list comprehension can be thought of as a short specialised syntax which does a for loop on a list and creates another list. The syntax is:
[ some_expression_applied_to_each_element_in_list for element in list].
Note the [ ] parentheses around the whole statement -- this is what makes the result into a list.
We will use a list comprehension to take each value in firstList, transform it (multiply by 2 in this case) and put the transformed value into a new list, secondList.
|
Note also that although you might intuitively think that you could get a list containing every element in firstList multiplied by 2 with an expression like 2 * firstList, you don't. Doing this will replicate the elements in firstList.
|
We don't even have to explicitly create the first list - we can wrap making the list we are doing the list comprehension on into the list comprehension itself. Here we are creating a list of the powers of 2, $[2^0, 2^1, 2^4, \ldots, 2^{11}]$
|
We can be choosy about what goes into the new list, selecting elements which elements to transform and put into the new list, by adding an if statement at the end. For instance, we can get the set of even numbers in the set {0,1,...,9} as follows.
|
|
|
|
Let us do a more interesting comprehension where we transform each element x in the list [0, 2, ..., 8] by 2^x and then select the transformation of only those elements whose square, i.e., x^2, is divisible by 3. We see this list comprehension step-by-step below.
|
|
|
We used our hidden function, makeEMF, to help us to associate each of the numbers that appeared on Lotto ball one with the relative frequencies of its occurence(i.e., the number of times that number appeared divided by the total number of observations, 1114 in this case). We used a list comprehension to do this. First we made the dictionary that associated numbers with their counts. Then we summed up the counts to tell us how many data points there were in total.
|
Then we used a list comprehension to make a list of relative frequencies: for each frequency we divided it by the total frequency. The reason that we used RR(totalCounts) is to make sure that Sage evaluates the result as a floating point real. The RR() function converts from other numerical types to the Sage floating point real type.
|
Try doing a list comprehension to get a list $[2^3, 4^3, 6^3, 8^3]$
|
|
|
You try
Now we are ready to play with some more real data - earthquakes. First of all we will use a file of data from before the earthquake. This is already attached to this worksheet (you will learn more about attaching data shortly...). For the moment, just evaluate this cell so that you have assigned the filename to the variable named myFilename.
|
Both Python and Sage provide lots of different ways of reading data from files. In this case we are going to use the same function to get the data as we used for the Lotto data. This is one we wrote ourselves. It is a little bit more complicated than you need for this course so again we have hidden it away so that you can just concentrate on the data itself.
|
How many rows and columns have we got?
|
What does the array look like? When Sage is asked to display large arrays, it will automatically truncate the output - what you see here is the first and last three columns of the first and last three rows. The data that is not shown is indicated with the . . . characters.
|
The getData(...) function was a general function, designed just to get data from a file. It did not have to know anything about the format of the earthquake data file. When we want to process the array we have made, however, we do have to know about our data. If we want to extract the magnitude data, for instance, we have to know which column it is in. We made a function to take the column containing the magnitude data and return us a list of the magnitude data omitting all the nan (not a number) values. You can look at this function in the Optional Exploration part of the worksheet if you want to, but it is not essential for this course to understand it.
First have a look at the documentation for our function:
And now get the data:
|
When you evaluate the cell below, you will probably find that the length of the list is less than the number of rows in the array. This is because the makeMagList function misses out any nan (not a number) values in the array. If you look at your data in Excel you will see that many rows do not have magnitude data. Note that 'no data' is very different to 'magnitude = 0'!
|
We can use the pylab.hist function to make a histogram of the data. We will be doing more on histograms later. For the moment, just evaluate and admire your earthquake magntitudes histogram.
![]() |
We can also do things like finding the maximum magnitude in the data set we brought in.
|
And the minimum.
|
If we were interested in finding out more about the row of data that had the largest magnitude, we could do have a look at the whole row from the complete array. You don't need to know how to do this for this course, but if you are interested, this code extracts all rows where the value in the 12th column is greater than 7.0
|
|
We also made a more complicated function to get the latitude and longitude coordinates for each earthquake so that we can plot them on a scatter plot. Again, you can have a look at it in the Optional Exploration section if you really want to, but you do not need to be able to write it for this course!
One thing important thing though is that we learned last year that some of the latitude and longitude data shows that earthquakes are well outside New Zealand's land or oceanic territory. Therefore we included a way to specify that we only want to get data that is within certain latitude and longitude ranges. Here we ask for data with latitude between -50 and -30, and longitude between 150 and 200 (inclusive) . We also ask only for data with magnitude $\ge$ 2.0
Use the function on our array myData.
11957 11957 |
The variable named listCoords has now been assigned to a list of three lists: one list for latitudes, one for longitudes, and one for magnitudes.
Now we are going to use the magnitudes to make a list of colours. The elif keyword means 'else-if' and is a way of making a more complex conditional statement checking multiple conditions (it is not essential material for this course, but is useful here).
|
Now we can plot the latitude/longitude coordinates as a scatter plot, with the colour of the point obtained from the list of colours we just created. We will do more on plotting later in the course.
![]() |
Try changing the minimum magnitude we want to show to see what effect this has on your plot (the minimum magnitude is the last argument passed to the makeCoordList function, for example if we use makeCoordList(myData, -50, -30, 150, 200, 3.0) then 3.0 is the minimum magnitude we will get.
|
|
|
You try
So far, we have made actually getting the data very easy for you. For this, you are going to do a little more work.
First we are going to download some New Zealand earthquake data from the GeoNet Quake Search catalog site. Please follow the instructions below carefully:
Look at the data in Excel. Where it is practical to do so, it is always useful to be able to actually have a look at your datafile before you start trying to process it. Sometimes, if the file is huge, this may not be easy, but somehow you need to know things like how many header lines there are at the top (the lines before the data starts), what is the layout of the file, how tidy it all is ...
Now you need to actually upload your data file to the SAGE Notebook server so that you can access it from your worksheet. Please follow the instructions to do this carefully:
DATA+'earthquakes_1July2009_19Mar2010.csv'
(copy it by selecting the text with your mouse and then copying).'earthquakes_1July2009_19Mar2010.csv'
with the text you copied just now ( including the two quote marks ' ')Now if you click evaluate for the cell below, the datafile will be assigned to a variable named newFilename.
|
Make the list of magnitudes, as we did before
|
You could check the minimum and maximums, like we did before:
|
|
Make a histogram:
|
Make the coordinates lists, like we did before (again, excluding data which is well outside New Zealand and setting the minimum magnitude to 2)
|
|
|
|
|
You try
We are going to be using functions quite a lot from now on, so we will get a bit more formal about how we write them.
You can think of functions as reusable procedures for doing useful things that you might want to have available in lots of other, more specialised, sections of code. The beauty of a function is that it is a way of generalising your program instructions: When you want to use a function you pass it the value or values you want it to use and (usually - see below) get back an answer which has been calculated using those values.
In Sage we can define our own functions. We can write our own reusable, general procedures to do useful, frequently needed, things for us.
Think of writing some code to sort out who out of your classmates can join your basketball team. It is a very politically incorrect heightist team, unfortunately, so only people 1.5m tall or more can join, and also they have to be available on Tuesday nights. You've got this information in the form of records for each person (note the use of tuples for the records).
In the cell below we create records for 12 classmates.
|
Each classmate has a record encoded as a tuple containing 3 pieces of information (name, height in metres, and availability on Tuesday evenings) about the classmate it represents.
Now we want to find out who should be in our team. We can index into the tuple (remember the indexing operator [ ])to get at the values in the first, second and third positions:
and we could 'hard-code' the whole thing:
|
As you can see, that is repetitive to code, boring, and very inflexible. In fact, whenever you see repeated lines in your code that seem to be doing the same basic thing with different values, it is a good signal that it is time to write a function.
Here is an expanded version of the information we gave you in an earlier worksheet about defining a function:
The function name is preceded by the keyword def for definition.
It is recommended to make sure that the function name gives some very concise indication of what it does.
After the function name you put the function parameters within a pair of parentheses ( ).
The function parameters are variable names that are used within the body of the function, doing with them whatever it is you want the function to do. We'll come back to them in a minute....
After we have defined the function by its name followed by its parameters in parentheses, we end the line with a colon : before continuing to the next line.
def myFunctionWithAGoodName(parameters):
Now, we are ready to write the content of the function. This must be indented and it is customary to leave 4 white spaces (as we have with all our other indentations). SAGE does this for you automatically if you have ended the line above with a colon :.
It is a matter of courteous programming practice to have a docstring for your function, i.e., comments on what the function does. The whole docstring is enclosed in a set of three single quotes (e.g., '''dostring'''). The docstring is returned when we ask SAGE for help on the function. The general format is to have one single line, ending in a full stop (period) which gives a concise description of the function, followed by a blank line, followed by more detail.
def myFunctionWithAGoodName(parameters):
'''This is the consise summary.
Followed by more detail ...'''
Inside the body of the function we do whatever it is we want to do. Most of the time, but not always, we want the function to tell us the result of these calculations, i.e. we want the function to return something when it has finished. We do this with the keyword return followed by the expression to be returned.
In the SAGE Notebook you can define a function in one cell, evaluaate it, and then use the function in other cells further down. More generally, if you want to put more code in the same cell, below the body of the function (i.e., code that is not part of the function body) you have to move the start of the line backwards from the indentation that indicated the function body, so that it aligns with the def keyword that started the function definition.
Here is an example function to print out a line which states whether a classmate is in the team by testing whether the height in the classmate record is $\ge$ 1.5m and the availability on Tuesday nights in the classmate record is true.
|
That is a good start but... What if we used our function for all our classmates and found that we did not get enough for a team. Maybe then we'd have to lower our minimum height, say to 1.4m? But then we would have to write a new function, with the minimum height of 1.4m embedded itside it. More repeated code - arrghghg. Instead, we could rewrite the function above to generalise it by making the minimum height to test against a parameter of the function. This means that the function itself does not have the minimum height to use embedded inside it, but instead you can tell the function what minimum height to use.
|
Now you need to really think about what is going on here. The function has parameters classmate and minHeight. These are named in the parentheses following the def keyword and the function name printInTeamSecondAttempt. The code inside the body of the function uses these names in its calculations.
When we define the parameters, we are defining some variable names that the function knows about and can use. These are called local variables: they are local to (specific to) the function. This is what makes the function an efficient way of dealing with different classmates and different values for minHeight.
When our code says printInTeamSecondAttempt(myClassmate1, mh)we are calling the function and passing it the arguments myClassmate1 and mh. Inside the function it will assign the record myClassmate1 to a local variable classmate (its first parameter name) and it will assign the value assigned to mh to a local variable minHeight (its second parameter name).
On the next line, when our code says printInTeamSecondAttempt(myClassmate2, mh)we are calling the function again and this time passing it the arguments myClassmate2 and mh. Inside the function it will now assign the record myClassmate2 to a local variable classmate and it will assign the value assigned to mh to a local variable minHeight (its second parameter name) to .
|
Here is an example where we are passing a different value for the minimum height to the function:
|
You'll notice that the printInTeamSecondAttempt(...) function does not actually return anything, it just prints out a line. It has given us a bit more flexibility, but what if we don't want to print things out? We might want to use this function in the middle of a huge program that deals with lots and lots of classmate tuples to make lots of different teams. Thousands of lines of output are not going to be helpful.
What about changing the function to return something? We could return a boolean value, ie true or false, to say whether the classmate tuple passed as the argument to the function meets our criteria.
|
Lets look at the this alternative. Functions that return boolean values are often given a name that starts with is... It's just a kind of convention and it means that you, as a programmer, can just see a function name like isInTeam(...) and, without even looking at the docstring, be pretty certain that it is going to return a boolean value, i.e. true or false.
Here we can see the keyword return, followed by the expression for the value we want to return. In this case we are returning the value the local variable named retValue is assigned to.
Note that this function uses a kind of shortcut to do the if.. else calculation that printInTeam() did. It specifies a default value for what it is going to return (we have assigned this to the variable name retValue). The default is false. The function body only changes default to true if the team membership criteria are satisfied. Doing this is a good way of programming for two reasons: First, in this case it saves writing the 'else' block and, in general, shorter code is faster and easier to understand. Secondly, it makes sure that we return a safe default value no matter how complex our function then gets.
In the next cell we call the function isInTeam and pass it the arguments myClassmate1 and mh.
Try changing the value that the variable mh is assigned to.
|
isInTeam is a function.
|
See what has happened to your lovely documentation string - it is now the 'help' file for the function!
When we call isInTeam we will get back a value. The function call gives a value whose type is the type of the value returned by the function (in this case, it is a boolean type).
|
The function call is returning a value (true or false).
Our new function also does not give us the names, so why is it better than printInTeam()? Because, by returning a boolean value, it is more flexible than a function that just prints something out.
Lets create the kind of situation we might have if we had lots of classmate record tuples to deal with. We'd have them in some kind of collection, say a list (maybe an array, but we'll use list here):
|
classmateList is a list of tuples, each tuple containing 3 pieces of information (name, height in metres, and availability on Tuesday evenings) about the classmate it represents. Now say we wanted the same kind of print out that we had before. We can use a for loop on the list, like this:
|
The if clause of the for loop is calling the function isInTeam(...) and using the boolean value that is returned to decide whether to execute the if block code (print cm[0], "is in the team") or the else block code (print cm[0], "cannot be in the team").
cm is the name we have used for the loop variable, and each value (i.e., tuple) in the list is assigned to the variable cm in turn. cm[0] indexes into the tuple the loop variable cm is currently assigned to and gets the name of the person that tuple represents.
But we can also use the same function in lots of different ways. What about using it to actually make our team?
|
We can then ask Sage to tell us what our team is:>/span>
|
You may also have seen that we could make this code shorter by just using a list comprehension:
|
What if we wanted to make lots of different teams? Write a function to do it!
|
The function makeTeam(...) has two parameters, clist and minValue. It uses a list comprehension to make the team list. The list comprehension calls the function isInTeam(...), as described above for the for loop.
Note that because the list comprehension allows us to make the team list with a single expression, the function body only has to have the return keyword and the list comprehension expression that makes the list to return (if you used the for loop method the function body would have to first make the list using the for loop and then return the list). Most functions do have more than one line in the function body, but occasionally a one-line function is all you need.
Try out the function.
|
Lets think about processing file of earthquake data. There are things we want to do to the file (open it, read the data, turn the data into some kind of array, return the array). When we say "read the data" and "turn the data into some kind of array", what do we mean by "the data"? Well, the data in the file consists of lines, one for each earthquake record. At the top of the file is a line which is the column headings.
|
|
The file consists of separate lines (the '\n' character is a new line character, signifying the end of a line). The first line is the column headings. The lines after that are the records for each earthquake. You can see that although Excel will display the file to you in rows and columns, in fact what we really have is lines with commas to split up the separate components of the line.
So to "read the data" we want to take each line after the header line and turn it into a row in our 2-dimensional array of data. To do this we need to split up the line, and the signal for where to split is the comma. We also want to strip off the \n\ newline character at the end of each line. We are going to do this for each line (after the header line) in the file. This is a situation where we might first write a function to process a single line of data. In the cell below we have written a simple version of the way that we could process a line from our earthquakes file. Note how general this function is: we could use it on almost any file provided that it had lines that needed to be split up.
|
Now when we want to deal with the whole file can use our processLineSimple function. You are not expected to write a function like this, but it will help your understanding of functions if you look at how it uses the processLinesSimple function we have just written to process each record in the file. In words, this function does the following:
|
Now you can use the getDataSimple function to turn some of the earthquake data into an array. We are restricting the data to the first five records so that you see all of those records. Notice that some values are nan. That stands for not a number and indicates that those value were missing in the original data file (you can confirm that by looking at the data in Excel).
|
Just to demonstrate how flexible these function are, here is another file of data. It is a record of an informal experiment a family did with their five cats to try to see which of two cat foods their cats liked. They timed how long it took each cat to each a meal of each of the two brands of cat food and recorded the results in the file CatFoodTime.txt. Have a look at the file.
|
|
Now use exactly the same getDataSimple function to get the cat food data:
|
|
You should now understand a bit better what we said about functions. They allow us to write flexible, reusable code, i.e. to avoid repetition and build larger blocks from smaller blocks. There is one other use for functions even if you don't think that you'll want to use the function in lots of different circumstances. This is just being able to break down your code into manageble blocks (functions), each of which fulfils a clear, self-contained purpose. This is described as modular programming. This kind of code is much easier to read and understand and maintain. You will also almost inevitably find that you reuse key blocks of code more than you thought you would when you first wrote it, so it's useful to make nice functions right from the start. Function parameters allow us to pass arguments to functions, and the function usually passes back, or returns, the result of its calculations. You can nest function calls within functions - that is what building up from your blocks is all about.
Think about how you could use a similar function to take lots of tuple records on people and calculate statistics like the proportion of a class who are over a certain height. Then think a bit wider - if you have some sort of data, you can use a function to do some analysis on the data and then summarise the results.
Only do this if you have done all the You Try sections above and are happy that you understand everything. This section is optional.
If you are interested in using files with Sage (or Python), have a look at the Python documentation about reading and writing files.
|
|
When Sage reads in a csv file, as we read in the earthquake data, it treats all the values as strings. This is good in some ways -- our data could be a mixture of string columns and number columns and it gives us the chance to decide how to deal with the data that we expect. Our getData function was written just to handle numeric values - we did not expect anything that could not be converted to a number. Sage provides easy ways of converting from strings of numerical characters to number types, so we can just convert the columns of strings we know should represent numbers into number types. But, what if something has gone wrong in the file and something we expect to be a number contains an non-numeric character? This can happen easily, for lots of reasons. One of the most obvious is that there was simply no data in the original file. This comes through as an empty string or ' ' (that is two single quote marks with nothing between them). Therefore, we really want to use a safe way of converting from a string to a number -- one which will fail 'gracefully' rather than just fall over and die.... If you have spent half an hour of computer processing time reading in a huge file you are not going to be pleased when the whole thing fails at this stage because of one missing value or corrupted number!
An example of the kind of function we need to safely convert from strings to float types is shown below. Again, you don't have to worry about the details. What you do have to think about is that real-world computational statistics is a lot more than just pressing buttons with a computer package to analyse perfect data with no effort: you have to think about your data and and you have to think about how to deal with problems from basics.
|
This function gets magnitude data from the array, using a list comprehension to skip not-a-number (nan) values. nan's typically arise when the original data was just missing, ie that line of the file did not have a value for magnitude. Note that it assumes a knowledge of the structure of the data (which column the magnitudes are in). This is not very flexible or robust and could have been done much better - can you think how?
|
Here is a more complicated function to get the latitude and longitude coordinates for each earthquake so that we can plot them on a scatter plot. You very definitely don't have to worry about what the following function is doing, although we have provided more comments than would be usual in normal programming. Note again that the function assumes a knowledge of the data structure (which columns the latitudes and longitudes are in). Note also that we want to test the data to make sure that we have both latitude and longitude. This function could be implemented much more efficiently but it does it's job ...
|
|
|
|
|
|
|