# Non-parametric Estimation and Testing

## Monte Carlo Methods

©2009 2010 2011 2012 Jennifer Harlow, Dominic Lee and Raazesh Sainudiin.

## Inference and Estimation: The Big Picture

The Big Picture is about inference and estimation, and especially inference and estimation problems where computational techniques are helpful.

 Point estimation Set estimation Parametric MLE of finitely many parametersdone Confidence intervals, via the central limit theorem Non-parametric (infinite-dimensional parameter space) about to see ... about to see ... One/Many-dimensional Integrals (finite-dimensional) coming up ... coming up ...

So far we have seen parametric models, for example

• $X_1, X_2, \ldots, X_n \overset{IID}{\sim} Bernoulli (\theta)$, $\theta \in [0,1]$
• $X_1, X_2, \ldots, X_n \overset{IID}{\sim} Exponential (\lambda)$, $\lambda \in (0,\infty)$
• $X_1, X_2, \ldots, X_n \overset{IID}{\sim} Normal(\mu^*, \sigma)$, $\mu \in \mathbb{R}$, $\sigma \in (0,\infty)$

In all these cases the parameter space (the space within which the parameter(s) can take values) is finite dimensional:

• for the $Bernoulli$, $\theta \in [0,1] \subseteq \mathbb{R}^1$
• for the $Exponential$, $\lambda \in (0, \infty) \subseteq \mathbb{R}^1$
• for the $Normal$, $\mu \in \mathbb{R}^1$, $\sigma \in (0,\infty) \subseteq \mathbb{R}^1$, so $(\mu, \sigma) \subseteq \mathbb{R}^2$

For parametric experiments, we can use the maximum likelihood principle and estimate the parameters using the Maximum Likelihood Estimator (MLE).

## Non-parametric estimation

Suppose we don't know what the distribution function (DF) is?  We are not trying to estimate some fixed but unknown parameter $\theta^*$ for some RV we are assuming to be $Bernoulli(\theta^*)$, we are trying to estimate the DF itself.  In real life, data does not come neatly labeled "I am a realisation of a $Bernoulli$ RV", or "I am a realisation of an $Exponential$ RV": an important part of inference and estimation is to make inferences about the DF itself from our observations. ##### Observations from some unknown process

Consider the following non-parametric product experiment:

$X_1, X_2, \ldots, X_n\ \overset{IID}{\sim} F^* \in \{\text{all DFs}\}$

We want to produce a point estimate for $F^*$, which is a allowed to be any DF ("lives in the space of all DFs"), i.e., $F^* \in \{\text{all DFs}\}$

$\{\text{all DFs}\}$ is infinite dimensional We have already seen an estimate, made using the data, of a distribution function:  the empirical or data-based distribution function (or empirical cumulative distribution function). This can be formalized as the following process of adding indicator functions of the half-lines beginning at the data points $[X_1,+\infty),[X_2,+\infty),\ldots,[X_n,+\infty)$:

$\widehat{F}_n (x) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{[X_i,+\infty)}(x)$

where

$\text{where }\mathbf{1}_{[X_i,+\infty)}(x) := \begin{cases} & 1 \quad \text{ if } X_i \leq x \\ & 0 \quad \text{ if }X_i > x \end{cases}$

We can remind ourselves of this for a small sample of $de\,Moivre(k=5)$ RVs:

deMs=[randint(1,5) for i in range(20)] # randint can be used to uniformly sample integers in a specified range deMs
sortedUniqueValues = sorted(list(set(deMs))) freqs = [deMs.count(i) for i in sortedUniqueValues] from pylab import cumsum cumFreqs = list(cumsum(freqs)) # cumRelFreqs = [ZZ(i)/len(deMs) for i in cumFreqs] # get cumulative relative frequencies as rationals zip(sortedUniqueValues, cumRelFreqs)
show(ecdfPlot(deMs), figsize=[6,3]) # use hidden ecdfPlot function to plot We can use the empirical cumulative distribution function $\widehat{F}_n$ for our non-parametric estimate because this kind of estimation is possible in infinite-dimensional contexts due to the following two theorems:

• Glivenko-Cantelli Theorem ("Fundamental Theorem of Statistics")
• Dvoretsky-Kiefer-Wolfowitz (DKW) Inequality

### Glivenko-Cantelli Theorem

Let $X_1, X_2, \ldots, X_n \overset{IID}{\sim} F^* \in \{\text{all DFs}\}$

and the empirical distribution function (EDF) is $\widehat{F}_n(x) := \displaystyle\frac{1}{n} \sum_{i=1}^n \mathbf{1}_{[X_i,+\infty)}(x)$, then

$\sup_x { | \widehat{F}_n(x) - F^*(x) | } \overset{P}{\rightarrow} 0$

Remember that the EDF is a statistic of the data, a statistic is an RV, and (from our work the convergence of random variables), $\overset{P}{\rightarrow}$ means "converges in probability".  The proof is beyond the scope of this course, but we can gain an appreciation of what it means by looking at what happens to the ECDF for $n$ simulations from:

• $de\,Moivre(1/5,1/5,1/5,1/5,1/5)$ and
• $Uniform(0,1)$ as $n$ increases:
@interact def _(n=(10,(0..200))): '''Interactive function to plot ecdf for obs from de Moirve (5).''' if (n > 0): us = [randint(1,5) for i in range(n)] p=ecdfPlot(us) # use hidden ecdfPlot function to plot #p+=line([(-0.2,0),(0,0),(1,1),(1.2,1)],linestyle=':') p.show(figsize=[3,3])

## Click to the left again to hide and once more to show the dynamic interactive window

@interact def _(n=(10,(0..200))): '''Interactive function to plot ecdf for obs from Uniform(0,1).''' if (n > 0): us = [random() for i in range(n)] p=ecdfPlot(us) # use hidden ecdfPlot function to plot p+=line([(-0.2,0),(0,0),(1,1),(1.2,1)],linestyle='-') p.show(figsize=[5,4],aspect_ratio=1)

## Click to the left again to hide and once more to show the dynamic interactive window

It is clear, that as $n$ increases, the ECDF $\widehat{F}_n$ gets closer and closer to the true DF $F^*$,   $\displaystyle\sup_x { | \widehat{F}_n(x) - F^*(x) | } \overset{P}{\rightarrow} 0$.

This will hold no matter what the (possibly unknown) $F^*$ is.  Thus, $\widehat{F}_n$ is a point estimate of $F^*$.

We need to add the DKW Inequality be able to get confidence sets or a 'confidence band' that traps $F^*$ with high probability.

### Dvoretsky-Kiefer-Wolfowitz (DKW) Inequality

Let $X_1, X_2, \ldots, X_n \overset{IID}{\sim} F^* \in \{\text{all DFs}\}$

and the empirical distribution function (EDF) is $\widehat{F}_n(x) := \displaystyle\frac{1}{n} \sum_{i=1}^n \mathbf{1}_{[X_i,+\infty)}(x)$,

then, for any $\varepsilon > 0$,

$P\left( \sup_x { | \widehat{F}_n(x) - F^*(x) | > \varepsilon }\right) \leq 2 \exp(-2n\varepsilon^2)$

We can use this inequality to get a $1-\alpha$ confidence band $C_n(x) := \left[\underline{C}_n(x), \overline{C}_n(x)\right]$ about our point estimate $\widehat{F}_n$ of our possibly unknown $F^*$ such that the $F^*$ is 'trapped' by the band with probability at least $1-\varepsilon$.

$\begin{eqnarray} \underline{C}_{\, n}(x) &=& \max \{ \widehat{F}_n(x)-\varepsilon_n, 0 \}, \notag \\ \overline{C}_{\, n}(x) &=& \min \{ \widehat{F}_n(x)+\varepsilon_n, 1 \}, \notag \\ \varepsilon_n &=& \sqrt{ \frac{1}{2n} \log \left( \frac{2}{\alpha}\right)} \\ \end{eqnarray}$

and

$$P\left(\underline{C}_n(x) \leq F^*(x) \leq \overline{C}_n(x)\right) \geq 1-\alpha$$

You try in class

Try this out for a simple sample from the $Uniform(0,1)$, which you can generate using random.   First we  will just make the point estimate for $F^*$, the EDF $\widehat{F}_n$

n=10 uniformSample = [random() for i in range(n)] uniformSample

In one of the assessments, you did a question that took you through the steps for getting the list of points that you would plot for an empirical  distribution function (EDF).  We will do exactly the same thing here.

First we find the unique values in the sample, in order from smallest to largest, and get the frequency with which each unique value occurs:

sortedUniqueValuesUniform = sorted(list(set(uniformSample))) sortedUniqueValuesUniform
freqsUniform = [uniformSample.count(i) for i in sortedUniqueValuesUniform] freqsUniform

Then we accumulate the frequences to get the cumulative frequencies:

from pylab import cumsum cumFreqsUniform = list(cumsum(freqsUniform)) # accumulate cumFreqsUniform

And the the relative cumlative frequencies:

cumRelFreqsUniform = [ZZ(i)/len(uniformSample) for i in cumFreqsUniform] # cumulative rel freqs as rationals cumRelFreqsUniform

And finally zip these up with the sorted unique values to get a list of points we can plot:

ecdfPointsUniform = zip(sortedUniqueValuesUniform, cumRelFreqsUniform) ecdfPointsUniform

You are not expected to know how to actually create the ECDF plot, so here is a function that you can just use to do it:

%auto # ECDF plot given a list of points to plot def ecdfPointsPlot(listOfPoints, colour='grey', lines_only=False): '''Returns an empirical probability mass function plot from a list of points to plot. Param listOfPoints is the list of points to plot. Param colour is used for plotting the lines, defaulting to grey. Param lines_only controls wether only lines are plotted (true) or points are added (false, the default value). Returns an ecdf plot graphic.''' ecdfP = point((0,0), pointsize="0") if not lines_only: ecdfP = point(listOfPoints, rgbcolor = "red", faceted = false, pointsize="20") for k in range(len(listOfPoints)): x, kheight = listOfPoints[k] # unpack tuple previous_x = 0 previous_height = 0 if k > 0: previous_x, previous_height = listOfPoints[k-1] # unpack previous tuple ecdfP += line([(previous_x, previous_height),(x, previous_height)], rgbcolor=colour) ecdfP += line([(x, previous_height),(x, kheight)], rgbcolor=colour, linestyle=":") if not lines_only: ecdfP += points((x, previous_height),rgbcolor = "white", faceted = true, pointsize="20") # padding max_index = len(listOfPoints)-1 ecdfP += line([(listOfPoints-0.2, 0),(listOfPoints, 0)], rgbcolor=colour) ecdfP += line([(listOfPoints[max_index], listOfPoints[max_index]),(listOfPoints[max_index]+0.2, listOfPoints[max_index])],rgbcolor=colour) return ecdfP

This makes the plot of the $\widehat{F}_{10}$, the point estimate for $F^*$ for these $n=10$ simulated samples.

show(ecdfPointsPlot(ecdfPointsUniform), figsize=[6,3]) What about adding those confidence bands?  You will do essentially the same thing, but adjusting for the required $\varepsilon$.  First we need to decide on an $\alpha$ and calculate the $\varepsilon$ corresponding to this alpha.  Here is some of our code to calculate the $\varepsilon$ corresponding to $\alpha=0.05$ (95% confidence bands), using a hidden function calcEpsilon

alpha = 0.05 epsilon = calcEpsilon(alpha, n) epsilon

See if you can write your own code to do this calculation, $\varepsilon_n = \sqrt{ \frac{1}{2n} \log \left( \frac{2}{\alpha}\right)}$. For completeness, do the whole thing:assign the value 0.05 to a variable named alpha, and then use this and the variable called n that we have already declared to calculate a value for $\varepsilon$.  Call the variable to which you assign the value for $\varepsilon$  epsilon so that it replaces the value we calculated in the cell above (you should get the same value as us!).

Now we need to use this to adjust the EDF plot.  In the two cells below we first of all do the adjustment for $\underline{C}_{\,n}(x) =\max \{ \widehat{F}_n(x)-\varepsilon_n, 0 \}$, and then use zip again to get the points to actually plot for the lower boundary of the 95% confidence band.

Now we need to use this to adjust the EDF plot. In the two cells below we first of all do the adjustment for $\overline{C}_{\,n}(x) =\min \{ \widehat{F}_n(x)+\varepsilon_n, 1 \}$, and then use zip again to get the points to actually plot for the lower boundary of the 95% confidence band.

cumRelFreqsUniformLower = [max(crf - epsilon, 0) for crf in cumRelFreqsUniform] # heights for the lower band cumRelFreqsUniformLower
ecdfPointsUniformLower = zip(sortedUniqueValuesUniform, cumRelFreqsUniformLower) ecdfPointsUniformLower

We carefully gave our ecdfPointsPlot function the flexibility to be able to plot bands, by having a colour parameter (which defaults to 'grey') and a lines_only parameter (which defaults to false).  Here we can plot the lower bound of the confidence interval by by adding ecdfPointsPlot(ecdfPointsUniformLower, colour='green', lines_only=true) to the previous plot:

pointEstimate = ecdfPointsPlot(ecdfPointsUniform) lowerBound = ecdfPointsPlot(ecdfPointsUniformLower, colour='green', lines_only=true) show(pointEstimate + lowerBound, figsize=[6,3]) Now you try writing the code to create the list of points needed for plotting the upper band $\overline{C}_{\,n}(x) =\min \{ \widehat{F}_n(x)+\varepsilon_n, 1 \}$.  You will need to first of all get the upper heights (call them say cumRelFreqsUniformUpper) and then zip them up with the sortedUniqueValuesUniform to get the points to plot.

Once you have got done this you can add them to the plot by altering the code below:

pointEstimate = ecdfPointsPlot(ecdfPointsUniform) lowerBound = ecdfPointsPlot(ecdfPointsUniformLower,colour='green', lines_only=true) show(pointEstimate + lowerBound, figsize=[6,3]) If we are doing lots of collections of EDF points we may as well define a function to do it, rather than repeating the same code again and again.  We use an offset parameter to give us the flexibility to use this to make points for confidence bands as well.

%auto def makeEDFPoints(myDataList, offset=0): '''Make a list empirical distribution plotting points from from a data list. Param myDataList, list of data to make ecdf from. Param offset is an offset to adjust the edf by, used for doing confidence bands. Return list of tuples comprising (data value, cumulative relative frequency(with offset)) ordered by data value.''' sortedUniqueValues = sorted(list(set(myDataList))) freqs = [myDataList.count(i) for i in sortedUniqueValues] from pylab import cumsum cumFreqs = list(cumsum(freqs)) cumRelFreqs = [ZZ(i)/len(myDataList) for i in cumFreqs] # get cumulative relative frequencies as rationals if offset > 0: # an upper band cumRelFreqs = [min(i+offset ,1) for i in cumRelFreqs] if offset < 0: # a lower band cumRelFreqs = [max(i+offset, 0) for i in cumRelFreqs] return zip(sortedUniqueValues, cumRelFreqs)

#### (end of You Try)

Now we will try looking at the Earthquakes data we have used before to get a confidence band around an EDF for that.  We start by bringing in the data and using our own interQuakeTimes function to calculate the times between earthquakes in seconds:

myFilename = 'earthquakes_1July2009_19Mar2010.csv' myData = getData(myFilename,headerlines=1,sep=',') interQuakesSecs = interQuakeTimes(myData, -50, -30, 150, 200)
len(interQuakesSecs)

There is a lot of data here, so let's use an interactive plot to do the non-parametric DF estimation just for some of the last data:

@interact def _(takeLast=(500,(0..min(len(interQuakesSecs),1999))), alpha=(0.05)): '''Interactive function to plot the edf estimate and confidence bands for inter earthquake times.''' if takeLast > 0 and alpha > 0 and alpha < 1: lastInterQuakesSecs = interQuakesSecs[len(interQuakesSecs)-takeLast:len(interQuakesSecs)] interQuakePoints = makeEDFPoints(lastInterQuakesSecs) p=ecdfPointsPlot(interQuakePoints, lines_only=true) epQuakes = calcEpsilon(alpha, len(lastInterQuakesSecs)) interQuakePointsLower = makeEDFPoints(lastInterQuakesSecs, offset=-epQuakes) lowerQuakesBound = ecdfPointsPlot(interQuakePointsLower, colour='green', lines_only=true) interQuakePointsUpper = makeEDFPoints(lastInterQuakesSecs, offset=epQuakes) upperQuakesBound = ecdfPointsPlot(interQuakePointsUpper, colour='green', lines_only=true) show(p + lowerQuakesBound + upperQuakesBound, figsize=[6,3]) else: print "check your input values"

## Click to the left again to hide and once more to show the dynamic interactive window

What if we are not interested in estimating $F^*$ itself, but we are interested in scientificially investigating whether two distributions are the same.  For example, perhaps, whether the distribution of earthquake magnitudes was the same in April as it was in March.  Then, we should attempt to reject a falsifiable hypothesis ...

## Hypothesis Testing

A formal definition of hypothesis testing is beyond our current scope.  Here we will look in particular at a non-parametric hypothesis test called a permutation test.  First, a quick review:

The outcomes of a hypothesis test, in general, are:

 'true state of nature' Do not reject $H_0$ Reject $H_0$ $H_0$ is true OK Type I error $H_0$ is false Type II error OK

So, we want a small probability that we reject  $H_0$ when $H_0$ is true (minimise Type I error).  Similarly, we want to minimise the probability that we fail to reject $H_0$ when $H_0$ is false (type II error).

The P-value is one way to conduct a desirable hypothesis test.  The scale of the evidence against $H_0$ is stated in terms of the P-value.  The following interpretation of P-values is commonly used:

• P-value $\in (0, 0.01]$: Very strong evidence against $H_0$
• P-value $\in (0.01, 0.05]$: Strong evidence against $H_0$
• P-value $\in (0.05, 0.1]$: Weak evidence against $H_0$
• P-value $\in (0.1, 1]$: Little or no evidence against $H_0$

### Permutation Testing

A Permuation Test is a non-parametric exact method for testing whether two distributions are the same based on samples from each of them.

What do we mean by "non-parametric exact"?  It is non-parametric because we do not impose any parametric assumptions.  It is exact because it works for any sample size.

Formally, we suppose that: $X_1,X_2,\ldots,X_m \overset{IID}{\sim} F^* \quad \text{and} \quad X_{m+1}, X_{m+2},\ldots,X_{m+n} \overset{IID}{\sim} G^* \enspace ,$ are two sets of independent samples where the possibly unknown DFs $F^*,\,G^* \in \{ \text{all DFs} \}$.

(Notice that we have written it so that the subscripts on the $X$s run from 1 to $m+n$.)

Now, consider the following hypothesis test: $H_0: F^*=G^* \quad \text{versus} \quad H_1: F^* \neq G^* \enspace .$

Our test statistic uses the observations in both both samples.  We want a test statistic that is a sensible one for the test, i.e., will be large when when $F^*$ is 'too different' from $G^*$

So, let our test statistic $T(X_1,\ldots,X_m,X_{m+1},\ldots,X_{m+n})$ be say: $T:=T(X_1,\ldots,X_m,X_{m+1},\ldots,X_{m+n})= \text{abs} \left( \frac{1}{m} \sum_{i=1}^m X_i - \frac{1}{n} \sum_{i=m+1}^n X_i \right) \enspace .$

(In words, we have chosen a test statistic that is the absolute value of the difference in the sample means.   Note the limitation of this:  if $F^*$ and $G^*$ have the same mean but different variances, our test statistic $T$ will not be large.)

Then the idea of a permutation test is as follows:

1. Let $N:=m+n$ be the pooled sample size and consider all $N!$ permutations of the observed data $x_{obs}:=(x_1,x_2,\ldots,x_m,x_{m+1},x_{m+2},\ldots,x_{m+n})$.

2. For each permutation of the data compute the statistic $T(\text{permuted data } x)$ and denote these $N!$ values of $T$ by $t_1,t_2,\ldots,t_{N!}$.

3. Under $H_0: X_1,\ldots,X_m,X_{m+1},\ldots,X_{m+n} \overset{IID}{\sim}F^*=G^*$, each of the permutations of $x= (x_1,x_2,\ldots,x_m,x_{m+1},x_{m+2},\ldots,x_{m+n})$ has the same joint probability $\prod_{i=1}^{m+n} f(x_i)$  -- $f(x_i)$ is the density function corresponding to $F^*=G^*$, $f(x_i)=dF(x_i)=dG(x_i)$.
4. Therefore, the transformation of the data by our statistic $T$ also has the same probability over the values of $T$, namely $\{t_1,t_2,\ldots,t_{N!}\}$. Let $\mathbf{P}_0$ be this permutation distribution under the null hypothesis. $\mathbf{P}_0$ is discrete and uniform over $\{t_1,t_2,\ldots,t_{N!}\}$.

5. Let $t_{obs} := T(x_{obs})$ be the observed value of the test statistic.

6. Assuming we reject $H_0$ when $T$ is large, the P-value = $\mathbf{P}_0 \left( T \geq t_{obs} \right)$

Saying that $\mathbf{P}_0$ is discrete and uniform over $\{t_1, t_2, \ldots, t_{N!}\}$ says that each possible permutation has an equal probabability of occuring (under the null hypothesis).  There are $N!$ possible permutations and so the probability of any individual permutation is $\frac{1}{N!}$

$\text{P-value} = \mathbf{P}_0 \left( T \geq t_{obs} \right) = \frac{1}{N!} \left( \sum_{j=1}^{N!} \mathbf{1} (t_j \geq t_{obs}) \right), \qquad \mathbf{1} (t_j \geq t_{obs}) = \begin{cases} 1 & \text{if } \quad t_j \geq t_{obs} \\ 0 & \text{otherwise} \end{cases}$

This will make more sense if we look at some real data.

### Permutation Testing with Shell Data

In 2008, Guo Yaozong and Chen Shun collected data on the diameters of coarse venus shells from New Brighton beach for a course project.  They recorded the diameters for two samples of shells, one from each side of the New Brighton Pier.  The data is given in the following two cells.

%auto leftSide = [52, 54, 60, 60, 54, 47, 57, 58, 61, 57, 50, 60, 60, 60, 62, 44, 55, 58, 55, 60, 59, 65, 59, 63, 51, 61, 62, 61, 60, 61, 65, 43, 59, 58, 67, 56, 64, 47, 64, 60, 55, 58, 41, 53, 61, 60, 49, 48, 47, 42, 50, 58, 48, 59, 55, 59, 50, 47, 47, 33, 51, 61, 61, 52, 62, 64, 64, 47, 58, 58, 61, 50, 55, 47, 39, 59, 64, 63, 63, 62, 64, 61, 50, 62, 61, 65, 62, 66, 60, 59, 58, 58, 60, 59, 61, 55, 55, 62, 51, 61, 49, 52, 59, 60, 66, 50, 59, 64, 64, 62, 60, 65, 44, 58, 63]
%auto rightSide = [58, 54, 60, 55, 56, 44, 60, 52, 57, 58, 61, 66, 56, 59, 49, 48, 69, 66, 49, 72, 49, 50, 59, 59, 59, 66, 62, 44, 49, 40, 59, 55, 61, 51, 62, 52, 63, 39, 63, 52, 62, 49, 48, 65, 68, 45, 63, 58, 55, 56, 55, 57, 34, 64, 66, 54, 65, 61, 56, 57, 59, 58, 62, 58, 40, 43, 62, 59, 64, 64, 65, 65, 59, 64, 63, 65, 62, 61, 47, 59, 63, 44, 43, 59, 67, 64, 60, 62, 64, 65, 59, 55, 38, 57, 61, 52, 61, 61, 60, 34, 62, 64, 58, 39, 63, 47, 55, 54, 48, 60, 55, 60, 65, 41, 61, 59, 65, 50, 54, 60, 48, 51, 68, 52, 51, 61, 57, 49, 51, 62, 63, 59, 62, 54, 59, 46, 64, 49, 61]
len(leftSide); len(rightSide)

$(115 + 139)!$ is a very big number.   Lets start small, and take a subselection of the shell data to demonstrate the permutation test concept:  the first two shells from the left of the pier and the first one from the right:

rightSub = [52, 54] leftSub =  totalSample = rightSub + leftSub totalSample

So now we are testing the hypotheses

$\begin{array}{lcl}H_0&:& X_1,X_2,X_3 \overset{IID}{\sim} F^*=G^* \\H_1&:&X_1, X_2 \overset{IID}{\sim} F^*, \,\,X_3 \overset{IID}{\sim} G^*, F^* \neq G^*\end{array}$

With the test statistic$\begin{array}{lcl}T(X_1,X_2,X_3) &=& \text{abs} \left(\displaystyle\frac{1}{2}\displaystyle\sum_{i=1}^2X_i - \displaystyle\frac{1}{1}\displaystyle\sum_{i=2+1}^3X_i\right) \\ &=&\text{abs}\left(\displaystyle\frac{X_1+ X_2}{2} - \displaystyle\frac{X_3}{1}\right)\end{array}$

Our observed data $x_{obs} = (x_1, x_2, x_3) = (52, 54, 58)$

and the realisation of the test statistic for this data is $t_{obs} = \text{abs}\left(\displaystyle\frac{52+54}{2} - \frac{58}{1}\right) = \text{abs}\left(53 - 58\right) = \text{abs}(-5) = 5$

Now we need to tabulate the permutations and their probabilities.  There are 3! = 6 possible permutataions of three items.  For larger samples, you could use the factorial function to calculate this:

factorial(3)

We said that under the null hypotheses (the samples have the same DF) each permutation is equally likely, so each permutation has probability $\displaystyle\frac{1}{6}$.

There is a way in Python (the language under the hood in Sage), to get all the permuations of a sequence:

list(permutations(totalSample))

We can tabulate the permuations, their probabilities, and the value of the test statistic that would be associated with that permutation:

 Permutation $t$ $\mathbf{P}_0(T=t)$ (52, 54, 58) 5 $\frac{1}{6}$ (52, 58, 54) 1 $\frac{1}{6}$ (54, 52, 58) 5 $\frac{1}{6}$ (54, 58, 52) 4 $\frac{1}{6}$ (58, 52, 54) 1 $\frac{1}{6}$ (58, 54, 52) 4 $\frac{1}{6}$
allPerms = list(permutations(totalSample)) for p in allPerms: t = abs((p + p)/2 - p/1) print p, " has t = ", t

To calculate the P-value for our test statistic $t_{obs} = 5$, we need to look at how many permutations would give rise to test statistics that are at least as big, and add up their probabilities.

$\begin{array}{lcl}\text{P-value} &=& \mathbf{P}_0(T \geq t_{obs}) \\&=&\mathbf{P}_0(T \geq 5)\\&=&\frac{1}{6} + \frac {1}{6} \\&=&\frac{2}{6}\\ &=&\frac{1}{3} \\ &\approx & 0.333\end{array}$

We could write ourselves a little bit of code to do this in Sage.  As you can see, we could easily improve this to make it more flexible so that we could use it for different numbers of samples, but it will do for now.

allPerms = list(permutations(totalSample)) pProb = 1/len(allPerms) pValue = 0 tobs = 5 for p in allPerms: t = abs((p + p)/2 - p/1) if t >= tobs: pValue = pValue + pProb pValue

This means that there is little or no evidence against the null hypothesis (that the shell diameter observations are from the same DF).

#### Pooled sample size

The lowest possible P-value for a pooled sample of size $N=m+n$ is $\displaystyle\frac{1}{N!}$.  Can you see why this is?

So with our small sub-samples the smallest possible P-value would be $\frac{1}{6} \approx 0.167$.  If we are looking for P-value $\leq 0.01$ to constitute very strong evidence against $H_0$, then we have to have a large enough pooled sample for this to be possible.  Since $5! = 5 \times 4 \times 3 \times 2 \times 1 = 120$, it is good to have $N \geq 5$

You try in class

Try copying and pasting our code and then adapting it to deal with a sub-sample (52, 54, 60) from the left of the pier and (58, 54) from the right side of the pier.

rightSub = [52, 54, 60] leftSub = [58, 54] totalSample = rightSub + leftSub totalSample

You will have to think about:

• calculating the value of the test statistic for the observed data and for all the permuations of the total sample
• calculating the probability of each permutation
• calculating the P-value by adding the probabilities for the permutations with test statistics at least as large as the observed value of the test statistic

(add more cells if you need them)

#### (end of You Try)

We can use the sample function and the Python method for making permutations to experiment with a larger sample, say 5 of each.

n, m = 5, 5 leftSub = sample(leftSide, n) rightSub = sample(rightSide,m) totalSample = leftSub + rightSub leftSub; rightSub; totalSample
tobs = abs(mean(leftSub) - mean(rightSub)) tobs

We have met sample briefly already:  it is part of the Python random module and it does exactly what you would expect from the name: it samples a specified number of elements randomly from a sequence.

help(sample)
#define a helper function for calculating the tstat from a permutation def tForPerm(perm, samplesize1, samplesize2): '''Calculates the t statistic for a permutation of data given the sample sizes to split the permuation into. Param perm is the permutation of data to be split into the two samples. Param samplesize1, samplesize2 are the two sample sizes. Returns the absolute value of the difference in the means of the two samples split out from perm.''' sample1 = [perm[i] for i in range(samplesize1)] sample2 = [perm[samplesize1+j] for j in range(samplesize2)] return abs(mean(sample1) - mean(sample2))
allPerms = list(permutations(totalSample)) pProb = 1/len(allPerms) pValue = 0 tobs = abs(mean(leftSub) - mean(rightSub)) for p in allPerms: t = tForPerm(p, n, m) if t >= tobs: pValue = pValue + pProb pValue
n+m
factorial(n+m) # how many permutations is it checking

As you can see from the length of time it takes to do the calculation for $(5+5)! = 10!$ permutations, we will be here a long time if we try to this on all of both shell data sets.  Monte Carlo methods to the rescue:  we can use Monte Carlo integration to calculate an approximate P-value, and this will be our next topic.

You try

Try working out the P-value for a sub-sample (58, 63) from the left of the pier and (61) from the right (the two last values in the left-side data set and the last value in the right-side one).  Do it as you would if given a similar question in the exam: you choose how much you want to use Sage to help and how much you do just with pen and paper.