©2009 2010 2011 2012 Jennifer Harlow, Dominic Lee and Raazesh Sainudiin.
Creative Commons Attribution-Noncommercial-Share Alike 3.0
The Big Picture is about inference and estimation, and especially inference and estimation problems where computational techniques are helpful.
Point estimation | Set estimation | |
Parametric
|
MLE of finitely many parameters |
Confidence intervals, |
Non-parametric |
about to see ... |
about to see ... |
One/Many-dimensional Integrals |
coming up ... |
coming up ... |
So far we have seen parametric models, for example
In all these cases the parameter space (the space within which the parameter(s) can take values) is finite dimensional:
For parametric experiments, we can use the maximum likelihood principle and estimate the parameters using the Maximum Likelihood Estimator (MLE).
Suppose we don't know what the distribution function (DF) is? We are not trying to estimate some fixed but unknown parameter $\theta^*$ for some RV we are assuming to be $Bernoulli(\theta^*)$, we are trying to estimate the DF itself. In real life, data does not come neatly labeled "I am a realisation of a $Bernoulli$ RV", or "I am a realisation of an $Exponential$ RV": an important part of inference and estimation is to make inferences about the DF itself from our observations.
Consider the following non-parametric product experiment:
\[X_1, X_2, \ldots, X_n\ \overset{IID}{\sim} F^* \in \{\text{all DFs}\}\]
We want to produce a point estimate for $F^*$, which is a allowed to be any DF ("lives in the space of all DFs"), i.e., $F^* \in \{\text{all DFs}\}$
$\{\text{all DFs}\}$ is infinite dimensional
We have already seen an estimate, made using the data, of a distribution function: the empirical or data-based distribution function (or empirical cumulative distribution function). This can be formalized as the following process of adding indicator functions of the half-lines beginning at the data points $[X_1,+\infty),[X_2,+\infty),\ldots,[X_n,+\infty)$:
\[ \widehat{F}_n (x) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{[X_i,+\infty)}(x) \]
where
\[\text{where }\mathbf{1}_{[X_i,+\infty)}(x) := \begin{cases} & 1 \quad \text{ if } X_i \leq x \\ & 0 \quad \text{ if }X_i > x \end{cases}\]
We can remind ourselves of this for a small sample of $de\,Moivre(k=5)$ RVs:
|
|
![]() |
|
We can use the empirical cumulative distribution function $\widehat{F}_n$ for our non-parametric estimate because this kind of estimation is possible in infinite-dimensional contexts due to the following two theorems:
Let $X_1, X_2, \ldots, X_n \overset{IID}{\sim} F^* \in \{\text{all DFs}\}$
and the empirical distribution function (EDF) is $\widehat{F}_n(x) := \displaystyle\frac{1}{n} \sum_{i=1}^n \mathbf{1}_{[X_i,+\infty)}(x)$, then
\[ \sup_x { | \widehat{F}_n(x) - F^*(x) | } \overset{P}{\rightarrow} 0 \]
Remember that the EDF is a statistic of the data, a statistic is an RV, and (from our work the convergence of random variables), $\overset{P}{\rightarrow}$ means "converges in probability". The proof is beyond the scope of this course, but we can gain an appreciation of what it means by looking at what happens to the ECDF for $n$ simulations from:
Click to the left again to hide and once more to show the dynamic interactive window |
Click to the left again to hide and once more to show the dynamic interactive window |
|
It is clear, that as $n$ increases, the ECDF $\widehat{F}_n$ gets closer and closer to the true DF $F^*$, $\displaystyle\sup_x { | \widehat{F}_n(x) - F^*(x) | } \overset{P}{\rightarrow} 0$.
This will hold no matter what the (possibly unknown) $F^*$ is. Thus, $\widehat{F}_n$ is a point estimate of $F^*$.
We need to add the DKW Inequality be able to get confidence sets or a 'confidence band' that traps $F^*$ with high probability.
Let $X_1, X_2, \ldots, X_n \overset{IID}{\sim} F^* \in \{\text{all DFs}\}$
and the empirical distribution function (EDF) is $\widehat{F}_n(x) := \displaystyle\frac{1}{n} \sum_{i=1}^n \mathbf{1}_{[X_i,+\infty)}(x)$,
then, for any $\varepsilon > 0$,
\[P\left( \sup_x { | \widehat{F}_n(x) - F^*(x) | > \varepsilon }\right) \leq 2 \exp(-2n\varepsilon^2) \]
We can use this inequality to get a $1-\alpha$ confidence band $C_n(x) := \left[\underline{C}_n(x), \overline{C}_n(x)\right]$ about our point estimate $\widehat{F}_n$ of our possibly unknown $F^*$ such that the $F^*$ is 'trapped' by the band with probability at least $1-\varepsilon$.
\[\begin{eqnarray} \underline{C}_{\, n}(x) &=& \max \{ \widehat{F}_n(x)-\varepsilon_n, 0 \}, \notag \\ \overline{C}_{\, n}(x) &=& \min \{ \widehat{F}_n(x)+\varepsilon_n, 1 \}, \notag \\ \varepsilon_n &=& \sqrt{ \frac{1}{2n} \log \left( \frac{2}{\alpha}\right)} \\ \end{eqnarray}\]
and
$$P\left(\underline{C}_n(x) \leq F^*(x) \leq \overline{C}_n(x)\right) \geq 1-\alpha$$
You try in class
Try this out for a simple sample from the $Uniform(0,1)$, which you can generate using random. First we will just make the point estimate for $F^*$, the EDF $\widehat{F}_n$
|
In one of the assessments, you did a question that took you through the steps for getting the list of points that you would plot for an empirical distribution function (EDF). We will do exactly the same thing here.
First we find the unique values in the sample, in order from smallest to largest, and get the frequency with which each unique value occurs:
|
|
Then we accumulate the frequences to get the cumulative frequencies:
|
And the the relative cumlative frequencies:
|
And finally zip these up with the sorted unique values to get a list of points we can plot:
|
You are not expected to know how to actually create the ECDF plot, so here is a function that you can just use to do it:
|
This makes the plot of the $\widehat{F}_{10}$, the point estimate for $F^*$ for these $n=10$ simulated samples.
![]() |
What about adding those confidence bands? You will do essentially the same thing, but adjusting for the required $\varepsilon$. First we need to decide on an $\alpha$ and calculate the $\varepsilon$ corresponding to this alpha. Here is some of our code to calculate the $\varepsilon$ corresponding to $\alpha=0.05$ (95% confidence bands), using a hidden function calcEpsilon:
|
See if you can write your own code to do this calculation, $\varepsilon_n = \sqrt{ \frac{1}{2n} \log \left( \frac{2}{\alpha}\right)}$. For completeness, do the whole thing:assign the value 0.05 to a variable named alpha, and then use this and the variable called n that we have already declared to calculate a value for $\varepsilon$. Call the variable to which you assign the value for $\varepsilon$ epsilon so that it replaces the value we calculated in the cell above (you should get the same value as us!).
|
Now we need to use this to adjust the EDF plot. In the two cells below we first of all do the adjustment for $\underline{C}_{\,n}(x) =\max \{ \widehat{F}_n(x)-\varepsilon_n, 0 \}$, and then use zip again to get the points to actually plot for the lower boundary of the 95% confidence band.
Now we need to use this to adjust the EDF plot. In the two cells below we first of all do the adjustment for $\overline{C}_{\,n}(x) =\min \{ \widehat{F}_n(x)+\varepsilon_n, 1 \}$, and then use zip again to get the points to actually plot for the lower boundary of the 95% confidence band.
|
|
We carefully gave our ecdfPointsPlot function the flexibility to be able to plot bands, by having a colour parameter (which defaults to 'grey') and a lines_only parameter (which defaults to false). Here we can plot the lower bound of the confidence interval by by adding ecdfPointsPlot(ecdfPointsUniformLower, colour='green', lines_only=true) to the previous plot:
![]() |
Now you try writing the code to create the list of points needed for plotting the upper band $\overline{C}_{\,n}(x) =\min \{ \widehat{F}_n(x)+\varepsilon_n, 1 \}$. You will need to first of all get the upper heights (call them say cumRelFreqsUniformUpper) and then zip them up with the sortedUniqueValuesUniform to get the points to plot.
|
|
Once you have got done this you can add them to the plot by altering the code below:
![]() |
|
If we are doing lots of collections of EDF points we may as well define a function to do it, rather than repeating the same code again and again. We use an offset parameter to give us the flexibility to use this to make points for confidence bands as well.
|
Now we will try looking at the Earthquakes data we have used before to get a confidence band around an EDF for that. We start by bringing in the data and using our own interQuakeTimes function to calculate the times between earthquakes in seconds:
|
|
There is a lot of data here, so let's use an interactive plot to do the non-parametric DF estimation just for some of the last data:
Click to the left again to hide and once more to show the dynamic interactive window |
|
What if we are not interested in estimating $F^*$ itself, but we are interested in scientificially investigating whether two distributions are the same. For example, perhaps, whether the distribution of earthquake magnitudes was the same in April as it was in March. Then, we should attempt to reject a falsifiable hypothesis ...
A formal definition of hypothesis testing is beyond our current scope. Here we will look in particular at a non-parametric hypothesis test called a permutation test. First, a quick review:
The outcomes of a hypothesis test, in general, are:
'true state of nature' |
Do not reject $H_0$ |
Reject $H_0$ |
$H_0$ is true
|
OK |
Type I error |
$H_0$ is false |
Type II error |
OK |
So, we want a small probability that we reject $H_0$ when $H_0$ is true (minimise Type I error). Similarly, we want to minimise the probability that we fail to reject $H_0$ when $H_0$ is false (type II error).
The P-value is one way to conduct a desirable hypothesis test. The scale of the evidence against $H_0$ is stated in terms of the P-value. The following interpretation of P-values is commonly used:
A Permuation Test is a non-parametric exact method for testing whether two distributions are the same based on samples from each of them.
What do we mean by "non-parametric exact"? It is non-parametric because we do not impose any parametric assumptions. It is exact because it works for any sample size.
Formally, we suppose that: \[ X_1,X_2,\ldots,X_m \overset{IID}{\sim} F^* \quad \text{and} \quad X_{m+1}, X_{m+2},\ldots,X_{m+n} \overset{IID}{\sim} G^* \enspace , \] are two sets of independent samples where the possibly unknown DFs $F^*,\,G^* \in \{ \text{all DFs} \}$.
(Notice that we have written it so that the subscripts on the $X$s run from 1 to $m+n$.)
Now, consider the following hypothesis test: \[ H_0: F^*=G^* \quad \text{versus} \quad H_1: F^* \neq G^* \enspace . \]
Our test statistic uses the observations in both both samples. We want a test statistic that is a sensible one for the test, i.e., will be large when when $F^*$ is 'too different' from $G^*$
So, let our test statistic $T(X_1,\ldots,X_m,X_{m+1},\ldots,X_{m+n})$ be say: \[ T:=T(X_1,\ldots,X_m,X_{m+1},\ldots,X_{m+n})= \text{abs} \left( \frac{1}{m} \sum_{i=1}^m X_i - \frac{1}{n} \sum_{i=m+1}^n X_i \right) \enspace . \]
(In words, we have chosen a test statistic that is the absolute value of the difference in the sample means. Note the limitation of this: if $F^*$ and $G^*$ have the same mean but different variances, our test statistic $T$ will not be large.)
Then the idea of a permutation test is as follows:
Saying that $\mathbf{P}_0$ is discrete and uniform over $\{t_1, t_2, \ldots, t_{N!}\}$ says that each possible permutation has an equal probabability of occuring (under the null hypothesis). There are $N!$ possible permutations and so the probability of any individual permutation is $\frac{1}{N!}$
\[ \text{P-value} = \mathbf{P}_0 \left( T \geq t_{obs} \right) = \frac{1}{N!} \left( \sum_{j=1}^{N!} \mathbf{1} (t_j \geq t_{obs}) \right), \qquad \mathbf{1} (t_j \geq t_{obs}) = \begin{cases} 1 & \text{if } \quad t_j \geq t_{obs} \\ 0 & \text{otherwise} \end{cases}\]
This will make more sense if we look at some real data.
In 2008, Guo Yaozong and Chen Shun collected data on the diameters of coarse venus shells from New Brighton beach for a course project. They recorded the diameters for two samples of shells, one from each side of the New Brighton Pier. The data is given in the following two cells.
|
|
|
$(115 + 139)!$ is a very big number. Lets start small, and take a subselection of the shell data to demonstrate the permutation test concept: the first two shells from the left of the pier and the first one from the right:
|
|
So now we are testing the hypotheses
\[\begin{array}{lcl}H_0&:& X_1,X_2,X_3 \overset{IID}{\sim} F^*=G^* \\H_1&:&X_1, X_2 \overset{IID}{\sim} F^*, \,\,X_3 \overset{IID}{\sim} G^*, F^* \neq G^*\end{array}\]
With the test statistic\[\begin{array}{lcl}T(X_1,X_2,X_3) &=& \text{abs} \left(\displaystyle\frac{1}{2}\displaystyle\sum_{i=1}^2X_i - \displaystyle\frac{1}{1}\displaystyle\sum_{i=2+1}^3X_i\right) \\ &=&\text{abs}\left(\displaystyle\frac{X_1+ X_2}{2} - \displaystyle\frac{X_3}{1}\right)\end{array}\]
Our observed data $x_{obs} = (x_1, x_2, x_3) = (52, 54, 58)$
and the realisation of the test statistic for this data is $t_{obs} = \text{abs}\left(\displaystyle\frac{52+54}{2} - \frac{58}{1}\right) = \text{abs}\left(53 - 58\right) = \text{abs}(-5) = 5$
Now we need to tabulate the permutations and their probabilities. There are 3! = 6 possible permutataions of three items. For larger samples, you could use the factorial function to calculate this:
|
|
We said that under the null hypotheses (the samples have the same DF) each permutation is equally likely, so each permutation has probability $\displaystyle\frac{1}{6}$.
There is a way in Python (the language under the hood in Sage), to get all the permuations of a sequence:
|
|
We can tabulate the permuations, their probabilities, and the value of the test statistic that would be associated with that permutation:
Permutation | $t$ | $\mathbf{P}_0(T=t)$ |
(52, 54, 58) | 5 | $\frac{1}{6}$ |
(52, 58, 54) | 1 | $\frac{1}{6}$ |
(54, 52, 58) | 5 | $\frac{1}{6}$ |
(54, 58, 52) | 4 | $\frac{1}{6}$ |
(58, 52, 54) | 1 | $\frac{1}{6}$ |
(58, 54, 52) | 4 | $\frac{1}{6}$ |
|
|
To calculate the P-value for our test statistic $t_{obs} = 5$, we need to look at how many permutations would give rise to test statistics that are at least as big, and add up their probabilities.
\[\begin{array}{lcl}\text{P-value} &=& $\mathbf{P}_0(T \geq t_{obs}) \\&=&\mathbf{P}_0(T \geq 5)\\&=&\frac{1}{6} + \frac {1}{6} \\&=&\frac{2}{6}\\ &=&\frac{1}{3} \\ &\approx & 0.333\end{array}\]
We could write ourselves a little bit of code to do this in Sage. As you can see, we could easily improve this to make it more flexible so that we could use it for different numbers of samples, but it will do for now.
|
|
This means that there is little or no evidence against the null hypothesis (that the shell diameter observations are from the same DF).
The lowest possible P-value for a pooled sample of size $N=m+n$ is $\displaystyle\frac{1}{N!}$. Can you see why this is?
So with our small sub-samples the smallest possible P-value would be $\frac{1}{6} \approx 0.167$. If we are looking for P-value $\leq 0.01$ to constitute very strong evidence against $H_0$, then we have to have a large enough pooled sample for this to be possible. Since $5! = 5 \times 4 \times 3 \times 2 \times 1 = 120$, it is good to have $N \geq 5$
You try in class
Try copying and pasting our code and then adapting it to deal with a sub-sample (52, 54, 60) from the left of the pier and (58, 54) from the right side of the pier.
|
|
You will have to think about:
|
|
|
(add more cells if you need them)
We can use the sample function and the Python method for making permutations to experiment with a larger sample, say 5 of each.
|
|
We have met sample briefly already: it is part of the Python random module and it does exactly what you would expect from the name: it samples a specified number of elements randomly from a sequence.
|
|
|
|
As you can see from the length of time it takes to do the calculation for $(5+5)! = 10!$ permutations, we will be here a long time if we try to this on all of both shell data sets. Monte Carlo methods to the rescue: we can use Monte Carlo integration to calculate an approximate P-value, and this will be our next topic.
You try
Try working out the P-value for a sub-sample (58, 63) from the left of the pier and (61) from the right (the two last values in the left-side data set and the last value in the right-side one). Do it as you would if given a similar question in the exam: you choose how much you want to use Sage to help and how much you do just with pen and paper.
|
|
|
|
|
|
|
|
|