Hypothesis
testing and bootstrapping
This tutorial demonstrates some of the many statistical tests that
R can perform. It is impossible to give an exhaustive list of
such testing functionality, but we hope not only to provide several
examples but also to elucidate some of the logic of statistical hypothesis
tests with these examples.
T
tests
In the
exploratory data analysis and regression
tutorial, we used exploratory techniques to identify 92 stars from the
Hipparcos
data set that are associated with the Hyades. We did this based on the values
of right ascension, declination, principal motion of right ascension,
and principal motion of declination. We then excluded one additional
star with a large error of parallax measurement:
hip = read.table("http://astrostatistics.psu.edu/datasets/HIP_star.dat",
header=T,fill=T)
attach(hip)
filter1= (RA>50 & RA<100 &
DE>0 & DE<25)
filter2=(pmRA>90 & pmRA<130 &
pmDE>-60 & pmDE< -10)
filter = filter1 & filter2
& (e_Plx<5)
sum(filter)
In this section of the tutorial, we will compare these Hyades stars with the remaining stars in
the Hipparcos dataset on the basis of the color (B minus V) variable.
That is, we are comparing the groups in the boxplot below:
color=B.V
boxplot(color~filter,notch=T)
For ease of notation, we define vectors H and nH (for
"Hyades" and "not Hyades") that
contain the data values for the two groups.
H=color[filter]
nH=color[!filter & !is.na(color)]
In the definition of nH above, we needed to exclude the NA values.
A two-sample t-test may now be performed with a single line:
t.test(H,nH)
Because it is instructive and quite easy, we may obtain the same results
without resorting to the
t.test
function. First, we calculate the variances of the sample means
for each group:
v1=var(H)/92
v2=var(nH)/2586
c(var(H),var(nH))
The t statistic is based on the standardized difference between the two
sample means. Because the two samples are assumed independent, the
variance of this difference equals the sum of the individual variances
(i.e., v1+v2). Nearly always in a two-sample t-test, we wish to test
the null hypothesis that the true difference in means equals zero. Thus,
standardizing the difference in means involves subtracting zero and then
dividing by the square root of the variance:
tstat=(mean(H)-mean(nH))/sqrt(v1+v2)
tstat
To test the null hypothesis, this t statistic is compared to a t distribution. In a Welch
test, we assume that the variances of the two populations are not necessarily equal, and
the degrees of freedom of the t distribution are computed using the so-called
Satterthwaite approximation:
(v1 + v2)^2 / (v1^2/91 + v2^2/2585)
The two-sided p-value may now be determined by using the
cumulative distribution function of the t distribution,
which is given by the
pt function.
2*pt(tstat,97.534)
Incidentally, one of the assumptions of the t-test, namely that each of the
two underlying populations is normally distributed, is almost certainly not true
in this example. However, because of the central limit
theorem, the t-test is robust against violations of this assumption; even if the
populations are not roughly normally distributed, the sample means are.
In this particular example, the Welch test is probably not necessary, since
the sample variances are so close that an assumption of equal variances is warranted.
Thus, we might conduct a slightly more restrictive t-test that assumes equal population
variances. Without going into the details here,
we merely present the R output:
t.test(H,nH,var.equal=T)
Permutation tests
Let's look at another example in which a t-test might be warranted.
The globular
cluster luminosity dataset gives measurements for two different
galaxies, Milky Way galaxy (MWG) and Andromeda galaxy (M31). The t-test
gives us a very simplistic way of comparing these galaxies, namely, by
the mean luminosity of their globular clusters. Because the apparent magnitudes
for M31 are offset by the distance modulus of 24.44, the t-test should compare
the difference in sample means with 24.44 instead of the usual zero:
gc = read.csv
("http://astrostatistics.psu.edu/datasets/glob_clus.csv")
t.test(lum~gal, data=gc, mu=24.44)
As before, the t-statistic here (1.6254) is accompanied by a p-value (0.1074).
This p-value may be interpreted as follows: IF the two samples really are representive
samples from populations whose means differ by 24.44, THEN the probability of
obtaining a t-statistic at least as far from zero as 1.6254 would be
roughly 0.1074. Since most people don't consider 0.1074 to be a very small
probability, the conclusion here is that there is not strong statistical
evidence of a difference in true means other than 24.44.
The p-value of 0.1074 is an approximation
calculated under certain distributional assumptions
on the populations. An alternative method for calculating a p-value corresponding
to the t-statistic of 1.6254 is called a permutation test. Conceptually, the
permutation test follows directly from the definition (given in the preceding
paragraph) of a p-value. Let us assume that the luminosities
for the MWG globular clusters are distributed exactly like the luminosities
for the M31 globular clusters, except that the latter are all shifted down by
24.44. Under this assumption (called the null hypothesis), we may add 24.44 to
each of the M31 luminosities and then the combined samples are equivalent to one
large sample from the common distribution. In other words, if the null hypothesis
is true and we randomly
permute the labels (81 instances of MWG and 360 of M31), then the original sample
is no different than the permuted sample.
By definition, then, the p-value should be equal to the proportion of all
possible t-statistics resulting from label permutations that are at least
as extreme as the observed t-statistic (1.6254). Since there are far too
many such permuations (roughly 1090)
for us to make this exact calculation, we will have to settle
for a random sample of permutations:
tlist=NULL
lum=gc[,"lum"]
gal=gc[,"gal"]
lum[gal=="M31"]=lum[gal=="M31"]-24.44
for(i in 1:5000) {
s=sample(441,81)
# choose a sample
newtstat =
t.test(lum[s], lum[-s])$stat
tlist=c(tlist, newtstat)
# add t-stat to list
}
Note: The above code is not built for speed!
By definition, the p-value is the probability of obtaining a test statistic more extreme than the
observed test statistic under the null hypothesis. Let's take a look at the null distribution of
the t statistic we just calculated, along with the observed value:
hist(tlist)
abline(v=c(-1,1)*1.6254,lty=2,col=2)
sum(abs(tlist)>=1.6254)/5000
This p-value is quite close to the 0.1074 obtained earlier.
Empirical distribution
functions
Suppose we are curious about whether a given sample comes from a particular
distribution. For instance, how normal is the random sample 'tlist' of t statistics
obtained under the null hypothesis in the previous example? How normal (say) are the
colors of 'H' and 'nH'?
A simple yet very powerful graphical device is called a Q-Q plot, in which some quantiles
of the sample are plotted against the same quantiles of whatever distribution we have in mind.
If a roughly straight line results, this suggests that the fit is good. Roughly, a pth quantile of
a distribution is a value such that a proportion p of the distribution lies below that value.
A Q-Q plot for normality is so common that there is a separate function,
qqnorm, that implements it. (Normality
is vital in statistics not merely because many common populations
are normally distributed --
which is actually not true in astronomy -- but because the central limit
theorem guarantees the
approximate normality of sample means.)
par(mfrow=c(2,2))
qqnorm(tlist,main="Null luminosity
t statistics")
abline(0,1,col=2)
qqnorm(H,main="Hyades")
qqnorm(nH,main="non-Hyades")
Not surprisingly, the tlist variable appears extremely nearly normally distributed (more precisely,
it is nearly standard normal, as evidenced by the proximity of the Q-Q plot to the line
x=y, shown in red). As for H and nH, the distribution of B minus V exhibits moderate non-normality in each
case.
In the bottom right corner of the plotting window, let's reconstruct the Q-Q plot from scratch for
tlist. This is instructive because the same technique may be applied to any comparison distribution,
not just normal. If we consider the 5000 entries of tlist in increasing order, let's call the ith
value the ((2i-1)/10000)th quantile for all i from 1 to 5000. We merely graph this point against the
corresponding quantile of standard normal:
plot(qnorm((2*(1:5000)-1)/10000),sort(tlist))
par(mfrow=c(1,1)) # reset plotting window
Related to the Q-Q plot is a distribution function called the empirical
(cumulative) distribution function, or
EDF. (In fact, the EDF is almost the same as a Q-Q plot against a uniform distribution.)
The EDF is, by definition, the cumulative distribution function for the
discrete distribution represented by the sample itself -- that is, the distribution that puts mass
1/n on each of the n sample points. We may graph the EDF using the
ecdf
function:
plot(ecdf(tlist))
While it is generally very difficult to interpret the EDF directly, it is possible
to compare an EDF to a theoretical cumulative distribution function or two another EDF.
Among the statistical tests that implement such a comparison is the Kolmogorov-Smirnov
test, which is implemented by the R function
ks.test.
ks.test(tlist,"pnorm")
ks.test(H,nH)
Whereas the first result above gives a surprisingly small p-value,
the second result is not surprising; we already saw that H and nH have
statistically significantly different means. However, if we center each, we obtain
ks.test(H-mean(H),nH-mean(nH))
In other words, the Kolmogorov-Smirnov test finds no statistically significant evidence that
the distribution of B.V for the Hyades stars is anything other than a shifted version of
the distribution of B.V for the other stars. We can perform the same test for
the luminosities of globular clusters from both the Milky Way and Andromeda:
lum.mwg=gc[gal=="MWG","lum"]
lum.m31=gc[gal=="M31","lum"]
ks.test(lum.mwg,
lum.m31-24.44)
The above test does give a statistically significant difference. The
K-S test tests whether the M31 dataset, shifted by 24.44,
comes from the same distribution as the MWG
dataset. The t-test, on the other hand, only tests whether these
distributions have the same mean.
Chi-squared
tests for categorical data
We begin with a plot very similar to one seen in the
exploratory data analysis and regression
tutorial:
bvcat=cut(color, breaks=c(-Inf,.5,.75,1,Inf))
boxplot(Vmag~bvcat, varwidth=T,
ylim=c(max(Vmag),min(Vmag)),
xlab=expression("B minus V"),
ylab=expression("V magnitude"),
cex.lab=1.4, cex.axis=.8)
The cut values for bvcat are based roughly on the quartiles of the B minus V variable.
We have created, albeit artificially, a second categorical
variable ("filter", the Hyades indicator, is the first). Here is
a summary of the dataset based only on these two variables:
table(bvcat,filter)
Note that the Vmag variable is irrelevant in the table above.
To perform a chi-squared test of the null hypothesis that the true population
proportions falling in the four categories are the same for both the Hyades and
non-Hyades stars, use the
chisq.test
function:
chisq.test(bvcat,filter)
Since we already know these two groups differ with respect to the B.V variable,
the result of this test is not too surprising. But it does give a qualitatively different
way to compare these two distributions than simply comparing their means.
The p-value produced above is based on the fact that the chi-squared statistic is
approximately distributed like a true chi-squared distribution (on 3 degrees of freedom,
in this case) if the null hypothesis is true. However, it is possible to obtain
exact p-values, if one wishes to calculate the chi-squared statistic for all possible
tables of counts with the same row and column sums as the given table. Since this is rarely
practical computationally, the exact p-value may be approximated using a Monte Carlo method
(just as we did earlier for the permutation test).
Such a method is implemented in the
chisq.test
function:
chisq.test(bvcat,filter,sim=T,B=50000)
The two different p-values we just generated a numerically similar but based
on entirely different mathematics. The difference may be summed up as follows:
The first method produces the exact value of an approximate p-value, whereas the second
method produces an approximation to the exact p-value!
The test above is usually called a chi-squared test of homogeneity. If we observe
only one sample, but we wish to test whether the categories occur in some pre-specified
proportions, a similar test (and the same R function) may be applied. In this case, the test
is usually called a chi-squared test of goodness-of-fit.
Nonparametric bootstrapping
of regression standard errors
Let us consider a linear model for the relationship
between DE and pmDE among the 92 Hyades stars:
x=DE[filter]
y=pmDE[filter]
plot(x,y,pch=20)
model1=lm(y ~ x)
abline(model1,lwd=2,col=2)
The red line on the plot is the usual least-squares line, for which estimation is
easy and asymptotic theory gives easy-to-calculate standard errors for the
coefficients:
summary(model1)$coef
However, suppose we wish to use a resistant regression method such as
lqs.
library(MASS)
model2=lqs(y ~ x)
abline(model2,lwd=2,col=3)
model2
In this case, it is not so easy to obtain standard errors for the coefficients.
Thus, we will turn to bootstrapping. In a standard, or nonparametric, bootstrap,
we repeatedly draw samples of size 92 from the empirical distribution of the data, which
in this case consist of the (DE, pmDE) pairs. We use
lqs
to fit a line to each sample, then compute the sample covariance of the resulting
coefficient vectors. The procedure works like this:
model2B=NULL
for (i in 1:200) {
s=sample(92,replace=T)
model2B=rbind(model2B,
lqs(y[s]~x[s])$coef)
}
We may now find the sample covariance matrix for model2B. The (marginal) standard errors
of the coefficients are obtained as the square roots of the diagonal entries of this matrix:
cov(model2B)
se=sqrt(diag(cov(model2B)))
se
The logic of the bootstrap procedure is that we are estimating an approximation
of the true standard errors. The approximation involves replacing the true distribution
of the data (unknown) with the empirical distribution of the data. This approximation
may be estimated with arbitrary accuracy by a Monte Carlo approach, since the empirical
distribution of the data is known and in principle
we may sample from it as many times as we wish. In other words, as the bootstrap sample
size increases, we get a better estimate of the true value of the approximation.
On the other hand, the quality of this approximation depends on the original sample size (92, in
our example) and
there is nothing we can do to change it.
An alternative way to generate a bootstrap sample
in this example is by generating a new value of each response variable (y) by adding the predicted
value from the original lqs model to a randomly selected residual from the original set of residuals.
Thus, we resample not the entire bivariate structure but merely the residuals. As an exercise, you
might try implementing this approach in R. Note that this approach is not a good idea if you have reason
to believe that the distribution of residuals is not the same for all points. For instance, if there is
heteroscedasticity or if the residuals contain structure not explained by the model, this residual resampling
approach is not warranted.
Using the
boot
package in R
There is a
boot
package in R that contains many functions relevant to bootstrapping.
As a quick example, we will show here how to obtain the same kind of bootstrap example
obtained above (for the lqs model of pmDE regressed on DE for the Hyades stars.)
library(boot)
mystat=function(a,b)
lqs(a[b,2]~a[b,1])$coef
model2B.2 = boot(cbind(x,y),
mystat, 200)
names(model2B.2)
As explained in the help file, the
boot
function requires as input a function that accepts as arguments the whole dataset
and an index that references an observation from that dataset. This is why we defined
the mystat function above.
To see the output that is similar to that obtained earlier for the m2B object, look in
m2B2$t:
cov(model2B.2$t)
sqrt(diag(cov(model2B.2$t)))
Compare with the output provided by
print.boot and the plot produced by
plot.boot:
model2B.2
plot(model2B.2)
Another related function, for producing bootstrap confidence intervals, is
boot.ci.
Parametric bootstrapping
of regression standard errors
We now return to the regression problem studied earlier.
Sometimes, resampling is done from a theoretical distribution rather than
from the original sample. For instance, if simple linear regression is applied to the
regression of pmDE on DE, we obtain a parametric estimate of the distribution of the residuals,
namely, normal with mean zero and standard deviation estimated from the regression:
summary(model1)
Remember that model1 was defined above as lm(y~x). We observe a residual standard
error of 4.449.
A parametric bootstrap scheme proceeds by simulating a new set of pmDE (or y) values using the
model
y = 21.9 - 3.007*x + 4.449*rnorm(92)
Then, we refit a linear model using y as the new response,
obtaining slightly different values of the regression coefficients.
If this is repeated, we obtain an approximation of the joint distribution of the regression coefficients
for this model.
Naturally, the same approach could be tried with other regression methods such as those discussed
in the
EDA and regression tutorial,
but careful thought should be
given to the parametric model used to generate the new residuals. In the normal case discussed here,
the parametric bootstrap is simple, but it is really not necessary
because standard linear regression already gives a very good approximation to the joint distribution
of the regression coefficients when errors are heteroscedastic and normal.
One possible use of this method is in a model that assumes the absolute residuals are exponentially
distributed, in which case least absolute deviation regression as discussed in the
EDA and regression tutorial can be justified. The reader is encouraged to
implement a parametric bootstrap using the
rq
function found in the "quantreg" package.
Kolmogorov-Smirnov: Bootstrapped p-values
Earlier, we generated 5000 samples from the null
distribution of the t-statistic for testing the null hypothesis that
the distribution of MWG luminosities is the same as the distribution
of M31 luminosities, shifted by 24.44. Let's do a similar thing for the
color comparison between Hyades and non-Hyades:
tlist2=NULL
all=c(H,nH)
for(i in 1:5000) {
s=sample(2586,92) # choose a sample
tlist2=c(tlist2, t.test(all[s],all[-s],
var.eq=T)$stat) # add t-stat to list
}
Let's look at two different ways of assessing whether the values in
the tlist2 vector appear to be a random sample from a standard normal
distribution. The first is graphical, and the second uses the
Kolmogorov-Smirnov test.
plot(qnorm((2*(1:5000)-1)/10000),
sort(tlist2))
abline(0,1,col=2)
ks.test(tlist2,"pnorm")
The graphical method does not show any major deviation from standard
normality, but this graphical "test" is better able to detect departures
from an overall normal shape than to detect, say, a shift of the mean away from
zero. The following procedures illustrate this fact:
plot(qnorm((2*(1:5000)-1)/10000),
sort(tlist2)-mean(tlist2))
abline(0,1,col=2)
ks.test(tlist2-mean(tlist2),"pnorm")
The graphical plot shows no discernable difference from before, but we see a vast
difference in the new p-value returned by the Kolmogorov-Smirnov test (!!).
However, let us now consider the p-value returned by the last use of ks.test
above. It is not quite valid because the theoretical null distribution
against which we are testing depends upon an estimate (the mean) derived from the data.
To get a more accurate p-value, we may use a bootstrap approach.
First, obtain the Kolmogorov-Smirnov test statistic from the test above:
obs.ksstat=ks.test(tlist2-mean(tlist2),
"pnorm")$stat
Now we'll generate a new bunch of these statistics under the null hypothesis
that tlist2 really represents a random sample from some normal distribution
with variance 1 and unknown mean:
random.ksstat=NULL
for(i in 1:1000) {
x=rnorm(5000)
random.ksstat=c(random.ksstat,
ks.test(x,pnorm,mean=mean(x))$stat)
}
Now let's look at a histogram of the test statistics and estimate a p-value:
hist(random.ksstat,nclass=40)
abline(v=obs.ksstat,lty=2,col=2)
mean(random.ksstat>=obs.ksstat)
Note that the approximate p-value above is smaller than the original
p-value reported by newkstest,
though it is still not small enough to provide strong evidence that the
tlist2 sample is not normal with unit variance.
The bootstrap procedure above relied on multiple resamples with replacement.
Since these
samples were drawn from a theoretical population (in this case, a normal
distribution with parameters that might be determined by the data), it
is considered a parametric bootstrap procedure.
In a nonparametric bootstrap procedure, the resamples are taken from
the empirical distribution of the data (that is, from a distribution that
places mass 1/n on each of the n observed values). It is possible to
implement a nonparametric bootstrap procedure to calculate a p-value
for the Kolmogorov-Smirnov test here, but to do so is a bit tricky.
The "straightforward" method for obtaining the null distribution
of K-S statistics would be repeated usage of the following:
s=sample(5000,replace=T)
ks.test(tlist[s],pnorm,mean=mean(tlist[s]))$stat
However, this statistic has a bias that must be corrected. To make this
correction, it is necessary to rewrite the ks.test function itself. There is
no way to use the output of the ks.test function to produce the bias-corrected
statistics.