Likelihood computations and random numbers
It is often necessary to simulate random numbers in R. There are
many functions available to accomplish this and related tasks
such as evaluating the density, distribution function, and quantile function
of a distribution.
Maximum likelihood estimation in a simple case
Let's reconsider the flux measurements in the gamma ray burst dataset.
grb <- read.table
("http://astrostatistics.psu.edu/datasets/GRB_afterglow.dat",
header=T, skip=1)
flux <- grb[,2]
hist(flux)
The histogram suggests that the univariate distribution has roughly
the shape of an exponential distribution (we'll speak more about what this
means later). Let us replot these data in a particular (and particularly
common) manner besides the histogram
that is also suggestive of an exponential distribution.
As a first step, let us calculate something akin to the (x,y) coordinates
of the empirical distribution function -- the function that has a jump
of size 1/n at every one of the sorted data points.
n <- length(flux)
xx=sort(flux)
yy=(1:n)/n
We could now obtain the empirical cdf by connecting the (xx,yy) points using
a stairstep pattern. However, we'll look at these points slightly differently.
The exponential distribution has a distribution function given by
F(x) = 1-exp(-x/mu) for positive x,
where mu>0 is a scalar parameter equal to the mean of the
distribution. This implies among other things that log(1-F(x)) = -x/mu is
a linear function of x in which the slope is the negative
reciprocal of the mean. Let us then look for the characteristic linear
pattern if we plot log(1-F(x)) against x using the empirical distribution
function for F:
plot(xx, log(1-yy+1/n),
xlab="flux",
ylab="log(1-F(flux))")
You may recall from the EDA and regression
tutorial that this plot looks like the plot of time vs. flux that
we produced as part of a regression analysis. This is only a coincidence;
the two plots are fundamentally different. The plot seen here is
fundamentally a univariate plot, whereas the time vs. flux plot was a
bivariate plot. We are ignoring the time variable here.
The plot certainly looks linear, so let us proceed on the assumption that
the flux data are a sample from an exponential distribution with unknown
parameter mu.
The overriding question of this section is this: How shall we estimate
mu?
As mentioned above, mu is equal to the mean of this population. For a quick
refresher on some probability theory, let us recall why this is so:
The first step in going from the distribution function F(x) = 1 - exp(-x/mu)
to the mean, or expectation, is to obtain the density function by
differentiating: f(x) = exp(-x/mu)/mu. Notice that we typically use F(x)
to denote the distribution function and f(x) to denote the density function.
Next, we integrate x*f(x) over the interval 0 to infinity, which gives the
mean, mu.
Since mu is the population mean, it is intuitively appealing to simply
estimate mu using the sample mean. This method, in which we match the
population moments to the sample moments and then solve for the parameter
estimators, is called the method of moments. Though it is a well-known
procedure, we focus instead on a much more widely used method (for good
reason) called maximum likelihood estimation.
The first step in maximum likelihood estimation is to write down the likelihood
function, which is nothing but the joint density of the dataset viewed as
a function of the parameters. Next, we typically take the log, giving what
is commonly called the loglikelihood function. Remember that all logs are
natural logs unless specified otherwise.
The loglikelihood function in this case is
(with apologies for the awkward notation)
l(mu) = -n log(mu) - x1/mu - ... - xn/mu
A bit of calculus reveals that l(mu) is therefore maximized at the sample
mean. Thus, the sample mean is not only the method of moments estimator
in this case but the maximum likelihood estimate as well.
In practice, however, it is sometimes the case that the linear-looking plot
produced earlier is used to estimate mu. As we remarked, the negative
reciprocal of the slope should give mu, so there is a temptation to fit
a straight line using, say, least-squares regression, then use the
resulting slope to estimate mu.
mean (flux) # This is the MLE
lm(log(1-yy+1/n) ~ xx)
1/.03025 # An alternative
estimator
There is a possible third method that I am told is sometimes used for some
kinds of distributions. We start with a histogram, which may be viewed as
a rough approximation of the density:
h <- hist(flux)
All of the information used to produce the histogram is now stored in the
h object, including the midpoints of the bins and their heights on a density
scale (i.e., a scale such that the total area of the histogram equals one).
To see how to use this information, note that the logarithm of the density
function is log f(x) = -log(mu) - x/mu, which is a linear function of x. Thus,
plotting the logarithm of the density against x might be expected to give a line.
counts <- h$counts
dens <- h$density[counts>0]
midpts <- h$mids[counts>0]
plot(midpts, log(dens))
When using linear regression to estimate the slope of the linear pattern just
produced, I am told that it is standard to weight each point by the number of
observations it represents, which is proportional to the reciprocal of the
variance of the estimated proportion of the number of points in that bin.
We can obtain both the weighted and unweighted versions here. We can then
obtain an estimate of mu using either the intercept, which is -log(mu), or
the slope, which is -1/mu:
m1 <- lm(log(dens) ~ midpts)
m2 <- lm(log(dens) ~ midpts,
weights=counts[counts>0])
exp(-m1$coef[1]) # This is one estimate
-1/m1$coef[2] # This is another
exp(-m2$coef[1]) # Yet another
-1/m2$coef[2] # And another
We have thus produced no fewer than six different estimators of mu (actually seven,
except that the MLE and the method of moments estimator are the same in this case).
How should we choose one of them?
There are a couple of ways to answer this question. One is to appeal to
statistical theory. The method of maximum likelihood estimation
is backed by a vast statistical literature that shows it has certain properties
that may be considered optimal. The method of moments is also
a well-established method, but arguably with less general theory behind it than
the method of maximum likelihood. The regression-based methods, on the
other hand, are all essentially ad hoc.
A second way to choose among estimators is to run a simulation study in which
we repeatedly simulate datasets (whose parameters are then known to us) and
test the estimators to see which seems to perform best. In order to
do this, we will need to be able to generate random numbers, which is the next
topic in this tutorial.
Generating random numbers in R
First, semantics: "Random numbers" does not refer solely to uniform
numbers between 0 and 1, though this is what "random numbers" means
in some contexts. We are mostly interested in generating non-uniform
random numbers here.
R handles many common distributions easily. To see a list, type
help.search("distribution",package="stats")
Let's consider the well-known normal distribution as an example:
?Normal
The four functions 'rnorm', 'dnorm',
'pnorm', and 'qnorm' give random normals, the normal density (sometimes called the differential
distribution function), the normal cumulative distribution function (CDF), and
the inverse of the normal CDF (also called the quantile function), respectively. Almost all
of the other distributions have similar sets of four functions. The 'r' versions are
rbeta,
rbinom,
rcauchy,
rchisq,
rexp,
rf,
rgamma,
rgeom,
rhyper,
rlogis,
rlnorm,
rmultinom,
rnbinom,
rnorm,
rpois,
rsignrank,
rt,
runif,
rweibull, and
rwilcox
(there is no rtukey because generally only
ptukey and qtukey
are needed).
As an example, suppose we wish
to simulate a vector of 10
independent, standard (i.e., mean 0 and standard deviation 1) normal random variables. We use the
rnorm function for this
purpose, and its defaults are mean=0 and standard deviation=1. Thus, we may simply type
rnorm(10)
Suppose we wish to simulate a large number of normal random variables with mean 10 and
standard deviation 3, then check a histogram against
two normal density functions, one based on the true parameters and one based on estimates, to
see how it looks. We'll use 'col=2, lty=2, lwd=3' to make the curve based on the true parameters
red (color=2), dashed (line type=2), and wider than normal (line width=3). Also note that
we are requesting 100 bins in the histogram (nclass=100) and putting it on the same vertical scale
as the density functions (freq=FALSE).
z=rnorm(200000, mean=10, sd=3)
hist(z,freq=FALSE,nclass=100)
x=seq(min(z),max(z),len=200)
lines(x,dnorm(x, mean=10, sd=3),col=2, lty=2, lwd=3)
lines(x,dnorm(x,mean=mean(z),sd=sqrt(var(z))))
We can find out what proportion of the deviates lie outside 3 standard deviations from the true mean,
a common cutoff used by physical scientists. We can also see the true theoretical proportion:
sum(abs((z-10)/3)>3)/length(z)
2*pnorm(-3)
In the first line above, we are using
sum
to count the number of TRUE's in the logical vector (abs((z-10)/3)>3).
This works because logical values are coerced to 0's and 1's when necessary.
The function
dnorm
has a closed form: With mean=0 and sd=1, dnorm(x)
equals exp(-x2/2)/sqrt(2*pi). By contrast,
the CDF, given by
pnorm,
has no closed form and must be numerically approximated. By
definition, pnorm(x) equals the integral of dnorm(t) as t ranges from minus infinity to x. To find a p-value
(i.e., the probability of observing a statistic more extreme than the one actually observed), we use
pnorm; to construct a confidence interval (i.e., a range of reasonable values for the true parameter),
we use the inverse,
qnorm.
pnorm(1:3)-pnorm(-(1:3))
qnorm(c(.05,.95))
The first line above summarizes the well-known 68, 95, 99.7 rule for normal distributions (these are the approximate
proportions lying within 1, 2, and 3 standard deviations from the mean). The second
line gives the critical values used to construct a 90% confidence interval for a parameter when its
estimator is approximately normally distributed.
Let us now briefly consider an example of a discrete distribution,
which means a distribution on a finite or countably
infinite set (as opposed to a continuous distribution like the normal).
The
Poisson
distribution, which has a single real-valued parameter lambda, puts all of
its probability mass on the nonnegative integers. A Poisson distribution, often used
to model data consisting of counts, has mean
and variance both equal to lambda.
k=0:10
dpois(k,lambda=2.5) # or equivalently,
exp(-2.5)*2.5^k/factorial(k)
Next, simulate some Poisson variables:
x=rpois(10000,lambda=2.5)
table(x)
mean(x)
var(x)
The power-law or Pareto distribution
A commonly used distribution in astrophysics is the power-law distribution,
more commonly known in the statistics literature as the
Pareto distribution. There are no built-in R functions for dealing with this
distribution, but because it is an extremely simple distribution it is easy
to write such functions.
The density function for the Pareto is f(x)=aba/x(a+1) for x>b.
Here, a and b are fixed positive parameters, where
b is the minimum possible value.
As an example, consider the log N = -1.5 * log S
relationship, where S is the apparent brightness and N
is the number of standard candles randomly located in transparent space. Thus, a Pareto
distribution with (a+1) = 1.5 is a reasonable, if simplistic,
model for the brightness of observed standard
candles in space. The b parameter merely reflects the choice of units of measurement.
As another example, consider the Salpeter function, the simple but widely
known expression of the initial mass function (IMF), in which the mass of
a randomly selected newly formed star has a Pareto distribution with parameter
a=1.35.
It turns out that a Pareto random variable is simply b*exp(X), where
X is an
exponential
random variable with rate=a (i.e., with mean=1/a).
However, rather than exploiting this simple relationship,
we wish to build functions for the Pareto distribution from
scratch. Our default values, which may be changed by the user, will be a=0.5 and b=1.
dpareto=function(x, a=0.5, b=1) a*b^a/x^(a+1)
Next, we integrate the density function to obtain the distribution function,
which is
F(x)=1-(b/x)a for x>=b (and naturally F(x)=0 for x < b):
ppareto=function(x, a=0.5, b=1) (x > b)*(1-(b/x)^a)
Note that (x > b) in the above function
is coerced to numeric, either 0 or 1.
Inverting the distribution function gives the quantile function. The following
simplistic function is wrong unless 0<u<1, so a better-designed function should do some
error-checking.
qpareto=function(u, a=0.5, b=1) b/(1-u)^(1/a)
Finally, to simulate random Pareto random variables, we use the fact that whenever the quantile function
is applied to a uniform random variable, the result is a random variable with the desired distribution:
rpareto=function(n, a=0.5, b=1) qpareto(runif(n),a,b)
Creating functions in R, as illustrated above, is a common procedure. Note that each of the arguments of
a function may be given a default value, which is used whenever the user calls the function without specifying
the value of this parameter. Also note that each of the above functions consists of only a single line; however,
longer functions may be created by enclosing them inside curly braces { }.
A few simple plots
The commands below create plots related to the four functions just created.
par(mfrow=c(2,2))
x=seq(1,50,len=200)
plot(x,dpareto(x),type="l")
plot(x,ppareto(x),type="l",lty=2)
u=seq(.005,.9,len=200)
plot(u,qpareto(u),type="l",col=3)
z=rpareto(200)
dotchart(log10(z), main="200 random logged Pareto deviates",
cex.main=.7)
par(mfrow=c(1,1))
The above commands illustrate some of the many plotting capabilities of R.
The par
function sets many graphical
parameters, for instance, 'mfrow=c(2,2)', which divides the plotting window into a matrix
of plots, set here to two rows and two columns. In the
plot
commands, 'type' is set here to
"l" for a line plot; other common options are "p" for points (the default),
"b" for connected dots, and "n" for nothing (to create
axes only). Other options used:
'lty' sets the line type (1=solid, 2=dashed, etc.), 'col' sets color (1=black, 2=red,
3=green, etc.), 'main' puts a string into the plot title, and 'cex.main' sets the text magnification.
Type
'? par'
to see a list of the many plotting parameters and options.
A simulation study
Let us posit that the random variable X has a Pareto distribution with parameters
a=1.35 and b=1. We will simulate multiple datasets with this property and
then apply several different estimation methods to each. To simplify matters,
we will assume that the value of b is known, so that the goal is estimation
of a.
To evaluate the
estimators, we will look at their mean squared error (MSE), which is just
what it sounds like: The average of the squared distances from the estimates
to the true parameter value of a=1.35.
To illustrate the estimators we'll evaluate, let's start by simulating a
single dataset of size 100:
d=rpareto(100, a=1.35)
Here are the estimators we'll consider:
- The maximum likelihood estimator. Since the density with b=1 is
given by f(x) = a/x^(a+1), the loglikelihood function is
l(a) = n log(a) - (a+1)(log x1 + log x2 + ... + log xn).
The maximizer may be found using calculus to equal n/(log x1 + ... + log xn).
For our dataset, this may be found as follows:
1/mean(log(d))
We used the sum of logarithms above where we could have used
the equivalent mathematical expression given by the log of the product.
Sometimes the former method gives more numerically stable answers for
very large samples, though in this case "100/log(prod(d))" gives exactly
the same answer.
- The method of moments estimator. By integrating, we find that
the mean of the Pareto distribution with b=1 is equal to a/(a-1). (This
fact requires that a be greater than 1.) Setting a/(a-1) equal to the sample
mean and solving for a gives 1/(1-1/samplemean) as the estimator.
1/(1-1/mean(d))
- What we'll call the EDF (empirical distribution function) estimator.
Since
log(1-F(x)) equals -a log(x) when b=1, by plotting the sorted values of log(d)
against log(n/n), log((n-1)/n), ..., log(1/n), we should observe roughly a
straight line. We may then use least-squares regression to find the slope
of the line, which is our estimate of -a:
lsd <- log(sort(d))
lseq <- log((100:1)/100)
plot(lsd, lseq)
tmp=lm(lseq ~ lsd)
abline(tmp,col=2)
-tmp$coef[2]
- What we'll call the unweighted histogram estimator. Since log f(x) equals
log(a) - (a+1) log(x) when b=1, if we plot the values of log(d) against
histogram-based estimates of the log-density function, we should observe
roughly a straight line with slope -(a+1) and intercept log(a). Let's use
only the slope, since that is the feature that is most
often the focus of a plot that is supposed to illustrate a power-law
relationship.
hd <- hist(d,nclass=20,plot=F)
counts <- hd$counts
ldens <- log(hd$density[counts>0])
lmidpts <- log(hd$mids[counts>0])
plot(lmidpts, ldens)
tmp=lm(ldens~lmidpts)
abline(tmp,col=2)
-1-as.numeric(tmp$coef[2])
- What we'll call the weighted histogram estimator. Exactly the
same as the unweighted histogram estimator, but we'll estimate the slope using
weighted least squares instead of ordinary least squares. The weights should
be proportional to the bin counts.
plot(lmidpts, ldens)
tmp=lm(ldens~lmidpts,
weights=counts[counts>0])
abline(tmp,col=2)
-1-as.numeric(tmp$coef[2])
Now let's write a single function that will take a vector of data as its
argument, then return all five of these estimators.
five <- function(d) {
lsd <- log(sort(d))
n <- length(d)
lseq <- log((n:1)/n)
m1=lm(lseq ~ lsd)$coef
hd <- hist(d,nclass=n/5,plot=F)
counts <- hd$counts
ldens <- log(hd$density[counts>0])
lmidpts <- log(hd$mids[counts>0])
m2 <- lm(ldens~lmidpts)$coef
m3 <- lm(ldens~lmidpts,
weights=counts[counts>0])$coef
out <- c(max.lik=1/mean(log(d)),
meth.mom=1/(1-1/mean(d)),
EDF=-as.numeric(m1[2]), unwt.hist=-1-as.numeric(m2[2]),
wt.hist=-1-as.numeric(m3[2]))
return(out)
}
The very last line of the function, "out", is the value that will be returned.
(We could also have written "return(out)".)
Let's test this function on our dataset:
five(d)
There is no good way to compare these estimators based on a single
sample like this. We now need to simulate multiple samples. Let's begin by
taking n=100.
n.100 <- NULL
for(i in 1:250) {
dd <- rpareto(100, a=1.35)
n.100 <- rbind(n.100, five(dd))
}
Now we can get estimates of the biases of the estimators
(their expectations minus
the true parameter) and their variances. Note that we'll use the
biased formula for the variance (i.e., the one that uses n instead of n-1
in the denominator) for a technical reason explained below.
bias.100 <- apply(n.100,2,mean)
- 1.35
var.100 <- apply(n.100,2,var)
*(249/250)
It is a mathematical identity that the mean squared error (MSE) equals the
square of the bias plus the variance, as we may check numerically for (say)
the first column of n.100. However, the identity only works if we use
the biased formula for the variance, which is why we used the multiplier
(249/250) above.
mean((n.100[,1]-1.35)^2)
bias.100[1]^2 + var.100[1]
Thus, we can construct the MSEs and view the results as follows:
mse.100 <- bias.100^2 + var.100
rbind(bias.100, var.100, mse.100)
Finally, let's repeat the whole experiment using samples of size 200.
n.200 <- NULL
for(i in 1:250) {
dd <- rpareto(200, a=1.35)
n.200 <- rbind(n.200, five(dd))
}
bias.200 <- apply(n.200,2,mean)
- 1.35
var.200 <- apply(n.200,2,var)
*(249/250)
mse.200 <- bias.200^2 + var.200
rbind(bias.200, var.200, mse.200)
EM algorithms
The class of algorithms called EM algorithms is enormously important
in statistics. There are many, many different specific algorithms that
can be called EM algorithms, but they have this in common: They seek to
iteratively maximize a likelihood function in a situation in which
the data may be thought of as incompletely observed.
The name "EM algorithm" has its genesis in a seminal 1977 paper by Dempster,
Laird, and Rubin in
the Journal of the Royal Statistical Society, Series B. Many distinct algorithms
published prior to 1977 were examples of EM, including the Lucy-Richardson
algorithm for image deconvolution that is apparently quite well known
in astronomy. The major contribution of Dempster et al was to unify these
algorithms and prove certain facts about them. Interesting historical note:
They even "proved" an untrue fact that was refuted six years (!) later (even
thirty years ago, publications in statistics churned through the pipeline
at a snail's pace).
We'll derive a simple EM algorithm on a toy example: We'll pretend that some
of the gamma ray burst flux measurements were right-censored, as follows:
cflux <- flux
cflux[flux>=60] <- 60
n <- length(cflux)
yy=(1:n)/n
plot(sort(cflux),log(1-yy+1/n))
The situation may be viewed like this: The complete dataset is a set
of n observations from an exponential distribution with unknown mean mu, say,
X1, ..., Xn. What we observe, however, is
Z1, ..., Zn, where Zi
is defined as min(Xi, 60).
The loglikelihood for the observed data is as follows:
m <- sum(flux>=60)
s <- sum(cflux)
loglik <- function(mu)
-(n-m)*log(mu)-s/mu
As it turns out, this loglikelihood function can be maximized explicitly:
mle <- s/(n-m)
However, we will construct an EM algorithm anyway for two reasons: First,
it is instructive to see how the EM operates. Second, not all censoring
problems admit a closed-form solution like this one does!
We start by writing down the complete-data loglikelihood for the sample.
This is straightforward because the complete data are simply a random sample
from an exponential distribution with mean mu. Next, we pick a starting
value of mu, say mu0. Then comes the tricky part: We take the conditional
expectation of the complete data loglikelihood, conditional on the observed
data and assuming that mu0 is the correct parameter. (This will have to
happen on the chalkboard!) The result is a function of both mu and mu0,
and construction of this function is called the E (expectation) step.
Next, we maximize this function over mu. The result of the maximization
becomes our next iterate, mu1, and the process repeats.
Let's start with mu0=20. Carrying out the calculations described above yields
the following iterative scheme:
mu <- 20
loglik(mu)
mu <- s/n + m*mu/n; loglik(mu)
mu <- s/n + m*mu/n; loglik(mu)
# repeat the last line a few times
Notice that the value of the (observed data) loglikelihood increases at each iteration.
This is the fundamental property of any EM algorithm! In fact, it is very
helpful when debugging computer code, since there must be a bug somewhere
whenever the loglikelihood is ever observed to decrease.
Notice also that the value of mu has converged to the true maximum likelihood
estimator after a few iterations.