Linear least squares and its uncertainties
from collections import namedtuple
import numpy as np
import numpy.linalg
import matplotlib
matplotlib.rcParams['savefig.dpi'] = 144
matplotlib.rcParams["image.composite_image"]=False
%matplotlib inline
import matplotlib.pyplot as plt
Setting up a simple test problem
The goal is to demonstrate techniques of error estimation for linear least squares, so let's set up a problem about as simple as possible while still presenting the difficulties we might hope these techniques solve.
The problem will be a simple straight-line fit to data. I'm not even going to bother fooling with the best-fit values; the slope and intercept will be zero, and the data points are symmetrical about zero, so the slope/intercept parameterization should yield uncorrelated errors on the fitted parameters.
The goal of the fitting is to find \(x\) that minimizes \[ \left\|Ax-b\right\|^2_u, \] that is, the sum of squares of the residuals \(r\) divided by (our estimate of) the uncertainties \(u\). We will be interested in what happens when our estimate of the uncertainties differs from the true uncertainties, so here I make the distinction.
xs = np.linspace(-1,1,100)
true_uncerts = np.abs(xs)
claimed_uncerts = 0.35*np.ones_like(xs)
def make_fake_data(xs, true_uncerts):
return true_uncerts*np.random.randn(len(xs))
ys = make_fake_data(xs, true_uncerts)
plt.errorbar(xs,ys,true_uncerts, linestyle='none')
The fitter
Here's a function to do the fitting. The actual calculation I defer to numpy's lstsq
function, which uses the singular value decomposition under the hood and is supposed to be somewhat robust to degeneracies in the problem. (Not as robust as one could wish, unfortunately.) Because several of the error estimates are most easily calculated from some of these intermediate results, I calculate them here, though I will explain them below. Because we have quite a few values to return, I use a named tuple to return them.
(A brief aside: because scipy has only recently adopted the use of named return values, they don't want to break existing code that simply assigns the return values, as in x, chi2, rk, s = lstsq(...)
, and named tuples satisfy this criterion. But I believe that in some cases they have objects that behave a little more like python's function call signatures: the named tuple is extended by more arguments accessible only by name. This would be nice here, but I don't know where to find the machinery to do it easily. Even the named tuple machinery is kind of cumbersome. I could write some, it wouldn't be long, but... I'll go down that rabbit hole another time.)
The basic fitting procedure can be carried out by matrix algebra. If we first leave aside the possibility of uncertainties, we get \[ x = (A^TA)^{-1} A^T b. \] Of course, numerically, matrix inversion is a troublesome operation, particularly when you have nearly-degenerate problems; the SVD provides a way to manage the treatment of such matrices, and the degree of near-degeneracy is summarized in the rank (rk
) and singular-value (s
) returns from numpy.linalg.lstsq
. In such a numerically-messy case, though, the covariance matrix is going to become completely horrible, so we're going to ignore that when we're looking at error estimates. But in any case, we can work with the "pseudo-inverse", which provides some of the same resistance to numerical headaches (and is also based on the SVD). Anyway, incorporating uncertainties simply amounts to scaling the rows of \(A\) and the elements of \(b\) by dividing by the claimed uncertainties.
linear_least_squares_result = namedtuple("linear_least_squares_result",
["names","x","chi2","rk","s","res","n","cov","cov_scaled","leverage","cov_robust"])
def linear_least_squares(Acols, b, uncerts):
names = sorted(Acols.keys())
A = np.array([Acols[n] for n in names]).T
n = len(b)
Ascaled = A/uncerts[:,None]
x, chi2, rk, s = numpy.linalg.lstsq(A/uncerts[:,None], b/uncerts)
res = b-np.dot(A,x)
cov = np.linalg.pinv(np.dot(Ascaled.T,Ascaled))
cov_scaled = cov*chi2/n
leverage = np.diag(np.dot(Ascaled,np.dot(cov,Ascaled.T)))
# HC3, Mackinnon and White estimator
cov_robust = np.dot(np.dot(
cov,
np.dot(Ascaled.T,Ascaled*(res/(1-leverage))[:,None]**2)),
cov)
return linear_least_squares_result(names=names,
x=x, chi2=chi2, rk=rk, s=s, n=n,
res=res, cov=cov, cov_scaled=cov_scaled,
leverage=leverage, cov_robust=cov_robust)
Acols = {"constant": np.ones(len(xs)),
"linear": xs}
res = linear_least_squares(Acols, ys, true_uncerts)
print res.x
Standard linear least squares
This just uses the standard formula for the covariance matrix of a linear least squares fit: \[ (A^T A)^{-1}, \] for an \(A\) with appropriately scaled rows. So it should work as long as it's provided with the correct uncertainties for the input values.
res = linear_least_squares(Acols, ys, true_uncerts)
print res.names
print res.cov
true_cov = res.cov
Let's quickly check that, statistically. That is, we want to know, if we had the same true values but many different instances of inaccurate measurements, what would the distribution of fit parameters be? It would be multivariate normal, described by a covariance matrix. We want our fitting algorithm to return that matrix, or an estimate of it. We can test whether it's right by actually creating many fake data sets, fitting them, and taking the covariance matrix of the fitted parameters. This should converge (like \(\sqrt{n}\)) towards the true covariance matrix.
A note about how I structure the code: because the convergence requires many fake data sets, the more the better, I use three IPython blocks: one to set up the problem, one to run a bunch of simulations, and one to compute the results. They are written so I can rerun the middle block as many times as I have patience for and watch the computed results converge. If this gets slow, there are tricks to parallelize the middle block, possibly even running it in the background while I evaluate partial results.
def run_fake():
ys = make_fake_data(xs, true_uncerts)
res = linear_least_squares(Acols, ys, true_uncerts)
return res.x
fake_xs = []
for i in range(100000):
fake_xs.append(run_fake())
print len(fake_xs)
fx = np.array(fake_xs)
print "simulation:"
print np.cov(fx.T)
print "claimed by linear_least_squares:"
print true_cov
Well, they don't agree perfectly. The off-diagonal entries, which analytically should be exactly zero, are bad. But remember that error estimates converge like root n, and n here is not huge. So it looks like the errors here are okay.
Dealing with wrong error estimates
Of course, we just saw the best-case scenario: the errors were normal and we knew exactly their sizes. In general we don't always know our uncertainties this well. What can be done about this? That is, suppose we put in our best guess for the experimental uncertainties and do the fitting. Is there any way to produce more pessimistic error estimates that might be okay even if our estimated uncertainties are wrong?
There's one commonly-used method: compute the reduced \(\chi^2\) of the fit and rescale the uncertainties until that reduced \(\chi^2\) is one. This correctly compensates for the situation where all your uncertainties are under- or over-estimated by the same factor. Of course, it suffers from the severe drawback that it quietly absorbs errors that are due to unmodelled structure (for example if these data had a quadratic part) into inflated errors on model parameters. If your model is wrong - that is, missing some important effect - then inflating your error bars is not at all a satisfactory solution, and it leaves you without much in the way of quality-of-fit estimates.
Nevertheless, what if we are confident that our model correctly describes all features of the data but that our uncertainties are all wrong by differing amounts? This situation - uncertainties that are different from each other - is called "heteroskedasticity", and there is a fairly substantial statistical literature about how to deal with it.
First I'm going to repeat the trick above with the fake data to determine the "true" distribution of fitted parameters. It's going to be different from (worse than) the distribution above, because the fitting procedure is using the wrong weights in the least-squares fit.
def run_fake_het():
ys = make_fake_data(xs, true_uncerts)
res = linear_least_squares(Acols, ys, claimed_uncerts)
return res.x
fake_het_xs = []
for i in range(100000):
fake_het_xs.append(run_fake_het())
print len(fake_het_xs)
fhx = np.array(fake_het_xs)
cov_het = np.cov(fhx.T)
print "Observed covariance when using wrong uncertainties:"
print cov_het
print "Observed covariance when using the right uncertainties:"
print true_cov
Because we've weighted the data points uniformly, rather than based on their true uncertainty, we get an estimator that does a worse job than it could. I don't show it here, but it'll still converge to the right value eventually, if you keep adding data points, but it'll do so much more slowly than one using the right weights. So it pays to get your uncertainties as close to right as possible even if you're not going to trust them for error estimation.
Of course, if you make your covariance estimate with the wrong uncertainties, the covariance comes out badly wrong, in fact much too small in this case, and not even scaling so that the reduced \(\chi^2\) is one solves the problem:
res = linear_least_squares(Acols, ys, claimed_uncerts)
print "Claimed covariance with the wrong uncertainties:"
print res.cov
print "Claimed covariance with the wrong uncertainties, scaled:"
print res.cov_scaled
print "Observed covariance when using wrong uncertainties:"
print cov_het
Bootstrap
The most direct approach to dealing with heteroskedasticity - and actually a pretty broad range of other kinds of statistical challenge - is the bootstrap method. In this method, fake - even more fake than the ones above - data sets are created from the observed data itself. Then the fitting method is applied repeatedly to these fake data sets to produce a large collection of fitted parameters, and their distribution is described. This has the advantage over the method above that one need know nothing about the method used to generate the true data, in particular, one need not know the true uncertainties. Annoyingly, it does take as many simulations as you have patience for (as opposed to the lovely matrix calculations you can do in one shot), but it's at least simple to implement.
def run_boot():
# draw from original data with replacement
ix = np.random.randint(len(xs), size=len(xs))
xs_l = xs[ix]
ys_l = ys[ix]
us_l = claimed_uncerts[ix]
res = linear_least_squares({"constant":np.ones_like(xs_l), "linear":xs_l}, ys_l, us_l)
return res.x
boots = []
for i in range(100000):
boots.append(run_boot())
print len(boots)
ba = np.array(boots)
print "Bootstrap:"
print np.cov(ba.T)
print '"True":'
print cov_het
The bootstrap is doing a surprisingly bad job, here. The bootstrap might do better if we had more data points - a thousand rather than a hundred. Most of these techniques work only "asymptotically", that is, they converge towards being right as you supply more data points. The bootstrap is confusing because there are two numbers that go into it - the number of bootstrap trials and the number of data points. A small number of bootstrap trials has an obvious negative effect on the result, but even as the number of bootstrap trials goes to infinity, the number of data points still provides a limit on how well it can work.
There are smarter ways to apply the bootstrap to a linear least-squares problem; for example, maybe you could generate your fake data sets by adding residuals to the model, but with your residuals chosen with replacement from the set of residuals. This avoids the problem that some data points have more impact on the fitted parameters. This, by the way, is what the "leverage" measures; or rather, it measures the impact of the data points on the fitted model. There are derived measures to quantify the impact of data points on fitted parameters more directly. But in case you're curious, here's a plot of the leverage of these data points.
The leverage is computed from the projection matrix (so called because it takes a data set to the best-fit model evaluated at those points) \[ H = A(A^TA)^{-1}A^T \] by picking out the diagonal entries \[ h_i = \left[H\right]_{ii}. \]
res = linear_least_squares(Acols, ys, true_uncerts)
plt.plot(xs, res.leverage, ".")
plt.ylabel("leverage")
plt.ylim(ymin=0)
plt.title("Using correct weights")
plt.figure()
res = linear_least_squares(Acols, ys, claimed_uncerts)
plt.plot(xs, res.leverage, ".")
plt.ylim(ymin=0)
plt.ylabel("leverage")
plt.title("Using bogus weights")
Note that when the correct weights are used, the points nearest zero have a great influence on the fitted model in their vicinity - their uncertainties are so small (and therefore their weights are so high) they're going to pull the best-fit line strongly towards them. With uniform weights, the endpoints have more pull just because of their influence on the slope.
Robust standard errors
Fortunately, linear least-squares is a very well-studied problem, and there's a nice matrix formula that is supposed to produce a very similar sort of calculation. Fundamentally what it tries to do is use the residual from each data point to estimate the uncertainty that data point should have had. Of course, this does not converge to a sensible estimate of the individual uncertainties, since no matter how many data points you have you have just as many uncertainties to estimate. But the uncertainties on your fit parameters are a finite set of modest size, and there's hope that an appropriate calculation could converge to those uncertainties. There are a few slightly different formulae for calculating this. I'm using the Mackinnon and White estimator, as described in Long and Ervin 1999 (as HC3). This is \[(A^TA)^{-1} A^T \text{diag}\left(\frac{r_i^2}{1-h_i^2}\right)_i (A^TA)^{-1}, \] where \(r_i\) is the \(i\)th residual, and \(h_i\) is the leverage of the \(i\)th point.
res = linear_least_squares(Acols, ys, claimed_uncerts)
print "Robust standard errors:"
print res.cov_robust
print '"True", approximated:'
print cov_het