Testing statistical tests

Often one wants to ask something like "is there a periodic signal in this data?" or "is there a correlation between these two data sets?". Often there is some way to calculate a number that represents how much signal there is or how much correlation there is. But of course there is always noise and uncertainty in the data, and so it's possible that the apparent signal or correlation is actually just noise. So for such a number to be useful, it must also come with an estimate of how likely it would be to get such a value if there were no signal, only noise. Such an estimate is often called a p-value.

I should say, before I go on, that there are a number of things a p-value isn't: it's not (one minus) a probability that the signal is real. It's not a measure of the strength or importance of the signal. It's not even enough to tell you whether you should believe the signal is real: if you look at a hundred thousand data sets, you should expect to find many with p<0.01 even if there's nothing there. But this has been better discussed elsewhere, and it's not my topic today.

Today I'm addressing a specific issue: suppose that you have implemented some new way of quantifying the presence of periodic signals, or correlations, or whatever, and your implementation returns a p-value. How do you make sure that p-value is actually correct?

Of course, you can (and should!) go over your code carefully, with a critical eye. But evaluating the p-value often involves evaluating some arcane special function. How can you be sure that you implemented it correctly, or for that matter, that you got the math right in the first place? It turns out there's a nice way to test your code as a black box.

For our example test, let's take a well-known test: the chi-squared test. As I'm going to use it here, we have m measurements, each with measurement uncertainty one, and we want to know whether the underlying quantity is the same in all measurements. The chi-squared test works by constructing the quantity chi-squared, which is the sum of the squares of the differences of the measurements from the mean:

def chi_squared(values):
return np.sum((values-np.mean(values))**2)

This quantity has a known distribution, naturally enough called the chi-squared distribution. It is parameterized by a number called the "degrees of freedom", which in our case is m-1, the number of measurements minus one because we subtracted the mean. For my p-value, I will just ask what the probability is that random noise would produce a chi-squared value larger than what we observed. (Note that we might well want to consider unusual any data set for which the chi-squared value is unnaturally small as well as ones where the chi-squared is unnaturally large, but that can only happen if we have overestimated the uncertainties on our measurements, so I'll leave it aside.)

def chi_squared_p(m,c):
return scipy.stats.chi2(m-1).sf(c)

For a distribution in scipy.stats, the "sf" method evaluates the "survival function", that is, what fraction of the distribution's values are larger than the given value.

So now we have a test and a p-value calculation. How are we to check that we implemented it correctly? Well, let's pick an m and a p-value we're interested in, say p0 = 0.05. Let's also pick a number of repetitions, N. We will generate N fake data sets in which there is nothing but noise, and see how many times our statistic reports p<p0. This should be something close to p0*N. How close? Well, the statistic should behave like flipping an unfair coin with probability p0, so we can use the binomial distribution to find limits that should contain the number of false positives 98% of the time.

def test_p(N,p0,m):
false_positives = 0
for i in range(N):
if chi_squared_p(m,chi_squared(np.random.randn(m)))<p0:
false_positives += 1

assert scipy.stats.binom(N,p0).cdf(false_positives)>0.005
assert scipy.stats.binom(N,p0).sf(false_positives)>0.005

There we have a test; the more trials you let it run (N), the tighter constraints it puts on the p-value. Unfortunately, it fails 2% of the time even if everything's fine, and it's random, so if it fails, you can't restart running it in the debugger (since the next run will get different values). There's a way around these problems, too. The first problem can be avoided by running the test once, then if it fails, rerunning it. Then the test only reports a failure 0.04% of the time if all is well, but a real problem in the algorithm will probably still show up. The second problem you can solve by seeding the random number generator every time you run the test. In python, decorators provide a handy way to avoid having to write boilerplate code to do both these things for each test; I have two decorators, seed() and double_check, that do this. I'll omit their code here because they have some nastiness to work around limitations in nosetests (but you can find them, along with a more detailed example of the techniques in this post here).

This test is nice, but a little limited: it only tests one p value, p0. Since every time the statistical test runs it returns a p value, surely we can do better?

Well, since the p value is supposed to be the probability that a random data set will exceed the value obtained in a particular test, if we generate lots of noise samples, we should have the fraction whose value is less than p0 roughly equal to p0 for every p0 - in other words, the p-values returned should be uniformly distributed. So we can use a statistical test for uniform distribution to check whether they're coming out right! One such test, present in scipy, is the Kolmogorov-Smirnov test. That gives the following scheme:

def test_ks(N,m):
pvalues = [chi_squared_p(m,chi_squared(np.random.randn(m))) for i in range(N)]

D, p = scipy.stats.kstest(pvalues, lambda x: x)
assert p>0.01

This does have the drawback that the K-S test is known to be most sensitive in the middle of the distribution, while what we care about is actually the tail. A minor modification can help, at the cost of some speed:

def test_ks_tail(N,m,p0):
pvalues = []
while len(pvalues)<N:
pv = chi_squared_p(m,chi_squared(np.random.randn(m)))
if pv<p0:
D, p = scipy.stats.kstest(pvalues, lambda x: x)
assert p>0.01

This has the embarrassing problem that it's too good: it turns out that while this works for the chi-squared statistic I describe, for the K-S test itself, the p-value returned is actually rather approximate, so that this test tends to fail.

Finally, there's one essential thing to check: how stringent are these tests? If we return the wrong p-value, do they pick it up? A quick check can be done simply by modifying the chi-squred calculator to jigger the value a bit, then checking that the tests fail. With a thousand trials, the tests pass just fine if I return 1.01*chi-squared; I have to increase it to something like 1.05*chi-squared to start getting failures.

This brings us to the elephant in the room: the power of a statistical test. The p-value is really only half the answer; it tells us how likely we are to get an apparent signal when there's nothing there. It tells us nothing about whether we would actually see a signal if it were there; for that, you need a different quantity, the power of the test.

No comments: