Lighthouse in the Sky: Curve Fitting part 4: Validating Bayesian Code

In my previous post, I wrote a tool to use Bayesian inference to ask whether a collection of photons represented a pulsed signal, and if so, what its parameters were. It gave plausible answers, but knowing my own fallibility, I really want some more careful test before trusting its output.

I previously talked about how to implement such a correctness test in a frequentist setting. But in a Bayesian setting, what one gets are probability distributions describing our knowledge about the hypotheses - how can a program test those?

After racking my brains for a while trying to find a way to do this, I asked about it on the scipy-user mailing list, and received a few useful suggestions, the most valuable of which was to send a polite email to Professor Tom Loredo. I did, and he replied immediately with a very helpful email, and some links to other people's work on the subject.

It turns out that, in the context of Bayesian fitting, the issue is called "calibration". The key realization is that if you select a model according to your prior distribution, generate a data set, and then do your fitting to generate posterior distributions, then your "true" parameters that you used to generate the data set look just as if they had been drawn from the posterior distribution.

This makes a certain kind of sense: after all, your posterior distribution is supposed to describe the distribution of models that might have generated your data set.

So how do you turn this into a test? After all, if you just get one sample from a distribution, it's pretty hard to say anything very definite, especially when the tails - the most unlikely extreme parameters - are not necessarily very well modeled. If you try to generate another sample, you pick a different model and must then generate a different data set, so you get a new point but a totally different posterior distribution. So now you have two points, drawn from two different distributions, and your situation has not necessarily improved.

There's a trick to let you work with a large sample: you transform them to all have the same distribution. You can do this because you know the posterior distribution you're working with; in my case I have its values sampled evenly. So I can construct the cumulative distribution function and use its inverse to get the posterior probability of a value less than the true model. This should be distributed uniformly no matter the shape of the posterior.

In fact, I'd rather use a slightly different trick: instead of looking at the probability of a value less than the true model, I'll use the probability of a value more extreme than the true model. Essentially I'll combine both tails. This has a more direct bearing on the question of confidence intervals, and still results in a uniform distribution as I repeat the process many times.

Once I have a long list of many values that should be distributed uniformly, there are any number of tests I can use; I'll use the handy but not particularly good Kolmogorov-Smirnov test. My unit testing code now looks like this:


def test_credible_interval():

    M = 100
    N = 50

    ps = []
    for i in range(M):
        f, p = np.random.rand(2)

        events = bayespf.generate(f, p, N)

        phases, fractions, r, P = bayespf.infer(events)

        frac_pdf = np.average(r,axis=0)
        fi = np.searchsorted(fractions, f)
        p = np.sum(frac_pdf[:fi])/np.sum(frac_pdf)
        p = 2*min(p, 1-p)

        ps.append(p)

    assert scipy.stats.kstest(ps,lambda x: x) > 0.01

Note that I don't bother testing the posterior for pulse phase. This is just laziness.

In a real Bayesian problem, there would usually be many more parameters, and I would probably not be able to evaluate the posterior on a grid like this. I'd probably be using some sort of Monte Carlo method, which would return a list of samples drawn from the posterior instead. But there are reasonable approaches in this setting too.

So far so good. But what about the probability that the signal is pulsed? In some sense hypothesis testing is just a special case of parameter fitting, but the discreteness of the "parameter" poses a problem: there's no way we can hope for a nice uniform distribution when only two values are possible. Professor Loredo very kindly sent me an unpublished note in which he addresses this problem, showing that if your code accepts a hypothesis whose posterior probability is P_crit, then a correct piece of code will be right - have chosen the correct hypothesis - with probability at least P_crit. Unfortunately P_crit is only a lower limit on the probability of getting things right; but there isn't really a way around this: suppose my code were working with a million photons a run. Then it would, in almost every case, give a probability very close to zero or one. There would be very, very few errors, no matter what value P_crit you set.

The fact that P_crit is only a lower limit does mean that this doesn't allow you to catch code that is too conservative: if your code makes fewer errors than its probabilities claim it should, this test has no way to tell that that's what's happening.

One must also choose P_crit carefully. In my example, if I choose P_crit=0.5, then code that simply chose randomly, or returned the same value all the time, would pass: after all, my priors specify equal probabilities for each model, and with P_crit=0.5 the code only needs to be right half the time. On the other hand, with P_crit really high, the code will almost never be wrong, although it only takes a few errors to fail, but it will almost never actually come to a conclusion, so you'll wait forever for evidence. There should be some fairly well-defined optimum value of P_crit, and I need to think more about what it is.

In any case, having selected P_crit and the other parameters, I can write a unit test for the hypothesis testing component:


def test_pulsed_probability():
    np.random.seed(0)
    p_accept = 0.75
    M = 50
    N = 30

    accepted = 0
    correct = 0
    total = 0
    while accepted<M:
        if np.random.rand()>0.5:
            f, p = np.random.rand(2)
            pulsed = True
        else:
            f, p = 0., 0.
            pulsed = False

        events = bayespf.generate(f, p, N)

        phases, fractions, r, P = bayespf.infer(events, n_phase=100, n_frac=101)

        if P>=p_accept:
            accepted += 1
            if pulsed:
                correct += 1
        if P<1-p_accept:
            accepted += 1
            if not pulsed:
                correct += 1
        total += 1
    assert 0.01<scipy.stats.binom(M,p_accept).cdf(correct)

As a final note, I'd like to thank Professor Loredo for his help, but note that any errors in the above code or description are entirely mine.

8 comments:

roban said...: Just saw this on arxiv (0911.2150) and thought you might be interested given you work (which I read about in your posts to the scipy list).

http://arxiv.org/abs/0911.2150; November 12, 2009 at 6:53 PM
Unknown said...: That is an interesting article. I avoid the problems they're talking about by sticking to a two-dimensional problem and using ultra-simple numerical integration, but that's only possible because I use such a simple toy problem. But their procedure looks almost perfect - you could plug it right in to a pymc model, possibly even simply using the results of a simulation you'd already run.; November 13, 2009 at 8:54 AM
Joshua Stults said...: So does your unit test fail 1% of the time?; November 17, 2009 at 1:45 AM
Unknown said...: @jstults: To be honest I haven't run it a hundred times. But I know that the hypothesis testing one will probably not fail that often - it's only testing a lower limit, and that lower limit is not very strict. The credible regions test should probably fail that often, though.

I usually use two decorators to work around this issue: one called "seed" that seeds the random number generator before running the test, and a second called double_check that reruns the test (once) if it fails.; November 17, 2009 at 1:58 AM
Joshua Stults said...: Hi Anne,

You seem to be really familiar with scipy, I wonder if you could recommend a tool for fitting a regression with uncertain predictor and response? I think you could define a model and use pymc (which I've never used, only browsed the docs), but I thought maybe it's a common enough problem that there's already something coded up in one of scipy's packages? Thanks for any tips.; November 28, 2009 at 7:15 PM
Unknown said...: @jstults: If I understand you correctly, you're looking to fit a model to data where both independent and dependent variables are uncertain. If you're fitting a linear model, I think OLS in scipy will do that for you. If you want some more complicated nonlinear model, I don't know of any general algorithm short of treating the errors on the independent variables as additional parameters to fit for.; November 28, 2009 at 11:58 PM
Joshua Stults said...: I thought ordinary least squares was for fixed x, but then read the wikipedia article and it claims that a normally distributed predictor leads to the same result as for a fixed one (I don't remember that from any of my math classes).

Your reply got me to look further into the docs for OLS and I found scipy.odr which I think will do what I want.

Thanks.; November 29, 2009 at 12:31 AM
James Philips said...: My open source project Python Equations uses Python, Scipy and Numpy - and already has ODR coded for use. It may be of some use to you.

http://code.google.com/p/pythonequations/downloads/list

James Phillips
http://zunzun.com
zunzun@zunzun.com; December 22, 2009 at 7:18 PM

Lighthouse in the Sky

Curve Fitting part 4: Validating Bayesian Code

8 comments:

Search This Blog

Popular Posts

Labels

Blog Archive

Other Blogs

Lighthouse in the Sky

Curve Fitting part 4: Validating Bayesian Code

8 comments:

Search This Blog

Popular Posts

Labels

Blog Archive

Other Blogs

Subscribe To