Fitting a Model to Data¶
In class we talked about two key ingredients for writing a computer recipe for optimizing a model to match data. We need a goodness-of-fit metric, and we need a way to explore the range of parameters that are out there.
Linear Least Squares¶
For some classes of model fitting problems (“linear least squares”
problems), there are some tools that can identify the best-fit
parameters almost instantaneously. For example, np.polyfit
does some
magical matrix math to instantly fit a polynomial to (x,y) data.
General Fitting Problems¶
You won’t always have a tidy “linear” problem, so it’s useful to understand how fitting might work more abstractly. Here are a few examples that help visualize the exploration and optimization process.
In [ ]:
import numpy as np, matplotlib.pyplot as plt
import henrietta as hsl
First, let’s generate a simulated transit. It’s useful to know the parameters we injected, so we can see how good our fits look in the end.
In [ ]:
# create a simulated light curve
actual = dict(period=1.58, radius=0.116, a=15.2, b=0.385)
lc = hsl.simulate_transit_data(tmin=-0.1, duration=0.2, cadence=1/60/24., N=1e7, **actual)
lc.scatter();
We’re going to use some of the astropy.modeling tools. You don’t need to know much about them now, but the basic step we need to take is to set up a “model” object that we can play with. We wrote a wrapper to help with this:
In [ ]:
model = hsl.setup_transit_model(period=1.58,
radius=[0.0, 0.3],
a=[3.0, 50.0],
b=0.385,
baseline=1.0)
If you give the setup_transit_model
function one value for a
parameter, it will treat that parameter as fixed. If you give it a range
of parameters, it will treat that parameter as unknown but within that
range. All of these are stored inside of our model object.
In [ ]:
print(model.param_names)
print(model.parameters)
print(model.fixed)
print(model.bounds)
An astropy model
can store parameters inside them, but it can also
be called as a function to generate a model curve:
In [ ]:
x = lc.time
y = lc.flux
plt.plot(x, y, '.k')
plt.plot(x, model(x));
Hmmmm…that wasn’t a great match. To play with fitting our model to improve its parameters, we need to define a goodness of fit function. Let’s try a common one:
In [ ]:
def sumofsquares(residuals):
'''
This calculates a goodness-of-fit from an array of residuals.
(a lower value implies a better fit)
Parameters
----------
residuals : `~numpy.ndarray`
Array of values for (data-model)/sigma
Returns
-------
gof : float
A single goodness-of-fit metric
(in this case, sum of squares)
'''
return np.sum(residuals**2)
With this, we can set up a “fitter”, which will help us figure out good
values for our parameters. By setting the goodness
keyword, we can
tell the fitter what function to use for calculating the goodness of fit
from the residuals. The fitter will try to minimize the goodness
values, so make sure you’re aware of that in your definition of the
goodness
function. (You could make a reasonable argument we should
be calling this “badness” instead. That’s fair.)
Exploration #1: Guess and Check¶
Let’s start with an inefficient, but conceptually simple, algorithm. We’ll just try random sets of parameters, and see which one looks best. First, we create a “fitter” object, that handles this exploration.
In [ ]:
fitter = hsl.VisualizedGuessNCheckFitter(goodness=sumofsquares)
Next, we can run this fitter to see what happens. It will return the best model, but it will also make an animation that shows us the process it took to get there.
In [ ]:
bestmodel = fitter.visualize(model, x, y)
This will save an animation out to the file
guessncheck-sumofsquares.mp4
. Open it up to take a look!
Exploration #2: Simplex¶
Our guessing and checking method seems a little inefficient. Here’s an
example of the simplex
method for minimizing functions (in this case
our goodness of fit). What’s going on with this one?
In [ ]:
faster = hsl.VisualizedSimplexFitter(goodness=sumofsquares)
bestmodel = faster.visualize(model, x, y)
simplex
will do a better job of fitting some parameters than others.
Play around with which parameters you allow to vary in
setup_transit_model
, and by what ranges. Does simplex
do a good
job of fitting all of them? Based on your animations, do have any ideas
why or what not?