✍️🧮❤️ Models + Data¶

We work so hard to gather sensitive observations, calibrate them carefully, and make precise measurements from them to construct reliable datasets. How do we go from those data to learning about objects or processes?

What's the size of that planet, the distance to that supernova, the age of that cluster, the temperature of that gas? Sometimes it's one obvious measurement, but often we need to fit theoretical models to real data to find out.

This notebook provide a very brief whirlwind perspective on how to fit models to data. We make lots of assumptions and hide lots of caveats, but we try to show general principles for finding good models.

In [1]:

Copied!





import matplotlib.pyplot as plt, numpy as np
plt.rcParams['figure.dpi'] = 300
plt.rcParams['figure.figsize'] = (8,3)
from IPython.display import Image
from astropy.io import ascii
import matplotlib.pyplot as plt, numpy as np
plt.rcParams['figure.dpi'] = 300
plt.rcParams['figure.figsize'] = (8,3)
from IPython.display import Image
from astropy.io import ascii

The Data¶

Let's play with an historically important dataset, the 25 pulsation periods and apparent magnitudes going into the first "Leavitt Law" period-luminosity relationship of Cepheid variable stars (Leavitt & Pickering 1912).

In [2]:

Copied!

Image('leavitt-text.png')
Image('leavitt-text.png')

Out[2]:

No description has been provided for this image

In [3]:

Copied!

table = ascii.read('leavitt-periods.csv')
table
table = ascii.read('leavitt-periods.csv')
table

Out[3]:

Table length=25

star_number	max_magnitude	min_magnitude	epoch	period
int64	float64	float64	float64	float64
1505	14.8	16.1	0.02	1.25336
1436	14.8	16.4	0.02	1.6637
1446	14.8	16.4	1.38	1.762
1506	15.1	16.3	1.08	1.87502
1413	14.7	15.6	0.35	2.17352
1460	14.4	15.7	0.0	2.913
1422	14.7	15.9	0.6	3.501
842	14.6	16.1	2.61	4.2897
1425	14.3	15.3	2.8	4.547
...	...	...	...	...
1374	13.9	15.2	6.0	8.397
818	13.6	14.7	4.0	10.336
1610	13.4	14.6	11.0	11.645
1365	13.8	14.8	9.6	12.417
1351	13.5	14.4	4.0	13.08
827	13.4	14.3	11.6	13.47
822	13.0	14.6	13.0	16.75
823	12.2	14.1	2.9	31.94
824	11.4	12.8	4.0	65.8
821	11.2	12.1	97.0	127.0

Let's simplify these into "x" and "y" quantities we'll try to visualize and fit. For fitting, we'll say that the independent variable $x$ is $\log_{10}$(pulsation period) and the dependent variable $y$ is the star's apparent magnitude.

In [4]:

Copied!

log_period = np.log10(table['period'])
magnitude = 0.5*(table['min_magnitude'] + table['max_magnitude'])
magnitude_uncertainty = 0.2
log_period = np.log10(table['period'])
magnitude = 0.5*(table['min_magnitude'] + table['max_magnitude'])
magnitude_uncertainty = 0.2

Let's plot those data, with no model yet. (We'll put the plotting in a function so we can reuse it later.)

In [5]:

Copied!





def plot_data():
    '''
    Plot the data by itself.
    '''
    plt.errorbar(log_period, magnitude, magnitude_uncertainty, 
                 linewidth=0, elinewidth=1, color='black', marker='o')
    plt.ylim(16.5, 11)
    plt.xlabel('log$_{10}$[Pulsation Period (in days)]'); plt.ylabel('Photographic Apparent Magnitude');
plot_data()
def plot_data():
    '''
    Plot the data by itself.
    '''
    plt.errorbar(log_period, magnitude, magnitude_uncertainty, 
                 linewidth=0, elinewidth=1, color='black', marker='o')
    plt.ylim(16.5, 11)
    plt.xlabel('log$_{10}$[Pulsation Period (in days)]'); plt.ylabel('Photographic Apparent Magnitude');
plot_data()

The Model¶

Let's start with a very simple model: a straight line. $$ y_{\rm model}(x) = a + bx $$ The model has two parameters, $a$ for the intercept and $b$ for the slope. We can evaluate this

In [6]:

Copied!





def model(x, a, b):
    '''
    This function takes the log10(period) as an independent variable  
    and two parameters (a = intercept, b = slope), and it returns 
    model values for the apparent magnitude of the star.
    '''
    return a + b*x
def model(x, a, b):
    '''
    This function takes the log10(period) as an independent variable  
    and two parameters (a = intercept, b = slope), and it returns 
    model values for the apparent magnitude of the star.
    '''
    return a + b*x

Let's plot the model, for some set of parameters, with no data. (We'll put the plotting in a function so we can reuse it later.)

In [7]:

Copied!





def plot_model(a=13, b=1):
    '''
    Plot the model by itself.
    '''
    grid_of_log_periods = np.linspace(0, 2.2)
    model_magnitudes = model(grid_of_log_periods, a, b)

    plt.plot(grid_of_log_periods, model_magnitudes, color='cornflowerblue', linewidth=3)
    plt.ylim(16.5, 11)
    plt.xlabel('log$_{10}$[Pulsation Period (in days)]'); plt.ylabel('Photographic Apparent Magnitude');
plot_model()
def plot_model(a=13, b=1):
    '''
    Plot the model by itself.
    '''
    grid_of_log_periods = np.linspace(0, 2.2)
    model_magnitudes = model(grid_of_log_periods, a, b)

    plt.plot(grid_of_log_periods, model_magnitudes, color='cornflowerblue', linewidth=3)
    plt.ylim(16.5, 11)
    plt.xlabel('log$_{10}$[Pulsation Period (in days)]'); plt.ylabel('Photographic Apparent Magnitude');
plot_model()

Comparing Model and Data¶

Our eyes are pretty good at figuring out what good and bad models would look like. Let's start by plotting the data and the model together, and seeing how they look.

In [8]:

Copied!





def plot_data_and_model(a=13, b=1):
    '''
    This plots both the data and the model (and gives it a title).
    '''
    plot_data()
    plot_model(a, b)
    plt.title(r'$y_{model} = a + bx$' + f' = {a:.2f}{b:+.2f}$x$')
def plot_data_and_model(a=13, b=1):
    '''
    This plots both the data and the model (and gives it a title).
    '''
    plot_data()
    plot_model(a, b)
    plt.title(r'$y_{model} = a + bx$' + f' = {a:.2f}{b:+.2f}$x$')

One way to figure out reasonable parameters for our model is to fiddle with them by hand.

In [9]:

Copied!

plot_data_and_model(a=13, b=1)
plot_data_and_model(a=13, b=1)

But for various reasons that feels like not a great general method.

Fortunately, we can write down a function describing the probability density of the parameters $p(a,b)$. With this function, we can determine:

the values of $a$ and $b$ that maximize the PDF (the "maximum likelihood estimates"). We might think of these as our "best-fit" values.
the range of values that fall within some probability confidence interval of those best-fit values. We could use this range to summarize the uncertainty on those parameters.

Let's write functions to calculate the $\chi^2$ ("chi-squared").

In [10]:

Copied!





def calculate_chisq(a, b):
    '''
    Return the value of chi-squared for given a and b.
    '''
    y = magnitude
    ymodel = model(log_period, a, b)
    sigma = magnitude_uncertainty
    return np.sum(((y-ymodel)/sigma)**2)
def calculate_chisq(a, b):
    '''
    Return the value of chi-squared for given a and b.
    '''
    y = magnitude
    ymodel = model(log_period, a, b)
    sigma = magnitude_uncertainty
    return np.sum(((y-ymodel)/sigma)**2)

Let's redefine our plotting function to also show the $\chi^2$ values now too:

In [11]:

Copied!





def plot_data_and_model(a=13, b=1):
    '''
    This plots both the data and the model (and gives it a title).
    '''
    plot_data()
    plot_model(a, b)
    chisq = calculate_chisq(a,b)
    plt.title(r'$y_{model} = a + bx$' + f' = {a:.2f}{b:+.2f}$x$ | $\chi^2$={chisq:.1f}')
def plot_data_and_model(a=13, b=1):
    '''
    This plots both the data and the model (and gives it a title).
    '''
    plot_data()
    plot_model(a, b)
    chisq = calculate_chisq(a,b)
    plt.title(r'$y_{model} = a + bx$' + f' = {a:.2f}{b:+.2f}$x$ | $\chi^2$={chisq:.1f}')

<>:8: SyntaxWarning: invalid escape sequence '\c'
<>:8: SyntaxWarning: invalid escape sequence '\c'
/var/folders/sz/5zjhq1h17p7gsb7fr5jr1lch0000gp/T/ipykernel_73753/4128376522.py:8: SyntaxWarning: invalid escape sequence '\c'
  plt.title(r'$y_{model} = a + bx$' + f' = {a:.2f}{b:+.2f}$x$ | $\chi^2$={chisq:.1f}')

We can tell that $\chi^2$ is a useful quantitative figure of merit. It gets smaller if our model get's closer!

In [12]:

Copied!

plot_data_and_model(a=16, b=-2)
plot_data_and_model(a=16, b=-2)

Let's make a map of $\chi^2(a,b)$ over lots of values of $a$ and $b$.

In [13]:

Copied!





res = 100
a_explore = np.linspace(17,15,res)
b_explore = np.linspace(-3, -1, res)
chisq = np.zeros((res,res))

# loop through a 2D grid of a and b
for i, a_test in enumerate(a_explore):
    for j, b_test in enumerate(b_explore):
        chisq[i,j] = calculate_chisq(a_test, b_test)
res = 100
a_explore = np.linspace(17,15,res)
b_explore = np.linspace(-3, -1, res)
chisq = np.zeros((res,res))

# loop through a 2D grid of a and b
for i, a_test in enumerate(a_explore):
    for j, b_test in enumerate(b_explore):
        chisq[i,j] = calculate_chisq(a_test, b_test)

What values of $a$ and $b$ minimize our chisq array?

In [14]:

Copied!





# let's pull out where the chi-squared is minimized
best = np.argmin(chisq)
# convert a 1D index to 2d indices
i_best, j_best = np.unravel_index(best, chisq.shape)
# pull out the best-fit values of a and b from the grid
a_best, b_best = a_explore[i_best], b_explore[j_best]
# print out that best-fit
print(f'a={a_best:.3}, b={b_best:.3} give the lowest chi-squared and highest log(likelihood)')
# let's pull out where the chi-squared is minimized
best = np.argmin(chisq)
# convert a 1D index to 2d indices
i_best, j_best = np.unravel_index(best, chisq.shape)
# pull out the best-fit values of a and b from the grid
a_best, b_best = a_explore[i_best], b_explore[j_best]
# print out that best-fit
print(f'a={a_best:.3}, b={b_best:.3} give the lowest chi-squared and highest log(likelihood)')

a=16.2, b=-2.03 give the lowest chi-squared and highest log(likelihood)

In [15]:

Copied!





imkw = dict(extent=[np.min(b_explore), np.max(b_explore), np.max(a_explore), np.min(a_explore)], aspect='auto', origin='lower')
fi, ax = plt.subplots(1,2, sharex=True, sharey=True, figsize=(12,3))

# imshow the log of the probability
plt.sca(ax[0]); plt.imshow(chisq, **imkw); plt.title('$\chi^2[p(a,b)]$')
kw = dict(alpha=0.3, color='white')
plt.axhline(a_best, **kw); plt.axvline(b_best, **kw);
plt.colorbar(label='$\chi^2[p(a,b)]$'); plt.xlabel('b (slope)'); plt.ylabel('a (intercept)')

# imshow the actual probability = exp(log(probability))
plt.sca(ax[1]); plt.imshow(np.exp(-chisq/2), **imkw); plt.title('$p(a,b)\propto e^{-\chi^2/2}$')
plt.axhline(a_best, **kw); plt.axvline(b_best, **kw);
plt.colorbar(label='$p(a,b)$');plt.xlabel('b (slope)'); plt.ylabel('a (intercept)');
imkw = dict(extent=[np.min(b_explore), np.max(b_explore), np.max(a_explore), np.min(a_explore)], aspect='auto', origin='lower')
fi, ax = plt.subplots(1,2, sharex=True, sharey=True, figsize=(12,3))

# imshow the log of the probability
plt.sca(ax[0]); plt.imshow(chisq, **imkw); plt.title('$\chi^2[p(a,b)]$')
kw = dict(alpha=0.3, color='white')
plt.axhline(a_best, **kw); plt.axvline(b_best, **kw);
plt.colorbar(label='$\chi^2[p(a,b)]$'); plt.xlabel('b (slope)'); plt.ylabel('a (intercept)')

# imshow the actual probability = exp(log(probability))
plt.sca(ax[1]); plt.imshow(np.exp(-chisq/2), **imkw); plt.title('$p(a,b)\propto e^{-\chi^2/2}$')
plt.axhline(a_best, **kw); plt.axvline(b_best, **kw);
plt.colorbar(label='$p(a,b)$');plt.xlabel('b (slope)'); plt.ylabel('a (intercept)');

<>:5: SyntaxWarning: invalid escape sequence '\c'
<>:8: SyntaxWarning: invalid escape sequence '\c'
<>:11: SyntaxWarning: invalid escape sequence '\p'
<>:5: SyntaxWarning: invalid escape sequence '\c'
<>:8: SyntaxWarning: invalid escape sequence '\c'
<>:11: SyntaxWarning: invalid escape sequence '\p'
/var/folders/sz/5zjhq1h17p7gsb7fr5jr1lch0000gp/T/ipykernel_73753/772915040.py:5: SyntaxWarning: invalid escape sequence '\c'
  plt.sca(ax[0]); plt.imshow(chisq, **imkw); plt.title('$\chi^2[p(a,b)]$')
/var/folders/sz/5zjhq1h17p7gsb7fr5jr1lch0000gp/T/ipykernel_73753/772915040.py:8: SyntaxWarning: invalid escape sequence '\c'
  plt.colorbar(label='$\chi^2[p(a,b)]$'); plt.xlabel('b (slope)'); plt.ylabel('a (intercept)')
/var/folders/sz/5zjhq1h17p7gsb7fr5jr1lch0000gp/T/ipykernel_73753/772915040.py:11: SyntaxWarning: invalid escape sequence '\p'
  plt.sca(ax[1]); plt.imshow(np.exp(-chisq/2), **imkw); plt.title('$p(a,b)\propto e^{-\chi^2/2}$')

The maximum likelihood values of $a$ and $b$ from our grid search seemed to give a pretty good match to the data. (Indeed, they give the best match possible, as defined by the assumptions of our model.)

In [16]:

Copied!

plot_data_and_model(a_best, b_best)
plot_data_and_model(a_best, b_best)

Note: There are much more efficient ways of finding the parameter set that minimizes $\chi^2$ and maximizes the log of the probability, rather than this big grid search. In Python, check out the options like np.polyfit, scipy.optimize, astropy.modeling, or others.

For example, let's fit a first-order polynomial (= a line) to these data using the np.polyfit tool. It's secretly doing some clever linear algebra to minimize $\chi^2$ very quickly, using the magical methodology of weighted linear least squares fitting.

In [17]:

Copied!

b_polyfit, a_polyfit = np.polyfit(log_period, magnitude, 1)
b_polyfit, a_polyfit = np.polyfit(log_period, magnitude, 1)

In [18]:

Copied!

plot_data_and_model(a_polyfit, b_polyfit)
plot_data_and_model(a_polyfit, b_polyfit)

In general, if you want to fit a model to some data, a good starting place is to figure out how to write down the likelihood of your data and then find the parameters that maximize that likelihood!