Binning Data to a New Grid¶

Often we have some y values the correspond to a particular grid of x values, and we want to resample them onto a different grid of x values. Interpolation is one way to do this, but it won't necessarily provide reasonable averages over wiggly features. There are lots of different features we might hope for in a resampling routine, but one common one is that we'd like get the same answer when integrating between two x limits, whether we're using the original or the resampled grid of quantities. One way to ensure such integrals are conserved is to calculate the cumulative distribution function (= the integral up to a particular limit) of the original arrays, interpolate onto the new grid, and differentiate; Diamond-Lowe et al. 2020 provides a literature example of this algorthim being used for exoplanet transmission spectrum observations.

For chromatic, the tools used to achieve this are the bintogrid and bintoR functions, which we demonstrate below.

In [1]:

Copied!

from chromatic import bintogrid, bintoR, version
import numpy as np, matplotlib.pyplot as plt

plt.matplotlib.rcParams["figure.figsize"] = (8, 3)
plt.matplotlib.rcParams["figure.dpi"] = 300
from chromatic import bintogrid, bintoR, version
import numpy as np, matplotlib.pyplot as plt

plt.matplotlib.rcParams["figure.figsize"] = (8, 3)
plt.matplotlib.rcParams["figure.dpi"] = 300

In [2]:

Copied!

version()
version()

Out[2]:

'0.4.14'

How do we bin some input arrays?¶

Let's create a fake input dataset with an input grid that is uniformly spaced in x.

In [3]:

Copied!

N = 100
x = np.linspace(1, 5, N)
y = x**2
N = 100
x = np.linspace(1, 5, N)
y = x**2

Then, let's use bintogrid to bin these input arrays onto a new grid with wider spacing. The results of this function are a dictionary that contain:

x = the center of the output grid
y = the resampled value on the output grid
x_edge_lower = the lower edges of the output grid
x_edge_upper = the upper edges of the output grid
N_unbinned/N_binned = the approximate number of input bins that contributed to each output bin

In [4]:

Copied!

binned = bintogrid(x, y, dx=0.5)
list(binned.keys())
binned = bintogrid(x, y, dx=0.5)
list(binned.keys())

Out[4]:

['x', 'x_edge_lower', 'x_edge_upper', 'y', 'N_unbinned/N_binned']

Let's compare the results on a plot. The resampled values line up very neatly with the

In [5]:

Copied!





plt.scatter(x, y, alpha=0.5, label="input")
plt.scatter(binned["x"], binned["y"], s=100, alpha=0.5, label="output")
plt.xlabel("x")
plt.ylabel("y")
plt.legend(frameon=False);
plt.scatter(x, y, alpha=0.5, label="input")
plt.scatter(binned["x"], binned["y"], s=100, alpha=0.5, label="output")
plt.xlabel("x")
plt.ylabel("y")
plt.legend(frameon=False);

No description has been provided for this image

This code should work similarly even if the input arrays are non-uniform in x, which can be a nice way to arrange a heterogeneous dataset into something easier to work with.

In [6]:

Copied!





x = np.sort(np.random.uniform(1, 5, N))
y = x**2
binned = bintogrid(x, y, dx=0.5)
plt.scatter(x, y, alpha=0.5, label="input")
plt.scatter(binned["x"], binned["y"], s=100, alpha=0.5, label="output")
plt.xlabel("x")
plt.ylabel("y")
plt.legend(frameon=False);
x = np.sort(np.random.uniform(1, 5, N))
y = x**2
binned = bintogrid(x, y, dx=0.5)
plt.scatter(x, y, alpha=0.5, label="input")
plt.scatter(binned["x"], binned["y"], s=100, alpha=0.5, label="output")
plt.xlabel("x")
plt.ylabel("y")
plt.legend(frameon=False);

The bintoR function is a wrapper around bintogrid that provides a quick way to bin onto logarithmic grid. In spectroscopy, it's common to want to work with wavelengths $\lambda$ that are spaced according to a constant value of $R = \lambda/\Delta \lambda = 1 / \Delta [\ln \lambda]$. This quantity $R$ is often called the spectral resolution.

In [7]:

Copied!





binned = bintoR(x, y, R=10)
plt.scatter(x, y, alpha=0.5, label="input")
plt.scatter(binned["x"], binned["y"], s=100, alpha=0.5, label="output")
plt.xlabel("x")
plt.ylabel("y")
plt.legend(frameon=False);
binned = bintoR(x, y, R=10)
plt.scatter(x, y, alpha=0.5, label="input")
plt.scatter(binned["x"], binned["y"], s=100, alpha=0.5, label="output")
plt.xlabel("x")
plt.ylabel("y")
plt.legend(frameon=False);

This may seem like a weird way to define a new output grid, but its usefulness becomes apparent when we plot on logarithmic axes. It's uniform in log space!

In [8]:

Copied!





plt.scatter(x, y, alpha=0.5, label="input")
plt.scatter(binned["x"], binned["y"], s=100, alpha=0.5, label="output")
plt.xlabel("x")
plt.ylabel("y")
plt.legend(frameon=False)
plt.xscale("log");
plt.scatter(x, y, alpha=0.5, label="input")
plt.scatter(binned["x"], binned["y"], s=100, alpha=0.5, label="output")
plt.xlabel("x")
plt.ylabel("y")
plt.legend(frameon=False)
plt.xscale("log");

How do we bin with uncertainties?¶

For any real measurements, there are probably uncertainties associated with them. Both bintogrid and bintoR will try their best to propagate uncertainties by using inverse-variance weighting and its maximum likelihood estimate for the binned uncertainty.

Let's look at a similar example as before, but with some uncertainties associated with each input y value.

In [9]:

Copied!

x = np.linspace(1, 5, N)
uncertainty = np.ones_like(x) * 5
y = np.random.normal(x**2, uncertainty)
x = np.linspace(1, 5, N)
uncertainty = np.ones_like(x) * 5
y = np.random.normal(x**2, uncertainty)

Let's resample it to a new grid, providing the known uncertainties on the original points. Notice that the result now also includes an uncertainty key.

In [10]:

Copied!

binned = bintogrid(x, y, uncertainty, dx=0.5)
list(binned.keys())
binned = bintogrid(x, y, uncertainty, dx=0.5)
list(binned.keys())

Out[10]:

['x',
 'x_edge_lower',
 'x_edge_upper',
 'y',
 'uncertainty',
 'N_unbinned/N_binned']

When we plot the input and output values, we can see that the typical output uncertainties are smaller than the typical input uncertainties, because we've effectively averaged together a few data points and therefore decreased the uncertainty for the new values.

In [11]:

Copied!





kw = dict(linewidth=0, elinewidth=1, marker="o", alpha=0.5, markeredgecolor="none")
plt.errorbar(x, y, uncertainty, label="input", **kw)
plt.errorbar(
    binned["x"], binned["y"], binned["uncertainty"], label="output", markersize=10, **kw
)
plt.xlabel("x")
plt.ylabel("y")
plt.legend(frameon=False);
kw = dict(linewidth=0, elinewidth=1, marker="o", alpha=0.5, markeredgecolor="none")
plt.errorbar(x, y, uncertainty, label="input", **kw)
plt.errorbar(
    binned["x"], binned["y"], binned["uncertainty"], label="output", markersize=10, **kw
)
plt.xlabel("x")
plt.ylabel("y")
plt.legend(frameon=False);

When we bin onto a logarithmic grid with bintoR, we can see that the uncertainties typically smaller for the higher values of x, where more input points are getting averaged together to make each output point.

In [12]:

Copied!





binned = bintoR(x, y, uncertainty, R=10)
plt.errorbar(x, y, uncertainty, label="input", **kw)
plt.errorbar(
    binned["x"], binned["y"], binned["uncertainty"], label="output", markersize=10, **kw
)
plt.xlabel("x")
plt.ylabel("y")
plt.legend(frameon=False)
plt.xscale("log");
binned = bintoR(x, y, uncertainty, R=10)
plt.errorbar(x, y, uncertainty, label="input", **kw)
plt.errorbar(
    binned["x"], binned["y"], binned["uncertainty"], label="output", markersize=10, **kw
)
plt.xlabel("x")
plt.ylabel("y")
plt.legend(frameon=False)
plt.xscale("log");

How do we customize the output grid?¶

There are a few different options you can use to specify the exact output grid you would like.

For bintogrid, the options are:

nx = the number of adjacent inputs points that should be binned together to create the output grid (for example, "bin every 3 points together")
dx = the spacing for a linearly-uniform output grid
newx = a custom output grid, referring to the centers of the new bins
newx_edges= a custom output grid, referring to the edges of the new bins. The left and right edges of the bins will be, respectively, newx_edges[:-1] and newx_edges[1:], so the size of the output array will be len(newx_edges) - 1

For bintoR, the options are:

R = the spectral resolution R=x/dx for a logarithmically-uniform output grid
xlim = a two-element list indicating the min and max values of x for the new logarithmic output grid. If not supplied, this will center the first output bin on the first value of x