Nonlinear regression¶

This document was prepared at Caltech with support financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In this tutorial, we will learn how to obtain a maximum likelihood estimate (MLE) for parameter values for a model describing x-y data, where x is an independent variable and y is measured.

The data set¶

The data set comes from a classic paper by Driever and Nüsslein-Volhard90183-3) in which they identified bicoid as the first morphogen, a chemical species that determines cell fate in a concentration-dependent manner. The data set is available here. Let's load it in and take a look.

In [2]:

df = pd.read_csv(
    "wt_bcd_driever_fig_3A.csv",
    header=None,
    comment="#",
    names=["AP length", "signal"],
)

x = df["AP length"]
y = df["signal"]

p = bokeh.plotting.figure(
    frame_width=300,
    frame_height=200,
    x_axis_label="AP length",
    y_axis_label="signal (a.u)",
    y_range=[0, 1],
)

p.circle(x, y)

bokeh.io.show(p)

Here, the "AP length" is the fractional distance along the anterior-posterior axis of the developing embryo.

We can work out theoretically that the Bicoid concentration profile should be related to the distance $x$ along the AP-axis according to

\begin{align} c = c_0\mathrm{e}^{-x/\lambda}, \end{align}

where $\lambda$ is the length scale of the gradient. However, the immunostaining procedure contributes some background signal, so we can model the signal $y$ as a function of $x$ as

\begin{align} y = b + y_0\mathrm{e}^{-x/\lambda}, \end{align}

where $b$ is the background signal.

Performing a curve fit¶

In class, we worked out that finding a point estimate for the parameters $b$, $y_0$, and $\lambda$ amounts to solving the optimization problem

\begin{align} \text{arg max} \sum_i \left(y_i - y(x_i;b, y_0, \lambda)\right)^2, \end{align}

where $(x_i, y_i)$ are the measured data and

\begin{align} y(x_i;b, y_0, \lambda) = b + y_0\mathrm{e}^{-x_i/\lambda}. \end{align}

This optimization problem is automatically solved using the convenient scipy.optimize.curve_fit() function. This function takes a set of data and returns the parameters of the model function that best describe the data. Its first input is the model function; next is the x-data (in our case position along the AP axis); the third input is the y-data (the measured signal). The last piece that we will need to address is a best guess for our parameters. curve_fit() then solves the above optimization problem. For more on the scipy.optimize.curve_fit() function, please read the documentation.

When we run scipy.optimize.curve_fit(), we will get two arrays out. The first array contains the optimal set of parameters for the curve fit. The second array is a 2d array with the covariance of the optimal parameters. For this course, we will ignore the second array and will only use the first array.

In [3]:

# Define linear function
def bcd_gradient(x, b, y0, lam):
    return b + y0 * np.exp(-x / lam)

# Best guess at parameters
b_guess = 0.2
y0_guess = 0.8
lam_guess = 0.3
p0 = (b_guess, y0_guess, lam_guess)

# Compute the curve fit (Guess is unit slope and zero intercept)
popt, _ = scipy.optimize.curve_fit(bcd_gradient, x, y, p0=p0)

# Parse the results
b, y0, lam = popt

# Print the results
print('b =', b)
print('y0 =', y0)
print('λ =', lam)

b = 0.17423298452853714
y0 = 0.7676215050510354
λ = 0.18712915664655308

So, we obtained $\lambda = 0.19$ as the length scale of the gradient.

We can also plot the best-fit curve on top of the data. To do so, we generate a set of x-values to make a smooth line and evaluate the theoretical curve at those points.

Computing environment¶

In [5]:

%load_ext watermark
%watermark -v -p numpy,scipy,pandas,bokeh,jupyterlab

Python implementation: CPython
Python version       : 3.9.7
IPython version      : 8.1.1

numpy     : 1.21.2
scipy     : 1.7.3
pandas    : 1.4.1
bokeh     : 2.4.2
jupyterlab: 3.3.2