(c) 2023 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This tutorial was generated from a Jupyter notebook, which can be downloaded here
In this tutorial, we will use our list of bead loss times to get an estimate for the characteristic cleavage time by RAG (for the concentration of our given experiment). We begin by importing our usual modules.
import numpy as np
import scipy.optimize
import iqplot
import bokeh.io
bokeh.io.output_notebook()
And, of course, we need to load in the data set we saved from the Image Processing III tutorial. In the event that you didn't complete that tutorial, you can download the the bead loss times here.
# Data from image processing
t_lost = np.loadtxt('t_lost_2016.csv')
m = int(np.loadtxt('number_of_uncut_2016.csv'))
t = np.loadtxt('time_points_2016.csv')
n = len(t_lost)
t_end = t[-1]
We assume that cleavage is a Poisson process. A Poisson process is a sequence of events (called arrivals) that occur over time such that the amount of time between any two arrivals is independent of all other inter-arrival times. This means that cleavage is a stochastic event that has no memory, nor knowledge of the future.
This is essentially our physical model for cleavage. We do not know the molecular details, only that the cleavage events are rare and random.
The amount of time to wait for an event in a Poisson process is exponentially distributed. We will not derive that fact here, but just state it. That means that the probability of observing a bead being lost between time and , where is a differential time difference, is , where
The function is called a probability density function, or PDF, and has units of inverse time. The parameter , which has units of inverse time, and characterizes how rapid cleavage is. It is proportional to the chemical rate constant from mass action kinetics that you may be accustomed to. Specifically, the constant of proportionality contains the concentrations of enzyme, DNA, and volume of the system.
The problem with this argument is that beads may be lost to processes other than cleavage by RAG protein. We do a control experiment without RAG to estimate the rate of bead loss in the absence of RAG, the rate constant of which we will define as . So, now we have two Poisson processes that could arrive and a bead could be lost. I.e., the probability density function for bead loss due to non-RAG-mediated bead loss is
Since we are observing loss, possibly due to either process, we really want is the probability distribution of bead loss due to either RAG cleavage or other processes. This is a more general problem: how is the waiting time for the arrival of one of two different Poisson processes distributed? To derive this, we first write down the probability that a Poisson process has not arrived before time . That is, we want .
Now, if the two processes, which we will label with and , are independent of each other, the probability that neither have arrived before time is
But this is the expression for the probability that a single Poisson process with rate does not arrive by time . So, the distribution for the arrival time of either Poisson process in time is that of a single Poisson process with time constant . So, we have
For convenience going forward, we will define to contain both the rate of cleavage and that of auto-detachment, knowing that we can subtract off the rate of auto-detachment obtained from the control experiment at the end of the analysis. So, we will use
If we observe an exponential distribution of cleavage times, we have strong evidence that cleavage is well-described by a Poisson process; it is just a stochastic memoryless process. If, however, the cleavage times are not exponentially distributed, we have to think hard about what might be at play in the dynamics of RAG-mediated cleavage.
We have the PDF describing bead loss at time , but we want the probability of our observed bead loss time. We need to be more precise in what we mean here. We take images at time points , , ..., . We can observe a bead is missing at any of these times points except . Let be the time where we observe a given bead is missing. What is the probability of this observation? It is the probability that the bead was lost between time and time .
If we have the same time interval between each image we take, then for all , and this expression is
Note that we may not always have the same time interval between frames because of loss of focus or other issues, we will make this assumption going forward.
Since the bead loss events are all independent, we can write down the probability of observing all of our lost beads as the product of these probabilities. Let be the time at which we observe bead going missing. Let be a list of the time points at which we observe each of of the beads we lost. Then,
where is the time interval between the image where bead is first absent and the last image where it was present. The last equality in the above expression holds if we have constant time intervals between images. We note that
where is the average of the times where we observed bead loss. As a result,
where again the last equality holds if we have constant time intervals between images.
We only run our movie for time . (In your experiment, we typically set hours.) So there may be beads that could be cleaved if we waited a while. These are data, too! The probability that a given bead is not cleaved before , which means that we observe it intact in the final frame of the movie, is
If there are such beads, then the probability of observing our cleaved beads and our uncleaved beads in the experiment is
where again the last equality holds if we have constant time intervals between images.
To get an estimate for , the rate of cleavage (which, recall, is actually the rate of cleavage plus the rate of non-enzyme-mediated bead loss), we choose the value of such that the expression for is maximized. This is called a maximum likelihood estimate.
The value of that maximizes is the same that maximized , and the logarithm is much easier to work with.
This is maximal where its derivative vanishes, that is where
In the case where we have evenly spaced time points, this is
To solve for the MLE, we need find the value of for which
We now consider an approximate solution for the special case where we have evenly spaced time points so that and rapid image acquisition so that . In this case,
so that
Solving for gives
So, if we waited a long time such that we had essentially no beads remaining, the . In other words, the rate is best estimated by the inverse of the average time of cleavage.
We can compute the approximate MLE for using the data.
beta_mle_approx = 1 / (np.mean(t_lost) + m / n * t_end)
print(beta_mle_approx)
0.013887675779544067
The characteristic cutting time is , which we now compute.
1 / beta_mle_approx
72.0062893081761
The characteristic cut time is about 72 minutes.
Note that in this approximation, we effectively assumed that even though we have discrete sampling times, we are able to measure the exact time of losing a bead. This is clear by forgoing integrating over the interval from time to for a bead loss observed at time , and instead directly using the probability density function such that
The logarithm of this is
Differentiating with respect to and setting to zero gives
Solving for gives
the same result as formally making the approximation above.
We now demonstrate how to compute the MLE exactly. Our task is to write a function that computes the function of we are trying to find the root of, namely
We first compute some convenient quantities from the data.
# Array of Δt's
delta_t = np.concatenate(((0,), np.diff(t)))
# Index of time points where cut
i = np.array([np.where(t == tl)[0][0] for tl in t_lost])
# The Δtij's
delta_t_ij = delta_t[i]
# Mean observation time
t_mean = np.mean(t_lost)
Now, we can define the function .
def root_fun(beta):
exp_val = np.exp(beta * delta_t_ij)
ret_val = -m * t_end - n * t_mean
ret_val += np.sum(delta_t_ij * exp_val / (exp_val - 1))
return ret_val
Now, we are left to find the MLE using SciPy's nifty solver scipy.optimize.fsolve()
. We need to pass in the function whose root we are trying to find and also our initial guess at a root. We already have an approximate value for , so we will use that.
beta_mle = scipy.optimize.fsolve(root_fun, beta_mle_approx)[0]
print(beta_mle)
0.013937036406431463
It's exactly the same as our approximate MLE, varying at the fourth decimal point. So, we are not too far off by using the approximate value for the MLE.
Now to check the model, we will plot the ECDF of the measured data along with the CDF of the theoretical Exponential distribution.
p = iqplot.ecdf(
np.concatenate((t_lost, np.nan * np.ones(m))),
q="time of bead loss (min)",
y_range=[-0.025, 1.025],
)
# Theoretical curve
t_theor = np.linspace(0, 70, 200)
cdf = 1 - np.exp(-beta_mle * t_theor)
p.line(t_theor, cdf, line_width=2, line_color='orange')
bokeh.io.show(p)
We see that the agreement is quite poor. This could be due to a few reasons. The chemical process of cutting beads may not actually be well described by a Poisson process. This is very interesting! It would mean that there may be some more complex chemical dynamics at play, which we might want to explore with some more sophisticated experimentation.
Another possibility is that we made mistakes in determining how many beads were uncleaved. We filtered out stuck beads in our image processing protocol, but it is possible that there are beads with multiple DNA tethers, or those that may be stuck but moving on the surface.
If we lack confidence in our ability to detect tethered, but uncut beads, we could try another model, still assuming cutting is a Poisson process, but only taking into account cleaved beads.
The probability density function of bead loss times that we can observe is
We need to normalize this distribution. Let be the constant or proportionality. Then,
This gives , and our probability distribution of cleavage times is
So, for our set of observations, , we have a probability density function of
Writing this out, we have
since all observed bead loss times are less than . We will directly use this function to find the MLE for , as we showed in the previous model that the error in not explicitly taking into account the time interval of observation is small.
We wish to find the value of the parameter that leads to maximal . We proceed as before by computing , differentiating, and setting to zero.
Differentiating with respect to gives
So, we are left to find the value of for which
We can again use scipy.optimize.fsolve()
to find the value of the MLE. We will use as our initial guess.
def alt_root_fun(beta):
exp_val = np.exp(-beta * t_end)
return 1 / beta - t_mean - t_end * exp_val / (1 - exp_val)
# Find the MLE!
alt_beta_mle = scipy.optimize.fsolve(alt_root_fun, 1 / t_mean)[0]
# Report result
print(alt_beta_mle)
0.041951259352945
We can also get the time scale.
print(1 / alt_beta_mle)
23.837186664333107
This is much shorter, about 24 minutes. This might be expected, as we could have over-assigned beads that we did not lose.
Now that we have a maximum likelihood estimate for , we can compute the theoretical CDF and see how it looks compared to our measured ECDF. The theoretical CDF is
We can code this up and plot it along with our empirical CDF.
p = iqplot.ecdf(t_lost, q="time of bead loss (min)")
# Compute theoretical CDF
t_theor = np.linspace(0, t_end, 400)
cdf = (1 - np.exp(-alt_beta_mle * t_theor)) / (1 - np.exp(-alt_beta_mle * t_end))
p.line(t_theor, cdf, line_width=2, line_color='orange')
bokeh.io.show(p)
This looks much better, suggesting that perhaps we did misidentify beads we did not lose. It's a pity, since that limits the data we can use. This motivates us to collect images as long as possible.