(c) 2023 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This tutorial was generated from a Jupyter notebook, which can be downloaded here.
In this tutorial, we will step through an image processing pipeline for analysis of the movies of beads we take in the single molecule kinetics experiment. Many image processing tasks are constructed as pipelines, a set of well-defined image processing operations done in succession. Deciding on the steps in your pipeline is as important as implementing it.
In the in-class tutorial, we will discuss options for a pipeline together. As we will see, processing the images present their own unique challenges. The steps in the pipeline can be somewhat complicated, so I have written code to do many of the steps for you. So, in this module, you will focus less on implementation and more on what we are doing to the images. We will rely on the modules I wrote for you to do the analysis.
As you work up your own data, be sure to read the doc strings of the functions. You might need to tweak parameters for your data sets. Now that we have the module, we can import it and get to work on analyzing our images!
import os
import numpy as np
import skimage.io
import colorcet
import bi1x
# This is the single molecule kinetics module
from bi1x import smk
import iqplot
import bokeh.plotting
import bokeh.io
# URL of notebook for interactive plotting
notebook_url = 'localhost:8888'
bokeh.io.output_notebook()
We will work with two data sets. The first comes from Soichi Hirokawa from 2016 when we were first developing this experiment for Bi 1x. It is stored in the practice_2016
directory. The second comes from a trial run by Russel Swift and Fred He and is in a directory labeled practice_2023
.
The images for the 2016 data set were acquired using 80 nM RAG with the consensus (wild type) sequence, and the images for the 2023 data set were acquired using XXXX nM RAG.
We will first analyze the 2016 data set to demonstrate the pipeline and then proceed to apply the same pipeline to the 2023 data set.
As an added wrinkle, it is common that the autofocus on the microscopes will get the wrong focal plane and show dark beads instead of light beads (think: why would this happen?). For the 2016 data set, this is not a problem, but is present in the 2023 data set.
So, let's load the images and take a look. We will load all of the images at the outset and store them in a list. We will use the handy os.path.join()
function to get the file names. While we're at it, we will also specify the inter-frame time, which is half a minute.
# Interframe time
dt = 0.5 # minutes
# Directory containing images
img_dir = "practice_2016"
# Get list of files, sorted so they are in the right order
flist = list(sorted(os.listdir(img_dir)))
# Load in TIFF files into collection of images
ic = [
skimage.io.imread(os.path.join(img_dir, fname))
for fname in flist
if fname[-4:] == ".tif"
]
Now, let's take a look at an image.
gray_mapper = lambda: bokeh.models.LinearColorMapper(colorcet.gray)
bokeh.io.show(bi1x.viz.imshow(ic[0], color_mapper=gray_mapper()))
It is kind of hard to see any beads, so let's zoom in to the top left corner.
inds = np.s_[:200, :200]
bokeh.io.show(bi1x.viz.imshow(ic[0][inds], color_mapper=gray_mapper()))
We see the beads, but also plenty of noise and occasional schmutz. Our goal is to identify the beads and determine when they are lost.
In class, we will develop a pipeline to do the analysis. I present the pipeline here. We will then work through the pipeline, step by step, with comments about what we are doing along the way.
Our strategy for detecting beads loss events is as follows. We will detect where all beads are in the first frame. We then draw a small box, called a region of interest, or ROI, around each bead. We then count how many beads (usually one or zero) that are inside the ROI over time. When an ROI has zero beads for four out of five successive frames (with the first of those five having zero beads), we say the bead was lost.
To implement this strategy, we need to specify the size of our ROI. We specify the radius as the maximum radius (in pixels) a tethered bead would be expected to diffuse. For our beads with the 40× objectives, this is typically an 11×11 ROI. So, we specify the ROI radius as 5.
roi_radius = 5
The pipeline I propose we use for analyzing the images
We have already done this. It is important to have an idea how much RAM you have to make sure the images in the stack of images will all fit in the RAM.
Sometimes frames go out of focus because the autofocus fails. For example, frame 63 is out of focus. Let's look at it.
bokeh.io.show(
bokeh.layouts.row(
bi1x.viz.imshow(ic[0][inds], color_mapper=gray_mapper(), frame_height=200),
bi1x.viz.imshow(ic[63][inds], color_mapper=gray_mapper(), frame_height=200),
)
)
Not good! We would not detect any beads in that frame, so we definitely want to exclude it. We can automatically detect bad frames by looking at the coefficient of variation (the standard deviation divided by the mean) of the pixel values. A high coefficient of variation is indicative of being in focus, since there are sharp white pixels and black pixels. A low coefficient of variation means we have poor focus. (In fact, the autofocus algorithms on the Olympus microscopes use the coefficient of variation to find the best focal plane.)
The function smk.good_frames()
gives the indices in the list of images of good frames. That is, the frames where the coefficient of variation is within some fraction of the mean coefficient of variation. Using the bad_frames
kwarg, we can also specify which frames we know to be bad. In this case, we do not know about any a priori.
# Find good frames
gf, covs = smk.good_frames(ic, 0.95, return_covs=True)
Let's check how we did with frame selection, plotting the coefficient of variation as a solid line and putting dots on good frames.
p = bokeh.plotting.figure(
frame_width=500,
frame_height=200,
x_axis_label='frame number',
y_axis_label='coefficient of variation',
)
p.line(np.arange(len(covs)), covs)
p.circle(gf, covs[gf], color='orange')
bokeh.io.show(p)
We see that frames 63 and 72 are bad. We should only include frames we know are good. To make subsequent analysis easier, we will make a new image collection with only the good frames and a set of time points with the good times points.
# Set up time points and new list of images with good time frames
ic = [ic[i] for i in gf]
t = (np.array(gf, dtype=float) - gf[0]) * dt
While we try to prevent it, if the stage drifts in the - plane, we have the problem that beads drift out of their ROI, but were not cleaved. To correct for this, we perform drift correction. To do this, we use cross-correlation of images, a more advanced topic in image processing that I will not discuss in detail here. We use the smk.drift_correct()
function to do this, which gives a new list of images that have been drift corrected. It also gives us the maximum shift of any given image. This is later used to make sure any beads close to the image boundary are not included because they could drift off of the entirely.
This calculation is the longest of the pipeline.
ic_dc, max_shift = smk.drift_correct(ic)
Next, we want to filter the images so that it is easier to detect peaks in intensity that correspond to beads. The filtering we do is fairly standard for bead tracking algorithms. First, we perform a mean filter using a structuring element that is a bit larger than a typical bead size. Our beads are roughly three pixels across, so we use a 5 5 square structuring element. We then perform a gentle Gaussian blur (usually with pixel) to smooth out the peaks. We subtract the mean filtered image from the blurred image, setting all negative pixel values to zero. This sharply enhances the peaks, as we'll see below.
We do not need to activate the neg_im
keyword argument for this data set, since the beads are all in focus and white.
We will now filter all of the images and store them in a list. This is the second slowest step.
ic_filt = [smk.preprocess(im, sigma=1, selem_width=5, neg_im=False) for im in ic_dc]
Let's take a look at what this filtering did.
bokeh.io.show(
bokeh.layouts.row(
bi1x.viz.imshow(ic[0][inds], color_mapper=gray_mapper(), frame_height=200),
bi1x.viz.imshow(ic_filt[0][inds], color_mapper=gray_mapper(), frame_height=200),
)
)
Much sharper beads, and easier to detect!
A peak in the image corresponds to a bead. Fortunately, there is a function to find peaks in images built in to scikit-image, skimage.feature.peak_local_max()
. Under the hood, the smk
module uses this to get the peaks. We will use the smk.get_all_peaks()
function to use the filtered images to identify bead positions. The result is a list of binary images. In each of these images, a pixel is True
if it corresponds to a peak and False
otherwise.
# Find peaks in filtered images
peaks = smk.get_all_peaks(ic_filt, border=max_shift)
This is simple. We simply take each peak in the first image in our stack and draw a box around it. This specifies the ROI. Conveniently, the smk.peak_rois()
function does this and returns the ROIs as NumPy slice
objects. This allows us to get the ROI out of an image, say im
, as im[roi]
.
# Get ROIs from first frame
rois = smk.peak_rois(peaks[0], roi_radius)
Some of the ROIs might contain two beads because the beads are very close (or stuck) together. Others might have a falsely-identified peak. There are two ways we filter the ROIs. First, we can check to see if the total intensity of the pixels in the ROI is reasonable. If it is too high, we might have two beads or junk in the ROI. If it is too low, we might have a falsely identified bead. Second, we could just check for the number of peaks in the ROI and delete those that have two or more.
We will first filter our ROIs based on intensity. If the intensity in each frame is beyond one standard deviation of the mean intensity, we will not consider it. Then, we will further filter those, keeping only those with a single peak. We will print how many ROIs we have as we go along to see how many are filtered out.
print(len(rois))
rois = smk.filter_rois_peaks(peaks[0], rois)
print(len(rois))
rois = smk.filter_rois_intensity(ic_filt[0], rois, thresh_std=(0.5, 1))
print(len(rois))
2790 2590 1850
Our next step is to filter out ROIs containing beads that are merely stuck to the surface, but not tethered to DNA. Because they have no DNA tethering them to the surface, such beads cannot be released upon cutting of the tether.
To determine if a bead is stuck, we will compute the variance in the position of the beads over time. We expect tethered beads to wiggle around, so if the variance is very low, this suggests that the bead is stuck to the surface. We therefore compute the variance (in units of interpixel distance) in peak position for each bead. To compute the variance, let , where and are the - and -coordinates of a given bead. The variance is then
where averages are denoted with overbars and are taken over all of the frames that contain a bead for a given ROI.
bead_variances = smk.positional_variance(peaks, rois)
Now that we have computed the positional variance of beads in all of the ROIs, let's take a look. We will plot both an ECDF and a histogram of the positional variance for all of our beads.
p_ecdf = iqplot.ecdf(
bead_variances,
x_axis_label="variance (sq. pixels)",
frame_height=150,
frame_width=300,
)
p_hist = iqplot.histogram(
bead_variances,
rug=False,
x_axis_label="variance (sq. pixels)",
frame_height=150,
frame_width=300,
x_range=p_ecdf.x_range,
)
bokeh.io.show(bokeh.layouts.column(p_ecdf, p_hist))
From the inflection point in the ECDF and the clearly divided peaks in the histogram, it is clear that roughly half of the beads have low variance and are therefore probably stuck. The remaining a free to diffuse and are therefore likely tethered. The inflection point appears to be at about a variance of 1.5 square interpixel distances. To make sure we are a bit above this, we will filter out all ROIs containing beads with positional variance below that value.
# Variance cutoff
var_cutoff = 1.8
rois = [roi for roi, var in zip(rois, bead_variances) if var >= var_cutoff]
# Total number of beads we consider
n_beads_2016 = len(rois)
print(n_beads_2016)
604
To find the number of peaks in each ROI, we can simply slice the ROI out of the peaks image and then sum the pixels. Remember, True pixels in the binary peaks image correspond to a peak (bead). The function smk.n_peaks_in_rois()
does this for all ROIs for each time point. The output is a NumPy array where entry i,j
is the number of peaks in ROI i
in frame j
.
# Find number of peaks in each ROI
n_peaks = smk.n_peaks_in_rois(peaks, rois)
Now that we've found the peaks, let's do a quick sanity check to see how many ROIs have zero, one, two, etc., beads in them.
bokeh.io.show(iqplot.spike(n_peaks.ravel()))
So, the vast majority of ROIs have one or zero beads in them, as we would expect from good segmentation.
Now, we'll compute the time where the beads are lost. To do this, we must define what it is to lose a bead. An ROI is considered to have lost its bead in frame i
if it has no peaks in frame i
and in frames i
to i + n_frames
more than a fraction frac
of frames have no peaks. We can scan through the image stack for each ROI and compute this.
The function smk.all_bead_loss()
does this and returns the time at which the bead was lost. It also returns the indices of the ROI corresponding to the loss events.
# Get bead loss times
t_lost_2016, beads_lost_2016 = smk.all_bead_loss(t, n_peaks, n_frames=5, frac=0.75)
Now that we have the bead loss times, we can visualize the results. Often, people plot histograms of bead loss times to visualize the distributions. I strongly advise against this because it introduces binning bias. You can instead plot your data without any binning or estimation as an empirical cumulative distribution function.
It is a little tricky in this case, because we also want to include beads we did not lose.
p = iqplot.ecdf(
np.concatenate((t_lost_2016, np.nan * np.ones(n_beads_2016 - len(t_lost_2016)))),
q="time of bead loss (min)",
y_range=[-0.025, 1.025],
)
bokeh.io.show(p)
Finally, we need to save the times where we lost our beads. This is the input to the next phase of analysis when we do maximum likelihood estimation to determine the characteristic time scale of bead loss.
We will use the np.savetxt()
function to save our data to a CSV file.
# Save time of lost beads
np.savetxt('t_lost_2016.csv', t_lost_2016, fmt='%.2f')
# Save number of uncut beads
np.savetxt('number_of_uncut_2016.csv', np.array([n_beads_2016 - len(t_lost_2016)]))
# Save time points
np.savetxt('time_points_2016.csv', t)
It is useful to occasionally manually verify that your image processing is working correctly. We can build an interactive Bokeh app to check.
Before we use the app, let's check for the first four beads that were lost so we can check those specifically.
beads_lost_2016[:4]
array([1, 2, 3, 5])
Now, let's look at the app. To show the app, we need to specify the notebook URL, which can be referenced from navigation bar on your browser. Usually, the notebook URL is 'localhost:8888'
. We set this in the first cell of this notebook.
bokeh.io.show(
smk.bead_checker(ic_filt, rois, roi_radius, peaks), notebook_url=notebook_url
)
This is interactive, so we can't show the results in the static rendering. But we find all four of these loss events were legit.
Now, we proceed to analyze the 2023 data. The pipeline is exactly the same, except we need to account for black beads.
This experiment was done with the wild type consensus sequence with 50 nM RAG and 400 nM HMGB1.
We will work through the pipeline momentarily, but let's first load in the images and take a look.
# Directory containing images
img_dir = "practice_2023"
# Get list of files, sorted so they are in the right order
flist = list(sorted(os.listdir(img_dir)))
ic = [
skimage.io.imread(os.path.join(img_dir, fname))
for fname in flist
if fname[-4:] == ".tif"
]
# View the upper left corner of an image
bokeh.io.show(bi1x.viz.imshow(ic[0][inds], color_mapper=gray_mapper()))
It is clear that we are in a focal plane immediately below the bead; we will need to activate the neg_im
keyword argument when preprocessing the images.
We again check for focus issues.
# Find good frames
gf, covs = smk.good_frames(ic, 0.95, return_covs=True)
p = bokeh.plotting.figure(
frame_width=500,
frame_height=200,
x_axis_label='frame number',
y_axis_label='coefficient of variation',
)
p.line(np.arange(len(covs)), covs)
p.circle(gf, covs[gf], color='orange')
bokeh.io.show(p)
This is different than the previous sample. Here, the coefficient of variation steadily degrades as the microscope slowly goes out of focus until it falls off for the last few frames. Let's take a quick look at the last frame to see what happened.
bokeh.io.show(
bokeh.layouts.row(
bi1x.viz.imshow(ic[0][inds], color_mapper=gray_mapper(), frame_height=200),
bi1x.viz.imshow(ic[-1][inds], color_mapper=gray_mapper(), frame_height=200),
)
)
It is not too awful, but definitely falling out of focus.
We now preprocess with filtering and drift correction.
# Drift correct
ic_dc, max_shift = smk.drift_correct(ic)
# Filter, taking into account out-of-focus beads
ic_filt = [smk.preprocess(im, sigma=1, selem_width=5, neg_im=True) for im in ic_dc]
# Take a look at filtered image
p1 = bi1x.viz.imshow(ic[0][inds], color_mapper=gray_mapper(), frame_height=200)
p2 = bi1x.viz.imshow(ic_filt[0][inds], color_mapper=gray_mapper(), frame_height=200)
p1.x_range = p2.x_range
p1.y_range = p2.y_range
bokeh.io.show(bokeh.layouts.row(p1, p2))
It's generally decent, but the diffraction patterns from being out of focus are a hindrance. Let us proceed with the rest of the pipeline.
# Find peaks in filtered images
peaks = smk.get_all_peaks(ic_filt, border=max_shift)
# Get ROIs from first frame
rois = smk.peak_rois(peaks[0], roi_radius)
# Filter the ROIS
rois = smk.filter_rois_peaks(peaks[0], rois)
rois = smk.filter_rois_intensity(ic_filt[0], rois, thresh_std=(0.5, 1))
# Compute positional variance to determine stuck beads
bead_variances = smk.positional_variance(peaks, rois)
# Take a look at bead variances
p_ecdf = iqplot.ecdf(
bead_variances,
x_axis_label="variance (sq. pixels)",
frame_height=150,
frame_width=300,
)
p_hist = iqplot.histogram(
bead_variances,
rug=False,
x_axis_label="variance (sq. pixels)",
frame_height=150,
frame_width=300,
x_range=p_ecdf.x_range,
)
bokeh.io.show(bokeh.layouts.column(p_ecdf, p_hist))
This is a different setup than the 2016 sample data. The cutoff here is about 3 square interpixel distances.
# Variance cutoff
var_cutoff = 3.0
# Only keep ROIs above variance cutoff
rois = [roi for roi, var in zip(rois, bead_variances) if var >= var_cutoff]
# Total number of beads considered
n_beads_2023 = len(rois)
# Find number of peaks in each ROI
n_peaks = smk.n_peaks_in_rois(peaks, rois)
# Check: Do most ROIs have zero of one peaks?
bokeh.io.show(iqplot.spike(n_peaks.ravel()))
Indeed, most ROIs have either zero or one peak, so our segmentation seems to be working ok.
Finally, let's compute the loss times.
# Get bead loss times
t_lost_2023, beads_lost_2023 = smk.all_bead_loss(t, n_peaks, n_frames=5, frac=0.75)
# Display ECDF with loss times
p = iqplot.ecdf(
np.concatenate((t_lost_2016, np.nan * np.ones(n_beads_2016 - len(t_lost_2016)))),
q="time of bead loss (min)",
legend_label="2016",
)
p = iqplot.ecdf(
np.concatenate((t_lost_2023, np.nan * np.ones(n_beads_2023 - len(t_lost_2023)))),
marker_kwargs=dict(color="orange"),
fill_kwargs=dict(fill_color="orange"),
legend_label="2023",
p=p,
)
bokeh.io.show(p)
There is a departure in the curves, likely due to different batches of the enzyme. The 2023 enzyme was at higher concentration, but was an older batch, which means there may have been some degradation.