Bioinformatics for Bi 1x

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

We'll begin by importing the modules we need. Importantly, we will import scikit-bio (skbio), a new and rapidly developing Python toolkit for bioinformatics. It is currently in early stages of development, so you should check for changes to the API if you use it in the future. Furthermore, it is also only available for Mac OS X and Linux. For some of our analysis, we will use the older Biopython, which allows us access to some convenient sequence parsing methods. Both scikit-bio and Biopython can be installed using conda. To do this, do the following at the command line.

conda install biopython scikit-bio
In [1]:
import glob
import itertools
import os

# NumPy 
import numpy as np

# Bio modules
import Bio.SeqIO

# Comment this out if you do not have scikit-bio installed.
import skbio

# Plotting modules
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}

# JB's favorite Seaborn settings for notebooks
rc = {'lines.linewidth': 2, 
      'axes.labelsize': 18, 
      'axes.titlesize': 18, 
      'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)
sns.set_style('darkgrid', rc=rc)

Analysis of sequences from the Luria-Delbrück and V(D)J experiments

We will begin by analyzing the sequences from our VDJ cloning. You received the sequences and chromatograms in a ZIP file. Upon unzipping, the chromatogram files are in the directory VDJ sequencing. These files have a .ab1 suffix. They are in a format developed by Applied Biosystems (abi format). These files contain both the sequences and a measure of the quality of the sequences, defined by the Phred quality score. The Phred quality, $Q$ is defined as

\begin{align} Q = -10 \log_{10}p, \end{align}

where $p$ is the probability of a given base being wrong.

A more widely used file format is FASTQ. The Wikipedia page for the FASTQ format gives a good description. Since FASTQ files are much easier to work with, we will use Biopython to convert the abi files to FASTQ. (skbio does not read abi files). In doing this, we will see how we can use the os and glob modules to access files and process them.

First, let's see what's in the directory.

In [2]:
# Specify directory with VDJ chromatograms
ab1_vdj_dir = 'vdj_sequencing/VDJ_chromatograms/'

# Get a list of files in that directory
dir_list = os.listdir(ab1_vdj_dir)


Lets also look at the Luria Delbruck folder. For each group, you performed a colony PCR using a colony from the wild-type and the MSH2 strains. For each, we sequences using primers Ext_int, F1, F2, R2, and R3, working our way along the CAN1 sequence.

In [3]:
# Specify directory with VDJ chromatograms
ab1_ld_dir = 'luria_delbruck_sequencing/'

# Get a list of files in that directory
dir_list = os.listdir(ab1_ld_dir)


We now want to convert each of these into FASTQ format. We will make a new directory called vdj_2016_fastq and ld_2016_fastq to keep these files.

In [4]:
# Useful function to make directories (and pass if dir already exists)
def mkdir_p(path):
    except OSError:
        if os.path.isdir(path):
        else: raise

Now that we can safely and conveniently make directories, let's start parsing the V(D)J sequences.

In [5]:
# Specify FASTQ directory
fastq_vdj_dir = 'vdj_2016_fastq'
# Get all the .ab1 files in the directory (glob module is convenient!)
listing = glob.glob(os.path.join(ab1_vdj_dir, '*.ab1'))

# Go through each file in the chromatogram directory and convert it to FASTQ
for fname in listing:
    # Get the prefix of the file
    prefix = os.path.splitext(os.path.split(fname)[-1])[0]
    # Make the name of the output FASTQ file
    fastq_fname = os.path.join(fastq_vdj_dir, prefix + '.fastq')
    # Use Biopython to convert file format
    n_converted = Bio.SeqIO.convert(fname, 'abi', fastq_fname, 'fastq')

And we'll do the same for the Luria-Delbrück sequences.

In [6]:
# Specify FASTQ directory
fastq_ld_dir = 'ld_2016_fastq'
# Get all the .ab1 files in the directory (glob module is convenient!)
listing = glob.glob(os.path.join(ab1_ld_dir, '*.ab1'))

# Go through each file in the chromatogram directory and convert it to FASTQ
for fname in listing:
    # Get the prefix of the file
    prefix = os.path.splitext(os.path.split(fname)[-1])[0]
    # Make the name of the output FASTQ file
    fastq_fname = os.path.join(fastq_ld_dir, prefix + '.fastq')
    # Use Biopython to convert file format
    n_converted = Bio.SeqIO.convert(fname, 'abi', fastq_fname, 'fastq')

Lets take a quick look at one of our V(D)J sequences. In particular, lets plot the quality score:

In [7]:
#load sequence file using BioPython
in_file = 'vdj_2016_fastq/CMNS_VDJ.fastq'
for seq_record in Bio.SeqIO.parse(in_file, "fastq"):
    seq = seq_record

#plot quality score
plt.xlabel('Position (bp)', fontsize=20)
_ = plt.ylabel('PHRED Score', fontsize=20)

We find that the quality of the sequenced nucleotides (using standard Sanger sequencing) is poor early on and then again after about 700 bp, the quality drops again. When comparing sequences, keep in mind that the low-quality positions will be less reliable and it may be suitable to ignore them completely.

For our V(D)J cloning, all we want to know is whether our sequence contains the desired mutation. While we will be using Benchling to perform more sophisticated assembly and alignment, it will be insightful for us to code a simple aligner to scan our sequencing read against the known sequence of our plasmid.

To keep things simple, the following function turns any sequence of A, C, G, T, or N into a 4xL array. Each column corresponds to the nucleotide position, while the rows correspond to the nucleotide identity. For example, if the first letter is an 'A', the first column will be a 1 in row 1 and zero otherwise. In situations where we have sequence quality information, the position will be re-weighted by the base call accuracy, as calculated from the PHRED score.

In [8]:
def seq2mat(seq, qual = None):
    seq = String of letters (A,C,G,T, and N)
    qual = 1-d array containing PHRED scores for each 
    Converts the sequence of length L into a 4xL
    array, where row 1-4 corresponds to A, C, G, and T
    respectively. If no quality scores are provided, 
    each letter is given weight of 1. Otherwise, each position
    is weighted according to quality score (<=1).
    seq_dict = {'A':0,'C':1,'G':2,'T':3, 'N':np.array([0,1,2,3])}
    mat = np.zeros((4,len(seq)))
    for i,bp in enumerate(seq):
        if qual==None:
            mat[seq_dict[bp],i] = 1
            mat[seq_dict[bp],i] = 1.0 -10.0**(-qual[i]/10.0)
    return mat

Lets load in our sequence file and the plasmid template sequence we will compare to:

In [9]:
#load sequence file again using BioPython
seq_in_file = 'vdj_2016_fastq/CMNS_VDJ.fastq'
for seq_record in Bio.SeqIO.parse(seq_in_file, "fastq"):
    seq = seq_record
#convert template seq to mat
seq_mat = seq2mat(seq.seq, qual=seq.letter_annotations['phred_quality'])

#load our template file again using BioPython
template_in_file = 'vdj_sequencing/pZE12_template.fasta'
for seq_record in Bio.SeqIO.parse(template_in_file, "fasta"):
    template = seq_record
#convert template seq to mat
template_mat = seq2mat(template.seq)

Now, lets move our sequence array along the plasmid template array and perform position-by-position multiplication. Whenever a sequence nucleotide does not match with the template, we will be a product equal to zero and hence, a low alignment score.

In [10]:
# Create blank array to place our alignment scores
seq_scan = np.zeros(len(template_mat[1])-len(seq_mat[1]))

# Perform scan
for i in range(0,len(seq_scan)):
    seq_scan[i] = np.sum(np.multiply(seq_mat,template_mat[:,i:i+len(seq_mat[1])]))

Lets take a quick look at what we found by plotting our alignment scores from our scan. The maximum alignment score (a perfect match) should be equal to the length of the sequence we are aligning. Since we weighted each sequence position by its PHRED score, this might be slightly less.

In [11]:
#Plot alignment scores divided by length of the sequence we are aligning.
_ = plt.plot(seq_scan/len(seq_mat[1]))
_ = plt.xlabel('Scan Position (bp)', fontsize=20)
_ = plt.ylabel('Alignment Score (a.u.)', fontsize=20)

It looks like we found a match! Lets take a look at whether our coloning worked and we inserted the correct mutation. Since the sequence read is roughly 800bp in length, lets just zoom in on the region where the 12 RSS is; roughly 200bp from our sequencing primer.

In [12]:
# region of interest from primer site
bp_start = 200 #bp
bp_len = 50 #bp ; length of region we'll look at

with sns.axes_style('white'):
    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(15, 9))

    # Generate a custom diverging colormap with seaborn
    cmap = sns.diverging_palette(220, 10, as_cmap=True)

    # Lets relabel our axis
    yticks = ['A','C','G','T']
    xticks = template.seq[i+bp_start:i+bp_start+bp_len]

    # Position of best sequence alignment
    i = seq_scan.argmax()

    # Draw a 'heatmap' for the region of interest in our sequence red. 
    # We will use seaborn's heatmap tool. Dark blue indicates a mismatch
    sns.heatmap(seq_mat[:,bp_start:bp_start+bp_len], cmap=cmap, vmax=.3,
                linewidths=.5, cbar_kws={"shrink": .5}, ax=ax,
                square=True, cbar=False, alpha =1)

    # Overlay our template. Light red should indicate the WT sequence where
    # a mutation has occured.
    sns.heatmap(seq2mat(template.seq[i+bp_start:i+bp_start+bp_len]), vmax=.3,
                linewidths=.5, cbar_kws={"shrink": .5}, ax=ax,
                square=True, cbar=False, alpha = 0.5, yticklabels=yticks,

    # Lets rotate the text labels
    _ = plt.yticks(rotation=0) 
    _ = plt.xticks(rotation=0) 

Well, that worked quite well! Without too much effort we found out that our cloning worked and we have inserted the desired mutation. Now we're ready to move on and perform our tethered particle motion (TPM) experiment.

However, in general, the assembly of genomic sequences is not a trivial process and there has been a lot of effort made to allow us to generate whole-genome assemblies from large sets smaller DNA reads. So we should take caution and consider instances when the above approach might fail. Consider what would happen if we had an unexpected insertion or deletion? How about if we had used the reverse complement of our sequence?

Hence for more complicated situations, and in particular where we need to assemble sequences without any reference template, we would have to consider more complex algorithms.

The longest common subsequence

One important algorithm associated with DNA assemblies is to compute the length of the longest common subsequence (LCS) between two sequences. Deletions and insertions are common throughout evolution and the LCS give a good indication of how similar two sequences are, even if these insertions or deletions are present. Here, a subsequence refers to a sequence that can be generated by deleting entries in the original sequence, but leaving the order unchanged.For example, the subsequences of length three of ATGC are:


All subsequences contain one deletion. There are four deletions we could choose, so there are four subsequences. The LCS between two sequences is the subsequence(s) that both share that are the longest.

Finding all subsequences

To implement this strategy, we need a function to generate all subsequences of a given length. Its call signature will be

allsubseqs(seq, n)

and it will return a set of all of the subsequences. A set is a built-in Python data type that is like a list, but only contains the unique elements. It is unordered and immutable.

It turns out that we don't need to write this function! The itertools.combinations() function in the standard library gives this functionality. We don't need to test this (since the standard library is trustworthy; carefully unit tested), so we can just use it.

In [13]:
def allsubseqs(seq, n):
    """Return a set containing all subsequences of length n"""
    return set(itertools.combinations(seq, n))

Let's give it a try.

In [14]:
allsubseqs('abcd', 3)
{('a', 'b', 'c'), ('a', 'b', 'd'), ('a', 'c', 'd'), ('b', 'c', 'd')}

So now we could consider writing an algorithm where we start by first comparing two sets of subsequences of length n and ask whether we found any matches. If not, we would then repeat by comparing subsequences of length n-1, and so on until we found the best match. As you might expect, the number of possible subsequences grows quite dramatically with sequence length and requires substantially more computation or clever algorithms than our simple approach above.

Unless you are in the business of coming up with novel algorithms, many of the algorithms you use in your work in biology will be those developed by someone else. For our purposes today, we will make use of the built-in alignment tools provided by Benchling to align our multiple sequence reads across the CAN1 gene in yeast. If you're interested in seeing how to code up an assembly tool that searches for the LCS and applies an approach using dynamic programming to reduce computation time, check out this other tutorial on algorithmic complexity.

Merging files

Before we go ahead and use Benchling it will be convenient to merge our Luria Delbruck sequences into a single FASTQ file. This is easily done with scikit-bio or with Biopython. We will save the merged FASTQ files to a fresh directory, since these are the files we'll use to do assembly with Benchling.

First, we'll get the listings, prefixes, etc., we need to merge the files.

In [15]:
# Make a new directory for merged, trimmed FASTQ files
merged_dir = os.path.join(fastq_ld_dir, 'merged')

# Get all FASTQ files
listing = glob.glob(os.path.join(fastq_ld_dir, '*.fastq'))

# Get all prefixes of files (this is anything that comes before underscore)
prefixes = []
for fname in listing:
    file_prefix = os.path.split(fname)[-1]
    prefix = file_prefix[:file_prefix.find('_')]
    if prefix not in prefixes:

Now, we'll merge the files using scikit-bio.

In [16]:
# Loop through prefixes and merge files
for prefix in prefixes:
    listing = glob.glob(os.path.join(fastq_ld_dir, prefix + '*.fastq'))
    out_file = os.path.join(merged_dir, prefix + '.fastq')
    with open(out_file, 'w') as f:
        for fname in listing:
            seqs =, format='fastq', variant='sanger')
            for seq in seqs:
                seq.write(f, format='fastq', variant='sanger')

If you don't have scikit-bio, you can alternatively do it with BioPython.

In [17]:
# Loop through prefixes and merge files
for prefix in prefixes:
    listing = glob.glob(os.path.join(fastq_ld_dir, prefix + '*.fastq'))
    out_file = os.path.join(merged_dir, prefix + '_ld.fastq')
    with open(out_file, 'w') as f:
        for fname in listing:
            seqs = Bio.SeqIO.parse(fname, 'fastq')
            for seq in seqs:
                Bio.SeqIO.write(seq, f, 'fastq')

Assembling the sequences and finding mutations using Benchling

Now that we have our set of reads contained in a single FASTQ file, we can assemble them into a contiguous gene. We will assmble them with respect to the reference wilde type CAN1. To get the sequence, we can query the Yeast Genome Database for CAN1. We can then download a FASTA (.fsa) file containing the sequence.

To assemble the sequences using Benchling, do the following.

  1. Create a new project folder (let's call this 'LD_sequence_assembly') using the '+' button in the left panel.
  2. In the new project folder you can then click the '+' button in the Inventory panel (again in the left panel). Select 'Import Sequences'. You can now import your FASTA file of the CAN1 gene.
  3. Perform the assembly: Select the CAN1 sequence that you imported. On the right most panel, there are several icons with different tools. Hover over them and click the 'Align' tool (5th one down). Select 'CREATE NEW ALIGNMENT'. Select 'CHOOSE FILE(s)' and find your merged FASTQ file of sequences. The default choice of aligner is fine for our assembly. Finally select 'CREATE ALIGNMENT'.
  4. After the alignment is complete, you should see whether the colonies you selected have any mutations. Nucleotide positions that are highlighted red are mismatches between the sequence and the CAN1 template. Keep in mind that sequences near the start of a sequencing run and near the end will have low quality scores and be unreliable (that's why we've performed 5 staggered sequencing runs). Use the alignment to find discrepencies between the wild type CAN1 gene and those that you sequenced from your colonies that survived Canavanine exposure.

You should only consider a mismatch between your sequences and the reference sequence to be a mutation if all (and at least two) of the sequences covering the region of the reference sequence show the same discrepancy. A snapshot (using print screen or similar) is sufficient when you submit your assignment.

You can repeat the same procedure with your V(D)J cloning sequence to double-check that you have the desired mutation.