Analysis of sequencing data from the antibiotic resistance experiment¶

(c) 2022–2025 Liana Merk, Gil Sharon, and Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with support financial support from the Donna and Benjamin M. Rosen Bioengineering Center.


In [1]:
import os
import fnmatch
import pandas as pd
import numpy as np

from qiime2 import Visualization

In the cell below, you should specify the specifics of your analysis. After that, you will mostly just want to run the cells through to the bottom of this notebook. At the end, you will have QIIME artifacts readily available for analysis.

In [2]:
# Put the full path to your home working directory
home_dir = "/home/ec2-user/sequencing_analysis/"

# Subdirectory in home_dir containing all of the reads (probably 'soil')
data_dir = "soil/"

# Whether or not to do pre-processing (this takes a looong time)
# Only change to True if you have not done it yet.
pre_process = False

QIIME2 has two object types, artifacts (with suffix .qza), which are used to store and access data, and visualizations (with suffix .qzv), which is are used, you guessed it, to visualize the results. We will create two directories to store these files.

In [3]:
# Full path to the directory with reads
data_dir = os.path.join(home_dir, data_dir)

# These are artifact and visualization directories located within the home directory.
artifact_dir = os.path.join(home_dir, 'qiime_artifacts/')
visualization_dir = os.path.join(home_dir, 'qiime_visualizations/')

# If these do not exist, make them
if not os.path.exists(artifact_dir):
    os.makedirs(artifact_dir)
    
if not os.path.exists(visualization_dir):
    os.makedirs(visualization_dir)

Purpose¶

In this experiment, we are analyzing the microbial content of soil samples. We extracted the DNA using a Qiagen PowerSoil kit. We amplified the V4-V5 region of 16S, attaching Illumina adapters for a subsequent MiSeq run. The reads were demultiplexed by Laragen, our sequencing provider, and are stored in the soil/ directory. Here, we will use QIIME2 as our analytics platform. We will denoise using DADA2 and trim using the primer length. We will then perform taxonomic classification. Using these results, we will delve into some of the highest colonizing populations. Further, we seek to understand how the microbial content of soil varies with the local environment form which they were collected.

Pre-Processing¶

Creating the Manifest File¶

First, QIIME2 needs to know which sequencing files came from which samples. In our experiment, we performed paired end sequencing, where we sequenced both ends of the amplified V4 region of 16S. We will search through the directory of sequnces, which is stored in data_dir, and gather the prefixes. Then, we will add two columns, one which records the forward path and one with the reverse read path.

In [4]:
if pre_process:
    with open(os.path.join(home_dir, 'manifest.txt'), 'w') as f:
        # Add formatting rows
        f.write('sample-id\tforward-absolute-filepath\treverse-absolute-filepath\n')

        # list all files and write the forward reads into the manifest file
        list_of_files = os.listdir(data_dir)  
        forward = '*L001_R1_001.fastq.gz'

        for file in list_of_files:  
            if fnmatch.fnmatch(file, forward):
                line = str(file.split("_")[0]
                           + f'\t{data_dir}'
                           + str(file.split("L001")[0])
                           + 'L001_R1_001.fastq.gz'
                           + f'\t{data_dir}'
                           + str(file.split("L001")[0])
                           + 'L001_R2_001.fastq.gz\n')
                f.write(line)

    pd.set_option('display.max_colwidth', None)           
    pd.read_csv(os.path.join(home_dir, 'manifest.txt'), sep='\t')

Importing the sequences¶

Now that we have the manifest, we and use QIIME2 to import the sequences and store them as a QIIME2 artifact. QIIME2 is a command line tool, so we need to run it from the command line of the instance. However, we do not need to directly do this from the terminal, and we should in fact not do it from the terminal. Rather, we can make repeated command line executions from the Jupyter notebook. A command that is preceded with a ! (pronounced "bang") asks Jupyter to run what follows on the command line. We use QIIME2's import tool.

In [5]:
if pre_process:
    !qiime tools import \
      --type 'SampleData[PairedEndSequencesWithQuality]' \
      --input-path $home_dir'manifest.txt' \
      --output-path $artifact_dir'pair-end-demux.qza' \
      --input-format PairedEndFastqManifestPhred33V2

Now we can create a visualization of the demultiplexed samples to investigate sequencing quality.

In [6]:
# Next, create a visualization of demuliplexed samples with quality
if pre_process:
    !qiime demux summarize \
      --i-data $artifact_dir'pair-end-demux.qza' \
      --o-visualization $visualization_dir'pair-end-demux.qzv'

To look at the visualization, we use QIIME2's Visualization.load() function. A quick and easy way to see the visualizations is to drag and drop files to QIIME's convenient web-based viewer.

In [7]:
Visualization.load(os.path.join(visualization_dir, 'pair-end-demux.qzv'))
Out[7]:
No description has been provided for this image

Trimming and Denoising¶

We see the quality is quite good throughout the whole read, up to base 151. This is because our sequencing provider already did some filtering of good reads for us. For our denoising procedure, we will truncate to base 150 for both the forward and reverse reads. Note we do not need to trim, since the reads we got already have the primers trimmed, thanks to our sequencing provider.

Note that in the below, I am using four threads for the denoising. You should only select the number of threads that correspond to the number of CPUs you have available on your machine.

In [8]:
# Use Dada2 to denoise the sample. This removes spurious reads that are more likely to be sequencing errors than novel colonies
if pre_process:
    !qiime dada2 denoise-paired \
        --i-demultiplexed-seqs $artifact_dir'pair-end-demux.qza' \
        --p-trunc-len-f 150 \
        --p-trunc-len-r 150 \
        --p-n-threads 4 \
        --o-representative-sequences $artifact_dir'rep-seqs.qza' \
        --o-table $artifact_dir'OTU_table.qza' \
        --o-denoising-stats $artifact_dir'stats-dada2.qza'

And we can again make a visualization.

In [9]:
if pre_process:
    !qiime metadata tabulate \
      --m-input-file $artifact_dir'stats-dada2.qza' \
      --o-visualization $visualization_dir'stats-dada2.qzv'

Adding metadata¶

Each sample has its own brewing conditions, and we need to feed this information in such that qiime can associate them to their given sample. This information was created on google sheets and then validated to be in the correct format using the Keemei tool.

In [10]:
metadata = os.path.join(home_dir, 'metadata.tsv')
df = pd.read_csv(metadata, sep='\t')

# Display dataframe
df
Out[10]:
sample-id team location estimated-depth-inches pH shallow
0 1 ABORW Orange Walk 4.5 4.5 0
1 2 RA Watson Laboratories 3.4 4.9 0
2 3 AC NO DATA NO DATA NO DATA NO DATA
3 4 LFCG NO DATA NO DATA NO DATA NO DATA
4 5 JP Oak Tree near Arms 4.7 4 0
5 6 MLRB Dabney Garden 4 4.6 0
6 7 NK West Lilly Pond 2 7 1
7 8 ES NO DATA NO DATA NO DATA NO DATA
8 9 F4CMR NO DATA NO DATA NO DATA NO DATA
9 10 GC4 Chen Garden 5.5 6 0
10 11 EY Tournament Park 2 4 1
11 12 K Tournament Park 3 4 0
12 13 SECS2 Mud patch from middle of turtle pond 1.5 6 1
13 14 SECS1 Under tree next to turtle pond 3.5 6 0
14 15 MR NO DATA NO DATA NO DATA NO DATA
15 16 BMRC Bush in between chemical physics building and ... 1.57 4 1
16 17 ZA Tournament park by trash bin on left 1.8 4.5 1
17 18 CL Tournament park by trash bin on left 1.8 4.5 1
18 19 ABOLW Olive Walk 4 4.6 0

Feature Table Creation¶

Now let's generate an OTU table, a table of operational taxonomic units. We will go ahead and make a visualization in the same code cell.

In [11]:
# Gives us a table with each feature and its abundance
!qiime feature-table summarize \
  --i-table $artifact_dir'OTU_table.qza' \
  --o-visualization $visualization_dir'OTU_table.qzv' \
  --m-sample-metadata-file $metadata

# Relates the features to the sequences
!qiime feature-table tabulate-seqs \
  --i-data $artifact_dir'rep-seqs.qza' \
  --o-visualization $visualization_dir'rep-seqs.qzv'
R[write to console]: Warning:
R[write to console]:  ‘timedatectl’ indicates the non-existent timezone name ‘n/a’

R[write to console]: Warning:
R[write to console]:  Your system is mis-configured: ‘/etc/localtime’ is not a symlink

R[write to console]: Warning:
R[write to console]:  It is strongly recommended to set envionment variable TZ to ‘Etc/UCT’ (or equivalent)

Saved Visualization to: /home/ec2-user/sequencing_analysis/qiime_visualizations/OTU_table.qzv
R[write to console]: Warning:
R[write to console]:  ‘timedatectl’ indicates the non-existent timezone name ‘n/a’

R[write to console]: Warning:
R[write to console]:  Your system is mis-configured: ‘/etc/localtime’ is not a symlink

R[write to console]: Warning:
R[write to console]:  It is strongly recommended to set envionment variable TZ to ‘Etc/UCT’ (or equivalent)

Saved Visualization to: /home/ec2-user/sequencing_analysis/qiime_visualizations/rep-seqs.qzv

Taxonomic analysis¶

There are several ways to do taxonomic analysis. We will use a Native Bayes classifier using the Green Genes reference.

Using Native Bayes classifier¶

We can then use a GreenGenes reference taxonomy to make our taxonometric assigment. Note there is no classifier for 515F/926R primers, so we can test using the 515F/806R classifiers. First, we need to fetch the classifier. We can fetch the lightweight one from the moving pictures tutorial (gg-13-8-99-515-806-nb-classifier.qza) and also a more complete classifier (2022.10.backbone.full-length.nb.sklearn-1.4.2.qza)

In [12]:
if not os.path.exists(os.path.join(artifact_dir, "gg-13-8-99-515-806-nb-classifier.qza")):
    !wget \
      -O $artifact_dir"gg-13-8-99-515-806-nb-classifier.qza" \
      "https://moving-pictures-tutorial.readthedocs.io/en/latest/data/moving-pictures/gg-13-8-99-515-806-nb-classifier.qza"

if not os.path.exists(os.path.join(artifact_dir, "2022.10.backbone.full-length.nb.sklearn-1.4.2.qza")):
    !wget \
      -O $artifact_dir"2022.10.backbone.full-length.nb.sklearn-1.4.2.qza" \
      "http://ftp.microbio.me/greengenes_release/2022.10/sklearn-1.4.2-compatible-nb-classifiers/2022.10.backbone.full-length.nb.sklearn-1.4.2.qza"

# We'll use the smaller classifier to save disk space
classifier = 'gg-13-8-99-515-806-nb-classifier.qza'

Next, we use the native Bayes classifier to generate the taxonomy.

In [13]:
!qiime feature-classifier classify-sklearn \
  --i-classifier $artifact_dir$classifier \
  --i-reads $artifact_dir'rep-seqs.qza' \
  --o-classification $artifact_dir'taxonomy.qza'
R[write to console]: Warning:
R[write to console]:  ‘timedatectl’ indicates the non-existent timezone name ‘n/a’

R[write to console]: Warning:
R[write to console]:  Your system is mis-configured: ‘/etc/localtime’ is not a symlink

R[write to console]: Warning:
R[write to console]:  It is strongly recommended to set envionment variable TZ to ‘Etc/UCT’ (or equivalent)

Saved FeatureData[Taxonomy] to: /home/ec2-user/sequencing_analysis/qiime_artifacts/taxonomy.qza

Finally, we can make visualizations of the taxonomy, first as a text-based table with taxa and confidence, and then as interactive bar graphs of relative abundance.

In [14]:
!qiime metadata tabulate \
  --m-input-file $artifact_dir'taxonomy.qza' \
  --o-visualization $visualization_dir'taxonomy.qzv'

!qiime taxa barplot \
  --i-table $artifact_dir'OTU_table.qza' \
  --i-taxonomy $artifact_dir'taxonomy.qza' \
  --m-metadata-file $metadata \
  --o-visualization $visualization_dir'taxa-bar-plots.qzv'
R[write to console]: Warning:
R[write to console]:  ‘timedatectl’ indicates the non-existent timezone name ‘n/a’

R[write to console]: Warning:
R[write to console]:  Your system is mis-configured: ‘/etc/localtime’ is not a symlink

R[write to console]: Warning:
R[write to console]:  It is strongly recommended to set envionment variable TZ to ‘Etc/UCT’ (or equivalent)

Saved Visualization to: /home/ec2-user/sequencing_analysis/qiime_visualizations/taxonomy.qzv
R[write to console]: Warning:
R[write to console]:  ‘timedatectl’ indicates the non-existent timezone name ‘n/a’

R[write to console]: Warning:
R[write to console]:  Your system is mis-configured: ‘/etc/localtime’ is not a symlink

R[write to console]: Warning:
R[write to console]:  It is strongly recommended to set envionment variable TZ to ‘Etc/UCT’ (or equivalent)

Saved Visualization to: /home/ec2-user/sequencing_analysis/qiime_visualizations/taxa-bar-plots.qzv

We can take a peek at the bar graph to see what we have.

In [15]:
Visualization.load(f'{visualization_dir}taxa-bar-plots.qzv')
Out[15]:
No description has been provided for this image

Onwards and Upwards¶

Great! Now we have a denoised OTU table and representative sequences. From here, you can filter sequences, generate phylogenetic trees, perform principle coordinate analysis, and analyze diversity. There is a lot to explore!

Any time you run a QIIME command, give a short description of what you are generating and why you chose specific parameters you did (especially for things like filtering, cutoffs, etc.). With QIIME, it's temping to just punch everything in and see what comes out, but your analysis should be mindful and well explained.

To use these file names in subsequent uses of QIIME, you can use a $ as we did above with the mapping file. For example, to make a new OTU table with only samples collected at a depth of less than three inches, do the following. (Note that I never overwrite a file from the analysis pipeline. It is imperative that you take care not to overwrite anything; always make a new file name.)

In [16]:
otu_table = os.path.join(artifact_dir, 'OTU_table.qza')
output_file = os.path.join(artifact_dir, 'OTU_table_filtered.qza')

# Generate OTU table, filtering to include only shallow samples
!qiime feature-table filter-samples \
  --i-table $otu_table \
  --m-metadata-file $metadata \
  --p-where "[shallow]='1'" \
  --o-filtered-table $output_file

# Gives us a table with each feature and its abundance
!qiime feature-table summarize \
  --i-table $output_file \
  --o-visualization $visualization_dir'OTU_table_filtered_shallow.qzv' \
  --m-sample-metadata-file $metadata
R[write to console]: Warning:
R[write to console]:  ‘timedatectl’ indicates the non-existent timezone name ‘n/a’

R[write to console]: Warning:
R[write to console]:  Your system is mis-configured: ‘/etc/localtime’ is not a symlink

R[write to console]: Warning:
R[write to console]:  It is strongly recommended to set envionment variable TZ to ‘Etc/UCT’ (or equivalent)

Saved FeatureTable[Frequency] to: /home/ec2-user/sequencing_analysis/qiime_artifacts/OTU_table_filtered.qza
R[write to console]: Warning:
R[write to console]:  ‘timedatectl’ indicates the non-existent timezone name ‘n/a’

R[write to console]: Warning:
R[write to console]:  Your system is mis-configured: ‘/etc/localtime’ is not a symlink

R[write to console]: Warning:
R[write to console]:  It is strongly recommended to set envionment variable TZ to ‘Etc/UCT’ (or equivalent)

Saved Visualization to: /home/ec2-user/sequencing_analysis/qiime_visualizations/OTU_table_filtered_shallow.qzv

Let's take a look at the OTU table.

In [17]:
Visualization.load(f'{visualization_dir}OTU_table_filtered_shallow.qzv')
Out[17]:
No description has been provided for this image

Now, in the Feature Detail tab, you can see the most abundant features and how many samples they were present in. If you want to find out what these are directly from here, you can look at the representative sequences file. Let's say we are curious what the top feature is. We would search for it in the sequence table below, and can click on the sequence to BLAST it. Note you can can also download the fasta file from the representative sequences (Download your sequences as a raw FASTA file) and search through the file using a text editor or a Python package called BioSeqIO.

In [18]:
Visualization.load(f'{visualization_dir}rep-seqs.qzv')
Out[18]:
No description has been provided for this image

Happy exploring!

Computing Environment¶

In [19]:
%load_ext watermark
%watermark -v -p pandas,jupyterlab,qiime2
Python implementation: CPython
Python version       : 3.10.14
IPython version      : 8.21.0

pandas    : 2.2.2
jupyterlab: 4.4.2
qiime2    : 2025.4.0