(c) 2022 Liana Merk, Gil Sharon, and Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with support financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
import os
import fnmatch
import pandas as pd
import numpy as np
from qiime2 import Visualization
In the cell below, you should specify the specifics of your analysis. After that, you will mostly just want to run the cells through to the bottom of this notebook. At the end, you will have QIIME artifacts readily available for analysis.
# Put the full path to your home working directory
home_dir = "/home/ec2-user/sequencing_analysis/"
# Subdirectory in home_dir containing all of the reads (probably 'soil')
data_dir = "soil/"
# Whether or not to do pre-processing (this takes a looong time)
pre_process = True
QIIME2 has two object types, artifacts (with suffix .qza
), which are used to store and access data, and visualizations (with suffix .qzv
), which is are used, you guessed it, to visualize the results. We will create two directories to store these files.
# Full path to the directory with reads
data_dir = os.path.join(home_dir, data_dir)
# These are artifact and visualization directories located within the home directory.
artifact_dir = os.path.join(home_dir, 'qiime_artifacts/')
visualization_dir = os.path.join(home_dir, 'qiime_visualizations/')
# If these do not exist, make them
if not os.path.exists(artifact_dir):
os.makedirs(artifact_dir)
if not os.path.exists(visualization_dir):
os.makedirs(visualization_dir)
In this experiment, we are analyzing the microbial content of soil samples. We extracted the DNA using a Qiagen PowerSoil kit. We amplified the V4-V5 region of 16S, attaching Illumina adapters for a subsequent MiSeq run. The reads were demultiplexed by Laragen, our sequencing provider, and are stored in the soil/
directory. Here, we will use QIIME2 as our analytics platform. We will denoise using DADA2 and trim using the primer length. We will then perform taxonomic classification. Using these results, we will delve into some of the highest colonizing populations. Further, we seek to understand how the microbial content of soil varies with the local environment form which they were collected.
First, QIIME2 needs to know which sequencing files came from which samples. In our experiment, we performed paired end sequencing, where we sequenced both ends of the amplified V4 region of 16S. We will search through the directory of sequnces, which is stored in data_dir
, and gather the prefixes. Then, we will add two columns, one which records the forward path and one with the reverse read path.
if pre_process:
with open(os.path.join(home_dir, 'manifest.txt'), 'w') as f:
# Add formatting rows
f.write('sample-id\tforward-absolute-filepath\treverse-absolute-filepath\n')
# list all files and write the forward reads into the manifest file
list_of_files = os.listdir(data_dir)
forward = '*L001_R1_001.fastq.gz'
for file in list_of_files:
if fnmatch.fnmatch(file, forward):
line = str(file.split("_")[0]
+ f'\t{data_dir}'
+ str(file.split("L001")[0])
+ 'L001_R1_001.fastq.gz'
+ f'\t{data_dir}'
+ str(file.split("L001")[0])
+ 'L001_R2_001.fastq.gz\n')
f.write(line)
pd.set_option('display.max_colwidth', None)
pd.read_csv(os.path.join(home_dir, 'manifest.txt'), sep='\t')
Now that we have the manifest, we and use QIIME2 to import the sequences and store them as a QIIME2 artifact. QIIME2 is a command line tool, so we need to run it from the command line of the instance. However, we do not need to directly do this from the terminal, and we should in fact not do it from the terminal. Rather, we can make repeated command line executions from the Jupyter notebook. A command that is preceded with a !
(pronounced "bang") asks Jupyter to run what follows on the command line. We use QIIME2
's import
tool.
if pre_process:
!qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path $home_dir'manifest.txt' \
--output-path $artifact_dir'pair-end-demux.qza' \
--input-format PairedEndFastqManifestPhred33V2
Imported /home/ec2-user/sequencing_analysis/manifest.txt as PairedEndFastqManifestPhred33V2 to /home/ec2-user/sequencing_analysis/qiime_artifacts/pair-end-demux.qza
Now we can create a visualization of the demultiplexed samples to investigate sequencing quality.
# Next, create a visualization of demuliplexed samples with quality
if pre_process:
!qiime demux summarize \
--i-data $artifact_dir'pair-end-demux.qza' \
--o-visualization $visualization_dir'pair-end-demux.qzv'
Saved Visualization to: /home/ec2-user/sequencing_analysis/qiime_visualizations/pair-end-demux.qzv
To look at the visualization, we use QIIME2's Visualization.load()
function. A quick and easy way to see the visualizations is to drag and drop files to QIIME's convenient web-based viewer.
Visualization.load(os.path.join(visualization_dir, 'pair-end-demux.qzv'))
We see the quality is quite good until the mid-200's. We decide to truncate at 250 for the forward and 200 for the reverse. Note we do not need to trim, since the reads we got already have the primers trimmed, thanks to Laragen.
# Use Dada2 to denoise the sample. This removes spurious reads that are more likely to be sequencing errors than novel colonies
if pre_process:
!qiime dada2 denoise-paired \
--i-demultiplexed-seqs $artifact_dir'pair-end-demux.qza' \
--p-trunc-len-f 250 \
--p-trunc-len-r 200 \
--p-n-threads 4 \
--o-representative-sequences $artifact_dir'rep-seqs.qza' \
--o-table $artifact_dir'OTU_table.qza' \
--o-denoising-stats $artifact_dir'stats-dada2.qza'
Saved FeatureTable[Frequency] to: /home/ec2-user/sequencing_analysis/qiime_artifacts/OTU_table.qza Saved FeatureData[Sequence] to: /home/ec2-user/sequencing_analysis/qiime_artifacts/rep-seqs.qza Saved SampleData[DADA2Stats] to: /home/ec2-user/sequencing_analysis/qiime_artifacts/stats-dada2.qza
And we can again make a visualization.
if pre_process:
!qiime metadata tabulate \
--m-input-file $artifact_dir'stats-dada2.qza' \
--o-visualization $visualization_dir'stats-dada2.qzv'
Saved Visualization to: /home/ec2-user/sequencing_analysis/qiime_visualizations/stats-dada2.qzv
Each sample has its own brewing conditions, and we need to feed this information in such that qiime can associate them to their given sample. This information was created on google sheets and then validated to be in the correct format using the Keemei tool.
metadata = os.path.join(home_dir, 'metadata.tsv')
df = pd.read_csv(metadata, sep='\t')
# Display dataframe
df
sample-id | team | location | estimated-depth (cm) | pH | |
---|---|---|---|---|---|
0 | JPJL-s | Jason-Juni | flowers outside Gates-Thomas Laboratory | 7.0 | 5.5 |
1 | EKSK-s | Sulekha-Eli | Olive tree between Linde and Gates | 20.0 | 4.5 |
2 | MDSS-s | Marama-Sydney | A tree root outside of red door | 6.3 | 5.5 |
3 | ATMT-s | Matthew-Andrea | From the large tree outsite Venerable | 7.0 | 5.5 |
4 | TLSP-s | Sophie-Trinity | from the orange trees near Dabney | 6.0 | 4.0 |
5 | SSKV-s | Kodie-Sam | turtle pond | 6.0 | 5.5 |
6 | APSC-s | Alex-Sophie | roses by Kerckhoff | 16.0 | 4.5 |
7 | VJMJY-s | Vanessa-Jen | Big tree outside Beckman | 11.0 | 5.0 |
8 | SDOP-s | Sophie_D-Oliver | Plants behind Marks/Braun | 10.0 | 5.0 |
9 | HTSW-s | Sophia_Wu-Haruna | South side of Kerchoff, near the tree | 6.0 | 5.0 |
10 | PSZY-s | Pranay-Zitian | Plants behind Chen near the bridge | 20.0 | 4.5 |
11 | MDLCP-s | Maya-Cristian | Flower bush next to Bechtel and parking lot | 15.0 | 5.5 |
12 | RJ-s | Rashi | Special secret location | 10.0 | 5.0 |
Now let's generate an OTU table, a table of operational taxonomic units. We will go ahead and make a visualization in the same code cell.
# Gives us a table with each feature and its abundance
!qiime feature-table summarize \
--i-table $artifact_dir'OTU_table.qza' \
--o-visualization $visualization_dir'OTU_table.qzv' \
--m-sample-metadata-file $metadata
# Relates the features to the sequences
!qiime feature-table tabulate-seqs \
--i-data $artifact_dir'rep-seqs.qza' \
--o-visualization $visualization_dir'rep-seqs.qzv'
Saved Visualization to: /home/ec2-user/sequencing_analysis/qiime_visualizations/OTU_table.qzv Saved Visualization to: /home/ec2-user/sequencing_analysis/qiime_visualizations/rep-seqs.qzv
There are several ways to do taxonomic analysis. We will present two methods, a Native Bayes classifier using the Green Genes reference, and a fragment insertion method.
We can then use a GreenGenes reference taxonomy to make our taxonometric assigment. Note there is no classifier for 515F/926R primers, so we can test using the 515F/806R classifiers. First, we need to fetch the classifier.
if not os.path.exists(os.path.join(artifact_dir, "gg-13-8-99-515-806-nb-classifier.qza")):
!wget \
-O $artifact_dir"gg-13-8-99-515-806-nb-classifier.qza" \
"https://data.qiime2.org/2022.2/common/gg-13-8-99-515-806-nb-classifier.qza"
Next, we use the native Bayes classifier to generate the taxonomy.
!qiime feature-classifier classify-sklearn \
--i-classifier $artifact_dir'gg-13-8-99-515-806-nb-classifier.qza' \
--i-reads $artifact_dir'rep-seqs.qza' \
--o-classification $artifact_dir'taxonomy.qza'
Saved FeatureData[Taxonomy] to: /home/ec2-user/sequencing_analysis/qiime_artifacts/taxonomy.qza
Finally, we can make visualizations of the taxonomy, first as a text-based table with taxa and confidence, and then as interactive bar graphs of relative abundance.
!qiime metadata tabulate \
--m-input-file $artifact_dir'taxonomy.qza' \
--o-visualization $visualization_dir'taxonomy.qzv'
!qiime taxa barplot \
--i-table $artifact_dir'OTU_table.qza' \
--i-taxonomy $artifact_dir'taxonomy.qza' \
--m-metadata-file $metadata \
--o-visualization $visualization_dir'taxa-bar-plots.qzv'
Saved Visualization to: /home/ec2-user/sequencing_analysis/qiime_visualizations/taxonomy.qzv Saved Visualization to: /home/ec2-user/sequencing_analysis/qiime_visualizations/taxa-bar-plots.qzv
We can take a peak at the bar graph to see what we have.
Visualization.load(f'{visualization_dir}taxa-bar-plots.qzv')
We will also try a fragment insertion method (SEPP) implemented in QIIME2. The same method will also help up assign taxonomy to the reads. If you want to read more, see this paper from Rob Knight's group. As before, we need to fetch the classifier.
if not os.path.exists(os.path.join(artifact_dir, "sepp-refs-gg-13-8.qza")):
!wget \
-O $artifact_dir"sepp-refs-gg-13-8.qza" \
"https://data.qiime2.org/2019.10/common/sepp-refs-gg-13-8.qza"
To do the classification, the calculation is rather involved. It will probably take an hour or more to run, so we will not do it here. The command to do it is:
!qiime fragment-insertion sepp \
--i-representative-sequences $artifact_dir'rep-seqs.qza' \
--i-reference-database $artifact_dir'sepp-refs-gg-13-8.qza' \
--o-tree $artifact_dir'insertion-tree.qza' \
--o-placements $artifact_dir'insertion-placements.qza'
Great! Now we have a denoised OTU table and representative sequences. From here, you can filter sequences, generate phylogenetic trees, perform principle coordinate analysis, and analyze diversity. There is a lot to explore!
Any time you run a QIIME command, give a short description of what you are generating and why you chose specific parameters you did (especially for things like filtering, cutoffs, etc.). With QIIME, it's temping to just punch everything in and see what comes out, but your analysis should be mindful and well explained.
To use these file names in subsequent uses of QIIME, you can use a $
as we did above with the mapping file. For example, to make a new OTU table with only sample collected near trees, do the following. (Note that I never overwrite a file from the analysis pipeline. It is imperative that you take care not to overwrite anything; always make a new file name.)
otu_table = os.path.join(artifact_dir, 'OTU_table.qza')
rep_seq = os.path.join(artifact_dir, 'rep-seqs.qza')
output_file = os.path.join(artifact_dir, 'OTU_table_filtered_trees.qza')
# Update metadata file with trees
metadata_with_trees = os.path.join(home_dir, "metadata_with_trees.tsv")
df["tree"] = df["location"].str.contains("tree").astype(int)
df.to_csv(metadata_with_trees, sep='\t', index=False)
# Generate OTU table, filtering to include only trees
!qiime feature-table filter-samples \
--i-table $otu_table \
--m-metadata-file $metadata_with_trees \
--p-where "[tree]='1'" \
--o-filtered-table $output_file
# Gives us a table with each feature and its abundance
!qiime feature-table summarize \
--i-table $output_file \
--o-visualization $visualization_dir'OTU_table_filtered_trees.qzv' \
--m-sample-metadata-file $metadata_with_trees
Saved FeatureTable[Frequency] to: /home/ec2-user/sequencing_analysis/qiime_artifacts/OTU_table_filtered_trees.qza Saved Visualization to: /home/ec2-user/sequencing_analysis/qiime_visualizations/OTU_table_filtered_trees.qzv
Let's take a look at the OTU table.
Visualization.load(f'{visualization_dir}OTU_table_filtered_trees.qzv')
Now, in the Feature Detail
tab, you can see the most abundant features and how many samples they were present in. If you want to find out what these are directly from here, you can look at the representative sequences file. Let's say we are curious what the top feature is. We would search for it in the sequence table below, and can click on the sequence to BLAST it. Note you can can also download the fasta file from the representative sequences (Download your sequences as a raw FASTA file
) and search through the file using a text editor or a Python package called BioSeqIO.
Visualization.load(f'{visualization_dir}rep-seqs.qzv')
Happy Exploring!
%load_ext watermark
%watermark -v -p pandas,jupyterlab,qiime2
Python implementation: CPython Python version : 3.8.13 IPython version : 8.2.0 pandas : 1.2.5 jupyterlab: 3.3.2 qiime2 : 2022.2.0