H3ABioNet

Pan African Bioinformatics Network for H3Africa

Standard operating procedure for 16S rRNA diversity analysis

Introduction

The genes encoding the RNA component of the small subunit of ribosomes, commonly known as the 16S rRNA in bacteria and archaea, are among the most conserved across all kingdoms of life. Nevertheless, they contain regions that are less evolutionarily constrained and whose sequences are indicative of their phylogeny. Amplification of these genomic regions by PCR from an environmental sample and subsequent sequencing of a sufficiently large number of individual amplicons enables the analysis of the diversity of clades in the sample and a rough estimate of their relative abundance. The analytical process is known as “16S rDNA diversity analysis”, and is the focus of the present SOP.

The procedure and tools are only recommendations and it is up to the user to evaluate what works best for their needs.

Schematic workflow of the analysis

2 16S-rRNA-Diversity-Analysis

  

 


Definition of terms used

16S rRNA gene The gene that is responsible for the coding of the 16S ribosomal RNA. The gene is used in constructing phylogenies.
Barcodes Short nucleotide sequences added onto the ends of the DNA fragments that are to be sequenced. It allows for indexing of samples, so multiple DNA libraries can be mixed together into one sequencing lane.
Variable region 16S rRNA gene sequences contain hypervariable regions that can provide clade-specific signature sequences useful for bacterial identification.
Demultiplex This is a process of binning reads based on barcodes, primarily used to split them amongst samples.
OTU An operational taxonomic unit is an operational definition of a species or group of species often used when only DNA sequence data is available.
Rarefaction analysis Rarefaction is a process used to estimate the true diversity of a sample by extracting random subsets of sequences. The analysis estimates diversity from subsets of different sizes and extrapolates the resulting rarefaction curve to an infinite number of sequences. Rarefaction is also used to determine whether the sequencing depth achieved (number of reads per sample) is sufficient to capture the diversity within a sample. A plot is generated showing increase in number of species (or other diversity metrics) as the number of sequence reads increase. A curve that is reaching asymptote indicates that no further diversity would be expected if sequencing depth was increased.
Phred quality score or Q score Measure of sequencing accuracy. Logarithmically related to the probability that a base is called incorrectly by a sequencer. Example: a Phred score of 30 (Q30) means that the probability of an incorrect call for that base is 1 in 1000 and the base call accuracy is 99.9%. The base call accuracy for a base with Q10 is 90%, Q20 is 99%, Q30 is 99.9%, Q40 is 99.99%, and Q50 = 99.999%. 
Adapter Platform specific nucleotide sequence added to the ends of DNA molecule to facilitate sequencing e.g. in Illumina the adapter facilitates binding to the complementary target sequences mobilised on the flow cell.
Chimera PCR artefact. Chimeras are potentially formed during PCR when incompletely extended DNA fragments from different templates anneal due to closely related sequences generating recombinants between starting templates. Chimeras can greatly impact estimates of diversity (generally overestimate).
Alpha diversity Diversity within a single sample. Diversity can be characterised using the number of different species (richness), the abundance of each species (evenness), with indices that combine richness and evenness, and with divergence-based methods (phylogenetic diversity).
Beta diversity Comparison of diversity between samples.
Unifrac Beta diversity distance metric based on the phylogenetic distance between the members of communities/samples. Unifrac captures the total amount of evolution that is unique to each sample.

Phase 1: Preprocessing of reads

It is essential to preprocess raw reads before subjecting them for downstream analysis. The preprocessing includes the removal of low quality bases, ambiguous bases and adapter sequences, the stitching together of paired reads, and the detection of chimeric reads. Sequencing errors, reads with ambiguous bases and chimeras can all cause the appearance of spurious OTUs if they are not removed.

Input: raw reads (multiplexed or demultiplexed)

Output: high quality reads ready for OTU picking

QC plots and stats

The first step in the data preprocessing is to check the quality of bases in all the reads. Once we understand the quality spectrum of the reads, we can decide on the parameters for trimming low quality bases. If the raw reads are not demultiplexed, demultiplexing should be performed before proceeding to the next step. Illumina’s CASAVA or QIIME tools can perform the demultiplexing task.

Software: FASTQC, PRINSEQ, SolexaQA

Trim and Filter reads

At the 3’ end of reads there are often adaptor sequences left from library preparation. These adaptor bases need to be removed, and low quality bases need to be trimmed off. Any of the indicated programs can be used for this. Bokulich et al. 2013 recommend a minimum phred quality score of 3 to trim low quality bases at the ends of the reads. Jeraldo et al. 2014 (in review) recommend trimming the 3’ end of the reads with a moving average score of 15, with a window size of 4 bases and to removal of any reads shorter than 75% of the original read length. It is also recommended that reads containing ambiguous bases (N) be discarded.

Software: Trimmomatic, PRINSEQ, SolexaQA

Paired read stitching

When the combined length of reads sequenced from both ends of DNA fragments is longer than the size of the fragment, there is an overlap between the paired reads. The read pairs can be stitched together based on the overlap information, thus generating a single sequence. During the read stitching process, higher quality bases can be selected thus improving the quality of stitched reads. PANDASeq does not do well when the overlap is almost the entire read. PEAR works well for all lengths of overlaps between the paired reads.

Software: PEAR, PANDASeq, FLASH, UPARSE merge

Chimera detection

Chimeras are artifacts of PCR. These are formed during PCR cycles by the joining of two or more different parent DNA templates. If these chimeras are not removed they may be recognized as novel sequences during the alignment process and therefore mislead interpretation. At present there are no tools that can remove chimeras completely without throwing away non-chimeric sequences. Of the various tools available, UCHIME was found to perform better than ChimeraSlayer, which was the best program to detect chimeras before UCHIME was developed (Edgar et al. 2011). To remove chimeras from 454 sequences Perseus can also be used (Quince et al. 2011).

Software: UCHIME, ChimeraSlayer, Perseus


Phase 2: OTU picking, classification and phylogenetic tree generation

During this phase reads are processed so that comparisons between samples can be made. The first step is to cluster reads based on similarity into OTUs and to select a representative sequence for each OTU. Each OTU is then classified by comparison to a reference database and a phylogeny inference is made based on sequence alignment and the construction of a phylogenetic tree.

Input: high quality reads

Output: OTUs, representative sequences, OTU table with classification and abundance of each OTU, heatmap, sequence alignment and phylogenetic tree

OTU picking

OTU picking is the clustering of the preprocessed reads into OTUs. The clusters are formed based on sequence identity. The identity threshold can be defined by the user. Sequences that are more than 97% identical are conventionally assumed to be derived from the same bacterial species/OTU. Other identity percentages can be used, depending on the granularity of the desired clusters and the known divergence in 16S sequences of the OTUs of interest. Three approaches for OTU picking exist. 1) de novo OTU picking groups sequences based on levels of pairwise sequence identity; 2) closed reference OTU picking aligns and groups sequences relative to a reference database, and sequences that are not >97% identical to a known reference are discarded 3) open-reference OTU picking starts with alignment to a reference database, but if the read does not match a known sequence it is not discarded but sent for de novo OTU picking. After the sequences have been clustered into OTUs and counted to estimate OTU abundance, a representative sequence is picked for each OTU. Each OTU is therefore represented by a single sequence and this will speed up downstream analysis. There are multiple choices to select a representative sequence. It can be the first sequence, the longest sequence, the seed sequence used in OTU picking, the most abundant sequence or a random sequence.

Software: UPARSE, QIIME

Classification

Here a taxonomic identity is assigned to each representative sequence. The taxonomies are pulled from a reference set. There are three main reference databases with aligned, validated and annotated 16S rRNA genes: GreenGenes, Ribosomal Database Project (RDP) and Silva. Each of these databases has strengths and weaknesses that need to be taken into consideration, and all are in common use. Several methods for assigning taxonomy against these reference databases exist, including UCLUST, the RDP classifier and RTAX.

Databases: SILVA, GreenGenes, RDP

Software: UCLUST, the RDP classifier, RTAX

Alignment

To understand the evolutionary relationships between the sequences in the sample and to perform a diversity analysis, it is necessary to generate a phylogenetic tree of the OTUs. The first step in generating the tree is to generate a multiple alignment of the representative OTU sequences. PyNAST aligns the sequences to a template alignment of reference 16S sequences. Infernal makes use of a Hidden Markov Model that also incorporates secondary structure information.

Software: PyNAST, INFERNAL

Create phylogenetic tree

The phylogenetic tree represents the relationship between the sequences in terms of the evolutionary distance from a common ancestor. In downstream analysis this tree is used for example in calculating the UniFrac distances.

Software: FastTree

An alternate option to most of the steps mentioned in phase 1 and phase 2 is to run IM-TORNADO (Jeraldo et al. 2014 in review). IM-TORNADO is an integrated pipeline that takes demultiplexed reads and trims low quality bases, does paired read stitching, removes chimeras and generates OTU table, phylogenetic tree and assigns taxonomy. Unique feature of IM-TORNADO is that it can analyze paired reads that do not overlap. Non-overlapping paired reads are typically generated from non-overlapping variable regions of 16S rRNA. Such studies try to utilize information in two variable regions instead of one variable region as in any standard 16S rRNA study to define OTUs.


Phase 3: Measure diversity and other statistical analysis

OTU information (number of OTUs, abundance of OTUs) and the phylogenetic tree generated from the phase 2 is utilized to estimate diversity within and between samples. Additional statistical analysis to test the significance of the diversities can also be done.

Input: classified OTU table with abundance, phylogenetic tree and sample metadata

Output: alpha and beta diversity metrics, distance matrix, results from statistical tests, rarefaction plots, PCoA plots, heatmaps

Determine alpha diversity

Alpha diversity is a measure of diversity within a sample. It gives an indication of richness and/or evenness of species present in a sample. The accuracy of the measured diversity is mainly affected by the sequencing depth (number of reads per sample). Sequencing depth must be high enough to capture the true diversity within a sample. Samples with higher number of reads would show higher diversity than samples with lower number of reads. Rarefaction analysis is therefore required to understand the actual diversity within a sample and to determine if your sequencing effort is sufficient and if the total diversity within the sample has been captured. Mothur as well as QIIME have tools to generate multiple rarefactions and then measure alpha diversity on the rarefied OTU tables. Several popular alpha diversity measures are available both in Mothur and QIIME: Shannon index, chao1, observed species, and phylogenetic diversity whole tree.

Software: mothur, QIIME

Determine beta diversity

Beta diversity is a measure of diversity between samples. One of the most commonly used metrics is the Unifrac distance that compares samples using phylogenetic information. An all-by all or pairwise matrix of the beta diversity metrics between all the samples in the study is generated and can be visualized in different ways such as a tree, graph, network etc. Mothur and QIIME have several tools to generate distance metrics, phylogenetic trees and PCoA plots.

Software: UNIFRAC for distance metrics, mothur, QIIME

Other statistical analysis

Additional statistical tests between samples or groups of samples can be done in QIIME. For alpha diversity a parametric or non-parametric t-test can be performed on a rarefied number of sequences. For beta diversity the Mantel, partial Mantel and Mantel correlogram matrix correlation can be used to compare distance matrices. Multivariate analyses are also available for testing significance between the distance matrix and other factors. The statistical methods available are: adonis, ANOSIM, BEST, Moran’s I, MRPP, PERMANOVA, PERMDISP, and db-RDA. Native methods in R and other R packages such as phyloseq and ade4 can also be considered for these types of analyses.

Software: QIIME, R packages (phyloseq, ade4)


References

  • Bokulich NA, Subramanian S, Faith JJ, Gevers D, Gordon JI, et al. (2013) Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nat Methods 10: 57–59. doi: 10.1038/nmeth.2276
  • Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R (2011) UCHIME improves sensitivity and speed of chimera detection. Bioinformatics 27: 2194-2200. doi: 10.1093/bioinformatics/btr381
  • Quince C, Lanzen A, Curtis TP, Davenport RJ, Hall N, Head IM, Read LF, Sloan WT: Accurate determination of microbial diversity from 454 pyrosequencing data. Nat Methods 2009, 6:639-641.

Appendix


Practise Datasets

The input datasets, metadata and a set of questions to guide tha analysis can be accessed here.


Resource Requirements

Resource requirements for running the 16S rRNA diversity analysis SOP
Data set size Step Minimum hardware requirements Scalability Approx. time frame Biggest bottleneck Notes
13GB (.fq, 150 samples, 11 milion paired reads)
Quality control, read merging Nr cores = 1
RAM = 4GB
Storage = Size of input X 3
- Run time proporsional to size (samples/coverage)
- Required storage increases proportional to size (samples/coverage)
Less than an hour per run
- QC with FastQC
- Trimming with Trimmomatic
- Merging with PEAR
The QC process is most efficient if it can be parallised.
- It might be necessary to run analysis pipeline several times using different quality trimming/filtering approaches, different algorithms and algorithm settings and using different databases.
de novo OTU picking (UPARSE), classification, alignment, phylogenetic tree generation Nr cores = 1
RAM = 8GB
Storage = Size of input x 1
- Run time proporsional to size (samples/coverage)
- Required storage increases proportional to size (samples/coverage)
Less than two/three hours per run
- De novo OTU picking and chimera detection with UPARSE
- Classification with UCLUST
- Alignment with PyNAST
- Phylogenetic tree generation with FastTree
De novo OTU picking runs in serial
Further statistical analysis e.g. alpha and beta diverstity Nr cores = 1
RAM = 8GB
Storage = Size of input x 1
- Run time proporsional to size (samples/coverage)
- Required storage increases proportional to size (samples/coverage)
Less than an hour per run
- summarize_taxa.py, alpha_rarefaction.py, beta_diversity_through_plots.py, make_distance_histograms.py, make_emperor.py with QIIME
No resource bottleneck

 

List of H3ABioNet servers with suffcient resources for doing the 16S rRNA diversity analysis SOP
Option Model Cores RAM Disk capacity Sufficient for this analysis H3ABioNet nodes with a server
Option A Dell C6145 64 256GB 10TB Yes RUBi, CPGR, ICIPE, IPT, CBIO
Option C Dell C6145 24 128GB 10TB Yes MUHAS, NMIMR, KCCR, UDSM
Uganda HP 1 HP DL360e 16 64GB 3.6TB Yes UVRI
Uganda HP 2 HP DL385p 16 128GB 6TB Yes UVRI
Tanzania Dell 1 Dell 8 144GB 12TB Yes MDH