H3ABioNet

Pan African Bioinformatics Network for H3Africa

SOP for NGS variant calling in whole exomes of Man

This Standard Operating Procedure describes the technical steps for Node Accredidation process for variant calling in exome data. To find out more information on each step, please mouse over the corresponding box.

This SOP was created by Dr. Liudmila Sergeevna Mainzer and Prof. Jongeneel C. Victor from the HPCBio facility at the University of Illinois at Urbana–Champaign and all copyright resides with them. 


NGS Exome Variant Calling Workflow Schematic

1 NGS-Exome-Variant-Calling

 


Introduction

This SOP briefly outlines the essential steps (Workflow Schematic) in the process of making genetic variant calls, and recommends tools that have gained community acceptance for this purpose. It is assumed that the data consist of samples from a single tissue (not tissue comparison studies). Recommended coverage for acceptable quality of calls in a research setting is around 30­x - 50x for whole genome and 70­x - 100x for exome sequencing, but lower coverage is discussed as well.

The procedures outlined below are recommendations to the H3ABioNet groups planning to do variant calling on human genome data, and are not meant to be prescriptive. Our goal is to help the groups setup their procedures and workflows, and to provide an overview of the main steps involved and the tools that can be used to implement them.

Important note

The Genome Analysis Toolkit (GATK) distributed by the Broad Institute of Harvard and MIT (see http://www.broadinstitute.org/gatk/) is a commonly used framework and toolbox for many of the tasks described below. The licensing model for GATK has recently changed to a mixed closed/open source model, with provisions that any organization that operates on a fee for service model may be required to acquire a commercial license from a GATK distributor (Appistry in the US). This may or may not apply to the organizations seeking accreditation from the H3ABioNet network. While we recommend GATK tools for many of the tasks, we try to also provide alternatives for those organizations that cannot or do not wish to acquire GATK licenses.


Definition of Terms used


The table below (partly borrowed from GATK, 2013) provides definitions of the basic terms used throughout this SOP:

Adapters

Short nucleotide sequences added onto the ends of the DNA fragments that are to be sequenced.

(Illumina whitepaper 2010, Rodriguez­, Ezpeleta et al.2012, Chapter 2 Tufts 2011).

Functions:

  1. Permit binding to the flow cell;
  2. Allow for PCR enrichment of adaptor ­ligated DNA only;
  3. Allow for indexing or “barcoding” of samples, so multiple DNA libraries can be mixed together into 1 sequencing lane
Genetic variant
  1. Single nucleotide polymorphism (SNP)
  2. Small (<10nt) insertions/deletions (indels)
  3. Copy number variations (CNVs-outside the scope of this SOP)
Lane

The basic machine unit for sequencing.The lane reflects the basic independent run of an NGS machine. For Illumina machines, this is the physical sequencing lane.

Library

A unit of DNA preparation that at some point is physically pooled together. Multiple lanes can be run from aliquots from the same library. The DNA library is the natural unit that is being sequenced. For example, if the library has limited complexity, then many sequences are duplicated and will result in a high duplication rate across lanes. Also see Tufts (2013).

NGS Next generation sequencing
Sample

A single individual, such as human CEPHNA12878. Multiple libraries with different properties can be constructed from the original sample DNA source.

Here we treat samples as independent individuals whose genome sequence we are attempting to determine. From this perspective, tumor / normal samples are different despite coming from the same individual.

SNP

Single nucleotide polymorphism. Types:

  • In a non­coding region
  • In a coding region
    • Synonymous
    • Nonsynonymous
      1. missense
      2. nonsense

Phase I: Preprocessing of the raw reads

A recent publication by Pabinger et al. (2013) provides a good discussion of the common tools and approaches for variant calling. Also see the older Nielsen et al. (2011).

Phase 1: Preprocessing of the raw reads

The following steps prepare reads for analysis and must be performed in sequence.

Step 1.1: Adaptor trimming

Sequencing facilities usually produce read files in fastq format (Fastq 2013), which contain a base sequence and a quality score for each base in a read. Usually the adaptor sequences have already been removed from the reads, but sometimes bits of adapters are left behind, anywhere from 90% to 20% of the adaptor length. These need to be removed from the reads. This can be done using your own script based on a sliding window algorithm. A number of tools will also perform this operation: Fastx­toolkit (fastx_clipper), Bioconductor (ShortRead package), Flexbar (Dodt et al. 2012), as well as a number of tools listed on BioScholar (http://bioscholar.com/genomics/tools-remove-adapter-sequences-next-generation-sequencing-data/).

Selection of the tool to use depends on the amount of adaptor sequence leftover in the data. This can be assessed manually by grepping for parts of known adaptor sequences on the command line.

Step 1.2: Quality trimming

Once the adaptors have been trimmed, it is useful to inspect reads in bulk. Frequently the quality tends to drop off toward one end of the read. PrinSeq will show that very nicely (Schmieder and Edwards 2011). These read ends with low average quality can then be trimmed, if desired, using FASTX­Toolkit fastq_quality_filter, PrinSeq, SolexaQA (Cox et al. 2010).

Step 1.3: Removal of very short reads

Once the adaptor remnants and low quality ends have been trimmed, some reads may end up being very short (i.e. <20 bases). These short reads are likely to align to multiple (wrong) locations on the reference, introducing noise into the variation calls. They can be removed using PrinSeq or a simple in­house Perl script. This step goes faster if de­duplication is performed beforehand. Minimum acceptable read length should be chosen based on the length of sequencing fragment: longer for longer fragments, shorter for shorter ones – it is a matter of some experimentation with the data.The three pre­processing steps above can be parallelized by chunking the initial fastq file (hundreds of millions of reads, up to 50 G of hard disk space per file) into several files that can be processed simultaneously. The results can then be combined.

Step 1.4: De­duplication

The presence of duplicate reads in a sequencing project is a notorious problem. The causes are discussed in a blog post by Eric Vallabh Minikel (2012). Duplicately sequenced molecules should not be counted as additional evidence for or against a putative variant – they must be removed prior to the analysis. A number of tools can help with that: CLCBio has a plugin for it (CLCBio 2013), Picard provides MarkDuplicates, and FASTX­Toolkit has fastx_collapser for this purpose. De­duplication can also be performed by a simple in­house written Perl script.

Steps 1.3 and 1.4 reduce the size of the dataset to be used in the next steps, which will finish faster as a result.


Phase II: Initial variant discovery

Analysis proceeds as a series of the following sequential steps.

Step 2.1 Alignment

Reads need to be aligned to the reference genome in order to identify the similar and polymorphic regions in the sample. Currently hg19 is the standard reference. A number of aligners can perform the task, reviewed in Li and Homer (2010). BWA (Li and Durbin 2009) and bowtie2 (LangMead and Salzberg 2012) have become the trusted tool for Illumina data, because they are accurate, fast, well supported, and open­source. Performance comparison data for these two aligners can be found at seqanswers (http://seqanswers.com/forums/showthread.php?t=15200)  and  sourceforge (http://lh3lh3.users.sourceforge.net/alnROC.shtml). Longer reads (>1000 b) can be aligned well with PacBio blasr aligner (Chaisson and Tesler, 2012) or blat (BlatFAQ 2013).
The output file is usually in a binary bam format, still taking tens of Gigabytes of hard disk space (http://genome.ucsc.edu/goldenPath/help/bam.html). This step tends to be I/O intensive, so it is useful to place the reference onto an SDD, as opposed to HDD, to speed up the process. The alignment can be easily parallelized by chunking the data into subsets of reads and aligning each subset independently, then combining the results.

Step 2.2 Artifact removal: local realignment around indels

Here is a quote from the GATK document on best practices in variant detection (GATK 2013): “Reads that align on the edges of indels often get mapped with mismatching bases
that might look like evidence for SNPs. We look for the most consistent placement of the reads with respect to the indel in order to clean up these artifacts. <...skip...> There are two types of realignment, which constitute a vastly different amount of
 
processing power required:

  • Realignment only at known sites, which is very efficient, can operate with little coverage (1x per lane genome wide) but can only realign reads at known indels.
  • Fully local realignment uses mismatching bases to determine if a site should be realigned, and relies on sufficient coverage to discover the correct indel allele in the reads for alignment. It is much slower (involves SW step) but can discover new indel sites in the reads. If you have a database of known indels (for human, this database is extensive) then at this stage you would also include these indels during realignment, which vastly improves sensitivity, specificity, and speed.”
  • Realignment can be accomplished by using the GATK realign subroutine (IndelRealigner 2013). Alternatives include Dindel (for Illumina only) and SRMA (for Illumina and ABI SOLiD).

Step 2.3: Base quality score recalibration

Here is another excerpt from the GATK document on best practices in variant detection (GATK 2013): “The per­base estimate of error known as the base quality score is the foundation
upon which all statistically calling algorithms are based. We've found that the estimates provided by the sequencing machines are often inaccurate, and worse, biased. Through recalibration an empirically accurate error model is assigned to the bases to create an analysis­ready bam file.”
The problem is also demonstrated in a Nature publication (DePristo et al. 2011). The quality score recalibration can be performed using GATK­BQSR. Bioconductor's ReQON is an alternative tool (Cabanski et al. 2012).

Step 2.4 Calling the variants

There is no single “best” approach to capture all the genetic variations. Pabinger et al. (2013) suggest using a consensus of results from three tools:

  1. CRISP (Bansal 2010),
  2. HaplotypeCaller or UnifiedGenotyper from GATK, and
  3. mpileup from SAMtools (tutorial is available on the ANGUS site, Michigan State university http://ged.msu.edu/angus/tutorials-2012/snp_tutorial.html).

Only those variants should be kept that have a high confidence score:

  • Minimum Q30 for deep coverage data (>10x per sample) and
  • Minimum Q4 (if <=100 samples) or Q10 (if >100 samples) for shallower coverage.

The variant calls are usually produced in form of VCF files (10­20 M of hard disk space per file), with a diff­type  format:  http://www.1000genomes.org/wiki/Analysis/vcf4.0

Step 2.5 Statistical filtering

The raw VCF files frequently have many sites that are not really genetic variants, but rather machine artifacts that make the site statistically non­reference. GATK designed a tool (variant quality score recalibrator) to separate out the false positive machine artifacts from the true positive genetic variants. The tutorial is posted on gatkforums: http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration-vqsr

Note that this step is for variant quality score recalibration. The step 2.3 above was for base quality score recalibration. They are very different.


Phase III: Variant annotation and prioritization

This phase serves to select those variants that are of particular interest, depending on the research problem at hand. The methods are specific to the problem, thus we cannot elaborate on them, and only provide the list of commonly used tools below.

  • Mendelian disease linked variants: VAR­MD, KGGSeq, FamSeq (Peng et al 2013).
  • Predicting the deleteriousness of a non­synonymous single nucleotide variant: dbNSFP, HuVariome, Seattle­Seq, ANNOVAR, VAAST, snpEff
  • Identifying variants within the regulatory regions: RegulomeDB (Boyle et al 2012)
  • Galaxy

If it is desirable to perform all processing in Galaxy, then the workflow will be limited to the following tools that can be integrated into Galaxy:

  • bioconductor,
  • bowtie,
  • bwa,
  • fastx,
  • flexbar,
  • samtools.
  • Their uses in the workflow are explicated in the workflow schematic.

References

  • Bansal V (2010) A statistical method for the detection of variants from next­generation resequencing of DNA pools. Bioinformatics 26(12):i318­i324.
  • BlatFAQ (2013) http://genome.ucsc.edu/FAQ/FAQblat.html
  • Boyle AP, Hong EL, Hariharan M, Cheng Y, Schaub MA, Kasowski M, Karczewski KJ, Park J, Hitz BC, Weng S, Cherry JM, Snyder M. Annotation of functional variation in personal genomes using RegulomeDB. Genome Research 2012, 22(9):1790­1797. PMID: 22955989.
  • CR Cabanski, K Cavin, C Bizon, MD Wilkerson, JS Parker, KC Wilhelmsen, CM Perou, JS Marron, DN Hayes (2012) ReQON: a Bioconductor package for recalibrating quality scores from next­generation sequencing data. BMC Bioinformatics, 13:221
  • Chaisson MJ, Tesler G (2012) Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 2012, 13:238.
  • Cox, M.P., D.A. Peterson, and P.J. Biggs. (2010) SolexaQA: At­a­glance quality assessment of Illumina second­generation sequencing data. BMC Bioinformatics 11:485.
  • CLCBio  (2013) http://www.clcbio.com/clc-plugin/duplicate-mapped-reads-removal/
  • DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ. (2011) A framework for variation discovery and genotyping using next­generation DNA sequencing data. Nat Genet. 43(5):491­8.
  • M Dodt, JT. Roehr, R Ahmed, C Dieterich. (2012) Flexbar — flexible barcode and adapter processing for next­generation sequencing platforms. MDPI Biology 2012, 1(3):895­905.
  • Fastq (2013) http://en.wikipedia.org/wiki/FASTQ_format
  • GATK (2013) Best Practice Variant Detection with the GATK v4, for release 2.0. http://gatkforums.broadinstitute.org/discussion/1186/best-practice-variant-detection-with-the-gatk-v4-for-release-2-0
  • Illumina whitepaper (2010) http://www.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf
  • IndelRealigner (2013) http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_indels_IndelRealigner.html
  • Langmead B, Salzberg S. (2012) Fast gapped­read alignment with Bowtie 2. Nature Methods, 9:357­359.
  • Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows­Wheeler Transform. Bioinformatics, 25:1754­60.
  • Li H., Homer N. (2010) A survey of sequence alignment algorithms for next­generation sequencing. Briefings in Bioinformatics 2(5):473­483.
  • Peng G, Fan Y, Palculict TB, Shen P, Ruteshouser EC, Chi A, Davis RW, Huff V, Scharfe C, Wang W (2013) Rare variant detection using family­based sequencing analysis. Proceedings of the National Academy of Sciences. 110(10):3985­90
  • De Pristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ. (2011) A framework for variation discovery and genotyping using next­generation DNA sequencing data. Nat Genet. 43(5):491­498.
  • Minikel EV (2012) http://www.cureffi.org/2012/12/11/how-pcr-duplicates-arise-in-next-generation-sequencing/
  • Nielsen R, Paul JS, Albrechtsen A, Song YS (2011) Genotype and SNP calling from next­generation sequencing data. Nature Reviews 12:443­451.
  • S Pabinger, A Dander, M Fischer, R Snajder, M Sperk, M Efremova, B Krabichler, MR. Speicher, J Zschocke, Z Trajanoski (2013) A survey of tools for variant analysis of next­generation genome sequencing data. Briefings in Bioinformatics (21 January 2013).
  • Rodriguez ­Ezpeleta N, Hackenberg M, Aransay A.M. (2012) Bioinformatics for high throughput sequencing. Springer.
  • Schmieder R and Edwards R (2011) Quality control and preprocessing of metagenomic datasets. Bioinformatics 27:863­864.
  • Tufts (2011) http://genomics.med.tufts.edu/documents/protocols/TUCF_Understanding_Illumina_TruSeq_Adapters.pdf
  • Tufts (2013) http://genomics.med.tufts.edu/home/faq

Practise Datasets

A folder containing the input and output datasets as well as a README file can be accessed here.


Resource Requirements

Resource requirements for running the human variant calling SOP on full genome data
Data set size Step Minimum hardware requirements Scalability Approx. time frame Biggest bottleneck Notes
170GB (50X human genome, .fq.gz, 1.5 billion reads)
Quality control Nr cores = 1
RAM = 4GB
Storage = Size of input x 1
- Run time proporsional to size (samples/coverage)
- Required storage increases proportional to size (samples/coverage)
A few hours per run
- QC with FastQC
- Trimming with trimmomatic
The QC process is most efficient if it can be parallised.
- GATK resource bundle and reference = 30GB
- Storage requirements during processing is 10X the size of the input dataset. E.g. for 10 full genomes 17TB is required. The intermediate output can be reduced by piping output/input between tools or by cleaning up intermediate output whilst processing.
Alignment of reads Nr cores = 1
RAM = 4GB
Storage = Size of input X 5
- Run time proporsional to size (samples/coverage)
- Required storage increases proportional to size (samples/coverage)
- An alignment requires 4GB of RAM (human reference genome loaded into memory)
- An alignment can be chunked among cores to reduce the processing time.
- The number of alignments are proportional to the number of cores and RAM available
- BWA mem + sambamba (mark duplicates) = 8 hours Process involves many reads and writes to disk. Therefor access to fast disk is more efficient.
Variant calling with GATK Nr cores = 1
RAM = 4GB
Storage = Size of input X 5
- Run time proporsional to size (samples/coverage)
- Required storage increases proportional to size (samples/coverage)
- An alignment requires 4GB of RAM (human reference genome loaded into memory)
- An alignment can be chunked among cores to reduce the processing time
- The number of alignments are proportional to the number of cores and RAM available
- A GATK walker can be chunked among cores to reduce the processing time.
- Extract alignments beloning to each chromosome = 1 hour
- Index alignments on each chromosome < 10 min
- Add group info with picard = 1 hour
- Sort result by coordinate = 1 hour
- On each chromosome
- Local realignment = 2 hours
- Quality score recalibration = 3 hours
- Variant calling with the UnifiedGenotyper = 3 hours
Process involves many reads and writes to disk. Therefor access to fast disk is more efficient.

 

Resource requirements for running the human variant calling SOP on exome data
Data set size Step Minimum hardware requirements Scalability Approx. time frame Biggest bottleneck Notes
7GB (45X, .fq.gz)
Quality control Nr cores = 1
RAM = 4GB
Storage = Size of input + 100MB
- Run time proporsional to size (samples/coverage)
- Required storage increases proportional to size (samples/coverage)
An hour per run
- QC with FastQC
- Trimming with trimmomatic
The QC process is most efficient if it can be parallised.
- GATK resource bundle and reference = 30GB
- Storage requirements during processing is 10X the size of the input dataset. E.g. for 10 exomes 700GB is required. The intermediate output can be reduced by piping output/input between tools or by cleaning up intermediate output whilst processing.
Alignment of reads Nr cores = 1
RAM = 4GB
Storage = Size of input X 5
- Run time proporsional to size (samples/coverage)
- Required storage increases proportional to size (samples/coverage)
- An alignment requires 4GB of RAM (human reference genome loaded into memory)
- An alignment can be chunked among cores to reduce the processing time.
- The number of alignments are proportional to the number of cores and RAM available
- Input reads broken into 8 chunks. On each chunk:
- bwa aln < 10 min
- bwa sampe < 20 min
- sam to bam < 10 min
- Joining chunks
- novosort: sort, merge add RG tags < 10 min
- samtools: index merged bam file < 10 min
- picard: MarkDuplicates = 1 hour
- samtools: index duplicate marked bam < 10 min
Process involves many reads and writes to disk. Therefor access to fast disk is more efficient.
Variant calling with GATK Nr cores = 1
RAM = 4GB
Storage = Size of input X 5
- Run time proporsional to size (samples/coverage)
- Required storage increases proportional to size (samples/coverage)
- An alignment requires 4GB of RAM (human reference genome loaded into memory)
- An alignment can be chunked among cores to reduce the processing time
- The number of alignments are proportional to the number of cores and RAM available
- A GATK walker can be chunked among cores to reduce the processing time.
- Extract alignments belonging to each chromosome < 10 min
- Index alignments on each chromosome < 10 min
- Add group info with picard < 10 min
- Sort result by coordinate < 10 min
- On each chromosome
- Local realignment < 30 min
- Quality score recalibration = 30 min
- Variant calling with the UnifiedGenotyper = 40 min
Process involves many reads and writes to disk. Therefor access to fast disk is more efficient.

 

List of H3ABioNet servers with suffcient resources for doing the human variant calling SOP
Option Model Cores RAM Disk capacity Sufficient for this analysis H3ABioNet nodes with a server
Option A Dell C6145 64 256GB 10TB Yes Rhodes, CPGR, ICIPE, IPT, CBIO
Option C Dell C6145 24 128GB 10TB Yes MUHAS, NMIMR, KCCR, UDSM
Uganda HP 1 HP DL360e 16 64GB 3.6TB Yes UVRI
Uganda HP 2 HP DL385p 16 128GB 6TB Yes UVRI
Tanzania Dell 1 Dell 8 144GB 12TB Yes MDH