The advent of next-generation sequencing (NGS) has greatly accelerated genomics research, which produces millions to billions of sequence reads at a high speed. Currently, available NGS platforms include Illumina, Ion Torrent/Life Technologies, 454/Roche, Pacific Bioscience, Nanopore, and GenapSys. They can produce reads of 100-10,000 bp in length, enabling sufficient coverage of the genome at a lower cost. But faced with the enormous amount of sequence data, how do we best deal with them? And what are the most appropriate computational methods and analysis tools for this purpose? In this review, we focus on the bioinformatics pipeline of whole exome sequencing (WES).
Whole exome sequencing is a genomic technique for sequencing the exome (all protein-coding genes). It is widely used in basic and applied research, especially in the study of Mendelian diseases. You can read the article principle and workflow of whole exome sequencing to know more about WES. A typical workflow of WES analysis includes these steps: raw data quality control, preprocessing, sequence alignment, post-alignment processing, variant calling, variant annotation, and variant filtration and prioritization. They will be discussed below.
Figure 1. A general framework of WES data analysis (Bao et al. 2014).
Raw data quality control
Sequence data generally have two common standard formats: FASTQ and FASTA. FASTQ files can store Phred-scaled base quality scores to better measure sequence quality. It is, therefore, widely accepted as the standard format for NGS raw data. There are multiple tools developed to assess the quality of NGS raw data, such as FastQC, FastQ Screen, FASTX-Toolkit, and NGS QC Toolkit.
Read QC parameters:
- Base quality score distribution
- Sequence quality score distribution
- Read length distribution
- GC content distribution
- Sequence duplication level
- PCR amplification issue
- Biasing of k-mers
- Over-represented sequences
With a comprehensive read QC report (generally involves the above parameters), researches can determine whether data preprocessing is necessary. Preprocessing steps generally involve 3’ end adapter removal, low-quality or redundant read filtering, and undesired sequence trimming. Several tools can be used for data preprocessing, such as Cutadapt and Trimmomatic. PRINSEQ and QC3 can achieve both quality control and preprocessing.
There are algorithms for shot reads mapping, including Burrows-Wheeler Transformation (BWT) and Smith-Waterman (SW) algorithms. Bowtie2 and BWA are two popular short reads alignment tools that implement BWT (Burrows-Wheeler Transformation) algorithm. MOSAIK, SHRiMP2, and Novoalign are important short reads alignment tools that are implementations of SW algorithm with increased accuracy. Additionally, multithreading and MPI implementations allow significant reduction in the runtime. Of all the tools mentioned above, Bowtie2 is outstanding by fast running time, high sensitivity, and high accuracy.
After reads mapping, the aligned reads are post-processed so as to remove undesired reads or alignment, such as reads exceeding a defined size and PCR duplicates. Tools such as Picard MarkDuplicates and SAMtools can distinguish PCR duplicates from true DNA materials. Subsequently, the second step is to improve the quality of gapped alignment via indel realignment. Some aligners (such as Novoalign) and variant callers (such as GATK HaplotypeCaller) involve indel alignment improvement. After indel realignment, BQSR (BaseRecalibrator from the GATK suite) is recommended to improve the accuracy of base quality scores prior to variant calling.
The variant analysis is important to detect different types of genomic variants, such as SNPs, SNVs, indels, CNVs, and larger SVs, especially in cancer studies. It is vital to distinguish somatic from germline variants. Somatic variants present only in somatic cells and are tissue-specific, while germline variants are inherited mutations presented in the germ cells and are linked with patient’s family history. Variant calling is used to identify SNP and short indels in exome samples. The common variant calling tools are listed in Table 1. Some studies have evaluated these variant callers. Liu et al. recommended GATK, and Bao et al. recommended a combination of Novoalign and FreeBayes.
Table 1. The common variant calling tools.
|Germline variant calling||GATK, SAMtools, FreeBayes, Atlas2|
|Somatic variant detection||GATK, SAMtools mpileup, Issac variant caller, deepSNV, Strelka, MutationSeq, MutTect, QuadGT, Seurat, Shimmer, SolSNP, jointSNVMix, SomaticSniper, VarScan2, Virmid|
After variants are identified, they need to be annotated for better understanding disease pathogenesis. Variant annotation generally involves information about genomic coordinates, gene position, and mutation type. Many studies focus on the non-synonymous SNVs and indels in the exome, which account for 85% of known disease-causing mutations in Mendelian disorders and a great deal of mutations in complex diseases.
Besides the basic annotation, there are many databases that can provide additional information about the variants. ANNOVAR is a powerful tool that combines over 4,000 public databases for variant annotation, such as dbSNP, 1000 Genomes, and NCI-60 human tumor cell line panel exome sequencing data. This tool can be used for minor allele frequency (MAF) prediction, deleterious prediction, indication of conservation of the mutated site, experimental evidence for disease variant, and prediction scores from GERP, PolyPhen, and other programs. Other common databases include OncoMD, OMIM, SNPedia, 1000 genomes, bdSNP, and personal genome variants.
Variant filtration and prioritization
WES can generate thousands of variant candidates. The number can be reduced by variant prioritization, to generate a short but prior candidate mutation list for further experimental validation. Variant prioritization involves three steps: 1) removal of less reliable variant calls; 2) depletion of common variants (due to the assumption that rare variants are more likely to cause disease); 3) prioritization of variants relative to the disease using discovery-based and hypothesis-based approaches. The available tools for variant filtration and prioritization include VAAST2, VarSifer, KGGseq, PLINK/SEQ, SPRING, GUI tool, Gnome, and Ingenuity Variant Analysis.
In the next few years, whole exome sequencing may be adopted as a routine clinical procedure for disease treatment. And many healthcare facilities have already provided genetic testing by utilizing NGS technologies such as WES. The next challenge will be the data management with millions of genomic variants, and the integration of genomic variants, clinical records, and patient information.
If you are interested in the whole exome sequencing provided by CD Genomics, please feel free to contact us. We provide full whole exome sequencing service package, including sample standardization, exome capture, library construction, high-throughput sequencing, raw data quality control, and bioinformatics analysis. We can tailor this pipeline to your research interest.
- Bao R, Huang L, Andrade J, et al. Review of current methods, applications, and data management for the bioinformatics analysis of whole exome sequencing. Cancer informatics, 2014, 13: CIN. S13779.
- Meena N, Mathur P, Medicherla K M, et al. A Bioinformatics Pipeline for Whole Exome Sequencing: Overview of the Processing and Steps from Raw Data to Downstream Analysis. bioRxiv, 2017: 201145.
- Xu H, DiCarlo J, Satya RV, Peng Q, Wang Y. Comparison of somatic mutation calling methods in amplicon and whole exome sequence data. BMC Genomics. 2014;15:244.