The Methods of Whole Genome Sequencing

Overview of Whole Genome Sequencing

The genome of each individual organism contains its entire genetic information. Whole genome sequencing technology can comprehensively and accurately analyze entire genomes, thereby breaking the information contained in it and revealing the complexity and diversity of the genome. The emergence of whole genome sequencing technology is a revolutionary advancement in all areas of life sciences. Whole genome sequencing can detect variants, including single-nucleotide variants, insertions/deletions, copy number changes, and large scale structural variants. Whole genome sequencing can be classified into de novo and resequencing depending on whether there is a reference genome. If there is a reference genome, genome assembly will become more easy and rapid.

  1. Two Classic Approaches for Sequencing Large Genomes

In the early 80s, Sanger successfully completed a whole genome sequencing of the lambda phage by using the shotgun method, and the method was successfully applied to the larger virus DNA, the organelle DNA, and the sequencing of the bacterial genome DNA. Shotgun sequencing is the classic strategy for whole genome sequencing. The shotgun sequencing strategy provides a technical guarantee for large-scale sequencing. The technology first randomly interrupts a complete target sequence into small fragments, sequenced separately, and then splicing them into a consistent sequence by using the overlapping relationships of these small fragments. It mainly includes two methods: one is hierarchical shotgun sequencing (clone-by-clone method) and the other is whole genome shotgun sequencing.

  • Clone-by-clone sequencing

This method was once adopted by the HGP consortium. This method can generate high density maps, making the genome assembly easier. It generally includes four steps, preparation of BAC clone library, preparation of clone fingerprint, BAC clone sequencing, and sequence assembly. However, this method is time-consuming and costly, so it is seldom used at present.

Figure 1. Steps involved in the clone-by-clone sequencing.

WGS generally involves six steps, isolation of genomic DNA, random fragmentation of genomic DNA, size selection using electrophoresis, library construction, paired-end sequencing (PE sequencing), and genome assembly. Two different sizes of DNA fragments including longer insert (2-2.5 kb) and short insert (0.5-1.2 kb) are selected from the agarose gel. While the long inserts are cloned in phage or socmid vectors, the short inserts are cloned in plasmid vectors. The short insert clone library is used for sequencing from both the ends. Since large numbers of clones are sequenced, each of the genomes will be covered more than 10 times. Long insert clones can be used to increase the efficiency of genome assembly.


  • Does not require genome maps.
  • Less time consuming
  • Money-saved


  • Genome assembly for eukaryotic genomes is difficult due to abundant repetitive sequences
  • Genome sequencing using this method is not accurate.
  1. NGS Accelerates WGS

Unlike clone-based library approaches, next-generation sequencing platforms utilize a dramatically simplified method of library construction, which has simplified and accelerated the whole genome shotgun sequencing. In generally, genomic DNA is first randomly fragmented using sonication or nebulization, and then are ligated to a platform-specific set of double-stranded adapters to generate a shotgun library. Subsequently, these library fragments can be amplified in situ by hybridization and extension from complementary adapters which are covalently attached to the surface of a glass microfluidic cell or a small bead (depending on the sequencing platform). All NGS instruments utilize a microfluidic device to contain the amplified fragments of the shotgun library, followed by an imaging step that collects data from fragments being actively sequenced.

We will take the Illumina sequencer as an example to illustrate the workflow of WGS based on high-throughput sequencing.

  • Construction of Sequencing Library

The genome is first prepared, and then the DNA is randomly fragmented into hundreds of bases or shorter fragments with specific adapters at both ends. If the transcriptional group is sequenced, the library construction is a bit more troublesome. After the RNA fragmentation, it needs to reverse to cDNA, then add the connector, or reverse the RNA to the cDNA first, then fragment and add the joint. The size of the fragment (insert size) has an impact on the subsequent data analysis and can be selected according to needs. For genome sequencing, several different insert sizes are usually chosen to get more information when assembling.

  • Surface Attachment and Bridge Amplification

The reaction of Solexa sequencing is carried out in a glass tube called flow cell, and flow cell is subdivided into 8 Lanes, each of which has a number of fixed single strand joints on the inner surface of each Lane. The DNA fragment of the joint was transformed into a single strand and combined with the primers on the sequencing channel to form a bridge like structure for subsequent preamplification.

  • Denaturation and Complete Amplification

The unlabeled dNTP and the common Taq enzyme were added for solid phase bridge PCR amplification, and the single-stranded bridge sample was amplified into a double-stranded bridge fragment. By denaturation, a complementary single strand is released and anchored to the nearby solid surface. By continuously cycling, millions of clusters of double-stranded analytes will be obtained on the solid surface of the Flow cell.

  • Single Base Extension and Sequencing

Four fluorescently labeled dNTPs, DNA polymerases, and linker primers were added to the sequenced flow cells for amplification. When each sequencing cluster extends the complementary strand, each fluorescent labelled dNTP is added to release the corresponding fluorescence. The sequencer obtains sequence information of the fragment to be tested by capturing a fluorescent signal and converting the optical signal into a sequencing peak by computer software. The read length is affected by a number of factors that cause signal attenuation, such as incomplete cutting of fluorescent markers. As the length of the reading increases, the error rate will also increase.

  • Data Analysis

This step is not strictly a part of the sequencing process, but it only makes sense through the work in front of this step. The raw data obtained by sequencing is a sequence of only a few tens of bases in length, and the contigs that assemble these short sequences through bioinformatics tools are even the framework of the entire genome. Alternatively, these sequences are aligned to an existing genome or a similar species genome sequence, and further analyzed to obtain biologically meaningful results.

  1. Application of Third-generation Sequencing Sequencing in Whole Genome Sequencing

Although next-generation sequencing has enabled population-scale analyses of small variants, it’s difficult to identify larger structural variations. Further, de novo assembly using next-generation sequencing are often of lower quality compared with those using older and more expensive methods. The single-molecule sequencing technologies can get over these difficulties, which can span nearly entire chromosome arms and are not sensitive to GC content. Third-generation sequencing technologies have been used to produce highly accurate de novo and reference assemblies for microorganisms, plants, animals, and humans, enabling new insights into evolution and sequence diversity.

If you are interested in our genomics services, please feel free to contact our scientists.


  1. Bentley D R. Whole-genome re-sequencing. Current Opinion in Genetics & Development, 2006, 16(6):545-552.
  2. Fuentespardo A P, Ruzzante D E. Whole-genome sequencing approaches for conservation biology: advantages, limitations, and practical recommendations. Molecular Ecology, 2017, 26(20):5369.
  3. Batzoglou S, Berger B, Mesirov J, et al. Sequencing a genome by walking with clone-end sequences (abstract):a mathematical analysis// International Conference on Computational Molecular Biology. DBLP, 2000:45.
  4. Sanger F ,, Coulson A R, Hong G F, et al. Nucleotide sequence of bacteriophage lambda DNA. Journal of Molecular Biology, 1982, 162(4):729-73.
  5. Kawarabayasi Y, Sawada M, Horikawa H, et al. Complete sequence and gene organization of the genome of a hyper-thermophilic archaebacterium, Pyrococcus horikoshii OT3. Dna Research, 1998, 5(2):55.
  6. Kaneko T, Sato S, Kotani H, et al. Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions. Dna Research, 1996, 3(3):185-209.
  7. Myers E W, Sutton G G, Delcher A L, et al. A Whole-Genome Assembly of. Science, 2014.
  8. Siegel A F, Engh G V D, Hood L, et al. Modeling the Feasibility of Whole Genome Shotgun Sequencing Using a Pairwise End Strategy. Genomics, 2000, 68(3):237.
  9. White O, Fraser C M. Genome sequence of the radioresistant bacterium Deinococcus radiodurans R1. Science, 1999, 286(5444):1571-1577.
  10. May B J, Zhang Q, Li L L, et al. Complete genomic sequence of Pasteurella multocida, Pm70. Proceedings of the National Academy of Sciences of the United States of America, 2001, 98(6):3460-3465.
  11. Ginsburg G S, Willard H F. Genomic and personalized medicine. Academic Press, 2008.


Principles and Workflow of Whole Genome Bisulfite Sequencing

Principles of whole genome bisulfite sequencing

Epigenetic studies have confirmed that DNA-methylation modification of specific gene regions plays an important role in chromosome conformation and gene expression regulation. Methylation of DNA cytosine residues at the C5 (5meC) is a common epigenetic mark in many eukaryotes and is widely found in CpG or CpHpG (H=A, T, C). There are mainly three approaches, including endonuclease digestion, affinity enrichment, and bisulfite conversion (Table 1). Almost all sequence-specific DNA methylation analysis approaches require a methylation-dependent treatment before amplification or hybridization to maintain fidelity. Various molecular biology techniques, such as next-generation sequencing (NGS), are subsequently performed to detect 5meC residues.

Table 1. Main principles of NGS-based methylation analysis.

  Enzyme digestion Affinity enrichment Sodium bisulfite
Principles Some restriction enzymes, such as HpaII and SmaI, are inhibited by 5meC in the CpG. Affinity enrichment uses antibodies specific for 5meC or methyl-binding proteins with affinity for profiling of DNA methylation. Sodium bisulfite chemically turns unmethylated cytosine into uracil, hence enabling methylation detection.
Method example Methyl-seq*MCA-seq





*MCA: methylated CpG island amplification; *HELP: HpaII tiny fragment enrichment by ligation-mediated PCR; *MSCC: methylation-sensitive cut counting; *MeDIP-seq: methylated DNA immunoprecipitation; *MIRA: methylated CpG island recovery assay; *RRBS: reduced representation bisulfite sequencing; *WGBS: whole genome bisulfite sequencing; *BSPP: bisulfite padlock probes.

Bisulfite conversion spurred a revolution in genome methylation analysis in 1990s. Since bisulfite can convert un-methylated cytosines in the genome into uracils and then replaced by thymines during PCR amplification, which can be distinguished from the cytosine originally modified by methylation by counting cytosines and thymines for each position after sequencing (Figure 1). Whole genome bisulfite sequencing (WGBS), as a research method of great significance in this field, applies a combination of bisulfite treatment and next/third generation sequencing technologies (mostly, shotgun sequencing) to study DNA methylation at genomic level.

Figure 1. Bisulfite conversion and PCR amplification prior to DNA sequencing.

Advantages of whole genome bisulfite sequencing

  • Making genome-wide methylation profiling possible at a single-base level.
  • Assessing the methylation status of almost every CpG locus, including intergenic “gene deserts”, partial methylation domains, and remote regulatory elements.
  • Revealing absolute DNA methylation levels and methylation sequence background.

Workflow of whole genome bisulfite sequencing

In short, the basic steps of whole genome bisulfite sequencing (WGBS) include DNA extraction, bisulfite conversion, library preparation, sequencing, and bioinformatics analysis. Here we use Illumina HiSeq as our example to illustrate the workflow of WGBS.

Figure 2. The workflow of whole genome bisulfite sequencing (Khanna et al. 2013).

  • DNA Extraction

Firstly, approximately 1-5 mg of tissue samples collected from humans, animals, plants or microorganisms are prepared for DNA. In general, samples for whole-genome bisulfite sequencing need to meet the following four characteristics.

  1. Eukaryotes;
  2. Hypomethylation (as shown in Figure 3, studies have shown that once the number of CpG sites in a region increases, the sequencing data of WGBS begins to decrease);

iii. Its reference genome has been assembled to the scaffold level at least;

  1. Relatively complete genome annotations. And then, apply a suitable kit to extract high-purity and high-molecular-weight DNA. The extracted DNA should have a mass of no less than 5 μg, a concentration of no less than 50 ng/ul, and an OD260/280 of 1.8 to 2.0.

Figure 3. Conventional WGBS technology has low coverage of methylation sites (Raine et al. 2016)

  • Bisulfite Conversion

Bisulfite conversion is considered to be the “gold standard” for DNA methylation analysis, the principles have been shown in Figure 4. For this method, BS-induced DNA degradation may lead to depletion of genomic regions enriched for unmethylated cytosines. Therefore, it is important to assess the amount of DNA degradation under reaction conditions, and how this affects the desired amplicon should also be considered. Olova et al. (2018) found that DNA degradation is strong in bisulfite conversion protocols that utilize high denaturation or high bisulfite molarity. There are several kits available in the market (Table 2).

Figure 4. Bisulfite-mediated deamination of cytosine (Hayatsu et al. 2004).

Table 2. Bisulfite conversion protocols and parameters.

Kits Denaturation Conversion temperature Incubation time
Zymo EZ DNA Methylation Lightning Kit Heat-based; 99 °C
Alkaline-based; 37 °C
65 °C 90 minutes
EpiTect Bisulfite kit (Qiagen) Heat-based; 99 °C 55 °C 10 hours
EZ DNA Methylation Kit (Zymo Research) Alkaline-based; 37 °C 50 °C 12-16 hours
  • Library Preparation

Take the EpiGnomeTM Methyl-Seq Kit (Epicentre) as an example (as shown in Figure 5), bisulfite-treated single-stranded DNA is random-primed using a polymerase capable of reading uracil nucleotides, to synthesize DNA containing a specific sequence tag. The 3’ end of the newly synthesized DNA strand is then selectively labeled with a second specific sequence, thus a two-marker DNA molecular with a known sequence tag at the 5’ and 3’ ends can be obtained. Illumina P7 and P5 adapters are subsequently added by PCR at the 5 and 3 ends prior to DNA sequencing.

Figure 5. Workflow for the EpiGnomeTM Methyl-Seq Kit.

  • Sequencing

Hiseq sequencing technology, a novel sequencing method based on sequencing-by-synthesis (SBS), is widely applied for WGBS. The bridge amplification on a flow cell is achieved by using a single molecule array. Since the new reversible blocking technique can synthesize only one base at a time and label the fluorophore, the corresponding laser is used to excite the fluorophore, and the excitation light can be captured to read the base information. Paired-end 150 bp strategy is typically employed in WGBS to sequence 250-300 bp insertion bisulfite-treated DNA libraries. In addition to Illumina HiSeq, PacBio SMRT, Nanopore, Roche 454, and other Illumina platforms are also commonly used for this purpose.

  • Data Analysis

A series of analyses can be performed for the sequencing results. Five main types of information analysis are listed in Table 3. In addition, methylation density analysis, differentially methylated region (DMR) analysis, DMR annotation and enrichment analysis (GO/KEGG) and clustering analysis can also be performed. The common bioinformatic resources of WGBS include BDPC, CpGcluster, CpGFinder, Epinexus, MethTools, mPod, QUMA, and TCGA Data Portal.

Table 3. Main types of WGBS data analysis.

Type Details
Alignment against reference genome Tools, such as SOAP software, are used to compare the reads with the reference genome sequence, and only the aligned reads will be used for the analysis of methylation information. Align reads allowing C-C matches and C-T mismatches.
mC calling Determine mC position throughout the genome. mC ratios are computed by considering read quality and multi-locus mapping probabilities. Discard small-probability alignment that has a low reliability of alignment.
Sequence depth and coverage analysis An image reflecting the relationship between gene coverage and sequencing depth determines whether methylation discovery can be made with a certain degree of confidence at specific base positions.
Methylation level analysis The methylation level of each methylated C base is calculated as follows: 100*reads/total reads. The genome-wide average methylation level reflects the overall characteristics of the genomic methylation profile.
Global trends of methylome The distribution ratio of CG, CHGG and CHH in methylated C bases reflects the characteristics of whole genome methylation maps of specific species to some extent.

Featured services:

Whole genome bisulfite sequencing

Targeted bisulfite sequencing


Reduced Representation Bisulfite Sequencing


  1. Fraga, M. F., Esteller, M. (2002). Dna methylation: a profile of methods and applications. Biotechniques,33(3), 636-49.
  2. Green, R. E., Krause, J., Briggs, A. W., Maricic, T., Stenzel, U., et al. (2010). A Draft Sequence of the Neandertal Genome. Science, 328(5979), 710–722.
  3. Hayatsu, H., Negishi, K., & Shiraishi, M. (2004). DNA methylation analysis: speedup of bisulfite-mediated deamination of cytosine in the genomic sequencing procedure. Proceedings of the Japan Academy,80(4), 189-194.
  4. Herman, J. G., Graff, J. R., Myöhänen, S., Nelkin, B. D., & Baylin, S. B. (1996). Methylation-specific pcr: a novel pcr assay for methylation status of cpg islands. Proceedings of the National Academy of Sciences of the United States of America,93(18), 9821-9826.
  5. Ji, L., Sasaki, T., Sun, X., Ma, P., Lewis, Z. A., & Schmitz, R. J. (2014). Methylated dna is over-represented in whole-genome bisulfite sequencing data. Front Genet,5(5), 341.
  6. Khanna, A., Czyz, A., & Syed, F. (2013). Epignome[trade] methyl-seq kit: a novel post-bisulfite conversion library prep method for methylation analysis. Nature Methods,10(10).
  7. Laird, P. W. (2003). The power and the promise of DNA methylation markers. Nature Reviews Cancer, 3(4), 253–266. doi:10.1038/nrc1045
  8. Laura-Jayne, G., Mark, Q. T., Lisa, O., Jonathan, P., Neil, H., & Anthony, H. (2015). A genome-wide survey of dna methylation in hexaploid wheat. Genome Biology,16(1), 273.
  9. Lin Liu, Ni Hu, Bo Wang, Minfeng Chen, Juan Wang, & Zhijian Tian, et al. (2011). A brief utilization report on the illumina hiseq 2000 sequencer. Mycology,2(3), 169-191.
  10. Meissner, A., Gnirke, A., Bell, G. W., Ramsahoye, B., Lander, E. S., & Jaenisch, R. (2005). Reduced representation bisulfite sequencing for comparative high-resolution dna methylation analysis. Nucleic Acids Research,33(18), 5868-77.
  11. Meyer, M., Kircher, M., Gansauge, M. T., Li, H., Racimo, F., & Mallick, S., et al. (2012). A high coverage genome sequence from an archaic denisovan individual. Science,338(6104), 222-6.
  12. Olova, N., Krueger, F., Andrews, S., Oxley, D., Berrens, R. V., & Branco, M. R., et al. (2018). Comparison of whole-genome bisulfite sequencing library preparation strategies identifies sources of biases affecting dna methylation data. Genome Biology,19(1), 33.
  13. Raine, A., Manlig, E., Wahlberg, P., Syvänen, A. C., & Nordlund, J. (2016). Splinted ligation adapter tagging (splat), a novel library preparation method for whole genome bisulphite sequencing. Nucleic Acids Research,45(6), e36.
  14. Ziller, M. J., Müller, F., Liao, J., Zhang, Y., Gu, H., & Bock, C.,et al. (2011). Genomic distribution and inter-sample variation of non-cpg methylation across human cell types. Plos Genetics, 7(12), e1002389.

Overview of Metatranscriptomic Sequencing: Principles, Workflow, and Applications

What is metatranscriptomic sequencing?

Metatranscriptomic sequencing provides direct access to culturable and non-culturable microbial transcriptome information by large-scale, high-throughput sequencing of transcripts from all microbial communities in specific environmental samples. Metatranscriptomic sequencing offers an opportunity to randomly sequence mRNAs as a unit for understanding the regulation of complex processes in microbial communities. The study of the metatranscriptome through Next-Generation Sequencing techniques allows us to obtain gene expression profiles from whole microscopical populations, providing new insights into poorly known biological systems and overcoming technical limitations related to individual bacteria isolation.

Challenges of metatranscriptomic sequencing

Although current metatranscriptomic techniques are promising, there are still several obstacles that limit their large-scale application. First, much of harvested RNA comes from ribosomal RNA (rRNA), and its dominating abundance can dramatically reduce the coverage of mRNA, which is the main focus of transcriptomic studies. To overcome this, some efforts have been made to effectively remove rRNA. Second, mRNA is notoriously unstable, compromising the integrity of the sample before sequencing. Third, differentiating between host and microbial RNA can be challenging, although commercial enrichment kits are available. This may also be done in silico if a reference genome is available for the host, as in the work of Perez-Losada et al. who considered the impact of host–pathogen interactions on the human airway microbiome. Finally, transcriptome reference databases are limited by their coverage.

Workflow of metatranscriptomic sequencing

To put it simply, the first step is to extract total RNA from the sample, and then to detect it. The qualified RNA is subjected to fragment screening, database construction and corresponding quality testing. The qualified library will be sequenced (mainly using Illumina sequencing platform). The raw data obtained by sequencing will be used for bioinformatics analysis.

The metatranscriptomics library preparation process is shown in figure 2. The two main strategies for mRNA enrichment are illustrated, either by using rRNA separation through means of hybridization with 16S and 23S rRNA probes, or by a depletion of rRNAs through means of a 5-exonuclease. Then, first strand of cDNA is synthesized by means of reverse transcriptase using random hexamers. And second strand of cDNA is synthesized by a DNA polymerase. Finally, sequencing adapters are attached to the cDNA strands, and this could be done either by PCR or by ligation.

Figure 2. The metatranscriptomic sequencing library preparation process (Peimbert M, et al. 2016)

The overall process of metatranscriptomic sequencing bioinformatics steps are: filtering the readings, selecting the library between aligning the reference sequence and performing de novo assembly, annotation, statistical analysis, and uploading the original, assembled, and annotated data sets.

The applications of metatranscriptomics

  • Human health

Symbiotic bacteria (normal flora) play a key role in protecting us from pathogens, but under certain conditions they can overcome protective host responses and trigger pathological effects. Microbial population analysis can be used as an indicator of an individual’s health status and as a powerful tool for the prevention, diagnosis and treatment of specific diseases.

  • Assessment of microbiome–immune interactions

The effects of microbiota on the mucosal immune system are thought to be key to affect host physiology. A study of toll-like receptor 5 (TLR5) knockout (KO) mice is an interesting example of the use of metatranscriptomics to complement metagenomic and 16S rRNA characterization of this microbial immune interaction. Metatranscriptomics analysis showed that flagellar motor-related gene expression was up-regulated in TLR5KO mice compared to wild-type mice. In this model, TLR 5 flagellin recognition causes the production of anti-flagellin antibodies, resulting in down-regulation of various bacterial flagellar motor genes, thereby inhibiting microflora. Deletion of TLR 5 results in reduced production of anti-flagellin antibodies, leading to upregulation of bacterial flagellar motor genes, thereby increasing the ability of bacteria in the gastrointestinal environment to disrupt the mucosal barrier.

  • Studying microbiome small noncoding RNAs

The bacterial transcriptome includes small non-coding RNAs (sRNAs), which are typically between 50 and 500 bp in size and are involved in gene regulation. They regulate the translation or stability of the transcript by interacting with the 5′-untranslated region (UTR) of the target mRNA sequence. They are important for their ability to regulate important processes in bacteria, such as iron metabolism, virulence and quorum sensing, and to adapt quickly to changing environments. The emergence of next-generation sequencing methods has accelerated their identification of various bacteria, such as Salmonella and Bacillus subtilis. Next-generation sequencing methods also offer an opportunity to study bacterial sRNAs at the community level. For example, metatranscriptomics analysis of bacteria from different depths of the ocean suggests that sRNAs have a potential role in niche adaptation. Metatranscriptomics of the human activity gut microflora identified a number of sRNAs, although their role in gut microflora has not yet been elucidated.

  • Drug discovery

Hundreds of drugs used today are derived from bacterial compounds. The study of metagenomes or metatranscriptomes in microbial communities offers new opportunities to explore innovative sources for drug discovery that are inaccessible today due to technical limitations in the isolation of these non-culturable microorganisms.

  • Agriculture

Microbial communities living on and around plants play a vital role in the nutrients needed for plant growth. In addition, the presence of specific micro-communities makes crops healthy and productive. Metagenomics and metatranscriptomics provide an opportunity to explore how microbial soil populations produce healthier and higher yielding crops.

  • Ecology

Microorganisms are able to remove a wide variety of natural and synthetic harmful substances and convert them into other harmless compounds in humans and the environment. I don’t know how these microbial communities degrade harmful chemicals, but it provides new solutions for repairing and monitoring environmental pollution or improving drinking water purification methods.

  • Food industry

Metagenomics and metatranscriptomics methods can be used to improve food quality, function and safety, and provide information related to metabolic activities of microbial communities.

At CD Genomics, we provide you with high-quality sequencing and integrated bioinformatics analysis for your metatranscriptomics project. If you have additional requirements or questions, please feel free to contact us.


  1. Peimbert M, Alcaraz L D. A Hitchhiker’s Guide to Metatranscriptomic sequencing [M]// Field Guidelines for Genetic Experimental Designs in High-Throughput Sequencing. Springer International Publishing, 2016.
  2. Vanessa A P, Huang W, Victoria S U, et al. Metagenomics, Metatranscriptomic sequencing, and Metabolomics Approaches for Microbiome Analysis: [J]. Evolutionary Bioinformatics Online, 2016, 12(Supple 1):5-16.
  3. Dick G. Metatranscriptomic sequencing [M]// Genomic Approaches in Earth and Environmental SciMaurice C F, Haiser H J, Turnbaugh P J. Xenobiotics shape the physiology and gene expression of the active human gut microbiome[J]. Cell, 2013, 152(1-2):39-50.
  4. Warnecke F, Hess M. A perspective: Metatranscriptomics as a tool for the discovery of novel biocatalysts[J]. Journal of Biotechnology, 2009, 142(1):91-95.
  5. Jorth P, Turner K H, Gumus P, et al. Metatranscriptomics of the Human Oral Microbiome during Health and Disease[J]. Mbio, 2014, 5(2): e01012.
  6. O’Malley M A. Metatranscriptomics[M]. Springer New York, 2013.
  7. Cao Y, Fanning S, Proos S, et al. A Review on the Applications of Next Generation Sequencing Technologies as Applied to Food-Related Microbiome Studies: [J]. Frontiers in Microbiology, 2017, 8:1829.
  8. Bashiardes S, Zilberman-Schapira G, Elinav E. Use of Metatranscriptomics in Microbiome Research[J]. Bioinformatics & Biology Insights, 2016, 10(10):19-25.

Read the full article here:

Bioinformatics Workflow of Whole Exome Sequencing

The advent of next-generation sequencing (NGS) has greatly accelerated genomics research, which produces millions to billions of sequence reads at a high speed. Currently, available NGS platforms include Illumina, Ion Torrent/Life Technologies, 454/Roche, Pacific Bioscience, Nanopore, and GenapSys. They can produce reads of 100-10,000 bp in length, enabling sufficient coverage of the genome at a lower cost. But faced with the enormous amount of sequence data, how do we best deal with them? And what are the most appropriate computational methods and analysis tools for this purpose? In this review, we focus on the bioinformatics pipeline of whole exome sequencing (WES).

Whole exome sequencing is a genomic technique for sequencing the exome (all protein-coding genes). It is widely used in basic and applied research, especially in the study of Mendelian diseases. You can read the article principle and workflow of whole exome sequencing to know more about WES. A typical workflow of WES analysis includes these steps: raw data quality control, preprocessing, sequence alignment, post-alignment processing, variant calling, variant annotation, and variant filtration and prioritization. They will be discussed below.

Figure 1. A general framework of WES data analysis (Bao et al. 2014).

Raw data quality control

Sequence data generally have two common standard formats: FASTQ and FASTA. FASTQ files can store Phred-scaled base quality scores to better measure sequence quality. It is, therefore, widely accepted as the standard format for NGS raw data. There are multiple tools developed to assess the quality of NGS raw data, such as FastQC, FastQ Screen, FASTX-Toolkit, and NGS QC Toolkit.

Read QC parameters:

  1. Base quality score distribution
  2. Sequence quality score distribution
  3. Read length distribution
  4. GC content distribution
  5. Sequence duplication level
  6. PCR amplification issue
  7. Biasing of k-mers
  8. Over-represented sequences

Data preprocessing

With a comprehensive read QC report (generally involves the above parameters), researches can determine whether data preprocessing is necessary. Preprocessing steps generally involve 3’ end adapter removal, low-quality or redundant read filtering, and undesired sequence trimming. Several tools can be used for data preprocessing, such as Cutadapt and Trimmomatic. PRINSEQ and QC3 can achieve both quality control and preprocessing.

Sequence alignment

There are algorithms for shot reads mapping, including Burrows-Wheeler Transformation (BWT) and Smith-Waterman (SW) algorithms. Bowtie2 and BWA are two popular short reads alignment tools that implement BWT (Burrows-Wheeler Transformation) algorithm. MOSAIK, SHRiMP2, and Novoalign are important short reads alignment tools that are implementations of SW algorithm with increased accuracy. Additionally, multithreading and MPI implementations allow significant reduction in the runtime. Of all the tools mentioned above, Bowtie2 is outstanding by fast running time, high sensitivity, and high accuracy.

Post-alignment processing

After reads mapping, the aligned reads are post-processed so as to remove undesired reads or alignment, such as reads exceeding a defined size and PCR duplicates. Tools such as Picard MarkDuplicates and SAMtools can distinguish PCR duplicates from true DNA materials. Subsequently, the second step is to improve the quality of gapped alignment via indel realignment. Some aligners (such as Novoalign) and variant callers (such as GATK HaplotypeCaller) involve indel alignment improvement. After indel realignment, BQSR (BaseRecalibrator from the GATK suite) is recommended to improve the accuracy of base quality scores prior to variant calling.

Variant calling

The variant analysis is important to detect different types of genomic variants, such as SNPs, SNVs, indels, CNVs, and larger SVs, especially in cancer studies. It is vital to distinguish somatic from germline variants. Somatic variants present only in somatic cells and are tissue-specific, while germline variants are inherited mutations presented in the germ cells and are linked with patient’s family history. Variant calling is used to identify SNP and short indels in exome samples. The common variant calling tools are listed in Table 1. Some studies have evaluated these variant callers. Liu et al. recommended GATK, and Bao et al. recommended a combination of Novoalign and FreeBayes.

Table 1. The common variant calling tools.

Variant calling Tools
Germline variant calling GATK, SAMtools, FreeBayes, Atlas2
Somatic variant detection GATK, SAMtools mpileup, Issac variant caller, deepSNV, Strelka, MutationSeq, MutTect, QuadGT, Seurat, Shimmer, SolSNP, jointSNVMix, SomaticSniper, VarScan2, Virmid

Variant annotation

After variants are identified, they need to be annotated for better understanding disease pathogenesis. Variant annotation generally involves information about genomic coordinates, gene position, and mutation type. Many studies focus on the non-synonymous SNVs and indels in the exome, which account for 85% of known disease-causing mutations in Mendelian disorders and a great deal of mutations in complex diseases.

Besides the basic annotation, there are many databases that can provide additional information about the variants. ANNOVAR is a powerful tool that combines over 4,000 public databases for variant annotation, such as dbSNP, 1000 Genomes, and NCI-60 human tumor cell line panel exome sequencing data. This tool can be used for minor allele frequency (MAF) prediction, deleterious prediction, indication of conservation of the mutated site, experimental evidence for disease variant, and prediction scores from GERP, PolyPhen, and other programs. Other common databases include OncoMD, OMIM, SNPedia, 1000 genomes, bdSNP, and personal genome variants.

Variant filtration and prioritization

WES can generate thousands of variant candidates. The number can be reduced by variant prioritization, to generate a short but prior candidate mutation list for further experimental validation. Variant prioritization involves three steps: 1) removal of less reliable variant calls; 2) depletion of common variants (due to the assumption that rare variants are more likely to cause disease); 3) prioritization of variants relative to the disease using discovery-based and hypothesis-based approaches. The available tools for variant filtration and prioritization include VAAST2, VarSifer, KGGseq, PLINK/SEQ, SPRING, GUI tool, Gnome, and Ingenuity Variant Analysis.


In the next few years, whole exome sequencing may be adopted as a routine clinical procedure for disease treatment. And many healthcare facilities have already provided genetic testing by utilizing NGS technologies such as WES. The next challenge will be the data management with millions of genomic variants, and the integration of genomic variants, clinical records, and patient information.

If you are interested in the whole exome sequencing provided by CD Genomics, please feel free to contact us. We provide full whole exome sequencing service package, including sample standardization, exome capture, library construction, high-throughput sequencing, raw data quality control, and bioinformatics analysis. We can tailor this pipeline to your research interest.


  1. Bao R, Huang L, Andrade J, et al. Review of current methods, applications, and data management for the bioinformatics analysis of whole exome sequencing. Cancer informatics, 2014, 13: CIN. S13779.
  2. Meena N, Mathur P, Medicherla K M, et al. A Bioinformatics Pipeline for Whole Exome Sequencing: Overview of the Processing and Steps from Raw Data to Downstream Analysis. bioRxiv, 2017: 201145.
  3. Xu H, DiCarlo J, Satya RV, Peng Q, Wang Y. Comparison of somatic mutation calling methods in amplicon and whole exome sequence data. BMC Genomics. 2014;15:244.


Applications of RNA-Seq

What is RNA-Seq?

Regulation of gene expression is fundamental to link genotypes with phenotypes. RNAs shape complex gene expression networks which drive biological processes. An in-depth understanding of the underlying mechanisms about how to govern these complex gene expression networks is vital for the treatment of complex disease such as cancer. Hybridization-based microarrays are used to allow the simultaneous monitoring of expression levels of annotated genes in cell populations. However, genome-wide approaches are proved to provide more valuable insights into transcriptomes. These next/third sequencing platforms allow the rapid and cost-effective generation of massive amounts of sequence data. The RNA profiling by utilizing high-throughput sequencing technologies are known as RNA-seq.

What are the applications of RNA-Seq?

Since RNA-seq is quantitative, it is useful to determine RNA expression levels. In addition to this basic function, RNA-seq can be used for differential gene expression, variants detection and allele-specific expression, small RNA profiling, characterization of alternative splicing patterns, system biology, and single-cell RNA-seq.

Figure 1. Overview of the typical RNA-seq analysis pipeline (Han et al. 2015).

  • Differential gene expression

An important application of RNA-seq is the comparison of transcriptomes across different developmental stages, treatments, or disease conditions. This analysis, also known as differential gene expression analysis, requires identification of genes along their isoforms and precise assessment of their expression levels. It is important to illustrate functional elements of the genome and uncover the biological mechanisms of development and disease.

The common tools for differential gene expression include Cuffdiff, DESeq, DESeq2, EdgeR, PoissonSeq, Limma voom, and MISO.

  • Variants detection and allele-specific expression

RNA-seq allows identification of variants and allele-specific expression. Single-nucleotide polymorphisms (SNPs) refer to the variation in a single nucleotide that occurs at a specific position in the genome, which may lead to allele-specific expression (ASE). ASE means that one of two alleles is highly transcribed into mRNA and the other is lowly transcribed or even not transcribed at all. Recent studies have also associated ASE to the susceptibility of a number of human diseases. RNA-seq and whole-genome DNA sequencing (WGS) allow identification of common disease variants, including SNPs and ASE.

The common tools used for variants detection are GATK, ANNOVAR, SNPiR, SNiPlay3.

  • Small RNA profiling

Small RNA species generally involve microRNA (miRNA), small interfering RNA (siRNA), and piwi-interacting RNA (piRNA), as well as other types of small RNA, such as small nucleolar RNA (snoRNA) and small nuclear RNA (snRNA). Small RNAs play a role in gene silencing and post-transcriptional regulation of gene expression. Small RNAs have been demonstrated to be involved in biological processes, including development, cell proliferation and differentiation, and apoptosis. Most initial small RNA discovery studies used pyrosequencing, and subsequently, other NGS platforms with higher throughput, which resulted in genome-wide surveys and the discovery of an increasing number of small RNA species. Common bioinformatic tools for small RNA sequencing data are shown in Table 1.

Table 1. sRNA-seq web application comparison (Rahman et al. 2018).

Features Oasis 2 omiRas mirTools 2.0 MAGI Chimira sRNAtoolbox
FASTQ compression      
miRNA modifications and edits  
Novel miRNA database        
Infection and cross-species analysis          
Non-model organism        
Differential expression
Multivariate differential expression        
Novel miRNA target prediction      
Pathway/GO analysis  
Batch job submission (API)          
Genome browser          
  • Characterization of alternative splicing patterns

Alternative splicing patterns are important to understand development and human diseases since altered splicing patterns contribute to development, cell differentiation, and human disease. RNA-seq is a powerful tool for characterization of alternative splicing patterns. Paired-end sequencing enables sequence information from both ends, thereby detecting splicing patterns without a requirement for previous knowledge of transcript annotations. PacBio SMRT sequencing allows examination of splicing patterns and transcript connectivity in an unbiased and genome-scale manner by generating full-length transcript sequences.

The common tools for characterization of alternative splicing patterns include TopHat, MapSplice, SpliceMap, SplitSeek, GEM mapper, SpliceR, SplicingCompass, GIMMPS, MATS, and rMATS.

Figure 2. RNA-seq for detection of alternative splicing events (Ozsolak and Milos 2011).

  • System biology

Creating lists of differential expression (DE) genes is not the final step of RNA-seq analysis. Further biological insight into an experimental system can be acquired by looking at the expression changes of sets of genes. This process, known as system biology, is based on the understanding that the whole is greater than the sum of the parts. Pathway analysis and co-expression network analysis are two important included parts.

Table 2. The tools for pathway analysis and co-expression network analysis using RNA-seq data.

Pathway analysis GSEA A knowledge-based approach for genome-wide expression profiling.
GSVA A non-parametric, unsupervised method for estimating variation of gene set enrichment through the samples of an expression data set.
SeqGSEA Provides methods for gene set enrichment analysis by integrating differential expression and splicing.
GAGE An evaluation of the very latest large-scale genome assembly algorithms.
SPIA Identifies the pathways most relevant to the condition
TAPPA A java-based tool for identification of phenotype-associated genetic pathways.
DEAP Identifies important regulatory patterns from differential expression data.
GSAASeqSP Can identify pathways or gene sets significantly associated with a disease or phenotype.
Co-expression network GSCA help researchers make discoveries by using massive amounts of publicly available gene expression data.
DICER Detects differentially co-expressed gene sets by using a novel probabilistic score for differential correlation.
WGCNA A powerful method to isolate co-expressed groups of genes from microarray or RNA-seq data.
  • Single-cell RNA-seq

The single-cell RNA-seq offers opportunities to dissect of the interplay between intrinsic cellular processes and extrinsic stimuli in cell fate determination. It also contributes to a better understanding of how an ‘outlier cell’ may determine the outcome of an infection. In addition, a majority of living cells cannot be cultivated in vitro, single-cell RNA-seq may discover novel species or regulatory processes of biotechnological or medical relevance. The workflow of single-cell RNA-seq generally involves the following steps: single-cell isolation, cDNA library construction, RNA-seq, and bioinformatics (Figure 2).

Figure 3. The general workflow of single-cell RNA-seq.

Applications of single-cell RNA-seq

  • Stem cell differentiation
  • Embryogenesis
  • Whole-tissue analysis
  • Single-cell RNA-seq for whole-organism studies
  • Disease biology and treatment

If you want more information about RNA-seq, please refer to the following articles:

Bioinformatics workflow of RNA-seq
The technologies and workflow of RNA-seq


  1. Ozsolak F, Milos P M. RNA sequencing: advances, challenges and opportunities. Nature reviewsgenetics, 2011, 12(2): 87.
  2. Rahman R U, Gautam A, Bethune J, et al. Oasis 2: improved online analysis of small RNA-seq data. BMC bioinformatics, 2018, 19(1): 54.
  3. Han Y, Gao S, Muegge K, et al. Advanced applications of RNA sequencing and challenges. Bioinformatics and biology insights, 2015, 9: BBI. S28991.
  4. Saliba A E, Westermann A J, Gorski S A, et al. Single-cell RNA-seq: advances and future challenges. Nucleic acids research, 2014, 42(14): 8845-8860.

Why Whole Genome Sequencing (WGS) Still Not Broadly Used for Individual


In recent years, with the further development of high-throughput sequencing technology, the cost of sequencing has continued to decrease, and whole-exome sequencing (WES) has been increasingly applied to genetic disease detection, which has improved the diagnosis rate of diseases.

The Question

However, it comes with the question: does the widely used whole-genome sequencing (WGS) currently suitable for clinical application? It is likely that whole-genome sequencing will subsume genetic testing for individual or even panels of genes, replacing individual genotyping assays with a comprehensive assessment of genetic variation.

  1. Doctors are too tired to analyze and explain so many VUS, laboratory data analysis and clinical is in disjunction. Is there any reanalysis for undiagnosed cases and re-collection of clinical phenotypes is not yet determined.
  1. The whole genome sequencing costis high, and the information that could be read out is little. It is still in the scientific research stage, and the clinical application is still early.
  1. Since clinical applications are considered, the main purpose of clinical diagnosis should consider accuracy, periodicity and cost. Scientific research must use research funding!
  1. At present, the cost of WGS is still high, and the sequencing, analysis and interpretation is too time consuming. The information useful to patients is similar to the sequencing of exons.
  1. For single-gene disease, the combination of WES aCGH/SNP-array/CMA has been able to meet most reequipments. Compared with WES, WGS does have a wider coverage, but WGS detects too many variations, such as deep variation in non-coding regions, and a large number of small fragmentsof hundred bp, kb-level deletions/repetitions. These variations are difficult to explain. WGS is not fundamentally different from WES. The most important thing at present is not to expand the genome range of detection, but to expand variants that can be accurately detected, such as repeated amplification of polynucleotides. Compared with the NGS’s WGS for clinical use, it is better to wait for the technical matureness of third generations of sequencing.
  1. Currently, at least the near future, I personally think that WGS is not suitable for clinical applications. Reason 1, cost considerations. The cost of sequencing a single WGS basically equal to the cost of the current trios’ family, but the positive rate has not increased significantly (data shows 40% of WES and 42% of WGS), and the cost of analysis has increased significantly. Reason 2, without available reference database. Even if more deep intron sites are detected, there is no way to make a pathogenic judgment. Although WGS is superior to WES in terms of detection rate of CNV and SV, low-cost detection method is an alternative.

A Comparison Study of Whole Genome Sequencing (WGS) in Clinical Setting


In recent years, with the further development of high-throughput sequencing technology, the cost of sequencing has continued to decrease, and whole-exome sequencing (WES) has been increasingly applied to genetic disease detection, which has improved the diagnosis rate of diseases. However, it comes with the question: does the widely used whole-genome sequencing (WGS) currently suitable for clinical application?

The study

On March 22nd, Genet Med. published an article online (PMID: 29565419) entitled Whole-genome sequencing offers additional but limited clinical utility compared with reanalysis of whole-exome sequencing.

There have been few previous comparisons of WGS and WES for the detection rate of genetic diseases. After screening, a total of 108 patients were enrolled in the WGS analysis. Their gene chip and WES test both showed negative results and their clinical data and previous sequencing raw data were preserved intact. After WGS test, the results showed that 10 cases (9%) of positive results, 5 cases were uncertain, and 93 cases were negative.

The authors analyzed the reasons for the positive results of 10 cases of WGS, including three aspects:

(1) The academic background of WES and WGS: Although WES also detected mutation site on the 1st, 2cd, and 3rd case, it was not reported as the pathogenic site, mainly because at the time of detection, the correlation between pathogenic gene and clinical phenotype has not been determined yet;

(2) The influence of structural variation and non-coding region variation: such as the 4th, 5th, and 6th case;

(3) Impact of sequencing platform: The 7th, 8th, 9th and 10th case belongs to this situation. The mutation sites were detected by WES on the Illumina platform.

In summary, among the 10 cases with negative WES previously, 7 cases were detected by WES reanalysis and WGS, and 3 cases were detected by WGS for structural variation and non-coding region variation.

Why Whole Genome Sequencing (WGS) Is Important for Clinical Applications?

  1. Whole genome sequencing (WGS) has broad spectrum of applications in clinical field, especially for diseases with unexplained clinical conditions, especially children with poor development and mental retardation. If Chromosomal Microarray Analysis (CMA), Next Generation Sequencing (NGS), and Whole Exome Sequencing (WES) unable to diagnose, WGS could be another option.
  2. Due to the uniformity brought by WGS, 30X coverage is generally considered to be very sufficient. Without depending on capture reagents, WGS is easier to achieve the basic unification on the wet lab, and save some cost.

For WGS price, the market completion is fierce and good for reducing cost. So, I think it is very likely that WGS will become mainstream in the near future.

Another benefit of WGS is its homogeneity of mtDNA. Theoretically it could solve the difficulty of finding large CNV and partial heterogeneity problems in mtDNA.

  1. Although WGS is not suitable for clinical application at present, it is tentative to start trials in some “pilot” units.

Compared with WES, WGS can find non-coding/intronic variants, CNV/SV, skip the need for capture, etc. The difficulty lies in the cost of interpretation and sequencing. As the cost of sequencing decreases, the superiority of WGS will become more apparent. Therefore, the application of WGS in the clinic is only a matter of time.

However, what is the best practice for WGS, is still a question for colleagues and experts to work together to study and explore.

Outsourcing Plastic Molding And Mold Making In China, Trust But Verify

Nearly every single plastic molding company in the US and Europe has or is considering sending work to China, no surprise here. The incentives are very real, as are the pressures. Not only are the financial matters pressing, but some customers actually demand a China presence.

Considering the fact that China has become the world’s second largest economy, passing Germany and Japan, the potential for growth is huge, to put it mildly.

Most people recall the very poor quality of Chinese products just a few years ago. Some products are still of very low quality and it seems that you actually get what you pay for in many cases.

On the other hand, the concept of actual built-in quality seems to be slowly sinking into the national mentality, albeit very slowly. Some areas, such as Hong Kong, have a much better tradition of adapting European quality.

When Ronald Reagan was president, he was deeply involved with the arms race with the Soviet Union. One of his favorite phrases was a translation of a Russian proverb: “Trust but verify.” This became his mantra when dealing with Mikhail Gorbachev concerning the INF treaty.

This would be a good mantra for anyone doing plastic molding in China: “Trust but verify.” It seems that the mold makers and molders, and maybe others as well, have a tendency to do what you pay for when you are present, and then cut corners when you are not present.

Without attempting to sound condescending or judgmental, this just is the case. Of course there are countless exceptions, nevertheless, it is still advisable to trust but verify.

A real-life case in point is the fact that American companies usually insist on brand name mold components in their injection molds. Nobody wants a low-grade, soft ejector pin in their mold, for example. So, most people insist on PCS, DME or Progressive ejector pins.

Oddly, after a few thousand shots, the pins bend, break, pit and flake. Yet the pin has PCS etched right into the steel, so how could this be? Simple enough, it was made in a little shop that makes one pin for every company known and just etches whatever name is required. They don’t care if the steel is not H13, just so it works for a while and they make their money.

Anyone who has traveled in developing countries knows about this sort of thing. It happens all the time with just about anything that can be copied or pirated. I once bought a Disney movie before it was in the theaters! You can buy passports, driver’s licenses, birth certificates and anything else you want.

Once you build a working relationship with a Chinese supplier you would think that you are set and don’t need to trust and verify. Wrong. If that were the case, every mold that came in would be right, made using proper techniques and have documented sizes and materials.

That just is not the case, unfortunately, but it doesn’t seem to make much difference to the accounting department in some companies. The mold is so inexpensive that you can just re-work it and still make money. Don’t ask the mold maker about this though.

Find more manufacturers & suppliers: China plastic manufacturer

Mechanical Design of Biomedical Products Using Plastics

Biomedical products typically have physical requirements that differ in some respects from other products. Those requirements usually center on the need for materials and configurations that are compatible with the human body. Not only are such products regulated by FDA requirements, but they must also be able to withstand multiple sterilization cycles involving high temperatures or the use of solvents, or both.

To design parts in the biomedical industry it is necessary to understand the properties of biomedical safe materials, and to understand the constraints on processing those materials to produce sound and economical parts. Not all injection molding factories have both the capability and experience to mold these materials. As an example, parts have been designed and molded both domestically and abroad using Lexan HP2NR and Lexan HPX4. Both of these are FDA approved biocompatibility tested (FDA USP Class VI/ISO10993) plastics.

Lexan HP2NR is clear Polycarbonate plastic. 121C autoclavable for a handful of cycles. As an example, this material is being utilized in a lens for a product used for skin care treatment. The molding resource has been able to mold this material at almost defect free levels in the past 2 years. Lexan HPX4 is a Siloxane copolymer. It performs better in autoclave at 121C (a few dozen cycles, again depends on in-mold stress, morpholine level in autoclave etc. It has a slight haze in its natural state. An example of a biomedical application of this material is a part being colored with FDA approved dye to a gray Pantone 430C color when molded on an oral device used by sleep apnea patients. After molding, the parts go through a thermal press process that creates 300+ features necessary for the retention of the epoxy applied by the user. Parts are thoroughly cleaned in isopropyl alcohol solution, heat dried then bagged and boxed for shipment.

In addition to understanding the issues relating to the materials employed in designing and producing biomedical products it is also necessary to have a good grasp on ergonomic principles and the ability to apply those principles in design. Ergonomics is defined as the study of designing equipment and devices that fit the human body, its movements, and its cognitive abilities. It is always good to consider ergonomics in product design, but in the biomedical arena it is usually critical to the success of the product.

In summary, a successful biomedical product development should be characterized by carefully considered selection of materials and the capability to properly process those materials. Additionally, biomedical product development should also consider a strong dedication to ergonomic principles.

China-plasticmolding cooperates with dozens of Injection Molding Factories, we are a professional Injection Molding Company in China, offers custom injection molding service since 2003.

A Review of Dynamics and Stabilization of the Human Gut Microbiome during the First Year of Life


Dynamics and Stabilization of the Human Gut Microbiome during the First Year of Life was first published on Cell Host &Microbe in 2015. Authors include Fredrik Bckhed and Jovanna Dahlgren.

Experiment Design

Sample: intestinal microbes of 98 mothers and newborn babies (mostly Swedish)

Sequencing strategy: using metagenomic sequencing, a total of 1.52Tb of data, an average of 3.99Gb/sample

Analysis Procedures

  1. Based on the metagenomic data, the gene catalog was established at each time point by de novo assembly, and the KEGG database was used to generate the gene functional annotation.
  2. According to the abundance of different samples, contigs were assembled by binning, and 4356 genomes (>0.9Mb) were obtained by co-assembly. These assembled genomes are supplemented by 1147 genomes in NCBI.
  3. All genomes were subsequently clustered to obtain 690 unique metagenomic OTUs (MetaOTUs), which was equivalent to the classification of species.

Analysis Content

The Phylum Firmicutes and Bacteroides were the most abundant among all detected microorganisms, followed by actinomycetes and proteobacteria. According to the metagenomic data species annotations, a total of 373 MetaOTUs were annotated to the species, and the remaining 317 represented new species that were associated with known species. Most of the MetaOTUs obtained from newborns are also found in mothers, and the abundance is gradually increasing. As revealed by Figure 1, the red area is Novel MetaOTUs, the outer circle is the species annotated to the door level, the inner circle is the species that is gazing to the genus level, and the middle circle represents the abundance of each MetaOTUs of different samples.

Figure 1. MetaOTUs phylogenetic tree

By using unweighted UniFrac distance PCoA analysis of all samples, the samples were clustered according to age. The 12-month neonatal situation was most similar to that of the mother, because the neonatal intestinal microflora structure had stabilized.

With age growing, the alpha diversity in the neonatal intestinal flora gradually increased, while the beta diversity gradually decreased, indicating that the microbial species in the community became more complex, and the differences between communities became smaller.

Next, the authors performed a comparison of the gut microbiota structure of neonates with C-section and vaginally born. The result turned out to be consistent with the PCoA results. As the age increases, the bacterial composition tends to approach mothers. However, due to the absence of maternal birth canal, the number of maternal microorganisms obtained at the time of birth is small. Compared with the vaginally newborn, their establishment of microorganisms in the intestine is slow and some of the flora is missing.

Figure 2. A comparison of the gut microbiota structure of neonates with C-section and vaginally born

The metagenomic analysis also reveals the energy utilization of the neonatal intestinal flora over time. The function of the fecal flora in the first year of delivery is improved, and the phosphotransferase system (PTS) gene related to carbohydrate absorption is rich in the neonatal intestinal flora.

The gut flora of neonatal and 4-month-old neonatal is enriched with the gene that digests the sugar in the breast milk, at which point the sugar is the main source of energy. The β-glucose-specific transporter is the most abundant in newborns at 4 months and 12 months of age. The intestinal flora of 12-month-old newborns is enriched with genes that break down polysaccharides and starch and is associated with an increase in Bacteroides variabilis, which has all the enzymes involved in polysaccharide digestion.

Figure 3. KO pathway

Bacteria in the gut of virginally newborns include: Enterococcus, Escherichia/Shigella, Streptococcus, and Rothia Geory and Brown, indicating a relatively oxygen-rich intestinal environment. The 4-month neonatal gut flora is characterized by Bifidobacterium, Lactobacillus, Collins, Granulicatella, and Vesococcus, indicating a gradual decrease in intestinal oxygen concentration and an increase in the ability to produce and utilize lactic acid. The diet at this time is mainly breast milk.

The characteristics of the 12-month neonatal gut flora include: bacteria found in newborns and in 4-month old newborns (as previously listed), and only present in 12 months Bacteria, such as the genus Eichhornia.

Figure 4. Characteristics of intestinal flora in different periods of caesarean section


As an important research tool, metagenomics can get a lot of high-value information in the process of microbial population research. It is of great significance for further research on microbial-related metabolism and immunity.

Features of CD Genomics Metagenomic Sequencing

  1. Rich experience in sample processing

Such as soil, sediment, intestinal contents, manure, water, air, dairy products…CD Genomics has rich experience in various sample extraction;

  1. High quality data

CD Genomics has a wide range of technical platforms to obtain high quality data;

  1. Satisfactory analysis report

More database annotations for more analysis results

  1. Deep data mining capacity and comprehensive follow-up customer services

CD Genomics has professional bioinformatics analysis team, powerful experimental and sequencing platform to provide microbial genome de novo resequencing16S/18S/ITS, metagenomics, transcriptome sequencing and other micro-site one-stop sequencing analysis services.