A simple introduction of de novo sequencing
The sequence was spliced and assembled by bioinformatics analysis to obtain the genome sequence map of the species. A species can be sequenced without any genetic information through this method. At present, de novo sequencing service is widely used to analyze the genome sequence, genetic composition and evolutionary characteristics of unknown species from scratch.
Genome de novo sequencing refers to the sequence of a species’ genome which is unknown or without proximal species. Aiming at the different length of genome sequencing DNA fragments and its library, and then biologists use bioinformatics methods for matching, the assembly and annotation, finally obtain the complete genome sequences of map.
The following picture shows protein sequence.
Differences between de novo sequencing and re-sequencing
Re-sequencing refers to doing genome sequencing in the case of known species genome, aiming at different individuals within the species or different tissues of an individual, Thus, scientists can find the difference between different individuals or tissue cells at the genome level. In this way, we can find out a lot of single nucleotide polymorphism loci (SNP), Insertion loss loci (InDel, Insertion Deletion), Structure mutation loci (SV, Structure Variation) and Copy Number Variation (Copy Number Variation, CNV) Variation etc., so that to obtain the genetic characteristics of biological groups.
Take an example here, the virus has a high mutation rate and can be sequenced in the approach of de novo sequencing service instead of re-sequencing.
Method of de novo sequencing
In recent years, based on mass spectrometry, high-throughput proteomics has developed rapidly, the use of tandem mass spectra identification of protein is a basic and important link in its data processing. It is de novo sequencing mass spectrometry which can be applied to protein de novo sequencing and also peptide de novo sequencing.
Mass spectrometry method to identify the proteins divides into two common analysis methods are the database search and de novo sequencing. The database search is mainly to match the actual mass spectrometry and protein sequences from the database theoretical cracking map. This is the main method of protein identification, which has strong dependence on protein sequence databases. However, de novo sequencing would not be influenced by the error information in protein sequence database. So in this way, under the condition of incomplete information in the protein sequence database, biologists can ever analyze the tandem mass spectrometry data. This technique provides a direct method to explain tandem mass spectrometry data which doesn’t need to use any protein sequence database information. Compared with the database search, de novo sequencing method can analyze the new species or genome tandem mass spectrum data of any species that had not been sequenced before, so that database search method cannot replaced.
Figure: A schematic diagram of protein de novo sequencing
De novo sequencing plays an indispensable role in the data analysis, as the whole genome sequencing and the database is not perfect enough, under this present circumstance, it’s no doubt that analyzing unknown protein sequence in this way is a better choice. However, it has not been widely popularized because of the high demand for data quality and large amount of calculation. In recent years, there have been some other methods and research ideas, but the most perfect effect has not ever been achieved. All in all, there is still a considerably long way to go for improvement in de novo sequencing.