The virus is the simplest organism. The complete virus particles include the coat protein and the internal genomic DNA or RNA (some coat proteins outside the coat have an envelope composed of host cells containing the glycoprotein encoded by the viral gene. The virus cannot replicate independently, it must enter the host cell to make the virus replicate by means of some enzymes and organelles in the cell. The function of the coat protein (or envelope) is to recognize and invade specific host cells and protect the viral genome from nucleases Destruction.
- The size of the virus genomeis quite different. Compared with bacteria or eukaryotic cells, the genome of the virus is very small, but the genomes of different viruses are also very different. For example, hepatitis B virus DNA is only 3 kb in size and contains less information. It can only encode 4 proteins. The genome of poxvirus is 300 kb, which can encode hundreds of proteins, not only for the enzymes involved in viral replication. Even the enzymes for nucleotide metabolism are encoded, so the poxvirus is much less dependent on the host than the hepatitis B virus.
- The viral genome may consist of DNA or RNA. Each virus particle contains only one nucleic acid, or DNA or RNA, and the two generally do not coexist in the same virus particle. The DNA and RNA constituting the viral genome may be single-stranded or double-stranded, and may be a closed-loop molecule or a linear molecule. For example, papillomavirus is a closed-loop double-stranded DNA virus, while the genome of adenovirus is linear double-stranded DNA, poliovirus is a single-stranded RNA virus, and the genome of reovirus is double-stranded. RNA molecule. In general, most DNA viruses have genomic double-stranded DNA molecules, while most RNA viruses have single-stranded RNA molecules.
- The genome of most RNA viruses is composed of continuous ribonucleic acid strands, but some genomic RNAs of viruses are composed of discrete nucleic acid strands. The genomic RNA molecules of influenza viruses are segmental and consist of eight RNA molecules. Each RNA molecule contains information encoding a protein molecule; the reovirus genome consists of a double-stranded segmental RNA molecule with a total of 10 double-stranded RNA fragments, each of which encodes a protein. At present, no viral genome composed of segmental DNA molecules has been found.
- Gene overlap means that the same DNA fragment can encode two or even three protein molecules. This phenomenon is only found in mitochondria and plasmid DNA in other biological cells, so it can also be considered as a structural feature of the viral genome. This structure enables smaller genomes to carry more genetic information. The overlapping genes were discovered by Sanger in 1977 when studying ΦX174. ΦX174 is a single-stranded DNA virus, the host is Escherichia coli, and therefore, it is a phage. It infects E. coli and synthesizes 11 protein molecules with a total molecular weight of about 250,000, which is equivalent to the amount of information contained in 6078 nucleotides. The viral DNA itself has only 5375 nucleotides, which can encode a protein molecule with a total molecular weight of 200,000. Sanger can’t solve this contradiction for a long time before clarifying that some of the 11 genes of ΦX174 overlap. There are several cases of overlapping genes:
(1) One gene is completely inside another gene. For example, genes A and B are two different genes, and B is contained in gene A. Similarly, gene E is within gene D.
(2) Partial overlap. For example, gene K overlaps with a part of genes A and C.
(3) Only one base overlap of the two genes. For example, the last base of the stop codon of gene D is the first base of the J gene start codon (such as TAATG). Although these overlapping genes are mostly identical in their DNA, the protein molecules produced are often different because the reading frame is different when the mRNA is translated into a protein. Some overlapping genes have the same reading frame, but the starting sites are different. For example, in the SV40 DNA genome, there are 122 base overlaps between the three coat proteins VP1, VP2 and VP3, but the codons are not identical. While the small t antigens are completely within the large T antigen gene, they have a common start codon.
- Most of the viral genome is used to encode proteins, only a very small one is not translated, which is different from the redundancy of eukaryotic DNA. For example, the part that is not translated in ΦX174 only accounts for 217/5375. G4DNA accounts for 282/5577, less than 5%. The untranslated DNA sequence is usually a control sequence for gene expression. For example, the sequence between the H gene and the A gene of ΦX174 (3906-3973), a total of 67 bases, includes a control region for gene expression such as an RNA polymerase binding site, a transcription termination signal, and a ribosome binding site. Papillomavirus is a type of virus that infects humans and animals. The genome is about 8.0Kb, and the untranslated part is about 1.0kb. This region is also a regulatory region for other gene expression.
- Genes of functionally related proteins or genes of rRNA in viral genomic DNA sequences tend to cluster at one or several specific sites in the genome to form a functional unit or transcription unit. They can be transcribed together into a molecule containing multiple mRNAs, called polycistronie mRNA, which is then processed into template mRNA for various proteins. For example, the late gene encoding the adenovirus encodes 12 coat proteins of the virus. When the late gene is transcribed, it generates a polycistronic mRNA under the action of a promoter, and then processes it into various mRNAs, which encode various coat proteins of the virus. Functionally related; the DEJFGH gene in the ΦX174 genome is also transcribed in the same mRNA and then translated into various proteins, of which J, F, G and H are all coding for coat proteins, assembly of D proteins and viruses. Relatedly, the E protein is responsible for the lysis of bacteria, which are also functionally related.
- Except for retroviruses, all viral genomes are haploid, and each gene appears only once in the viral particle. There are two copies of the retroviral genome.
- The gene of bacteriophage (bacterial virus) is continuous; while the gene of eukaryotic virus is discontinuous, with introns, except for positive-strand RNA viruses, the genes of eukaryotic viruses are first transcribed into mRNA. The precursor is processed to remove the intron into mature mRNA. More interestingly, some eukaryotic introns or parts of them are introns for one gene and exons for another. This is the case with early genes such as SV40 and polyomavirus. The early genes of SV40, namely the big T and small t antigen genes, were all counterclockwise from 5146, the large T antigen gene was terminated to 2676, and the small t antigen was terminated to 4624, but from 4900 to 4555. A 346 bp fragment is an intron of the large T antigen gene, and the DNA sequence from 4900-4624 in the intron is a small t antigen encoding gene. Similarly, in polyomaviruses, the introns in the large T antigen gene are the genes encoding the T and t antigens.
Bovine papillomavirus genome structure and function
Papillomavirus is a DNA virus that infects human and animal skin, mucous membranes and causes papilloma lesions. It belongs to the papovavirus family. It can be divided into bovine papillomavirus (BPV), human papillomavirus (HPV), etc. depending on the host infected with the virus. The papillomavirus genomes that have been discovered so far have similar structures. The genomic structure and function of papillomavirus are illustrated by BPV as an example. The BPV DNA is 7945 bp in length and is a closed-loop supercoiled structure, which can bind to histones to form nucleosomes in host cells. The first base G of the single HpaI restriction site in the BPV DNA is the 1st position, and the base number is located in the direction of 5’→3′. DNA sequence analysis showed that all open reading frames (ORFs) existed on one DNA strand, and the genes overlap each other. The entire BPV gene component is a coding region and a non-coding region (NCR), and the coding region is divided into an early transcriptional functional region (E region) and a late transcription functional region (L region) according to the function of the encoded protein. 1. The non-coding region (NCR) non-coding region, also known as the upstream regulatory region (URR) or the long control region (LCR), is located between the late gene L1 stop codon and the first gene E6 first start codon, and the length is Different in different papilloma viruses, it is about 1.0 kb in BPV. In the promoter sequence of NCR transcription, transcription and expression of early genes can be initiated. In addition, there are enhancer sequences in this region, which can be activated by the early gene product E2 protein, further promoting the expression of early gene AAC, which has been clarified. The sequence of the enhancer of the BPVNCR region, which is a palindrome of TTGGCGGNNG and ATCGGTGCACCGAT. It can be seen from the structural characteristics of NCR that its main function is to regulate the expression of BPV gene.
- Early transcriptional functional region (or early gene region, E region) The E region of BPV contains eight open reading frames (ORFs), namely E6, E7, E8, E1, E2, E3, E4, E5, of which E6, E7 and E1 genes partially overlap, E8 is completely in E1, E3 and E4 are all contained in E2, and E5 and E2 partially overlap. The protein product encoded by the E2ORF can bind to the enhancer of NCR to increase or decrease the expression level of the early gene.
In addition, the E2ORF synergizes with the E1ORF to maintain the free state of the papillomavirus DNA without integration into the host cell chromosome. The proteins encoded by E6 and E7ORFs may be oncogenic proteins. E6 and E7 proteins can cause malignant transformation of the host into tumor cells. The mechanism of cell transformation induced by E6 and E7 proteins is not clear at this stage, but there are two explanations.  The Cys-xx-Cys repeat sequence was found in the amino acid sequence of the E6 and E7 proteins, and the structure is considered to be a specific structure possessed by the intracellular nucleic acid binding protein, and thus the E6 and E7 proteins are considered to be DNA-binding proteins. It can regulate the activity of genes, further affect the proliferation and differentiation of host cells, and make the process uncontrolled to form tumors.  Recently, two proteins with molecular weights of 53KD and 106KD were found in normal cells, respectively, called p53 and p106. protein. The loss or inactivation of these two proteins often causes cell malignancy. The study found that the E7 and E6 proteins of papillomavirus can be inactivated by binding to p53 and p106 proteins, respectively, which may also be a mechanism by which E6 and E7 proteins cause cell malignancy.
- Late transcriptional domain (late gene region, L region): There are two L-region ORFs, L1 and L2 ORF, which encode the coat protein of papillomavirus, in which L1 protein is the major coat protein and L2 protein is the minor coat protein.
Genomic structure and function of RNA phage
The most well-studied E. coli RNA phage are MS2, R17, f2 and Qβ. Their genomes are small, ranging from 3,600 to 4,200 nucleotides, and contain four genes. MS2.R17 and f2 have almost the same genomic structure. Two of the four genes encode structural proteins of phage: one is the gene of protein A, which is 1178 nucleotides in length. The function of protein A (called a mature protein) is to enable the phage to recognize the host and allow its RNA genome to enter the host bacterium, and each phage typically has only the protein A of the molecule. Another structural protein gene is 399 nucleotides in length and encodes a coat protein to form a viral particle, each of which has 180 molecules. The rest of the genome encodes an RNA replicase and a lytic protein. The gene encoding the lytic protein partially overlaps the coat protein and the replicase gene, but the reading frame is different from the reading frame of the coat protein. There are many secondary structures in the MS2, R17, and f2 genomes, and self-pairing of bases in RNA molecules may have a role in preventing RNase degradation. In addition, there is a non-translated sequence at the 5′ and 3′ ends of the coding gene, which also has a role in stabilizing RNA molecules. The genome of another RNA phage Qβ is slightly larger than the genome of the above RNA phage;  there is no independent lytic protein gene, but the structural protein A2 (or mature protein, Maturation Protein) has the function of dissolving protein.  also encodes another coat protein A1.
Structural features and functions of the hepatitis B virus genome
The genomic DNA structure of hepatitis B virus (HBV) is very peculiar and is a circular partial double helix structure with a length of about 3.2 kb. Two-thirds of them are double-helix and 1/3 are single-chain, which means that the two chains in DNA are not equal in length. The 5′ end of the long chain is not covalently linked to the 3′ end, but is covalently linked to a protein. The 5′ end of the long chain is complementary to 250-300 base pairs. The long chain is a negative chain and the short chain is a positive chain. The length of the short chain varies from virus to virus and is generally about 1.6-2.8 kb long, about 2/3 of the long chain. The gap between the short strands can be filled by a DNA polymerase in the viral particle. Hepatitis B virus is currently known as the smallest double-stranded DNA virus that infects humans. In order to replicate independently in cells, the virus contains as much genetic information as possible in a small genome. Therefore, the genomic structure of HBV appears to be particularly precise and concentrated, making full use of its genetic material.
There are many overlapping gene sequences, and there are four open reading frames in the HBV genome, which encode the nucleocapsid (C) and envelope (S) proteins of the virus, viral replicase (polymerase) and a seemingly virus Gene expression related to protein X. The two small ORFs in front of the S gene belong to the same reading frame as the S gene ORF. The ORFS can be read through and encode two S protein-associated antigens. These two antigens are also present on the surface of the virus particles. They are called pre-S1 (pre-S1) and pre-S2 (pre-S2), respectively. Similarly, there is a short ORF in front of theORF, called pre-C (pre-C), which encodes a larger C-protein associated antigen. All of these ORFs are on the negative strand DNA (long chain), in which the S gene is completely overlapped with the polymerase gene, the X gene overlaps with the polymerase gene and the C gene, and the C gene overlaps with the polymerase. Recently, Miller et al. found two ORFs, ORF-5 and ORF-6, in the HBV genome. These two ORFs overlap with the X gene, and ORF6 is not encoded by negative strand DNA, but is encoded by positive strand DNA. The function of these two ORFs is currently unclear.
The regulatory sequence is located inside the gene, which is also a way for HBV to save on the use of genetic material. Sequences involved in HBV group replication are: short-chain forward replication sequences (DR1 and DR2) and U5-like sequences (named for similar faces to the U5 sequence at the end of the retrovirus). DR1 and U5 are located in the pre-CORF and are the starting site for the long chain of synthetic DNA. DR2 is located at the overlap of the polymerase gene and the X gene and is the starting site for DNA short-chain synthesis.
There are four signal sequences involved in HBV gene expression:  promoter,  enhancer,  polyA additional signal,  glucocorticoid sensitive factor (GRE). Since the genes in the HBV genome are transcribed on the three HBV mRNA transcripts, respectively, there should be at least three RNA polymerase II promoters at the proximal 5′ end of each transcript in the viral genome, although these promoters The gene sequences are not known, but these promoters are apparently present within the encoded protein sequence. The enhancer (ENH) is located in the polymerase gene; the polyA additional signal is located in the CORF; and the GRE is located in the SORF and polymerase genes. GRE is a DNA fragment of a hormone receptor structure that, when combined, increases the level of transcription of a known gene.
GRE has many enhancer features:  is a factor that acts as a cis,  acts in both directions of transcription,  can function at different distances from the genes it regulates.
It can be seen from the above that the HBV genome is structurally strict and tissue efficient, and is rare in known viruses. HBVDNA not only has its unique structure, but its DNA replication process is also very special. When HBV DNA enters the host cell, it first becomes a complete closed-loop double-stranded DNA, and the negative strand is used as a template to synthesize a full-length “+” strand RNA (called pre-genomic RNA). The “+” strand RNA is packaged in immature core-like particles, and a DNA polymerase and a protein are also packaged in the particles. In the granule, the “+” strand RNA is used as a template to catalyze the synthesis of “-” strand DNA by reverse transcriptase. The specific mechanism is unclear, and may be similar to the replication of adenoviral DNA, because at the 5′ end of the “-” strand DNA. There are also proteins that are covalently bound. The synthesis of “+” strand DNA is polymerized and extended with the negative strand DNA as a template and a piece of RNA as a primer, and the core-like virus particles also become mature virus particles in the process. At this time, the positive strand DNA is still not synthesized, resulting in different lengths of the two DNA strands of the viral genome.
Through nearly ten years’ hard working and depend on our professional work team, we are proud of satisfying the needs of our clients both at home and abroad, which across more than 50 countries and districts. We always devote ourselves to providing you with the best and professional service. Our products including: Total RNA Sequencing, pacific biosciences, single cell transcriptomics, single cell genomics,etc.