Quality control method based on pacbio full-length transcriptome sequencing data

A technology of transcriptome sequencing and quality control method, which is applied in the field of quality control based on PacBio full-length transcriptome sequencing data, which can solve the impact of accuracy, chimeric sequences cannot be correctly identified, primer sequences are not filtered, and chimeric sequences cannot be determined. and other problems to achieve the effect of improving the accuracy

Active Publication Date: 2022-03-18
BIOMARKER TECH
View PDF7 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In the full-length sequence recognition step, filter out some chimeric sequences by judging whether there is a primer sequence in the middle of the sequence (see figure 1 ), but there are still some chimeric sequences that have not been filtered due to the inability to correctly identify the primer sequence
Especially in the absence of a sequenced species reference genome, it is not possible to identify possible chimeric sequences through alignment information with the reference genome
These unrecognized chimeric sequences are retained in the final transcriptome, which will have a great impact on the accuracy of subsequent transcriptome-related analysis results
In order to improve the accuracy of transcriptome sequencing data, it is necessary to further remove the chimeric sequences that cannot be identified in the prior art, but there is no relevant method reported so far

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Quality control method based on pacbio full-length transcriptome sequencing data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0030] The sequencing data in this example includes 23G of PacBio full-length transcriptome sequencing data of pine pine, and Illumina sequencing data of 3 biological replicates of pine pine samples, and the data volume of each replicate is not less than 6G.

[0031] Analyze the data according to the quality control method of the present invention, filter possible chimeric sequences, and obtain the final transcriptome. The specific method is:

[0032] (1) Using the IsoSeq analysis process to obtain high-quality and low-quality consistent full-length sequences from the original PacBio full-length transcriptome sequencing data;

[0033] (2) Based on the Illumina sequencing data, use proovread to correct the low-quality consistent full-length sequences, and filter the sequences whose average accuracy of the corrected sequences is less than 0.99;

[0034] (3) Merge high-quality and low-quality consistent full-length sequences that meet the conditions after correction, and count t...

Embodiment 2

[0039] The sequencing data of this example includes 21.88G of PacBio full-length transcriptome sequencing data of 1 lemon mixed sample, and the Illumina sequencing data of 3 separate samples (3 biological replicates for each sample) in the mixed sample. Less than 6G.

[0040] Analyze the data according to the quality control method of the present invention, filter possible chimeric sequences, and obtain the final transcriptome. The specific method is:

[0041] (1) Using the IsoSeq analysis process to obtain high-quality and low-quality consistent full-length sequences from the original PacBio full-length transcriptome sequencing data;

[0042] (2) Use proovread to correct low-quality consistent full-length sequences based on Illumina sequencing data, and filter sequences with an average accuracy of less than 0.99 after correction;

[0043] (3) Merge high-quality and low-quality consistent full-length sequences that meet the conditions after correction, and count the lengths ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention provides a quality control method based on PacBio full-length transcriptome sequencing data, including steps: 1) using the IsoSeq analysis process to obtain high-quality and low-quality consistent full-length sequences from the original PacBio full-length transcriptome sequencing data; 2) Correct low-quality consensus full-length sequences based on Illumina sequencing data, and filter sequences that still fail to meet high-quality standards; 3) Merge high-quality and low-quality consistent full-length sequences that meet the conditions after correction, and filter according to the following criteria : Remove overly long sequences generated by sequence chimerism; remove consensus full-length sequences that have palindromic sequences in their own alignment results; remove sequences that can be aligned to multiple positions by other consistent full-length sequences. The chimeric sequences that may exist in the consensus full-length sequence are filtered through multiple criteria, reducing the proportion of false positive results in the final transcriptome and improving the accuracy of subsequent transcriptome-related analysis results.

Description

technical field [0001] The invention relates to the technical field of bioinformatics, in particular to a quality control method based on PacBio full-length transcriptome sequencing data, which is used for filtering chimeric sequences in the PacBio full-length transcriptome sequencing data. Background technique [0002] The transcriptome is the link between the genetic information of the genome and the proteome of biological functions. The regulation of the transcription level is the most important and the most widely studied method of organism regulation. Transcriptome research is one of the essential tools for understanding life processes. Transcriptome sequencing can sequence the transcriptome of a sample at any time point or under any condition, dynamically reflect the gene transcription level, simultaneously identify and quantify rare transcripts and normal transcripts, and provide sample-specific transcript sequence structure information. [0003] However, sequencing t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G16B20/30
Inventor 郑洪坤许国路杨春鹤张雪川
Owner BIOMARKER TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products