Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Processing method for nucleic acid third generation sequencing raw data and application thereof

A processing method and raw data technology, applied in special data processing applications, electrical digital data processing, sequence analysis, etc., can solve the problems of short sequencing read length, high single-base error rate, large interference of assembly software, etc., to improve accuracy good complementation and correction, and the effect of reducing the single-base error rate

Active Publication Date: 2018-09-25
BGI TECH SOLUTIONS
View PDF4 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Although the next-generation sequencing technology can quickly produce a large amount of data, it has a fatal shortcoming that the sequencing read length is too short. As we all know, the most important indicators for genome assembly are N50 and the integrity of the genome. The read length is too short, and most of its assembly algorithms are based on the idea of ​​de Bruijngraph (de Bruijngraph), so that the biggest challenge encountered in assembly is to solve highly repetitive and highly heterozygous genomes
[0007] However, the biggest problem with the third-generation data lies in its extremely high single-base error rate, as high as 15%.
The error types are mainly indels, which are randomly distributed and can be corrected to a certain extent by means of biological information. Currently, the three-generation assembly processes released, such as SMRT, Falcon, Pbcr, and Canu, all have the third-generation data self-correction The function can reduce the error rate of the third-generation data by more than 20 times from 15% to about 3%, but the 3% error rate still interferes a lot with the assembly software based on the OLC (Overlap-Layout-Consensus) algorithm

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Processing method for nucleic acid third generation sequencing raw data and application thereof
  • Processing method for nucleic acid third generation sequencing raw data and application thereof
  • Processing method for nucleic acid third generation sequencing raw data and application thereof

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0068] In this example, human chromosome 22 with an effective reference sequence size of 35Mb and an actual size of 51Mb is used as the analysis object, and the original data processing method of the third-generation nucleic acid sequencing of the application is used for error correction. details as follows:

[0069] 1 Data processing

[0070] 1) The second-generation data and the third-generation data of human samples were respectively obtained through NCBI for testing.

[0071] 2) Using blasr, a global comparison software for single-molecule real-time sequencing long sequences, compared the third-generation data to the human genome hg19, extracted the third-generation sequencing long sequences on chromosome 22, and obtained a data volume of 3.14Gb, N50 It is 11.46Kb. That is, the third-generation long sequence data of step (b) in the processing method of the present application.

[0072] It should be noted that the third-generation data in this example are directly compar...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a processing method for nucleic acid third generation sequencing raw data and an application thereof. The processing method for the nucleic acid third generation sequencing rawdata includes comparing second generation short sequence data to third generation self-correcting data, obtaining the single-base covering depth of the third generation self-correcting data in a comparison result, shielding an area where the single-base covering depth is lower than the threshold as N, and adopting second generation sequencing hole filling software to fill the shielded area N to obtain the nucleic acid third generation sequencing data with low single-base error rate. According to the processing method for the nucleic acid third generation sequencing raw data, the second generation short sequence data is compared with third generation long sequence data, the second generation sequencing hole filling software is utilized to make up the shielded area N with the low single-base covering depth in the comparison result, the single-base error rate in the third generation sequencing data is effectively reduced, the sequencing quality is improved.

Description

technical field [0001] The present application relates to the field of nucleic acid sequencing data processing, in particular to a processing method and application of nucleic acid third-generation sequencing raw data. Background technique [0002] With the maturity and popularization of the next-generation sequencing technology (Next-generation sequencing, NGS), the cost of sequencing has been greatly reduced. Among them, the next-generation sequencer Hiseq2500 can produce 600Gb of data in one run, which is equivalent to 200 times the human genome. Although the next-generation sequencing technology can quickly produce a large amount of data, it has a fatal shortcoming that the sequencing read length is too short. As we all know, the most important indicators for genome assembly are N50 and the integrity of the genome. The read length is too short, and most of its assembly algorithms are based on the idea of ​​de Bruijngraph (de Bruijngraph), so that the biggest challenge en...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F19/22
CPCG16B30/00
Inventor 刘亚斌邓天全贺丽娟杨林峰高强
Owner BGI TECH SOLUTIONS
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products