Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

High-throughput sequencing data-based genome de novo assembly method

A sequencing data and assembly method technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of high proportion of repeated sequences, reduced assembly effect, and increased assembly difficulty

Active Publication Date: 2014-12-24
BIOMARKER TECH
View PDF3 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

For the short sequences used in assembly, the general assembly software only performs simple filtering and error correction processing, and does not perform secondary processing on these most original short sequences, which limits the De Bruijn diagram to a large extent. Upper limit on kmer size in builds
Therefore, for genome assembly methods that do not process short sequences, the kmer size is relatively small, and more branches will be generated in the construction of the De Bruijn graph, which greatly increases the complexity of the De Bruijn graph, thereby reducing the assembly effect
[0004] In addition, a major feature of animal and plant genomes is the high proportion of repetitive sequences, and repetitive sequences will generate a large number of optional sites and branches during genome assembly, thereby increasing the difficulty of assembly

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • High-throughput sequencing data-based genome de novo assembly method
  • High-throughput sequencing data-based genome de novo assembly method
  • High-throughput sequencing data-based genome de novo assembly method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0064] Example 1 Escherichia coli (E.coli) genome assembly

[0065] 1) Test data introduction

[0066] The test data is downloaded from the SRA (Short Read Archive) database of NCBI (National Center for Biotechnology Information, namely the National Center for Biotechnology Information), the SRA database website is www.ncbi.nlm.nih.gov / sra, the data The detailed accession number is SRX016044. The details of the test data are as follows:

[0067] Upload date: 2009-05-22;

[0068] Library size: 180bp;

[0069] Total sequencing volume: 2.1G;

[0070] Predicted genome sequencing depth: 456.5x.

[0071] 2) Evaluation method

[0072] A total of 7 assembly software were tested and compared, the main parameters of each assembly software were traversed, and then the result with the best assembly result was selected for comparison and evaluation. The detailed assembly parameters of the best assembly results of each software are as follows:

[0073] GNOVO (the inventive method) as...

Embodiment 2

[0105] Example 2 Streptomyces (S.roseosporus) genome assembly

[0106] 1) Test data introduction

[0107] The test data is downloaded from NCBI's SRA database, the website of the SRA database is www.ncbi.nlm.nih.gov / sra, and the detailed accession numbers of the data are SRX026747 and SRX016085.

[0108] a) The details of the test data SRX026747 are as follows:

[0109] Upload date: 2010-08-06;

[0110] Library size: 180bp;

[0111] Total sequencing volume: 10.7G;

[0112] Predicted genome sequencing depth: 1389.6X.

[0113] b) The details of the test data SRX016085 are as follows:

[0114] Upload date: 2009-09-20;

[0115] Library size: 4kb;

[0116] Total sequencing volume: 3.5G;

[0117] Predicted genome sequencing depth: 454.5X.

[0118] 2) Evaluation method

[0119] Here, a total of 5 assembly software are tested and compared, the main parameters of each assembly software are traversed, and then the result with the best assembly result is selected for comparison...

Embodiment 3

[0131] Example 3 Neurospora crassa (N.crassa) genome assembly

[0132] 1) Test data introduction

[0133] The test data is downloaded from NCBI's SRA database. The website of the SRA database is www.ncbi.nlm.nih.gov / sra, and the detailed accession number of the data is SRX030834.

[0134] a) The details of the test data SRX030834 are as follows:

[0135] Upload date: 2010-11-11;

[0136] Library size: 180bp;

[0137] Total sequencing volume: 5.5G;

[0138] Predicted genome sequencing depth: 148.3X.

[0139] 2) Evaluation method

[0140] A total of 6 assembly software are tested and compared here. Here, the main parameters of each assembly software are traversed, and then the result with the best assembly result is selected for comparison and evaluation. The detailed assembly parameters of the best assembly results of each software are as follows:

[0141] GNOVO assembly parameters are: k1=25, k2=95, m1=5, m2=2, other parameters are default parameters (detailed evaluation...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a high-throughput sequencing data-based genome de novo assembly method, which comprises the following steps: (1) establishing a de Bruijn graph according to high-throughput sequencing data, and carrying out sequencing data error correction and super read assembly on the basis of the corrected de Bruijn graph; (2) utilizing super read to carry out primary contigs assembly; (3) taking specifically local primary contigs and reads, locally assembling, and combining all local assembly results; (4) sequencing contigs by a sub-graph segmentation algorithm and a simulated annealing algorithm to obtain final scaffolds. The errors brought by high-throughput sequencing are eliminated by de Bruijn graph correction, so that the data accuracy is improved; the sequencing read length is improved by establishing a super read method, and the contigs length is obviously enhanced; the processing capacity of repeated sequences is greatly enhanced by local assembly.

Description

technical field [0001] The invention relates to a genome assembly method, in particular to a genome de novo assembly method based on short sequence sequencing fragments. Background technique [0002] With the rapid development of second-generation sequencing technology and the rapid decline in sequencing costs, de novo genome sequencing is increasingly favored by researchers. However, using a large amount of short read data to restore the original appearance of the genome is also facing a huge challenge, and the most critical step is contigs assembly. De Bruijn graph construction is the core of the graph theory assembly algorithm, it is the core of the current mainstream de novo assembly algorithm, it is based on the overlapping information of kmer to construct the Euler graph, it is the cornerstone of contigs construction, so the development of the present invention will also be based on De Bruijn diagram. [0003] The current contigs assembly algorithms only construct th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F19/18
Inventor 郑洪坤刘敏
Owner BIOMARKER TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products