A scaffolding method based on the statistical characteristics of the double-ended read insert size

A technique of statistical features and readings, applied in the field of bioinformatics, which can solve the problem of removing and ignoring the characteristics of long nodes and short nodes.

Active Publication Date: 2018-10-16
CENT SOUTH UNIV
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This strategy considers high-credibility and low-credibility edges together, and ignores the restrictive effect of high-confidence edges on other edges, and it is easy to remove some high-confidence edges.
[0009] (3) When traversing the scaffold graph, extracting paths and generating scaffolds, the optimization goal is often to maximize the number of matching double-ended reads on the entire path, while ignoring some characteristics of long nodes and short nodes
[0010] The existence of these problems limits the existing scaffolding methods to achieve more satisfactory results

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A scaffolding method based on the statistical characteristics of the double-ended read insert size
  • A scaffolding method based on the statistical characteristics of the double-ended read insert size
  • A scaffolding method based on the statistical characteristics of the double-ended read insert size

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0093] Such as figure 1 Shown, the concrete realization process of the present invention is as follows:

[0094] 1. Pretreatment

[0095] In this method, it is assumed that in paired-end reads, the left and right reads are not in the same orientation, and that the right read is to the right of the left read relative to the 5' to 3' orientation of the left read.

[0096] The contig file and the comparison result file are used as input data. In the alignment result file, due to repeated regions and sequencing errors, a read often has multiple alignment position information. For each reading, this method only retains the alignment position information with the highest alignment score, and removes all the remaining non-optimal alignment position information. According to the alignment position information of all reads, the read coverage of each position on each contig can be obtained, that is, how many aligned reads cover a certain base position. At the same time, the average ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a scaffolding method based on statistical characteristic of double-end insert size, comprising the steps of first of all, pre-processing contig noise compared by double-end reading followed by constructing scaffold diagram in which each node represents a contig; estimating expected value compared with double-end reading between two nodes based on insert size distribution; and based on actual numbers compared with double-end reading between two nodes and expected value, determining whether there are edge and edge weight between two nodes. Then, methods of iteration and linear planning are applied to solve possible conflicts in scaffold diagram. Lastly, breadth-first traverse algorithm is applied to determine scaffold in the scaffold diagram. The scaffolding method based on statistical characteristic of double-end insert size is simple and easy to use, showing good scaffolding results concerning different true data and higher accuracy compared with other scaffolding methods.

Description

technical field [0001] The invention relates to the field of bioinformatics, in particular to a scaffolding method based on the statistical characteristics of double-end read insert size. Background technique [0002] Genome generally refers to all coding and non-coding deoxyribonucleic acid (DNA) sequences, which are composed of four bases: adenine (A), thymine (T), cytosine (C) and guanine (G) The sequence, that is, the genome sequence is a string, which only contains four characters A, T, G, and C. Another character N is also included in the actual genome sequence, representing that the base at this position cannot be determined. Through genome sequencing, short-segment base sequences (reads or reads) on a large number of genome sequences can be obtained. A collection of reads obtained from genome sequencing, generally with relatively short read lengths. Sequence assembly methods use these short reads to restore the complete original genome sequence. With the rapid de...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F19/00
CPCG16B40/00
Inventor 王建新罗军伟李敏段桂华
Owner CENT SOUTH UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products