Rapid division method for big data gene sequencing files

A gene sequencing and big data technology, applied in the field of high-performance computing, can solve problems such as affecting bwa results and inconsistency in comparison results, and achieve the effects of improving the division speed, reducing the number of reading and writing, and eliminating comparison errors.

Active Publication Date: 2020-06-23
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF27 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Since the ordinary segmentation method does not take this into consideration, it affects the results of bwa ​​and easily causes inconsistencies in the comparison results

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Rapid division method for big data gene sequencing files
  • Rapid division method for big data gene sequencing files
  • Rapid division method for big data gene sequencing files

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] Before describing this method in detail, the format of the fastq file is briefly introduced. The fastq file is a text file with a sequence every four lines. The first line is the name information of the sequence, the second line is the base sequence, the third line is the description information, and the fourth line is the quality score information of the sequence. Each sequence is not exactly the same length. Sequencing files are divided into single-end sequencing files and paired-end sequencing files. A single-end sequencing file contains only one file, and a paired-end sequencing file contains a pair of files, and each sequence in this pair of files corresponds to each other.

[0042] According to an embodiment of the present invention, combining figure 1 Introduce the method of partition by block, which includes the following steps.

[0043] Step 101: Set the size of the file block, preferably, the value range can be between 1M and 100M.

[0044] The inventor's r...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the field of high-performance computing. The invention in particular relates to a rapid division method of a big data gene sequencing file, so that in a multi-node genetic analysis process, the sequencing file does not need to be actually segmented, sub-files are not generated, and a flexible division scheme is provided according to a subsequent analysis program, so that the load of each node is more balanced, hard disk read-write is reduced, and the division efficiency is improved.

Description

technical field [0001] The invention relates to the field of high-performance computing, in particular to a method for rapid segmentation of big data gene sequencing files. Background technique [0002] With the rapid development of the general health field, genetic analysis technology plays an increasingly important role. The gene sequencer produces a large number of sequencing files, and the most commonly used sequencing file format is the fastq format. Each sequencing file can range from a few gigabytes to tens of gigabytes to hundreds of gigabytes. How to process these big data quickly has become the bottleneck of gene analysis. [0003] Due to the large size of the sequencing file, it takes a lot of time to analyze and process with a single node, so multiple nodes are required for parallel computing to reduce the time for gene analysis. This requires dividing the sequencing files, each node only processes a part of the sequencing files, and finally combining the proc...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G16B40/00G16B30/00
CPCG16B40/00G16B30/00
Inventor 张中海谭光明张春明姚二林
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products