Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Gene variation data distributed storage method and architecture

A distributed storage and gene mutation technology, applied in the field of distributed storage method and architecture of genetic mutation data, can solve the problems of high data maintenance cost, poor scalability, and large data flow delay, and achieve high batch processing efficiency and good randomness Readability, the effect of reducing data redundancy

Active Publication Date: 2018-09-21
SOUTH CHINA UNIV OF TECH
View PDF7 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The genome-wide association analysis scenario requires both low-latency random read performance and efficient batch read and write performance. An inappropriate storage architecture may lead to problems such as low efficiency, complex models, and low scalability. It is necessary to design a suitable Storage architecture to improve the efficiency of genome-wide association analysis
[0003] The storage scheme based on Hadoop Distributed File System (HDFS) stores mutation detection files (VCF files) in the form of Block blocks on multiple nodes. It has strong scalability and can efficiently respond to batch analysis tasks, but it cannot provide low-cost Delayed random data access, also unable to provide data update operations
The HBase-based storage solution uses key-value pairs to store VCF files. HBase is a distributed database that can be easily expanded to multiple nodes. Based on HBase, low-latency random read and write can be achieved, but because HBase is a column cluster storage, and store key-value pairs, its scan overhead is relatively large, and efficient batch analysis operations cannot be achieved
The hybrid architecture based on HDFS+HBase can achieve low-latency random read and write and efficient batch analysis, but the model of this architecture is complex, the cost of data maintenance is high, and the delay from data generation to data flow that can be analyzed in batches is large
In addition, there are some genotype query tools, such as gqt, which create bitmap indexes on the basis of VCF files to speed up retrieval, but this tool can only complete part of the functions required by the scene, and more complex queries need to combine multiple tools , and most of these tools are processed by a single node, and the scalability is poor

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Gene variation data distributed storage method and architecture
  • Gene variation data distributed storage method and architecture
  • Gene variation data distributed storage method and architecture

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0048] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

[0049] Such as figure 1 As shown, the genetic variation data distributed storage method provided by the present invention includes the following steps:

[0050] S1. Preprocess the VCF file, cut off the VCF head, vertically split the VCF file into two parts: metadata information and sample genotype information, and further vertically split the sample genotype data into more parts according to th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a gene variation data distributed storage method and architecture. The method comprises a distributed data storage process, a distributed bitmap index creation process and a distributed query retrieval process; and the architecture comprises a distributed column storage module, a distributed bitmap index module and a query retrieval module. A new column storage engine kuduis adopted for performing data distributed storage, and distributed local bitmap indexes are established for samples, so that the problem of low random data access performance of an existing HDFS scheme is effectively solved; the problem of poor batch analysis performance of an HBase scheme is solved; a storage architecture model is simplified; the limitation problem that a genotype query tool depends on multiple tools is solved; and through a distributed local bitmap index scheme, high concurrency is realized and the expandability is improved.

Description

technical field [0001] The present invention relates to the field of big data storage, in particular to a method and architecture for distributed storage of genetic variation data based on columnar storage and bitmap indexing. Background technique [0002] With the rapid development of gene sequencing technology and the urgent need for personalized medicine, genome-wide association analysis has become an increasingly popular research field. Genome-wide association analysis relies on large-scale genetic variation detection data. These data belong to the typical big data category. The data organization, indexing, and expansion methods of different storage architectures will have a great impact on data retrieval and analysis. The genome-wide association analysis scenario requires both low-latency random read performance and efficient batch read and write performance. An inappropriate storage architecture may lead to problems such as low efficiency, complex models, and low scala...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F19/28G06F19/24
CPCG16B40/00G16B50/00
Inventor 董守斌王博董守玲袁华
Owner SOUTH CHINA UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products