Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Compact next-generation sequencing datasets and efficient sequencing processing using them

A compact, gene sequencing technology, applied in the field of gene analysis, which can solve the problems of increased cost and high computing cost, and achieve the effect of preserving compatibility

Inactive Publication Date: 2018-05-29
KONINKLJIJKE PHILIPS NV
View PDF1 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] The combination of the size of large genomic datasets and the rapidly decreasing cost of performing NGS means that genetic data storage is a major part of the total cost of sequencing applications and is expected to decrease as sequencing becomes cheaper and produces larger datasets. Continue growing
Furthermore, large raw read datasets translate into higher computational costs for downstream processing (such as alignment)

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Compact next-generation sequencing datasets and efficient sequencing processing using them
  • Compact next-generation sequencing datasets and efficient sequencing processing using them
  • Compact next-generation sequencing datasets and efficient sequencing processing using them

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] Disclosed herein is a method for formatting raw read data including base quality scores in a manner that allows for a substantial reduction in file size while preserving most of the useful information. As discussed earlier, in the regular FASTQ format, reads occupy slightly more than 2L 序列 (ASCII) characters, where L 序列 is the number of bases. Other existing text-based storage formats that store base sequences and corresponding base quality scores occupy a considerable amount of storage. For example, in the Qseq format, base sequences and quality scores are stored but arranged in a single line of text. The FASTA format is able to cut this storage roughly in half - but it does so by losing all base quality score information. Alternatively, anyone can convert a text-formatted read entry to a non-text format (eg, a binary format where two bits encode a base and the phred score is represented by a binary integer value). However, the most downstream processing components...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

In a method comprising operating on genetic sequencing reads acquired by processing base sequences collected from a tissue sample, a compact textual representation of the genetic sequencing reads is generated. The compact text representation includes: (1) a text string representing a base sequence, and (2) a base quality text field identifying the longest subsequence of a base sequence for which the base of the subsequence base quality scores for bases satisfying a base quality score threshold; and storing a compact textual representation of a gene sequencing read in a raw read memory. To provide flexibility, the base quality text field may identify the longest subsequence for each of two or more different base quality score thresholds. During read alignment, offset boundaries for gene sequencing reads can be efficiently selected using the contents of the base quality text field.

Description

technical field [0001] The following relates to the field of genetic analysis, and to the same application in medical fields such as including the fields of oncology, veterinary medicine, etc. Background technique [0002] Efficient gene-sequencing systems, sometimes called "next-generation sequencing" (NGS) systems, are capable of rapidly and essentially automatically sequencing entire genomes. Although NGS accuracy is sufficient for clinical applications and is expected to improve as the technology matures, existing NGS systems sometimes exhibit lower Reading Accuracy. [0003] To assess read precision (or reliability), a base quality score is typically calculated for each base of a read. In the case of Sanger sequencing, the phred quality score is calculated from the spectrogram data by calculating parameters such as peak shape and resolution for the sequenced bases, and comparing these values ​​to an empirically built lookup table. Phred scores are generally considere...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F19/22C12Q1/68G16B30/10G16B30/20
CPCG16B30/00G16B30/10G16B30/20
Inventor S·库马尔R·辛格B·查克拉巴蒂
Owner KONINKLJIJKE PHILIPS NV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products