Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Long sequence data dimensionality reduction method used for approximate query

A technology of data dimensionality reduction and long sequence, which is applied in electrical digital data processing, special data processing applications, instruments, etc. It can solve the problems of distance similarity measurement complexity, infeasibility, and high complexity of embedding space, and achieve obvious economic benefits. and social benefits, the effect of solving performance bottlenecks

Inactive Publication Date: 2008-06-11
PEKING UNIV
View PDF0 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, due to the complexity of the distance similarity measure in the original sequence space, the complexity of constructing the embedding space is too high to be feasible
[0010] From the above analysis, it can be seen that the existing methods that support long sequence data for similarity query cannot well solve the technical problems faced in current practical applications. Dimension method, used to support similarity query of long sequence data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Long sequence data dimensionality reduction method used for approximate query
  • Long sequence data dimensionality reduction method used for approximate query
  • Long sequence data dimensionality reduction method used for approximate query

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] The present invention proposes a method for dimensionality reduction of sequence data that solves the similarity measurement between long sequence data and proposes a method for reducing the dimensionality of sequence data based on the principle of sequence embedding, which transforms long sequences from ordered The sequence space of the multi-set representation is transformed into the set space represented by the multi-set, and the method of sequence dimensionality reduction is given by using the proposed multi-set principal component distance convergence principle. Based on the principle of long sequence dimensionality reduction, the index structure is defined, and an efficient approximate query algorithm is given.

[0027] The method of the present invention first utilizes the sequence embedding technology to convert an input long sequence data into a sequence embedding tree (Sequence Embedding Tree); then from the sequence embedding tree, extract characters composed ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A similarity query-oriented long sequence data dimension reduction method is provided, which comprises adopting sequence embedding technology to convert sequence data into an embedded tree and extracting a plurality of sets; extracting a plurality of corresponding principal components based on the embedded tree and a plurality of sets and bringing forward distance convergence-based sequence data dimension reduction principle on the basis; on the basis of character of dimension reduction, constructing index structures facing sequence similarity query, SEM-tree and putting forward a high efficiency similarity query method facing long sequence data on the basis of the index structures and on the principle of sequence distance double bounds (maximum upper bound and minimum bound). The invention can be widely applied in the similarity query facing long sequence data, such as finding targets being searched through similarity search from a sea of Internet text data and carrying out similarity query and analysis on gene fragments from large-scale genetic data. The invention can forecast gaining obvious economic and social benefits.

Description

technical field [0001] The invention relates to a long sequence data dimensionality reduction method for approximate query, in particular to a sequence embedding (Sequential Embedding) method supporting efficient long sequence approximate query, which belongs to the technical field of computer data management. Background technique [0002] Long sequence data exists in various scientific research and application fields, such as text data on the Internet, gene sequence data in the biological field, and so on. Approximation query for these long sequence data has a very wide range of application requirements. Such as the search of Internet text data, the similarity exploration of genes in gene databases, and the sequence query in the multimedia field. Approximation query based on long sequence data has become a concern in the field of international data management research. [0003] At present, there are some methods and technologies that support the similarity query of sequen...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 宋国杰谢昆青
Owner PEKING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products