Fast recognition algorithm of similarity data in big data set

A recognition algorithm and similarity technology, applied in the field of big data processing, can solve the problems of occupying a large amount of CPU time and memory space, increasing the time and space overhead, and reducing the performance of similarity data recognition, so as to ensure the validity and accuracy, and the calculation cost is fixed. Effect

Active Publication Date: 2014-09-03
广州摩翼信息科技有限公司
View PDF3 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

When using the traditional similarity data recognition algorithm to calculate the summary of the data, it takes a lot of CPU time and a lot of memory space, and it also causes a lot of disk IO
It should be pointed out that these disk accesses are random, which seriously reduces the performance of similarity data identification
In addition, the computational overhead of traditional acacia data recognition algorithms grows with the growth of the dataset
[0005] 2. Shorten the time for similarity data identification: Under the dataset, traditional similarity data identification algorithms need a lot of time to identify similar data, which directly leads to very serious delays
[0007] Although the typical similarity algorithms Shingle and Simhash can effectively identify similar data, the time and space overhead of these two algorithms is very large in large data sets. In particular, the time and space overhead of these two algorithms doubles as the size of the data file increases.
Therefore, the two typical similarity algorithms cannot effectively solve the above challenges under large data sets
[0008] Although the traditional sampling similarity algorithm has a short recognition time and fixed overhead, which does not increase with the increase of the length of the data file, the traditional sampling similarity algorithm is very sensitive to the modification of the content of the data file
For example, a single byte modification of the data file content will cause the traditional similarity recognition algorithm to fail.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Fast recognition algorithm of similarity data in big data set
  • Fast recognition algorithm of similarity data in big data set
  • Fast recognition algorithm of similarity data in big data set

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0036] Such as figure 1 As shown, the algorithm flow process of the present invention has the following steps:

[0037] (1), the correction of data file length, before carrying out similarity judgment to data file, at first obtain the length of data file, the length of data file is divided by a position influence factor, then the quotient that obtains is multiplied by position influence factor, Finally, the resulting product is used as the corrected data file length. The correction of the length of the data file is to avoid the failure of similarity data identification due to the offset of the sampling data position due to the modification of the data file.

[0038] (2), calculate the distance between the sampled data blocks, subtract the length of the data file after correction from the product of the length of the sampled data multiplied by the number of sampled data blocks, and then divide the obtained difference by the difference of the number of sampled data blocks minus...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A fast recognition algorithm of similarity data in a big data set comprises the steps of correcting the length of a data file, calculating the distance between sampling data blocks, calculating the positions of the sampling data blocks, extracting the data blocks, extracting one data block at the head of the data file and one data block at the tail of the data file, calculating the characteristic value of the extracted data blocks and judging the similarity of data through set operation. According to the fast recognition algorithm, the space-time overhead does not increase along with increase of the size of the data file, the length of the data file is corrected through a position influence factor, recognition failure of the similarity data due to position offset of the sampling data blocks can be avoided, and the effectiveness and the accuracy of recognition of the similarity data are effectively guaranteed through the information retrieval method.

Description

technical field [0001] The invention relates to the technical field of big data processing, in particular to a fast recognition algorithm for similarity data under a big data set. Background technique [0002] In 2013, IDC predicted that the total amount of global data will reach 4ZB by 2014, and the data growth will reach 50% compared with 2012. IBM uses 4V: volume, variety, value, and veracity to describe the characteristics of these data, which directly shows that these data are very complex. For example, there are a lot of structured, semi-structured and unstructured data in these data. It is precisely because of these complex characteristics that there are still many unsolved problems in the existing data processing methods. Among them, file similarity plays a very important role in data processing methods. For example, cluster analysis in data mining, plagiarism detection, remote file backup, identification of similar data in the file system, identification of hot d...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/90G06F18/22
Inventor 邓玉辉周永涛
Owner 广州摩翼信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products