Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A method and system for deduplication of data

A data and key data technology, applied in the field of real-time massive data processing, can solve the problems of low resource utilization, large memory resource usage, and high hardware investment, and achieve the effect of improving utilization, increasing data processing capacity, and reducing investment.

Active Publication Date: 2021-10-08
SHENZHEN IDREAMSKY TECH
View PDF13 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] It can be seen that when the existing data deduplication methods and systems deal with real-time massive data, the memory resource usage is huge, the hardware investment is high, and the resource utilization rate is too low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and system for deduplication of data
  • A method and system for deduplication of data
  • A method and system for deduplication of data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0043] The following references are attached figure 1 to attach Figure 7 , to further elaborate on various embodiments of the present invention.

[0044] First, explain the realistic basis of the data deduplication in the present invention:

[0045] For the processing of real-time massive data, whether it is data summation, data retrieval and data statistics, removing duplicate data from massive data and then performing corresponding processing can effectively reduce the amount of data and speed up data processing. At the same time, the obtained results are more accurate. However, when data is deduplicated, a large amount of non-repeated data needs to be compared with real-time data to determine whether the real-time data is duplicated. As a result, non-repeated data needs to occupy the system memory for a long time, which increases the occupancy rate of memory resources, causes the data to be processed to lag, silts up, slows down the processing speed, and reduces the amo...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a data deduplication method and system, comprising the following steps: opening a key data file, inputting a real-time keyword, calling a deduplication function to deduplicate the real-time keyword according to the key data file; After the step of opening the key data file, it also includes reading the key data file into the memory in the form of memory mapping and preliminary classifying the deduplication keywords in the key data file; reducing the occupied space of the memory and avoiding data deduplication. It consumes a lot of memory during re-duplication, and optimizes the data arrangement structure in key data files, effectively improving memory utilization and data de-duplication efficiency; the data de-duplication system using this method also has the above advantages.

Description

technical field [0001] The invention relates to the field of real-time massive data processing, in particular to a method and system for deduplication of data. Background technique [0002] In the existing big data system, the data statistics process needs to remove a large amount of duplicate data, so that the statistical results will be more accurate. In the existing open source streaming real-time computing frameworks (such as storm and spark streaming), it is relatively easy to realize real-time calculation summation, counting and other statistics. However, for data deduplication calculation, the real-time computing framework itself does not provide corresponding implementation, and developers need to implement it themselves or use third-party systems (such as key-value storage system redis, distributed storage system hbase and Cassandra) to implement. [0003] Currently, the non-duplicated data generated during the deduplication process of real-time data is stored in m...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F12/02G06F3/06
CPCG06F3/0608G06F3/0643G06F3/0652G06F12/023G06F2212/1044
Inventor 刘荣远
Owner SHENZHEN IDREAMSKY TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products