Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Dynamic configurable rule-based data cleaning framework under big data background

A data cleaning and big data technology, applied in the fields of electronic digital data processing, special data processing applications, instruments, etc., can solve the problems of not supporting online modification of rules, lack of implementation and understanding, data cleaning obstacles, etc., to facilitate online modification and The effect of adding and deleting rules and a solid compilation theoretical foundation

Inactive Publication Date: 2016-09-07
XINJIANG TECHN INST OF PHYSICS & CHEM CHINESE ACAD OF SCI
View PDF1 Cites 30 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Most of the existing research on data cleaning focuses on character data, and the processing of numerical (such as numerical fields falling within a certain interval), enumerated, Boolean and other fields is not mature and practical; most Data cleaning tools or frameworks are aimed at certain specific fields. If users need to introduce new rules or reuse some rules from other fields (for example, ID number rules are common in many fields), it becomes very difficult to expand existing solutions. Or it becomes very difficult to deploy these solutions to your own system; at present, there are still some cleaning tools whose cleaning detection and cleaning modification are implemented by hard coding, which will lead to poor scalability and flexibility of the system. When the cleaning rules change The code of the cleaning part needs to be re-implemented, and the descriptiveness of the hard-coded method for data cleaning is weak, especially in the implementation of complex logic data cleaning, which is relatively lacking in execution and understanding; there are also some cleaning tools In the process of detection, cleaning and modification, manual judgment is used. This method has the advantage of high accuracy when the amount of data is small, but it is powerless when the amount of data is huge and multi-source.
[0006] The inventors of the present invention have studied some existing rule-based data cleaning methods and summarized them as follows: 1) the NADEEF method proposed by Amr Ebaid et al., which supports various forms of rules, but complex rules cannot be realized between the rules. Logical operations, and the lack of processing of some important issues in the field of data cleaning, such as missing value filling, etc.; 2) AszpClean method proposed by Li Junkui et al., which realizes dynamic compilation of rules and zero configuration of rules, but for non-compliant The data is directly discarded, and the data repair function is not implemented
More importantly, this method only supports function-type rules and uses hard-coded methods to match attributes and rules, so this method does not support online modification of rules and it is difficult to reuse rules in multiple fields; 3) Other traditional constraints Methods, using conditional functions, including dependencies, etc. to represent rules, these methods can help us identify which data is dirty data, but rarely involve which specific attribute is wrong, and how to fix it

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Dynamic configurable rule-based data cleaning framework under big data background
  • Dynamic configurable rule-based data cleaning framework under big data background
  • Dynamic configurable rule-based data cleaning framework under big data background

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

[0037] The data cleaning framework based on dynamic configurable rules under the big data background of the present invention, the typical data preprocessing process is as follows figure 1 As shown, the box on the left represents the original data set, which includes structured data, semi-structured data and unstructured data. The middle box represents the two main tasks of data preprocessing: data transformation and data cleaning. The end result of data preprocessing is to output clean data.

[0038] figure 2 It is a flow chart of the DRDCM method, which gives an overview of the DRDCM method. The working mode is: analyze and extract effective rules from the data source, enter and store these rules into the rule library through the rule definition interface, and the definition of the rules must meet the Rule template. Professor Zhou Aoying...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention belongs to the field of big data processing and analysis, and discloses a dynamic configurable rule-based data cleaning framework under a big data background. According to the dynamic configurable rule-based data cleaning framework under the big data background, a new method which is interdisciplinary, reusable and configurable and which integrates the data conversion, data check and data recovery is adopted, so that the description ability and execution efficiency of the cleaning process are improved. Experimental results of a plurality of real data sets indicate that the data cleaning framework is capable of integrating dynamic configurable rules to a plurality of data sources and a plurality of different application fields in a seamless manner, and can be implemented in a plurality of projects, so that the effective functions of the data cleaning framework in real scenes are further verified.

Description

technical field [0001] The invention belongs to the field of big data processing and analysis, and is a data cleaning framework based on dynamically configurable rules used in a big data environment. Background technique [0002] Research on the data of several well-known companies, 25% of the important data is flawed. A survey found that "dirty data" caused American companies to pay about 600 billion US dollars (600 billion dollars) in losses every year. A recent survey by Experian QAS Inc found that British companies were in 2011 because of quality problems with customer data. Lost £8 billion. In fact, the market for data cleaning tools is growing at a rate of 17% per year, which is higher than the average growth rate of 7% for other sectors of the IT industry. Although data cleaning research is constantly advancing, there is still no ready-made solution that can be directly used and deployed in different application fields without complex customization to automatically ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/215
Inventor 蒋同海朱会娟周喜程力赵凡马博
Owner XINJIANG TECHN INST OF PHYSICS & CHEM CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products