An Unsupervised Automatic Data Cleaning Method

An automatic cleaning and unsupervised technology, applied in the field of data management, can solve the problems of increasing the delivery cost of enterprises, wrong delivery of goods, economic losses, etc., and achieve the effect of saving labor costs, improving accuracy, and improving effects

Active Publication Date: 2022-03-01
SICHUAN CHANGHONG ELECTRIC CO LTD
View PDF7 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In business, erroneous data can cause large financial losses
For example, wrong customer information may lead to wrong delivery of the goods purchased by the company, which not only increases the delivery cost of the company, but also has a relatively large negative impact on the image of the company for a long time
[0003] Among the existing data cleaning methods, some methods require heavy manual participation in the data cleaning process, such as providing suggestions for cleaning or confirming repairs, etc.; although some methods do not require manual participation in the cleaning process, they need to be formulated in advance. rule
Existing data cleaning methods are not applicable when data rules are unknown or labor costs are unaffordable

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • An Unsupervised Automatic Data Cleaning Method
  • An Unsupervised Automatic Data Cleaning Method
  • An Unsupervised Automatic Data Cleaning Method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0049] Such as figure 1 As shown, an unsupervised automatic data cleaning method can realize data cleaning in the absence of data quality patterns / rules and without manual intervention, while ensuring the effect and efficiency of data cleaning.

[0050] Specifically include the following steps:

[0051] S10. Data model learning:

[0052] To find out the hidden patterns / rules, the dependencies between attributes need to be learned from raw data which may contain invalid data. Since there may be invalid data, the absolute or strong dependencies between the attributes of the data table do not necessarily exist. By finding out the implicit non-absolute or relatively weak dependencies and expressing them in the form of a Bayesian network, the data model is obtained .

[0053] The key processes extracted in this step are as follows:

[0054] S101. Evaluate and sample the data to be repaired;

[0055] S102. Learning the original data set or the sampled data set to obtain the struc...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an unsupervised data automatic cleaning method, which includes the following steps: A. Data model learning, learning the dependency relationship between attributes from the original data that may contain invalid data, by finding out the implicit non-absolute Or relatively weak dependencies, to obtain a data model expressed in the form of a Bayesian network; B. Generation of data cleaning rules; after obtaining the complete data model of the original data or original data sampling, the data cleaning rules are generated , and specifically generate predicates and first-order predicate rules; C. Generate a Markov logic network based on the predicates and first-order predicate rules generated in step B; D. Generate inference rules based on the Markov logic network generated in step C And data cleaning based on inference results. The method of the invention can effectively improve the data quality of various business systems of the company without consuming a lot of manpower and material resources, and helps the management to make correct decisions.

Description

technical field [0001] The invention relates to the technical field of data management, in particular to an unsupervised automatic data cleaning method. Background technique [0002] Data in the real world usually needs to be cleaned (the data that needs to be cleaned is defined as dirty data below), such as may contain inconsistent, noisy, incomplete or repeated values. In business, erroneous data can cause great financial losses. For example, wrong customer information may lead to wrong delivery of the goods purchased by the company, which not only increases the delivery cost of the company, but also has a relatively large negative impact on the image of the company for a long time. [0003] Among the existing data cleaning methods, some methods require heavy manual participation in the data cleaning process, such as providing suggestions for cleaning or confirming repairs, etc.; although some methods do not require manual participation in the cleaning process, they need ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/215G06K9/62
CPCG06F18/295
Inventor 李玲唐军吴纯彬于跃陈秋宇
Owner SICHUAN CHANGHONG ELECTRIC CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products