Large-scale data abnormity detection method

A technology for large-scale data and anomaly detection, applied in electrical digital data processing, special data processing applications, genetic models, etc., can solve the problems of poor algorithm detection performance, easy to be affected by complex data, etc., to improve anomaly detection performance , The effect of reducing the workload of computing and reducing the amount of data

Inactive Publication Date: 2017-10-24
UNIV OF ELECTRONICS SCI & TECH OF CHINA
View PDF0 Cites 27 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, since the SCIFOREST algorithm only considers and tests the experimental data, in actual work, the detection performance of the algorithm is not good in the face of unbalanced, mixed, and high-dimensional large-scale data environments, and it is easily affected by complex data.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Large-scale data abnormity detection method
  • Large-scale data abnormity detection method
  • Large-scale data abnormity detection method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0039] Such as figure 1 The anomaly detection method for large-scale data of the present invention includes:

[0040] A. Data preprocessing and feature extraction: Perform necessary data preprocessing on the original data, including data integration, data reduction and data cleaning, and then obtain preprocessed data sets and sample subsets. Then perform feature extraction on the preprocessed data, including:

[0041] A1. Data resampling: balance the samples of the preprocessed data through the preset ratio of positive and negative classes, and reduce the impact of negative samples on feature extraction;

[0042] A2. Calculation of information gain rate: Calculate the information gain rate of features through the data of multiple sample subsets, and sort the calculation results to form multiple feature sets; the method of calculating the information gain rate of features is:

[0043] Suppose the data set is D and the feature is A i (i=1,...,k), first calculate the entropy H...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a large-scale data abnormity detection method comprising the steps that A. data preprocessing and feature extraction are performed; B. hyperplane calculation based on twin support vector machines is performed, and a hyperplane standard function of partition data space is constructed; C. an isolation tree is formed: the isolation tree is established through the partition criterion of the hyperplane of the twin support vector machines; D. an isolation forest is formed: the step C is repeated, and multiple isolation trees are constructed so as to form the isolation forest; and E. the isolation forest is traversed and the abnormity score is calculated: the isolation forest is traversed through the data under abnormity detection and the abnormity score is calculated to act as the standard for judging the degree of abnormity score, and existence of the abnormal data in the original data is judged according to the standard. The detection data volume can be effectively reduced so that the calculation workload can be reduced, the abnormity detection accuracy can be enhanced without significant increasing of time consumption and the abnormity detection performance for the high dimensional data can be greatly enhanced.

Description

technical field [0001] The invention relates to a data mining method, in particular to a large-scale data anomaly detection method. Background technique [0002] Anomaly detection refers to discovering and looking for data objects that are obviously different from most other data through corresponding technical means. Generally speaking, these data are very small compared to normal data. The object of anomaly detection is called anomaly point, or isolated point, outlier point. Although these data are often hidden among normal data and cannot be found directly, there may be important information hidden behind these data, which has great research value. In 1980, Hawkins first defined an outlier as a value that is significantly different from other values, making people question whether it is produced by a different and unknown mechanism. From then on, outliers are no longer noises in the field of data mining, nor data that need to be discarded in the preprocessing stage. W...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62G06F17/30G06N3/12
CPCG06F16/215G06N3/126G06F18/2411G06F18/24323
Inventor 罗光春殷光强田玲闫科
Owner UNIV OF ELECTRONICS SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products