Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Weight clustering and under-sampling-based unbalanced data classification method

A data classification and under-sampling technology, applied in the computer field, can solve the problems of loss of majority sample information and uncertain compensation degree, and achieve the effects of improving classification accuracy, optimizing random under-sampling, and increasing computing overhead

Inactive Publication Date: 2017-05-31
CENT SOUTH UNIV
View PDF0 Cites 24 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, in the undersampling of each round of Adaboost iterations, the RUSBoost algorithm randomly extracts samples from the majority class, so this method will cause the loss of information of the majority class samples.
Even if the lost information is compensated to a certain extent by the Boosting method, the degree of compensation is random and undeterminable

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Weight clustering and under-sampling-based unbalanced data classification method
  • Weight clustering and under-sampling-based unbalanced data classification method
  • Weight clustering and under-sampling-based unbalanced data classification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0078] In this implementation, the data used is 1000 pieces of two-dimensional data generated manually, and the ratio of the majority class to the minority class is 9:1. in figure 2 (a) is the general unbalanced data with a clear boundary between the majority class and the minority class. figure 2 (b) is the data where the majority class overlaps with the minority class. figure 2 (c) Imbalanced data showing separation of minority class subsets. Among them, 'x' indicates that the point is a majority class sample, and 'o' indicates that the point is a minority class sample. Table 1 shows the algorithm used in the experimental comparison. In this implementation, under the unbalanced data situation of these three distributions, the clustering method based on the sample weight variance proposed by the present invention is compared with K-Means (CEU) and the hierarchical clustering method. Clustering (EHCU) effect comparison experiment, the sample points of the same gray level...

Embodiment 2

[0084] In this implementation, 22 groups of KEEL data with different practical application backgrounds are selected as experimental test data. In the selected data set, the smallest ratio of the number of majority and minority classes is 9.09, and the largest is 128. For data with multiple categories, combine some categories or take only two categories. The experimental results are shown in Table 2. In order to make the results more reliable, the experiment conducts 5 experiments for each validation of each dataset and takes the average of the AUC results. Table 2 shows the experimental results of each comparison algorithm and the algorithm proposed in this paper on 22 imbalanced datasets.

[0085] Table 2 AUC index experimental results

[0086]

[0087] The results show that the algorithm proposed by the present invention has better performance than other algorithms on more data sets, and the comprehensive average AUC value is the largest. Algorithms improved by an ave...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The classification of unbalanced data sets already becomes one of most challenging problems in data mining. A quantity of minority class samples is far smaller than a quantity of majority class samples, so that the minority classes have the defects of low accuracy, poor generalization performance and the like in a classification learning process of a conventional algorithm. The algorithm integration already becomes an important method for dealing with the problem, wherein random under-sampling-based and clustering-based integrated algorithms can effectively improve classification performance. But, the former easily causes information loss, and the latter is complex in calculation and difficult to popularize. The invention provides a weight clustering-based improved integrated classification algorithm fusing under-sampling, which is specifically a weight clustering and under-sampling-based unbalanced data classification method. According to the algorithm, a cluster is divided according to weights of the samples, a certain proportion of majority classes and all minority classes are extracted from each cluster according to weight values of the samples to form a balanced data set, and classifiers are integrated by utilizing an Adaboost algorithm framework, so that the classification effect is improved. An experimental result shows that the algorithm has the characteristics of accuracy, simplicity and high stability.

Description

technical field [0001] The invention belongs to the technical field of computers, and in particular relates to an unbalanced data set classification method based on Adaboost algorithm weight clustering and under-sampling. Background technique [0002] With the development of Internet technology, the type and quantity of information that people obtain are increasing rapidly. A large amount of data noise and more complex data release types will bring new challenges to our data analysis. Among them, the classification of imbalanced datasets has become one of the most challenging problems in data mining, which widely exists in medical diagnosis, credit evaluation and other fields. In unbalanced data, the number of samples of the majority class is far greater than that of the minority class. If ordinary machine learning methods and evaluation criteria are used, the minority class may be ignored or even directly treated as noise. Therefore, it is often difficult for ordinary mac...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06K9/62
CPCG06F18/2148G06F18/24
Inventor 邓晓衡钟维坚任炬
Owner CENT SOUTH UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products