Weight clustering and under-sampling-based unbalanced data classification method

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A data classification and under-sampling technology, applied in the computer field, can solve the problems of loss of majority sample information and uncertain compensation degree, and achieve the effects of improving classification accuracy, optimizing random under-sampling, and increasing computing overhead

Inactive Publication Date: 2017-05-31

CENT SOUTH UNIV

View PDF0 Cites 24 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

However, in the undersampling of each round of Adaboost iterations, the RUSBoost algorithm randomly extracts samples from the majority class, so this method will cause the loss of information of the majority class samples.

Even if the lost information is compensated to a certain extent by the Boosting method, the degree of compensation is random and undeterminable

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0078] In this implementation, the data used is 1000 pieces of two-dimensional data generated manually, and the ratio of the majority class to the minority class is 9:1. in figure 2 (a) is the general unbalanced data with a clear boundary between the majority class and the minority class. figure 2 (b) is the data where the majority class overlaps with the minority class. figure 2 (c) Imbalanced data showing separation of minority class subsets. Among them, 'x' indicates that the point is a majority class sample, and 'o' indicates that the point is a minority class sample. Table 1 shows the algorithm used in the experimental comparison. In this implementation, under the unbalanced data situation of these three distributions, the clustering method based on the sample weight variance proposed by the present invention is compared with K-Means (CEU) and the hierarchical clustering method. Clustering (EHCU) effect comparison experiment, the sample points of the same gray level...

Embodiment 2

[0084] In this implementation, 22 groups of KEEL data with different practical application backgrounds are selected as experimental test data. In the selected data set, the smallest ratio of the number of majority and minority classes is 9.09, and the largest is 128. For data with multiple categories, combine some categories or take only two categories. The experimental results are shown in Table 2. In order to make the results more reliable, the experiment conducts 5 experiments for each validation of each dataset and takes the average of the AUC results. Table 2 shows the experimental results of each comparison algorithm and the algorithm proposed in this paper on 22 imbalanced datasets.

[0085] Table 2 AUC index experimental results

[0086]

[0087] The results show that the algorithm proposed by the present invention has better performance than other algorithms on more data sets, and the comprehensive average AUC value is the largest. Algorithms improved by an ave...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The classification of unbalanced data sets already becomes one of most challenging problems in data mining. A quantity of minority class samples is far smaller than a quantity of majority class samples, so that the minority classes have the defects of low accuracy, poor generalization performance and the like in a classification learning process of a conventional algorithm. The algorithm integration already becomes an important method for dealing with the problem, wherein random under-sampling-based and clustering-based integrated algorithms can effectively improve classification performance. But, the former easily causes information loss, and the latter is complex in calculation and difficult to popularize. The invention provides a weight clustering-based improved integrated classification algorithm fusing under-sampling, which is specifically a weight clustering and under-sampling-based unbalanced data classification method. According to the algorithm, a cluster is divided according to weights of the samples, a certain proportion of majority classes and all minority classes are extracted from each cluster according to weight values of the samples to form a balanced data set, and classifiers are integrated by utilizing an Adaboost algorithm framework, so that the classification effect is improved. An experimental result shows that the algorithm has the characteristics of accuracy, simplicity and high stability.

Description

technical field [0001] The invention belongs to the technical field of computers, and in particular relates to an unbalanced data set classification method based on Adaboost algorithm weight clustering and under-sampling. Background technique [0002] With the development of Internet technology, the type and quantity of information that people obtain are increasing rapidly. A large amount of data noise and more complex data release types will bring new challenges to our data analysis. Among them, the classification of imbalanced datasets has become one of the most challenging problems in data mining, which widely exists in medical diagnosis, credit evaluation and other fields. In unbalanced data, the number of samples of the majority class is far greater than that of the minority class. If ordinary machine learning methods and evaluation criteria are used, the minority class may be ignored or even directly treated as noise. Therefore, it is often difficult for ordinary mac...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06K9/62

CPCG06F18/2148G06F18/24

Inventor 邓晓衡钟维坚任炬

Owner CENT SOUTH UNIV

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Weight clustering and under-sampling-based unbalanced data classification method

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology