Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Unbalanced data set conversion method and system based on sampling and feature reduction

A technology of unbalanced data and conversion method, which is applied in the direction of instruments, character and pattern recognition, computer components, etc., can solve the problems of the complexity of the classifier training process, and does not consider the different importance of classifiers, so as to achieve the goal of improving accuracy Effect

Inactive Publication Date: 2019-10-18
COMP NETWORK INFORMATION CENT CHINESE ACADEMY OF SCI
View PDF0 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] 1. The existing method of oversampling the minority class sample sets treats all minority class sample sets equally, without considering the different importance of classifiers of different minority class sample sets; 2. The characteristics of the data set have an impact on the performance of the classifier Very important impact, if the feature contains many fields that have no effect on the classification result, it will bring greater complexity to the training process of the classifier

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unbalanced data set conversion method and system based on sampling and feature reduction
  • Unbalanced data set conversion method and system based on sampling and feature reduction
  • Unbalanced data set conversion method and system based on sampling and feature reduction

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0072] Embodiment 1 of the present invention provides an unbalanced data set conversion method based on sampling and feature reduction, such as figure 1 As shown, the method includes the following steps:

[0073] S1: Obtain an unbalanced data set, the unbalanced data set includes a majority class sample set and a minority class sample set;

[0074] S2: Sampling the unbalanced data set to obtain a new unbalanced data set, including using the S-NKSMOTE algorithm to oversample the minority class sample set, refer to figure 2 ,Specifically:

[0075] S21 Obtain k nearest neighbor samples of the sample x in the minority class sample set;

[0076] Among them, the k nearest neighbor samples are the k samples closest to the sample x in the kernel space, and the value of k can be set, which can be 100, 500, etc.;

[0077] S22: Compare the number of minority class samples in the k nearest neighbor samples with the number of majority class samples, when the number of minority class sa...

Embodiment 2

[0094] Embodiment 2 of the present invention provides an unbalanced data set conversion method based on sampling and feature reduction, the method includes the following steps:

[0095] S1: Obtain an unbalanced data set, the unbalanced data set includes a majority class sample set and a minority class sample set;

[0096] S2: Sampling the unbalanced data set to obtain a new unbalanced data set. For the specific method of step S2, refer to Figure 4 , including:

[0097] S210: Acquire boundary sample sets of the majority class sample set and the minority class sample set;

[0098] refer to Figure 5 , step S210 is specifically, wherein the distances referred to below are all distances in the nuclear space;

[0099] S211: Calculate the distance between each majority class sample and its nearest minority class sample in the majority class sample set;

[0100] S212: Calculate the distance between each minority class sample and its nearest majority class sample in the minority ...

Embodiment 3

[0138] Embodiment 3 of the present invention provides an unbalanced dataset conversion system based on sampling and feature reduction, such as Figure 9 As shown, the conversion system includes:

[0139] Obtain a data acquisition module 1 of an unbalanced data set, the unbalanced data set includes a majority class sample set and a minority class sample set;

[0140] Perform sampling processing on the unbalanced data set to obtain a sampling processing module 2 of a new unbalanced data set;

[0141] Perform dimensionality reduction processing on the new unbalanced data set, and convert it into a dimensionality reduction processing module 3 of a new unbalanced data set with reduced features.

[0142] continue to refer Figure 9 , the sampling processing module 2 includes:

[0143] Boundary sample acquisition submodule 210: used to obtain the boundary sample set of the majority class sample set and the minority class sample set; wherein, the boundary sample acquisition submodu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides an unbalanced data set conversion method and system based on sampling and feature reduction, and the method comprises the steps: carrying out the sampling of samples in an unbalanced data set through a sampling method, and enabling the number of minority class samples to be close to the number of majority class samples; sorting the features from large to small by utilizing the correlation between the features and the category labels; sequentially deleting one-dimensional features from the last dimension of the features according to a sequence; inputting the sample data set of which the one-dimensional features are reduced into the random forest model every time when the one-dimensional features are deleted, calculating ACC values corresponding to the samples, comparing all the ACC values, and selecting the feature dimension corresponding to the maximum ACC value as a target feature dimension of feature reduction. New unbalanced data obtained through the conversion method is input into the multi-classification SVM for training, and the classification accuracy can be remarkably improved.

Description

technical field [0001] The invention belongs to the technical field of unbalanced data conversion, in particular to an unbalanced data set conversion method and system based on sampling and feature reduction. Background technique [0002] The unbalanced data set conversion method is a method to reconstruct the data set from the data level when classifying the unbalanced data set, so as to reduce the unbalanced degree and improve the classification accuracy. Imbalanced dataset classification refers to the classification problem in the case of unequal sample data. Take the binary classification problem as an example, that is, the proportion of a certain type of data samples is significantly higher than that of other types of data samples. Among them, the samples with a large proportion form the majority class sample set, and the samples with a small proportion form the minority class sample set. Unbalanced data is widely used in real life, such as risk intrusion detection, r...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06K9/62
CPCG06F18/2411G06F18/214
Inventor 龙春魏金侠万巍赵静杨帆
Owner COMP NETWORK INFORMATION CENT CHINESE ACADEMY OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products