Unsupervised feature selecting method based on conditional mutual information and K-means

A feature selection method and conditional mutual information technology, applied in computer parts, character and pattern recognition, instruments, etc., can solve the problems of reduced classification accuracy, data imbalance, inapplicability, etc., to reduce redundancy and eliminate randomness. Sexuality, the effect of increasing relevance

Inactive Publication Date: 2017-03-15
NANJING UNIV OF INFORMATION SCI & TECH
View PDF0 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Most of the existing traditional feature selection methods aim to improve the classification accuracy without fully considering the distribution of data samples, and generally pursue the learning effect of large classes, and tend to ignore the learning performance of small classes
In order to solve the problem of data imbalance, at the data level, the positive samples of the training set can be resampled before training, so that the positive and negative samples can be balanced, and then corresponding learning (Exploratory under-sampling for class -imbalance learning.Liu X Y, Wu J, Zhou Z H), but this cannot make use of all the data, which will reduce the classification accuracy
At the algorithm level, the traditional feature selection algorithm is improved according to the characteristics of the unbalanced distribution of data categories, so that the algorithm can adapt to samples with unbalanced category distribution (new algorithm for feature selection in imbalanced problems: IM-IG. You Mingyu , Chen Yan, Li Guozheng), but this method is limited to two-type imbalance problems, and is not suitable for multi-type imbalance problems
[0006] For filtering feature selection, many supervised feature selection methods have been proposed, such as applying mutual information to evaluate candidate features, and selecting the top features as the input of the neural network classifier (Using mutual information for selecting features in supervised neural netlearning.R.Battiti), but this method ignores the redundancy between features, resulting in the selection of many redundant features, which is not conducive to the performance improvement of subsequent classifiers
And this method is only suitable for data with class label information, not suitable for unsupervised feature selection
[0007] In the field of unsupervised feature selection, many unsupervised feature selection methods applied to text have been proposed, but these methods cannot be directly applied to numerical data
Some unsupervised feature selection methods applied to numerical data, such as unsupervised filtering feature selection algorithms for classification features, are based on one-pass clustering algorithms and use the importance of each feature in different clusters as a basis for judgment , and finally select feature subsets according to the changing law of importance (research on unsupervised feature selection method for classification features. Wang Lianxi, Jiang Shengyi), this method only uses one clustering algorithm to divide the data, so that the clustering results exist Randomness, cannot guarantee the accuracy of feature selection

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unsupervised feature selecting method based on conditional mutual information and K-means
  • Unsupervised feature selecting method based on conditional mutual information and K-means
  • Unsupervised feature selecting method based on conditional mutual information and K-means

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0038] Below in conjunction with accompanying drawing, the implementation of technical scheme is described in further detail:

[0039] The unsupervised feature selection method based on conditional mutual information and K-means of the present invention will be further described in detail in conjunction with the flow chart and the implementation case.

[0040] In this implementation case, the conditional mutual information and K-means algorithm are used to select the features of the unlabeled data set. Such as figure 1 As shown, this method includes the following steps:

[0041] Step 10, performing multiple K-means clustering with different K values ​​and different cluster centers on the unlabeled data set, and obtaining each clustering result;

[0042] In step 101, the maximum number of clusters MAX and the minimum number of clusters MIN of the K-means algorithm are predetermined in the input stage, and before each clustering, a number is randomly selected in the range of [...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides an unsupervised feature selecting method based on conditional mutual information and K-means. Multiple times of clustering of unclassified labels is carried out by adopting K-means algorithms having different initial conditions, and on the basis of each time of clustering, a modularization metric value of every feature and the conditional mutual information between among the features are considered comprehensively, and related independence indexes among the features are used to select feature subsets having high relevancy and small redundancy. The feature subsets acquired by the clustering of the different K-means are gathered together to acquire a final feature subset. The unsupervised feature selecting method is effectively used for the imbalanced data sets having no labels, and the acquired feature subsets have the high relevancy and the small redundancy.

Description

technical field [0001] The invention belongs to the problem of feature selection in the field of machine learning, and specifically relates to a method for unsupervised feature selection of an unlabeled data set by using conditional mutual information and a K-means algorithm. Background technique [0002] In the practical application of machine learning, the number of features is often large, there may be irrelevant features, and there may be interdependence between features. The more the number of features, the longer it takes to analyze the features and train the model, and it is easy to cause the "dimension disaster", making the model more complex, which will lead to consequences such as a decline in the model's generalization ability. Therefore, feature selection is particularly important. [0003] Feature selection, also known as feature subset selection or attribute selection, refers to selecting a feature subset from all features to make the constructed model better....

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06K9/62
CPCG06F18/23213
Inventor 马廷淮邵文晔曹杰薛羽
Owner NANJING UNIV OF INFORMATION SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products