Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Feature selection method based on information gain ratio

A technology of information gain rate and attribute selection, which is applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of high time consumption, poor classification accuracy, and affecting classification accuracy, and achieve high time consumption , low time cost, and improved classification performance

Inactive Publication Date: 2015-07-01
CHINA UNIV OF GEOSCIENCES (WUHAN)
View PDF2 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The author Wang applied this method to the text classification problem and found that the simple attribute selection method seriously affected the classification accuracy, so they proposed a method based on CFS attribute weighting and achieved better classification performance
[0017] Because the attribute selection method of packaging can obtain better classification performance, but the time cost is relatively large, while the attribute selection method of filtering takes less time to select attributes, but the classification accuracy is not very good

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Feature selection method based on information gain ratio
  • Feature selection method based on information gain ratio
  • Feature selection method based on information gain ratio

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0045] The present invention provides a method for attribute selection based on information gain rate, which is used to obtain the best attribute subset from a training document set, comprising the following steps:

[0046] (1) For a known training document set D, any document d in the training document set D is represented as a word vector form d=1 ,w 2 ,...w m >, where w i is the i-th word in document d, and m is the number of words in document d;

[0047] Use the following formula to calculate the information gain rate of each attribute in the training document set D:

[0048] GainRatio ( D , w i ) = Gain ( D , w i ) SplitInfo ( ...

Embodiment 2

[0098] The present invention provides a method for attribute selection based on information gain rate, which is used to obtain the best attribute subset from a training document set, comprising the following steps:

[0099] (1) For a known training document set D, any document d in the training document set D is represented as a word vector form d=1 ,w 2 ,...w m >, where w i is the i-th word in document d, and m is the number of words in document d;

[0100] Use the following formula to calculate the information gain rate of each attribute in the training document set D:

[0101] GainRatio ( D , w i ) = Gain ( D , w i ) SplitInfo ( ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a feature selection method based on an information gain ratio. The method comprises the steps of sorting attributes according to the size of information gain ratio of each attribute; determining the number of selection attributes by implementing a 5-fold cross validation method for 9 times, i.e. percentage; and finally building a Naive Bayes text classifier on a selected attribute subset. According to the feature selection method based on the information gain ratio provided by the invention, the method is a mixed attribute selection method by integrating the advantages of a filtering method and a packaging method; furthermore, experimental results of a plurality of standard text classification datasets show that the classification precisions of the Naive Bayes text classifier can be improved by the feature selection method based on the information gain ratio in most situations; meanwhile, too much time expenses cannot be caused.

Description

technical field [0001] The invention relates to an attribute selection method based on information gain rate, and belongs to the technical field of artificial intelligence data mining classification. Background technique [0002] Naive Bayesian text classifier is often used to deal with text classification problems because of its simplicity and efficiency, but its attribute independence assumption affects its classification performance to some extent while making it efficient. Given a document d, the document is represented as a word vector of the form <w 1 ,w 2 ,...,w m >, Multinomial Naive Bayes (MNB), Complementary Naive Bayes (CNB) and the combined model of both (OVA) classify document d using Equations 1, 2 and 3, respectively. [0003] c ( d ) = arg max c ∈ C [ log p ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 蒋良孝张伦干李超群
Owner CHINA UNIV OF GEOSCIENCES (WUHAN)
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products