Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts

A feature vector, text classification technology, applied in special data processing applications, instruments, electronic digital data processing and other directions, can solve the problem of not considering the detailed distribution information of feature words, weight calculation deviation, etc., to achieve reasonable and effective weight calculation and improve performance. , Overcome the effect of large deviation in weight calculation

Active Publication Date: 2015-07-01
CENT SOUTH UNIV
View PDF4 Cites 37 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Although it has shown good performance in solving some two-class text classification problems, the TF-RF method has a major problem in multi-class text classification: it combines multiple classes of text into a single anti-class for processing, without Considering the detailed distribution information of feature words in these text categories, resulting in biased weight calculations
In addition, the feature word weights calculated by supervised word weighting methods such as TF-RF are related to the category of specific texts, and the categories of new texts or test texts to be classified are unknown. When the texts to be classified are represented as feature vectors, Either use traditional methods such as TF-IDF to calculate the weights, or use the TF-RF method to calculate the weights for all categories one by one. The former needs to add additional statistical information during training and learning, and the latter increases the amount of calculation and more when classifying or testing. kind of variable

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts
  • Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts
  • Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0051] The specific embodiment of the present invention will be described in detail below with reference to the accompanying drawings and specific cases, and relevant experimental results will be provided. In order to highlight the novelty of the present invention, some technical details well known in the art will be omitted.

[0052] Such as figure 1 and figure 2 As shown, using the TF-IGM (term frequency-inverse gravity moment) method to calculate the feature word weight and perform text classification The specific implementation steps are as follows:

[0053] Step (1): generating text feature vectors;

[0054] Input the text set (including the training set and the test set), perform the following steps s1 to s4 in order based on the TF-IGM method, and generate the feature vector of each text document.

[0055] Step s1: text preprocessing;

[0056] Prepare a batch of pre-classified text sets, and divide them into training sets and test sets according to a certain ratio;...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and a device for generating text characteristic vectors based on TF-IGM, as well as a method and a device for classifying texts. The concentration ratios of characteristic words distributed in different classes of texts are calculated by establishing inverted gravitational moment (IGM) models, and the weights of the characteristic words are calculated based thereon. The weights obtained by the calculation can more realistically reflect the importance of the characteristic words in the text classes, accordingly increasing the performance of text classifiers. The device for generating the text characteristic vectors based on the TF-IGM has a plurality of options that may be optimized and regulated based on the results of the performance test of the text classes in order to be adaptive to text data sets having different characteristics. It is proved by experiments on public English corpus and Chinese corpus that the TF-IGM method is much more superior to the existing methods such as TF-IDF methods and TF-RF methods, and the TF-IGM method is particularly applicable to multi-class text classifications of more than two classes.

Description

technical field [0001] The invention belongs to the technical field of text mining and machine learning, and in particular relates to a TF-IGM-based text feature vector generation method and device, and a text classification method and device. Background technique [0002] With the wide application of computers and the continuous development of the scale of the Internet, the number of electronic text documents has increased dramatically, so it is becoming more and more important to effectively organize, retrieve and mine massive text data. Automatic text classification is one of the widely used technical means. It often uses vector space model (VSM) to represent text, and then uses supervised machine learning method for classification. By extracting a certain number of feature words from the text and calculating their weights, the VSM model represents the text as a vector composed of the weight values ​​of multiple feature words, called feature vectors. When generating text...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 龙军陈科文张祖平杨柳
Owner CENT SOUTH UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products