Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Text classification method for different subject topics

A text classification and subject technology, applied in text database clustering/classification, unstructured text data retrieval, special data processing applications, etc. problem, to achieve the effect of improving the average classification accuracy, classification accuracy, and improving the accuracy

Inactive Publication Date: 2015-12-23
深圳市点通数据有限公司
View PDF6 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The current text classification method has a high accuracy rate when dealing with texts with obvious characteristics and large differences between categories, but the effect will be reduced when dealing with texts with a certain degree of similarity. Topics, namely Mathematics, Language, Foreign Languages, Objects, Chemistry, Biology, Politics, History, and Geography are examples. Among them, it is easier to classify science and liberal arts, but there is a certain degree of similarity between science and liberal arts.
In the method based on dictionary vectors, statistical learning methods are generally used when selecting feature words. Generally, only the information of words is considered in statistics, and the association between words is ignored; After words are represented as vectors, although the vectors contain the association information between words, when using word vectors to represent the entire text, it is difficult to find a unified feature input classifier due to the large variation in the length of different texts. In the scheme of using deep learning, the length of the text is fixed, which will inevitably lead to the loss of information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text classification method for different subject topics
  • Text classification method for different subject topics

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0020] Aiming at the deficiencies of existing methods, this program designs a new secondary classification processing method, which determines effective classification strategies according to different stages on the basis of selecting feature words. In order to make the feature words in the dictionary as representative as possible, this program uses chi-square test to select words. Chi-square test is a hypothesis testing method specially used for correlation analysis in statistics. Its model includes statistics on the frequency of related documents, which is more reliable than only counting word frequency, and the chi-square test is obtained in each category A set of feature words, which is more targeted than the feature words obtained in aggregate using information gain.

[0021] After using the chi-square test to obtain the feature words, the document can be expressed as a vector composed of these feature words, and then how to classify it is considered. Since the vocabular...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention is suitable for the technical field of data pre-processing, and provides a text classification method for different subject topics. The method comprises the following steps of: A, performing word selection on each subject by using a chi-square test to form a feature vocabulary of the subject; B, performing subject classification on the selected feature words by using a Naive Bayes model; and C, performing secondary classification on first two structures with maximum probabilities given in the Naive Bayes model classification by using a support vector machine to obtain a first result. Through two-time classification, the average accuracy of classification is improved. The text classification method for different subject topics is simple to realize and operate, convenient and accurate to use and more accurate in classification between the subjects, and effectively improves the accuracy of classification between the adjacent subjects.

Description

technical field [0001] The invention relates to data preprocessing technology, in particular to a method for classifying texts of different subjects. Background technique [0002] With the explosive growth of text information on the Internet, the demand for text processing is becoming more and more urgent, and the precision and accuracy required are also getting higher and higher, especially in the fields of document classification and information retrieval, it is often necessary to process large batches of documents are automatically classified. [0003] The current text classification method mainly includes three links, namely text representation, feature extraction and text classification. Generally speaking, the main difference between different text classification methods lies in how to represent the text. In terms of text representation, there are mainly two text classification methods based on dictionary vectors and deep learning. The former directly represents the t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 罗登周贤华万享张玉志
Owner 深圳市点通数据有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products