Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method for text classification using diverse text features

A technology of sample features and text classification, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as inability to mine internal structures well

Active Publication Date: 2018-10-16
NANJING UNIV
View PDF5 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Using a single text feature representation, it is impossible to mine a variety of internal structures in the data set and perform feature representation on these internal structures

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for text classification using diverse text features
  • Method for text classification using diverse text features
  • Method for text classification using diverse text features

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0091] In this embodiment, the WebKB data set (http: / / www.webkb.org / ) is used as the experimental data set, and the improved Dec.k-Means algorithm is used to generate multi-dimensional text representations, and ten sets of feature representations are generated, each set of features is 50 dimension, such as figure 1 Shown is a flow chart of the present invention when generating a text representation. The application process is as follows:

[0092] 1. Taking the WebKB dataset as input, the detailed information of the dataset is shown in Table 1:

[0093] Table 1

[0094] Number of samples in the training set

[0095] 2. Use the improved Dec.k-Means to generate m=10 sets of feature representations for the training set and test set. In each set of feature representations, the dimension of the feature vector is k 1 =k 2 =...=k 10 =50, the specific steps are as follows:

[0096] (1) Use the bag of words model + TF-IDF weight to convert the training set and test set i...

Embodiment 2

[0116] In this example, the AG's corpus of news articles data set, referred to as the AGNews data set (http: / / www.di.unipi.it / ~gulli / AG_corpus_of_news_articles.html) is used as the experimental data set, and the improved Alter LDA algorithm is used to generate multi-dimensional The text representation of , generate ten sets of feature representations, each set of features is 50 dimensions, the application process is as follows:

[0117] 1. Taking the AG News dataset as input, the details of the dataset are shown in Table 3:

[0118] table 3

[0119] Number of samples in the training set

[0120] 2. Use Alter LDA to generate m=10 sets of feature representations for the training set and test set. In each set of feature representations, the dimension of the feature vector is k 1 =k 2 =...=k 10 =50, the specific steps are as follows:

[0121] (1) Use the Latent Dirichlet Allocation (LDA) algorithm to obtain the topic distribution of words β (1) , the topic distribut...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for text classification using diverse text features. The method comprises the steps of generating a plurality of sets of different text feature representations by using a multi-dimensional text representation algorithm, and generating a multi-dimensional text feature representation longitudinally; using a plurality of different text representation algorithms to generate multiple sets of different text feature representations, and generating a multi-dimensional text feature representation transversely; combining different feature representation vectors of each sample as a new feature vector of the sample, thereby obtaining a new feature representation of a data set. The method improves existing text representation algorithms, proposes to use more text representations with lower dimensions and larger differences to mine different internal structures of texts, enhances a text representation ability, improves the effect of tasks such as text categorizationand the like, and greatly reduces the dimension of the text features.

Description

technical field [0001] The invention belongs to the field of text representation, and in particular relates to a method for classifying text by using diversified text features. Background technique [0002] In recent years, with the rapid development of computer technology and the Internet, human beings have entered the information age. Massive data, especially various text data, contain important information and great value. Reasonable sorting and summarization of these text data is conducive to better utilization of these large-scale text data. Text classification is a very effective method. [0003] Text classification has always been a very important basic research direction in the field of machine learning and artificial intelligence, and it is also widely used in the industry. The effectiveness of text classification depends largely on the quality of text feature representation. Plain text that can be read by humans cannot be directly recognized and utilized by mach...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F17/27
CPCG06F40/247
Inventor 黄书剑李念奇戴新宇张建兵尹存燕陈家骏
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products