Method for text classification using diverse text features

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of sample features and text classification, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as inability to mine internal structures well

Active Publication Date: 2018-10-16

NANJING UNIV

View PDF5 Cites 9 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Using a single text feature representation, it is impossible to mine a variety of internal structures in the data set and perform feature representation on these internal structures

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0091] In this embodiment, the WebKB data set (http: / / www.webkb.org / ) is used as the experimental data set, and the improved Dec.k-Means algorithm is used to generate multi-dimensional text representations, and ten sets of feature representations are generated, each set of features is 50 dimension, such as figure 1 Shown is a flow chart of the present invention when generating a text representation. The application process is as follows:

[0092] 1. Taking the WebKB dataset as input, the detailed information of the dataset is shown in Table 1:

[0093] Table 1

[0094] Number of samples in the training set

[0095] 2. Use the improved Dec.k-Means to generate m=10 sets of feature representations for the training set and test set. In each set of feature representations, the dimension of the feature vector is k 1 =k 2 =...=k 10 =50, the specific steps are as follows:

[0096] (1) Use the bag of words model + TF-IDF weight to convert the training set and test set i...

Embodiment 2

[0116] In this example, the AG's corpus of news articles data set, referred to as the AGNews data set (http: / / www.di.unipi.it / ~gulli / AG_corpus_of_news_articles.html) is used as the experimental data set, and the improved Alter LDA algorithm is used to generate multi-dimensional The text representation of , generate ten sets of feature representations, each set of features is 50 dimensions, the application process is as follows:

[0117] 1. Taking the AG News dataset as input, the details of the dataset are shown in Table 3:

[0118] table 3

[0119] Number of samples in the training set

[0120] 2. Use Alter LDA to generate m=10 sets of feature representations for the training set and test set. In each set of feature representations, the dimension of the feature vector is k 1 =k 2 =...=k 10 =50, the specific steps are as follows:

[0121] (1) Use the Latent Dirichlet Allocation (LDA) algorithm to obtain the topic distribution of words β (1) , the topic distribut...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method for text classification using diverse text features. The method comprises the steps of generating a plurality of sets of different text feature representations by using a multi-dimensional text representation algorithm, and generating a multi-dimensional text feature representation longitudinally; using a plurality of different text representation algorithms to generate multiple sets of different text feature representations, and generating a multi-dimensional text feature representation transversely; combining different feature representation vectors of each sample as a new feature vector of the sample, thereby obtaining a new feature representation of a data set. The method improves existing text representation algorithms, proposes to use more text representations with lower dimensions and larger differences to mine different internal structures of texts, enhances a text representation ability, improves the effect of tasks such as text categorizationand the like, and greatly reduces the dimension of the text features.

Description

technical field [0001] The invention belongs to the field of text representation, and in particular relates to a method for classifying text by using diversified text features. Background technique [0002] In recent years, with the rapid development of computer technology and the Internet, human beings have entered the information age. Massive data, especially various text data, contain important information and great value. Reasonable sorting and summarization of these text data is conducive to better utilization of these large-scale text data. Text classification is a very effective method. [0003] Text classification has always been a very important basic research direction in the field of machine learning and artificial intelligence, and it is also widely used in the industry. The effectiveness of text classification depends largely on the quality of text feature representation. Plain text that can be read by humans cannot be directly recognized and utilized by mach...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30G06F17/27

CPCG06F40/247

Inventor 黄书剑李念奇戴新宇张建兵尹存燕陈家骏

Owner NANJING UNIV

Features

Generate Ideas
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Method for text classification using diverse text features

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology