Short text-oriented optimization classification method

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A classification method and short text technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as difficulty in extracting semantic information, a large amount of external corpus, and limited improvement of classification accuracy, and achieve enhanced semantic representation. ability, reduce the amount of calculation, and improve the effect of precision

Active Publication Date: 2019-07-02

长沙市智为信息技术有限公司

View PDF15 Cites 16 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

When this method is applied to short text classification, there are the following problems: (1) When VSM calculates the semantic similarity between sentences, it does not consider the influence of synonyms in sentences on their similarity

(2) When there is a lot of text data, using VSM to represent text will cause serious dimension disaster problems

(3) Short texts are usually small in length, with many polysemous words and noisy words. The effective features of short texts extracted by traditional methods are often not enough, resulting in less semantic information representation of short texts, which is not conducive to subsequent classification

The method of using semantic expansion requires a large amount of external corpus, and also increases the computational overhead, which brings about the disaster of dimensionality, and its application scenarios are often limited

However, the method based on word vector representation alone has limited improvement in classification accuracy.

The main reason is that the word representation obtained by the traditional word embedding method or TF-IDF method only contains the semantic information or statistical information in the current text corpus, while the short text is small in length and has many polysemous words and noise words. There are fewer effective features, which makes it difficult to extract enough semantic information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0054] This embodiment is a specific embodiment of a method for optimal classification of short texts based on feature clustering. The present invention is mainly divided into six steps:

[0055] Step 1. Obtain training data, and preprocess the training data using the following steps:

[0056] A. The training data comes from the open source news corpus released by Fudan and Sogou Labs, with more than 200,000 pieces of data, including six categories: sports, Internet, economy, politics, art, and military;

[0057] B. Add the collected and organized online content word dictionary to improve the accuracy of subsequent word segmentation;

[0058] C. Remove stop words;

[0059] D. Segment the training data and complete the preprocessing.

[0060] Step 2. For the training data obtained in step 1, traverse each feature word in the data set after word segmentation, and select feature words whose word frequency is greater than the set threshold and have no repetition to construct a ...

Embodiment 2

[0086] This embodiment is a specific embodiment of a method for optimal classification of short texts based on feature clustering. The present invention is mainly divided into six steps:

[0087] Step 1. Obtain training data, and preprocess the training data using the following steps:

[0088] A. The training data comes from the China Mobile SMS data set, including normal, marketing, advertising, credit card, and others, a total of 5 categories, with a total data of about 100,000;

[0089] B. Remove stop words;

[0090] C. Segment the training data and complete the preprocessing.

[0091] Step 2. For the training data obtained in step 1, traverse each feature word in the data set after word segmentation, and select word frequency greater than the set threshold (set to 2 here) and no repeated feature words to construct feature item sets;

[0092] Step 3. To train the large-scale corpus collected, the specific steps are:

[0093] A. Collect open source Chinese corpus from Wi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a short text-oriented optimization classification method. The method comprises the following steps of: 1, obtaining an original data set and preprocessing the original data set; 2, selecting a feature item set from the preprocessed data set; 3, training the collected large-scale corpora by using a word vector tool to obtain a word vector model; 4, performing word vector representation on each feature item in the feature item set by using a word vector model, and performing primary clustering on the word vectors of the feature items to obtain a plurality of primary feature clusters; 5, performing two-stage loose clustering in each preliminary feature cluster to obtain a plurality of similar feature clusters; and 6, replacing the feature words obtained in the step 4 with the similar feature clusters obtained in the step 5, and then carrying out short text classification by using a classifier. Traditional short text classification mostly lacks semantic expression capability and is quite high in demnsion of the feature space; according to the invention, the semantic information of the short text can be expressed better, the dimension of the feature space is reduced, the precision and efficiency of short text classification are improved, and the short text classification method can be applied to short text classification tasks in various fields, such as spamshort message classification and microblog topic classification.

Description

technical field [0001] The invention belongs to the technical field of Chinese short text classification, and relates to an optimized classification method for short texts, in particular to a classification method for network short texts. Background technique [0002] In the information age of data explosion, the intelligence of mobile terminals and the rapid development of Internet technology have prompted people to communicate more and more frequently on the mobile Internet, resulting in a large amount of information data. Most of these data are in the form of short texts as the carrier of information transmission, such as Weibo and instant push news, etc. The content is concise, refined and rich in meaning, which has high research value. Therefore, how to automatically classify these short texts to help understand the rich meanings expressed by these short texts has become a hot and difficult research topic in the fields of natural language processing and machine learning...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/27G06K9/62

CPCG06F40/289G06F40/30G06F18/22G06F18/23213G06F18/2411

Inventor 尹垚李芳芳毛星亮施荣华石金晶胡超

Owner 长沙市智为信息技术有限公司

Who we serve

R&D Engineer
R&D Manager
IP Professional

Why Patsnap Eureka

Industry Leading Data Capabilities
Powerful AI technology
Patent DNA Extraction

Social media

Patsnap Eureka Blog

Learn More

PatSnap group products

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Short text-oriented optimization classification method

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology