Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Micro-blog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet allocation) model

A classification method and vocabulary chain technology, applied in special data processing applications, instruments, electrical digital data processing, etc., to achieve the effect of improving classification performance

Active Publication Date: 2018-11-30
ZHEJIANG UNIV OF TECH
View PDF2 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In these studies, the LDA model can achieve better results, but in the field of microblog text classification, the LDA model cannot solve the feature sparsity problem of microblog text

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Micro-blog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet allocation) model
  • Micro-blog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet allocation) model
  • Micro-blog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet allocation) model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0050] The technical solution of the present invention will be further described below in conjunction with the accompanying drawings.

[0051] The microblog classification method based on vocabulary chain feature extension and LDA model of the present invention, concrete implementation steps are as follows:

[0052] (1) Through corresponding channels, such as Sina Weibo, Tencent Weibo, etc., obtain a certain amount of Weibo text data containing multiple different Weibo categories;

[0053] (2) Preprocessing the acquired microblog text, mainly including text cleaning, Chinese word segmentation and stop word removal, etc. First use regular expressions to remove irrelevant noise data such as empty text, emoji, account names, network links and pictures in Weibo, and then use word segmentation tools to perform Chinese word segmentation and part-of-speech tagging on Weibo texts, and remove those meaningless and frequently appearing words, such as function words;

[0054] (3) outpu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a micro-blog short text classification method based on lexical chain feature extension and an LDA (latent Dirichlet allocation) model, and provides a lexical chain feature extension method according to short length, less content, sparse feature and the like of a micro-blog text. A basic lexical chain is generated based on Chinese thesaurus, the micro-blog text is extended by the aid of the basic lexical chain, the lexical chain can cover lexicons recorded by the Chinese thesaurus and can further cover other lexicons without being recorded by the Chinese thesaurus, and the lexical chain can be continually richened when the micro-blog text is extended. The micro-blog text is expressed by the aid of subject probability distribution of an LDA subject model according tohigh dimension and unobvious semantic feature of a vector space model in micro-blog text classification, a similarity calculation dimension is effectively reduced, and a certain semantic feature is fused. According to the method, advantages of lexical chain feature extension and the LDA model are combined, and a micro-blog classification method is provided. Experiment results show that the methodeffectively improves classification performance of the micro-blog text.

Description

technical field [0001] The invention relates to a method for classifying microblog texts. Background technique [0002] With the popularity and development of Weibo, it also brings great challenges while promoting people's communication and communication. The widespread use of microblogs in daily life has led to explosive growth of information. The main characteristics of microblog texts are: short text length, less content, and sparse features. Due to these characteristics, it has become a research hotspot and difficulty to filter out the parts that users are interested in from the massive microblogs and classify them. [0003] There are two main categories of microblog text classification methods: one is based on a large-scale corpus. This type of method aims at problems such as sparse microblog text features, and uses knowledge bases to expand concept semantics. Commonly used knowledge bases include WordNet, Wikipedia, and Synonyms Cilin. Using this method can mine the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F17/27
CPCG06F40/247
Inventor 刘端阳刘坤沈国江刘志朱李楠杨曦阮中远
Owner ZHEJIANG UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products