Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A Topic Modeling Method Based on Data Augmentation

A technology of topic modeling and topic model, which is applied in digital data processing, special data processing applications, natural language data processing, etc., can solve the problems of unfavorable short text feature expansion and selection, increasing time cost, etc.

Active Publication Date: 2020-03-17
HEFEI UNIV OF TECH
View PDF1 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Although the three processing methods can alleviate the feature sparsity problem of short texts to a certain extent, they are too hypothetical, and the selection of data sources or external knowledge will directly affect the expansion and selection of short text features, and these The method will also add a lot of extra time cost, which is not conducive to the expansion and selection of short text features of large-scale data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Topic Modeling Method Based on Data Augmentation
  • A Topic Modeling Method Based on Data Augmentation
  • A Topic Modeling Method Based on Data Augmentation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] In this example, if figure 1 As shown, a topic modeling method based on data enhancement is carried out as follows:

[0034] Step 1. Obtain document collection D={D 1 ,...,D d ,...,D |D|}, where D d Indicates the dth document, 1≤d≤|D|; assume the dth document D d is composed of |S| sentences, then let the dth document D d The set of sentences is S d ={S d,1 ,...,S d,s ,...,S d,|S|}, S d,s Indicates the dth document D d In the sth sentence, 1≤s≤|S|; assuming the dth document D d is composed of N words, then let the dth document D d The set of words for W d,j Indicates the dth document D d The jth word in , 1≤j≤N d ; Then let all the words in the document collection D constitute the word collection W={W 1 ,...,W i ,...,W V}, W i Indicates the i-th word, 1≤i≤V. The document set selected by the present invention is Sina Weibo data. Sina Weibo data is the original files published by Weibo users or content posted by other users. The characters of the pu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a topic modeling method based on data enhancement. The method is characterized by comprising the following steps that step one, a document collection is acquired and represented; step two, a topic of the document collection D is extracted by using a potential dirichlet distribution model, and K topic-word distributions and |D| document topic distributions are obtained; stepthree, topic influence assignment is carried out on the words; step four, the data enhancement is carried out on each document; step five, the topic model of the data enhancement is constructed, andthe final topic-word distribution is obtained. According to the modeling method, the document information can be fully utilized to carry out the data enhancement under the circumstance of data sparseness, so that the topic quality is enhanced.

Description

technical field [0001] The invention belongs to the field of data mining, in particular to a topic modeling method based on data enhancement. Background technique [0002] With the development of social media and mobile Internet, short texts such as Weibo and instant messages are flooding the Internet, making text content one of the most important elements in social networks. Analysis based on short text content can help us analyze user interests, detect emerging topics, identify interesting content, real-time web search, etc. The current mainstream method for analyzing text content is to use standard topic models such as probabilistic latent semantic analysis model and latent Dirichlet distribution method to mine normal text content, but it is still challenging for sparse short text. [0003] Aiming at the sparsity of short text features, there are mainly three processing methods to make up for the shortcoming of short text information. One is to combine the characteristic...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/216G06F16/36
CPCG06F16/36G06F40/216
Inventor 刘业政朱婷婷孙见山姜元春孙春华杜非熊强
Owner HEFEI UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products