Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion

A technology of linear fusion and microblog hotspots, applied in computing, computer components, character and pattern recognition, etc., can solve problems affecting algorithm operation efficiency, affecting topic clustering accuracy, ignoring word semantic information, etc.

Inactive Publication Date: 2020-07-03
HEBEI UNIV OF ENG
View PDF0 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0029] (2) JS divergence
[0067] 1. TF-IDF is an algorithm based on statistical information such as word frequency, which ignores the semantic information of words and affects the accuracy of topic clustering
[0068] 2. The research goal of this paper is a large number of short microblog texts. The document-word vector matrix constructed by the TF-IDF algorithm will have high sparsity, which will affect the efficiency of the algorithm

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion
  • Microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion
  • Microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0147] For a short text of Weibo news, the headline of the news is generally marked with double # signs or square brackets, and the rest of the content is the main text. The news title can play a role in summarizing the news content.

[0148] Definition 1 (title words and text words) Assume that any article uses the first 10 words of the processed microblog short text as the title, and the rest as the text; that is, if the column label l of the word s s <10, then s is the title word, otherwise, s is the text word.

example 1

[0149]Example 1: "[Hongjialou Elementary School volunteers went to the streets to help the elderly cross the road] Start from yourself, care about everyone, care about everything, and promote the spirit of Lei Feng." After preprocessing, the short text of Weibo news is: "Hong Jialou Primary School / volunteer / walking / street / helping / elderly / crossing / road / doing / caring / people / concerning / things / promoting / Lei Feng spirit", then the top 6 from "Hongjialou Primary School" to "elderly" The words are title words, and the following words are text words.

[0150] WMD distance is only measured by TF value when calculating the weight transfer amount of words. This method is relatively rough, because some words appear frequently but do not contribute much to topic discovery. Only the TF value of words is counted. It is difficult to accurately reflect the difference of words; at the same time, the importance of title words and text words is not the same, so the position of words should also be...

example 2

[0157] Example 2: "There are 100 microblog short text data sets, a total of 1000 words, in which the word 'explosion' appeared in 9 short texts, and appeared 20 times in total, 15 times in the title, respectively, 5 occurrences in the text". If the TF value is used to calculate, the weight 爆炸 =20 / 1000=0.02; If the weight calculation formula of the position contribution degree of the fusion word is used to calculate, then

[0158]

[0159] For the clustering algorithm, it is a very important step to accurately calculate the distance between the text and each cluster center to determine the cluster to which the text belongs, so the selection of the distance function plays a decisive role in the results of the clustering algorithm .

[0160] Definition 3 (distance function of fusion similarity) Given the text similarity Dis based on BTM topic modeling and JS divergence BTM (d i , d j ), and text similarity Dis based on GloVe word vector modeling and improved WMD distance ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion, which is characterized by comprising three stages of data acquisition and preprocessing, modeling and clustering, and comprises the following steps of: performing data acquisition and preprocessing, modeling obtained data, and clustering the modeled data; the invention provides a microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion in order to solve the problem that a distance function of a K-means algorithm can influence a microblog hot topic clustering result. The GloVe model only trains non-zero elements in the word and word co-occurrence matrix rather than the whole sparse matrix to utilize statistical information, and the sparsity problem faced by the TF-IDF algorithm in the document-word vector matrix construction process is effectively relieved. The GloVe model is combined with a global matrix decomposition method and a local context window method at the same time, the trained word vector can carry more semantic information, and the problem that one word has multiple meanings and cannot be well solved by a BTM topic model canbe relieved to a certain extent.

Description

technical field [0001] The invention relates to the technical field of topic discovery and tracking in natural language processing, in particular to a microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity. Background technique [0002] With the rapid development of traditional Internet and mobile Internet, Weibo has flourished. Weibo allows users to publish messages through web pages, external programs, and mobile phone clients to achieve message sharing. The advantages of short text, timeliness and interactivity of Weibo have been recognized by the public, and it has gradually become an important tool for people to obtain and publish information. How to mine hot topics from massive and disorderly microblog data has become an urgent problem to be solved. [0003] In order to solve the above problems, there are many methods nowadays, mainly including "use TF-IDF to vectorize the microblog text set, and then use clustering algorithm to fi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/35G06F16/34G06F40/211G06F40/289G06F40/30G06K9/62
CPCG06F16/35G06F16/345G06F18/23213G06F18/22
Inventor 吴迪张梦甜生龙黄竹韵杨瑞欣孙雷
Owner HEBEI UNIV OF ENG
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products