Microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion
A technology of linear fusion and microblog hotspots, applied in computing, computer components, character and pattern recognition, etc., can solve problems affecting algorithm operation efficiency, affecting topic clustering accuracy, ignoring word semantic information, etc.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0147] For a short text of Weibo news, the headline of the news is generally marked with double # signs or square brackets, and the rest of the content is the main text. The news title can play a role in summarizing the news content.
[0148] Definition 1 (title words and text words) Assume that any article uses the first 10 words of the processed microblog short text as the title, and the rest as the text; that is, if the column label l of the word s s <10, then s is the title word, otherwise, s is the text word.
example 1
[0149]Example 1: "[Hongjialou Elementary School volunteers went to the streets to help the elderly cross the road] Start from yourself, care about everyone, care about everything, and promote the spirit of Lei Feng." After preprocessing, the short text of Weibo news is: "Hong Jialou Primary School / volunteer / walking / street / helping / elderly / crossing / road / doing / caring / people / concerning / things / promoting / Lei Feng spirit", then the top 6 from "Hongjialou Primary School" to "elderly" The words are title words, and the following words are text words.
[0150] WMD distance is only measured by TF value when calculating the weight transfer amount of words. This method is relatively rough, because some words appear frequently but do not contribute much to topic discovery. Only the TF value of words is counted. It is difficult to accurately reflect the difference of words; at the same time, the importance of title words and text words is not the same, so the position of words should also be...
example 2
[0157] Example 2: "There are 100 microblog short text data sets, a total of 1000 words, in which the word 'explosion' appeared in 9 short texts, and appeared 20 times in total, 15 times in the title, respectively, 5 occurrences in the text". If the TF value is used to calculate, the weight 爆炸 =20 / 1000=0.02; If the weight calculation formula of the position contribution degree of the fusion word is used to calculate, then
[0158]
[0159] For the clustering algorithm, it is a very important step to accurately calculate the distance between the text and each cluster center to determine the cluster to which the text belongs, so the selection of the distance function plays a decisive role in the results of the clustering algorithm .
[0160] Definition 3 (distance function of fusion similarity) Given the text similarity Dis based on BTM topic modeling and JS divergence BTM (d i , d j ), and text similarity Dis based on GloVe word vector modeling and improved WMD distance ...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com