Barrage text clustering method based on feature extension and T-oBTM

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A text clustering and text technology, applied in text database clustering/classification, unstructured text data retrieval, special data processing applications, etc., can solve the problems of low algorithm efficiency, long model processing time, topic-word pair distribution and Issues such as complex topic distribution

Pending Publication Date: 2020-04-24

HEBEI UNIV OF ENG

View PDF5 Cites 2 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0012] The size of the corpus is huge, the word pairs are directly extracted, and many noisy word pairs are retained, resulting in complex topic-word pair distribution and topic distribution, resulting in long model processing time and low algorithm efficiency

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0047] The present invention proposes a barrage text clustering method based on feature expansion and T-oBTM, which includes three steps of network neologism processing stage, topic modeling stage, and text clustering stage, and its specific method is:

[0048] The first stage is network neologism processing, which includes text preprocessing. In the stage of network neologism processing, a new word recognition algorithm based on weight-optimized mutual information and left and right information entropy is used to find out network neologisms in the barrage text, and the network The new words are updated to the word segmentation lexicon, and the external knowledge base is used to obtain the relevant content of the network new words, and the characteristic words related to the network new words are obtained through analysis, and the corpus is obtained by using the characteristic words to expand the text features; the specific method of the network new word processing stage To: Us...

Embodiment 2

[0052] The following side documents are analyzed as a case: (only part of the text is shown)

[0053]

[0054] 1. Obtain one or more bullet chat texts of the video data, and then display the bullet chat data set;

[0055] 2. Use the new word recognition algorithm based on weight-optimized mutual information and left-right information entropy to find out the top8 new words in the barrage text set, and update the word segmentation lexicon;

[0056] 1. String mutual information score data display:

[0057] Format: 'second-order co-occurrence words': (mutual information calculation results, word frequency)

[0058]

[0059]

[0060] 2. The information entropy score of left and right strings:

[0061] Format: 'second-order co-occurrence words': left (right) information entropy

[0062]

[0063] 3. Word string word score: display the top 8 word strings. The observation results show that the higher the score, the greater the probability that the word string is a more c...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a barrage text clustering method based on feature extension and T-oBTM. The method comprises three steps of a network new word processing stage, a theme modeling stage and a text clustering stage. The invention provides an oBTM streaming short text clustering method (T-oBTM) for carrying out threshold constraint on word pairs according to bullet screen characteristics, the algorithm execution time is shortened, network new words are recognized and processed, the purpose of expanding text characteristics is achieved, and then the algorithm precision is improved. Accordingto the method, the network new words are recognized and processed, the word segmentation lexicon is enriched, and the word segmentation precision is improved; when the network new words are processed, the recognized entity nouns and sentiments, viewpoints and opinion words are processed differently, short text features are expanded, and clustering precision is improved.

Description

technical field [0001] The invention relates to the technical field of multimedia processing, in particular to a method for clustering barrage text based on feature expansion and T-oBTM. Background technique [0002] Bullet chat refers to the comments that can be sent to the screen when the video is playing, which can instantly express the user's views and emotions. Therefore, the research on the hidden information in the bullet chat is of great value, which is helpful for discovering video user topics and other work. Compared with other types of comments, the bullet chat text is too short, contains too many new words on the Internet, has strong immediacy, and changes rapidly, and belongs to streaming short text. Due to the above characteristics, the research on barrage text has the difficulty of less semantic information and high-dimensional sparsity. [0003] Bullet screens are sent by users in real time, and the content is mostly subjective emotion, so the research on bu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F16/35H04N21/235H04N21/435

CPCG06F16/355H04N21/235H04N21/435

Inventor 吴迪黄竹韵生龙张梦甜杨瑞欣孙雷

Owner HEBEI UNIV OF ENG

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Barrage text clustering method based on feature extension and T-oBTM

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology