A short text clustering method based on weighted word vector representation and combinatorial similarity

A clustering method and vector representation technology, applied in the direction of text database clustering/classification, unstructured text data retrieval, etc., can solve the problem of not considering word differences, information weakening, etc., to reduce time complexity and dimension. , the effect of enhancing the weight

Inactive Publication Date: 2019-03-22
上海文军信息技术有限公司
View PDF5 Cites 21 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this short text representation method simply assumes that all words have the same importance, without considering the differences between words, which may lead to the weakening of the information of some important words

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A short text clustering method based on weighted word vector representation and combinatorial similarity
  • A short text clustering method based on weighted word vector representation and combinatorial similarity
  • A short text clustering method based on weighted word vector representation and combinatorial similarity

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] In order to overcome the deficiencies in the prior art, the present invention proposes a short text clustering method based on weighted word vector representation and combined similarity, in order to further improve the accuracy of text sentiment classification.

[0050] In order to describe the present invention more specifically, the technical solutions of the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0051] figure 1 Shown is a flow chart of a text sentiment classification method based on multi-feature fusion integrated learning in this embodiment, and the specific process is:

[0052] Step 1: Data Acquisition. Get short text collection D={D 1 ,D 2 ,...,D N},D i Indicates the i-th short text, 1≤i≤N, N is the total number of short texts in the set D;

[0053] Step 2: For each short text D in the short text collection D i Carry out word segmentation, and remove stop words from the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a short text clustering method of weighted word vector representation and combination similarity. The method is: perfoorming short text preprocessing. Constructing a weighted word vector representation of short text. Calculating The Euclidean distance similarity and corotation similarity between short texts, and constructing the combination similarity matrix. Constructing alow-dimensional vector representation of short text. Finally, applying K-means applied to achieve more accurate short text clustering.

Description

technical field [0001] The invention belongs to the field of natural language processing technology and pattern recognition, in particular to a short text clustering method based on weighted word vector representation and combined similarity. Background technique [0002] With the rapid development of the Internet and the widespread popularity of social media, people use mobile phone text messages, WeChat, Weibo, forums, etc. to express current news, product reviews and other information. Wherein, the short text is a text with relatively short length and less content (usually refers to a text within 160 characters). In recent years, short texts on the Internet have appeared at an extremely fast growth rate and become an important way of information dissemination. Short text allows users to quickly understand the content of the topic without taking up too much reading time. The main characteristics of short texts are that they are short in length, contain less content, and ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35
Inventor 陈福陈小波
Owner 上海文军信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products