Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

K-means text clustering method and device with built-in constraint rules

A text clustering, text technology, applied in text database clustering/classification, unstructured text data retrieval, instruments, etc., can solve problems such as high computational complexity and unsatisfactory effect.

Active Publication Date: 2020-10-23
ZHONGKE DINGFU BEIJING TECH DEV
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] After analysis, the inventor believes that for each iteration of the typical k-means clustering algorithm, all the data in the data set need to participate in a calculation to calculate the distance from each text except the cluster center to each cluster center, especially when the data When the amount of text in the set is large and / or the number of selected cluster centers is large, the computational complexity is high
In addition, the typical k-means clustering method extracts features based on text content to calculate the distance between texts. The clustering effect is not ideal, and it is easy to group texts with similar features but different topics / irrelevant texts, or Texts with dissimilar features but same / related topics are classified into wrong clusters

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • K-means text clustering method and device with built-in constraint rules
  • K-means text clustering method and device with built-in constraint rules
  • K-means text clustering method and device with built-in constraint rules

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0045] Please refer to figure 1 , in the first embodiment of the present application, a k-means text clustering method with built-in constraint rules is provided, including the following steps from S100 to S500:

[0046] S100: Using the first constraint rule to preprocess the text set to be clustered to obtain a first preprocessing set corresponding to the first constraint rule, the texts conforming to the first constraint rule must be clustered into the same cluster, The first preprocessing set includes texts conforming to corresponding first constraint rules.

[0047] In the step of S100, the first constraint rule refers to a preset rule that texts conforming to the rule must be clustered into the same cluster. For example, if two texts A and B conform to a certain first constraint rule, when clustering the text set to be clustered including A and B, it is necessary to cluster A and B into the same cluster . Here, the texts in the text set to be clustered that meet differ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The embodiment of the present application discloses a k-means text clustering method and device with built-in constraint rules, the method includes: using the first constraint rule to preprocess the text set to be clustered to obtain the first preprocessing corresponding to the first constraint rule Set; obtain k texts in the text set to be clustered as cluster centers, k<N, N is the total number of texts in the text set to be clustered; if the cluster centers are included in the first preprocessing set, then the first preprocessing The rest of the text in the processing set except the cluster center is added to the cluster corresponding to the cluster center, and the text that has been added to the cluster in the text set to be clustered is cleared; the remaining text in the text set to be clustered is added to the corresponding cluster. In the cluster corresponding to the nearest cluster center; recalculate the new cluster center of each cluster, and output all the clusters if the new cluster center meets the preset stop condition. Using the above clustering method can reduce the computational complexity of text clustering, and at the same time improve the clustering accuracy.

Description

technical field [0001] The invention relates to the technical field of information processing and text mining, in particular to a k-means text clustering method and device with built-in constraint rules. Background technique [0002] Cluster analysis is one of the important tools in the field of data mining, and its basic work is to cluster data. Clustering is a process of classifying data into different clusters. Data in the same cluster have great similarity, while data between different clusters have great dissimilarity. Commonly used clustering algorithms include hierarchical clustering, grid clustering, partition clustering, etc. Among them, the k-means clustering algorithm in partition clustering is relatively more commonly used and easy to implement. [0003] A typical k-means clustering algorithm includes the following steps: [0004] 1. Randomly select k texts from a data set including N texts as cluster centers, k<N; [0005] 2. For each text except the clust...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35G06K9/62
CPCG06F16/35G06F18/24765G06F18/23213
Inventor 李德彦晋耀红席丽娜
Owner ZHONGKE DINGFU BEIJING TECH DEV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products