K-means text clustering method and device with built-in constraint rules

A text clustering and text technology, applied in text database clustering/classification, unstructured text data retrieval, special data processing applications, etc., can solve problems such as unsatisfactory results and high computational complexity

Active Publication Date: 2018-04-13
ZHONGKE DINGFU BEIJING TECH DEV
View PDF2 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] After analysis, the inventor believes that for each iteration of the typical k-means clustering algorithm, all the data in the data set need to participate in a calculation to calculate the distance from each text except the cluster center to each cluster center, especially when the data When the amount of text in the set is large and / or the number of selected cluster centers is large, the computational complexity is high
In addition, the typical k-means clustering method extracts features based on text content to calculate the distance between texts. The clustering effect is not ideal, and it is easy to group texts with similar features but different topics / irrelevant texts, or Texts with dissimilar features but same / related topics are classified into wrong clusters

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • K-means text clustering method and device with built-in constraint rules
  • K-means text clustering method and device with built-in constraint rules
  • K-means text clustering method and device with built-in constraint rules

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0045] Please refer to figure 1 , in the first embodiment of the present application, a k-means text clustering method with built-in constraint rules is provided, including the following steps from S100 to S500:

[0046] S100: Using the first constraint rule to preprocess the text set to be clustered to obtain a first preprocessing set corresponding to the first constraint rule, the texts conforming to the first constraint rule must be clustered into the same cluster, The first preprocessing set includes texts conforming to corresponding first constraint rules.

[0047] In the step of S100, the first constraint rule refers to a preset rule that texts conforming to the rule must be clustered into the same cluster. For example, if two texts A and B conform to a certain first constraint rule, when clustering the text set to be clustered including A and B, it is necessary to cluster A and B into the same cluster . Here, the texts in the text set to be clustered that meet differ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Embodiments of the invention disclose a k-means text clustering method and device with built-in constraint rules. The method comprises the following steps of utilizing a first constraint rule to preprocess a to-be-clustered text set to obtain a first preprocessing set corresponding to the first constraint rule; acquiring k texts in the to-be-clustered text set as cluster centers, wherein k is smaller than N, and N is the total number of the texts in the to-be-clustered text; if each cluster center is contained in the first preprocessing set, adding the rest of the texts except the cluster center in the first preprocessing set to a class cluster corresponding to the cluster center, and removing the texts added to the class cluster in the to-be-clustered text set; respectively adding the remaining texts in the to-be-clustered text set to the class clusters corresponding to the cluster centers closest to the texts; and computing a new cluster center of each class cluster again, and if thenew cluster center meets a preset stopping condition, outputting all class clusters. By the adoption of the clustering method, the computation complexity of text clustering can be reduced, and meanwhile, the clustering precision can be also improved.

Description

technical field [0001] The invention relates to the technical fields of information processing and text mining, in particular to a k-means text clustering method and device with built-in constraint rules. Background technique [0002] Cluster analysis is one of the important tools in the field of data mining, and its basic work is to cluster data. Clustering is a process of classifying data into different clusters. The data in the same cluster have great similarity, while the data between different clusters have great dissimilarity. Commonly used clustering algorithms include hierarchical clustering, grid clustering, partition clustering, etc. Among them, the k-means clustering algorithm in partition clustering is relatively more commonly used and easy to implement. [0003] A typical k-means clustering algorithm includes the following steps: [0004] 1. Randomly select k texts from a data set including N texts as cluster centers, k<N; [0005] 2. For each text except ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06K9/62
CPCG06F16/35G06F18/24765G06F18/23213
Inventor 李德彦晋耀红席丽娜
Owner ZHONGKE DINGFU BEIJING TECH DEV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products