K-means text clustering method and device with built-in constraint rules

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A text clustering and text technology, applied in text database clustering/classification, unstructured text data retrieval, special data processing applications, etc., can solve problems such as unsatisfactory results and high computational complexity

Active Publication Date: 2018-04-13

ZHONGKE DINGFU BEIJING TECH DEV

View PDF2 Cites 1 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0008] After analysis, the inventor believes that for each iteration of the typical k-means clustering algorithm, all the data in the data set need to participate in a calculation to calculate the distance from each text except the cluster center to each cluster center, especially when the data When the amount of text in the set is large and / or the number of selected cluster centers is large, the computational complexity is high

In addition, the typical k-means clustering method extracts features based on text content to calculate the distance between texts. The clustering effect is not ideal, and it is easy to group texts with similar features but different topics / irrelevant texts, or Texts with dissimilar features but same / related topics are classified into wrong clusters

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0045] Please refer to figure 1 , in the first embodiment of the present application, a k-means text clustering method with built-in constraint rules is provided, including the following steps from S100 to S500:

[0046] S100: Using the first constraint rule to preprocess the text set to be clustered to obtain a first preprocessing set corresponding to the first constraint rule, the texts conforming to the first constraint rule must be clustered into the same cluster, The first preprocessing set includes texts conforming to corresponding first constraint rules.

[0047] In the step of S100, the first constraint rule refers to a preset rule that texts conforming to the rule must be clustered into the same cluster. For example, if two texts A and B conform to a certain first constraint rule, when clustering the text set to be clustered including A and B, it is necessary to cluster A and B into the same cluster . Here, the texts in the text set to be clustered that meet differ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

Embodiments of the invention disclose a k-means text clustering method and device with built-in constraint rules. The method comprises the following steps of utilizing a first constraint rule to preprocess a to-be-clustered text set to obtain a first preprocessing set corresponding to the first constraint rule; acquiring k texts in the to-be-clustered text set as cluster centers, wherein k is smaller than N, and N is the total number of the texts in the to-be-clustered text; if each cluster center is contained in the first preprocessing set, adding the rest of the texts except the cluster center in the first preprocessing set to a class cluster corresponding to the cluster center, and removing the texts added to the class cluster in the to-be-clustered text set; respectively adding the remaining texts in the to-be-clustered text set to the class clusters corresponding to the cluster centers closest to the texts; and computing a new cluster center of each class cluster again, and if thenew cluster center meets a preset stopping condition, outputting all class clusters. By the adoption of the clustering method, the computation complexity of text clustering can be reduced, and meanwhile, the clustering precision can be also improved.

Description

technical field [0001] The invention relates to the technical fields of information processing and text mining, in particular to a k-means text clustering method and device with built-in constraint rules. Background technique [0002] Cluster analysis is one of the important tools in the field of data mining, and its basic work is to cluster data. Clustering is a process of classifying data into different clusters. The data in the same cluster have great similarity, while the data between different clusters have great dissimilarity. Commonly used clustering algorithms include hierarchical clustering, grid clustering, partition clustering, etc. Among them, the k-means clustering algorithm in partition clustering is relatively more commonly used and easy to implement. [0003] A typical k-means clustering algorithm includes the following steps: [0004] 1. Randomly select k texts from a data set including N texts as cluster centers, k<N; [0005] 2. For each text except ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30G06K9/62

CPCG06F16/35G06F18/24765G06F18/23213

Inventor 李德彦晋耀红席丽娜

Owner ZHONGKE DINGFU BEIJING TECH DEV

Who we serve

R&D Engineer
R&D Manager
IP Professional

Why Patsnap Eureka

Industry Leading Data Capabilities
Powerful AI technology
Patent DNA Extraction

Social media

Patsnap Eureka Blog

Learn More

PatSnap group products

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

K-means text clustering method and device with built-in constraint rules

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology