K-means text clustering method and device with built-in constraint rules

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A text clustering, text technology, applied in text database clustering/classification, unstructured text data retrieval, instruments, etc., can solve problems such as high computational complexity and unsatisfactory effect.

Active Publication Date: 2020-10-23

ZHONGKE DINGFU BEIJING TECH DEV

View PDF2 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0008] After analysis, the inventor believes that for each iteration of the typical k-means clustering algorithm, all the data in the data set need to participate in a calculation to calculate the distance from each text except the cluster center to each cluster center, especially when the data When the amount of text in the set is large and / or the number of selected cluster centers is large, the computational complexity is high

In addition, the typical k-means clustering method extracts features based on text content to calculate the distance between texts. The clustering effect is not ideal, and it is easy to group texts with similar features but different topics / irrelevant texts, or Texts with dissimilar features but same / related topics are classified into wrong clusters

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0045] Please refer to figure 1 , in the first embodiment of the present application, a k-means text clustering method with built-in constraint rules is provided, including the following steps from S100 to S500:

[0046] S100: Using the first constraint rule to preprocess the text set to be clustered to obtain a first preprocessing set corresponding to the first constraint rule, the texts conforming to the first constraint rule must be clustered into the same cluster, The first preprocessing set includes texts conforming to corresponding first constraint rules.

[0047] In the step of S100, the first constraint rule refers to a preset rule that texts conforming to the rule must be clustered into the same cluster. For example, if two texts A and B conform to a certain first constraint rule, when clustering the text set to be clustered including A and B, it is necessary to cluster A and B into the same cluster . Here, the texts in the text set to be clustered that meet differ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The embodiment of the present application discloses a k-means text clustering method and device with built-in constraint rules, the method includes: using the first constraint rule to preprocess the text set to be clustered to obtain the first preprocessing corresponding to the first constraint rule Set; obtain k texts in the text set to be clustered as cluster centers, k<N, N is the total number of texts in the text set to be clustered; if the cluster centers are included in the first preprocessing set, then the first preprocessing The rest of the text in the processing set except the cluster center is added to the cluster corresponding to the cluster center, and the text that has been added to the cluster in the text set to be clustered is cleared; the remaining text in the text set to be clustered is added to the corresponding cluster. In the cluster corresponding to the nearest cluster center; recalculate the new cluster center of each cluster, and output all the clusters if the new cluster center meets the preset stop condition. Using the above clustering method can reduce the computational complexity of text clustering, and at the same time improve the clustering accuracy.

Description

technical field [0001] The invention relates to the technical field of information processing and text mining, in particular to a k-means text clustering method and device with built-in constraint rules. Background technique [0002] Cluster analysis is one of the important tools in the field of data mining, and its basic work is to cluster data. Clustering is a process of classifying data into different clusters. Data in the same cluster have great similarity, while data between different clusters have great dissimilarity. Commonly used clustering algorithms include hierarchical clustering, grid clustering, partition clustering, etc. Among them, the k-means clustering algorithm in partition clustering is relatively more commonly used and easy to implement. [0003] A typical k-means clustering algorithm includes the following steps: [0004] 1. Randomly select k texts from a data set including N texts as cluster centers, k<N; [0005] 2. For each text except the clust...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(China)

IPC IPC(8): G06F16/35G06K9/62

CPCG06F16/35G06F18/24765G06F18/23213

Inventor 李德彦晋耀红席丽娜

Owner ZHONGKE DINGFU BEIJING TECH DEV

Who we serve

R&D Engineer
R&D Manager
IP Professional

Why Patsnap Eureka

Industry Leading Data Capabilities
Powerful AI technology
Patent DNA Extraction

Social media

Patsnap Eureka Blog

Learn More

PatSnap group products

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

K-means text clustering method and device with built-in constraint rules

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology