File clustering method based on information bottleneck theory

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
An information bottleneck and document clustering technology, which is applied in electronic digital data processing, special data processing applications, instruments, etc., can solve the problems of difficulty in guaranteeing clustering accuracy and low time complexity, and achieves high accuracy and simple principle. , fast effect

Inactive Publication Date: 2009-11-04

BEIHANG UNIV

View PDF0 Cites 5 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

The time complexity of incremental clustering is low, but it is often closely related to the order of document sequences. Different orders may lead to different clustering results, so the accuracy of clustering is difficult to guarantee

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0020] The present invention uses the information bottleneck theory to calculate the "similarity" relationship between documents, and uses an incremental clustering algorithm to cluster the documents, which ensures that the method has a relatively low time complexity and is suitable for time performance For applications with higher requirements, at the same time, a sequence clustering algorithm is used to adjust the incremental clustering results to ensure that the clustering process can obtain high accuracy. A large number of experiments show that this method has better performance than classical clustering algorithms such as K-Means algorithm and AIB algorithm.

[0021] The present invention is a document clustering method based on the information bottleneck theory. On the one hand, the method utilizes the information bottleneck theory to calculate the similarity d between documents; The class result C; the processing steps in the clustering process are:

[0022] Step 1, us...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a file clustering method based on an information bottleneck theory. The method firstly utilizes the information bottleneck theory to calculate the similarity between files; increment clustering algorithm is used for clustering files; minimum shared information loss is calculated on clustering result; if the minimum shared information loss satisfies a set threshold, the file is combined with the nearest cluster, otherwise a new cluster is created to store the file. Sequence clustering method is adopted for adjusting the clustering result to improve clustering accuracy, each file is sampled in sequence during adjusting process, and sampling frequency is set to control adjusting intensity. The adjusting policy contains all sample files and contributes to improving clustering accuracy.

Description

technical field [0001] The invention relates to a clustering method for electronic documents. More specifically, it refers to a document clustering method based on information bottleneck theory. Background technique [0002] The explosive growth of information in the Internet has brought inconvenience to the management and use of information. In order to reveal the potentially valuable information or structure hidden behind Web data, Web mining technology has achieved rapid development and wide application in recent years. Document clustering is one of the most important tools in the field of Web mining. Its purpose is to divide a set of documents into several clusters, requiring the text content in the same cluster to have a high similarity, while the similarity between different clusters as small as possible. Each clustering process mainly includes two parts, the calculation of text content similarity and the text clustering method. [0003] Most clustering procedures ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/30

Inventor 刘永利熊璋任捷欧阳元新

Owner BEIHANG UNIV

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

File clustering method based on information bottleneck theory

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology