Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method for rapidly clustering documents

A document clustering and document technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as low efficiency, and achieve the effect of improving computing efficiency

Inactive Publication Date: 2009-04-15
HARBIN INST OF TECH
View PDF0 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The present invention provides a fast document clustering method to overcome the efficiency and low problems of existing clustering methods due to high-dimensional feature quantization and frequent similarity calculations

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for rapidly clustering documents
  • Method for rapidly clustering documents
  • Method for rapidly clustering documents

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] Combine below Figure 1 to Figure 3 The present invention is described further with specific embodiment:

[0026] In the present invention, the document is represented as a set composed of several representative words, instead of being widely used as a vector in the same high-dimensional space as the model node, so that in the case of large-scale text clustering, the features of the document Indicates that the required memory consumption is greatly reduced. In this mode, two issues need to be properly dealt with: one is the construction of the vector space where the node vectors in the model are located; the other is how to effectively calculate the similarity due to the differences in the representation methods and dimensions of document and node vectors.

[0027] For the problem of vector space construction, there are two methods: one is to dynamically generate a vector space according to the actual situation of the samples to be clustered when the documents to be cl...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a fast document clustering method. The method is realized by the following steps: 1, a group of key words is extracted from each document by word frequency statistics; 2, the document is expressed to be a corresponding dimensional congregation of index value, the contained key words of the document are in the characteristic space of the congregation; 3, a nerve element in a self organization mapping model is expressed as a vector in the characteristic space; 4, the documents are input in sequence, and the similarity between the documents and all nerve elements is calculated; 5, the nerve element with maximum accumulated value is the winner; the winner and neighbor nerve elements adjust weight in current document direction; 6, an individual dimension that the nerve element is matched with the input document is adjusted while the weight of other dimensions are weakened; 7, all the documents are input, and the method is over. The invention utilizes a self-organization mapping clustering model to renovate the links of document quantization expression and similarity calculation, thus the calculation efficiency is greatly improved under the condition that the number of the documents is same and the clustering quality is maintained.

Description

(1) Technical field [0001] The invention relates to a document clustering technology, in particular to a fast document clustering method. (2) Background technology [0002] With the increasing popularity of the network and the remarkable achievements of information construction, people often need to face an astonishing number of natural language documents. The prominent problem is how to organize, concentrate and integrate the rich information and knowledge contained in them quickly and effectively. Processing, in order to improve the ability of human beings to grasp these massive amounts of information, and improve the level of cognition. Especially in recent years, the automatic sorting of users' personal documents, large-scale network information public opinion monitoring, topic tracking and detection technology, network public opinion trend tracking, and automatic classification of a large number of forum documents are inseparable from fast and high-quality research. Su...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 刘远超刘铭王晓龙
Owner HARBIN INST OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products