Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Rapid text clustering method on large corpus

A text clustering and corpus technology, applied in the field of relational databases, can solve problems such as large documents, unsatisfactory convergence speed, and sparseness

Active Publication Date: 2018-06-29
FUDAN UNIV
View PDF5 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Since the text data is only composed of words, compared with other extracted feature data, it usually has a higher dimension and is more sparse. The clustering method based solely on data similarity is difficult to obtain better results, and the method based on generative models such as Di Likelier multinomial mixture models are more prominent in performance
[0004] However, the time spent by the Dirichlet multinomial mixture model is proportional to the length of the document. For a large corpus, the documents in it are often large, resulting in unsatisfactory convergence speed and affecting the overall data processing efficiency.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Rapid text clustering method on large corpus
  • Rapid text clustering method on large corpus
  • Rapid text clustering method on large corpus

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0038] For the convenience of description, we refer to the fast text clustering method with index optimization as IGSDMM in the following.

[0039] Two data sets will be taken as examples to introduce the advantages of the present invention over existing clustering algorithms. The introduction of the dataset is as follows:

[0040] NG20. The dataset contains 18,846 documents from 20 mainstream western newsgroups. This is a classic way to measure text clustering algorithms. The average length of documents in NG20 is 137.85, and the average number of words is 91.

[0041] Tweets. The dataset consists of 2472 tweets and is associated with 89 queries. The relationship between tweets and queries is annotated by humans. The average length of tweets is 8.56, and the average number of words is 7.

[0042] Normalized mutual information (NMI) is widely used to measure the quality of clustering results. NMI measures the statistics shared between random variables representing clus...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention belongs to the technical field of relational databases, and particularly relates to a rapid text clustering method on a large corpus. Due to the fact that the text data usually has the advantages of high dimension and sparsity, a good effect is difficult to achieve merely through a clustering method based on data similarity, but for a method based on a generation model, such as a Dirichlet multiple hybrid model, expression is more prominent. Accordingly, optimization is conducted through symmetric prior and construction index of Dirichlet distribution, the total time only dependson the number of different words in a document, and efficient running can be achieved in long documents.

Description

technical field [0001] The invention belongs to the technical field of relational databases, and in particular relates to a fast text clustering method on a large corpus. Background technique [0002] Text clustering is a common problem in data mining. It is an important means to organize text information effectively and plays an important role in the research of natural language processing and other aspects. [0003] Since the text data is only composed of words, compared with other extracted feature data, it usually has a higher dimension and is more sparse. The clustering method based solely on data similarity is difficult to obtain better results, and the method based on generative models such as Di Likelier multinomial mixture models are more prominent in performance. [0004] However, the time spent by the Dirichlet multinomial mixture model is proportional to the length of the document. For a large corpus, the documents in it are often large, resulting in unsatisfact...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/285G06F16/35
Inventor 李林蔚郭良琛马会心何震瀛荆一楠王晓阳
Owner FUDAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products