Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A Topic Detection Method Based on Document Content and Interrelationships

A technique of interrelationship and detection method, which is applied in the direction of unstructured text data retrieval, network data retrieval, and other database retrieval, and can solve problems such as only considering document content

Active Publication Date: 2020-10-30
TRANSN IOL TECH CO LTD
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Most existing probability distribution-based topic modeling methods only consider document content

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Topic Detection Method Based on Document Content and Interrelationships
  • A Topic Detection Method Based on Document Content and Interrelationships

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0032] The present invention will be described in further detail below in conjunction with the examples, but the protection scope of the present invention is not limited thereto.

[0033] The invention relates to a topic detection method based on document content and interrelationships. The method includes the following steps.

[0034] Step 1: Obtain N documents, and preprocess the documents to obtain a document-feature co-occurrence matrix X and a pairwise relationship matrix R.

[0035] In the step 1, the preprocessing includes English text preprocessing and Chinese text preprocessing; the English text preprocessing includes word stem restoration and stop word elimination; the Chinese text preprocessing includes word segmentation and removal of low-frequency words.

[0036] In the present invention, the document-feature co-occurrence matrix X refers to a matrix based on documents and words.

[0037] In the present invention, the pairwise relationship matrix R represents the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a topic detection method based on document content and mutual relation. The method comprises the steps of preprocessing an obtained document to obtain a co-occurrence matrix and a two-two relation matrix of document features, constructing a target function on this basis, conducting iterative computation on a document representative degree matrix, a document subordination degree matrix, a word representative degree matrix and a word subordination degree matrix, and outputting the word representative degree matrix, wherein each row corresponds to one topic, and by using the word with the largest value in each row as a topic word for describing the topic, the topic word used for describing the topic is obtained. According to the topic detection method based on the document content and mutual relation, while document clustering and word clustering are conducted, joint clustering is more effective than respective clustering, a more comprehensive model is obtained when the relation of the document content and the relation among the documents are simultaneously considered compared with mere consideration of one information thereof, and through the introduction of the subordination degree and the representative degree, the method is not only applicable to clustering problems but also applicable to topic modeling problems.

Description

technical field [0001] The invention belongs to the technical field of digital computing equipment or data processing equipment or data processing methods especially suitable for specific functions, and in particular relates to a subject detection method based on document content and interrelationships. Background technique [0002] In many natural language processing and analysis problems, it is necessary to automatically detect the semantic topics of text content from massive Internet data through topic modeling methods, and at the same time group and classify documents. [0003] The current topic modeling method is represented by LDA or pLDA, which considers the topic model as a hidden variable, and solves the topic model based on hidden Dirichlet distribution. Most existing probability distribution-based topic modeling methods only consider document content. [0004] However, in many practical applications, there are often interrelated relationships between documents, s...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/30G06F40/284G06F40/289G06F16/35G06F16/953
CPCG06F16/35G06F16/951G06F40/284G06F40/289G06F40/30
Inventor 梅建萍王江涛
Owner TRANSN IOL TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products