Topic phrase extraction method

A topic and phrase technology, applied in natural language data processing, special data processing applications, instruments, etc., can solve problems such as unsatisfactory results, ambiguity of topic words, ignoring order, etc., to reduce topic drift and high accuracy and recall rate, the effect of reducing ambiguity problems

Inactive Publication Date: 2018-11-30
BEIJING INFORMATION SCI & TECH UNIV
View PDF5 Cites 16 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the traditional LDA model is based on the assumption of the "bag of words" model, ignoring the order of the words in the document, sampling the topics of the corpus itself, and only using the semantic information inside the corpus, which is prone to more ambiguity problems of topic words
Therefore, in many cases, the effect of using the LDA model for topic extraction is not ideal, and there are problems such as too small information granularity of topic words, low topic recognition, and ambiguity of topic words.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Topic phrase extraction method
  • Topic phrase extraction method
  • Topic phrase extraction method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0044]In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described below in conjunction with the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0045] refer to figure 1 As shown, a topic phrase extraction method includes the following steps:

[0046] Step 1) Document preprocessing: remove stop words and punctuation marks and use '$' as a separator to obtain the experimental corpus Cp;

[0047] Step 2) Find DTSet, FCSet and NPSet: On the basis of the experimental corpus Cp, use LDA training and Gibbs sampling to obtain DTSet, and use...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a topic phrase extraction method. The topic phrase extraction method includes preprocessing documents, seeking a document-topic set, a full text lexical chain set and a noun phrase set, seeking a central word set, seeking a candidate topic phrase set, and seeking a topic phrase set. The topic phrase extraction method has the advantages that topic phrases are extracted through combination between an LDA (latent Dirichlet allocation) model and a lexical chain, a knowledge base WordNet with complete semantic information outside a corpus can be utilized, a strong lexical chain can be acquired through semantic relevance calculation and strong chain rule filtration, and accordingly, the ambiguity of topic words is reduced greatly; the topic phrases are extracted according a central word extraction method and by N-P rule combination and deduplication steps, and topics are expressed by the topic phrases with rich semantic information, so that the problems such as low granularity and recognition degree of the topic words are solved, topic extraction accuracy and recall rate can be guaranteed, topic drifting is reduced, and needs of practical applications can be wellmet.

Description

technical field [0001] The invention belongs to the technical field of text mining, and in particular relates to a topic phrase extraction method. Background technique [0002] Literature topic extraction technology can not only improve the quality of document retrieval, but also effectively deal with the high-dimensional sparsity of the document vector space representation model. It is widely used in NLP tasks such as text classification, clustering, and information recommendation. Therefore, topic extraction is It is one of the research focuses in the field of text mining today. [0003] The LDA model is a probabilistic topic model commonly used in the field of document topic research. It can identify potential topic information in large-scale document collections and corpora without relying on knowledge bases. However, the traditional LDA model is based on the assumption of the "bag of words" model, ignoring the order of the words in the document, sampling the topics of...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
CPCG06F40/289
Inventor 吕学强董志安
Owner BEIJING INFORMATION SCI & TECH UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products