Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A document keyword extraction method and device based on LDA and word vectors

A technology of keywords and word vectors, which is applied in the field of document keyword extraction based on LDA and word vectors, can solve the problems of not being able to highlight the core content of the document, not being able to realize the comprehensiveness of keywords, and redundant keywords, so as to avoid noisy data interference, high precision, and the effect of improving accuracy

Active Publication Date: 2019-05-17
HEFEI INSTITUTES OF PHYSICAL SCIENCE - CHINESE ACAD OF SCI
View PDF2 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

There are some deficiencies in these studies: first, the keywords recommended at the topic level tend to be common words in documents, and cannot highlight the core content of each document; second, the extracted topic words may contain other irrelevant words, resulting in The offset of keywords; third, the appearance of synonyms or near synonyms with the greatest topic relevance leads to redundant recommended keywords, and it is impossible to realize the comprehensiveness of keywords to document topics

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A document keyword extraction method and device based on LDA and word vectors
  • A document keyword extraction method and device based on LDA and word vectors
  • A document keyword extraction method and device based on LDA and word vectors

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0014] Combine below Figure 1 to Figure 3 , the present invention is described in further detail.

[0015] refer to figure 1 , a method for extracting keywords from a document based on LDA and word vectors, comprising the following steps: (A) using a title discriminator 10 to judge whether the title of the document matches the content, if not, skip, and if they match, perform the next step ; (B) use the LDA topic model to calculate the weight of the topic in the document; and use the TF-IDF algorithm to calculate the weight of the vocabulary in the document to the topic; (C) calculate the weight of the vocabulary in the document according to the result of step B, and press according to the weight value The weights are sorted from large to small, and the candidate keyword set of the generated document is taken from the sorted N words; (E) the vocabulary in the document title vocabulary and the candidate keyword set is mapped to the word vector space; (F) Calculate the distan...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to the technical field of natural language processing and deep learning, in particular to a document keyword extraction method based on LDA and word vectors, which comprises thefollowing steps of (A) judging whether a document title and content are consistent or not by using a title discriminator, and executing the next step if the document title and the content are consistent; (B) calculating the weight of the theme in the document and the weight of the vocabularies in the document on the theme; (C) calculating the weight of the vocabularies in the document, and sortingaccording to the weight values to generate a candidate keyword set of the document; (E) mapping the vocabularies into a word vector space; (F) calculating the distance between the word vectors in theword vector space, sorting the word vectors according to the distance, and selecting the first M sorted vocabularies as keywords of the document. The invention further discloses an extraction device.Compared with a traditional method, the extracted document keywords are high in precision and high in reliability, the titles and the Chinese characters are filtered out, the interference of noise data is avoided, and accuracy is further improved.

Description

technical field [0001] The present invention relates to the technical fields of natural language processing and deep learning, in particular to a document keyword extraction method and device based on LDA and word vectors. Background technique [0002] Keywords can concisely and accurately describe the content of the text, and generally consist of several words and phrases. Keyword extraction, also known as keyword tagging, refers to extracting a number of representative words or phrases from text or text collections to reflect the main semantic information of the text. An important channel for information of interest. The advent of the Internet era puts forward new requirements for keyword extraction. The extracted keywords should have the following three characteristics: significance, readability and comprehensiveness. Significance means that the extracted keywords should reflect the core content of the document. For example, "machine translation" is extracted from the d...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
Inventor 胡泽林曹宜超高翊李淼冯韬付莎李华龙杨选将刘先旺郭盼盼曾伟辉
Owner HEFEI INSTITUTES OF PHYSICAL SCIENCE - CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products