Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Word multi-prototype vector representation and word sense disambiguation method based on CRP clustering

A vector representation and word meaning disambiguation technology, applied in semantic analysis, natural language data processing, special data processing applications, etc., can solve problems such as consuming a lot of manpower and material resources, and achieve the effect of improving accuracy and eliminating ambiguity

Active Publication Date: 2018-12-18
NORTH CHINA UNIV OF WATER RESOURCES & ELECTRIC POWER
View PDF10 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Based on the method of external knowledge base, with the help of external knowledge base (WordNet or HowNet) to explain or describe the different semantics of words, to identify the specific semantics of polysemy words, but the construction of external knowledge base or dictionary requires a lot of manpower and material resources

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Word multi-prototype vector representation and word sense disambiguation method based on CRP clustering
  • Word multi-prototype vector representation and word sense disambiguation method based on CRP clustering
  • Word multi-prototype vector representation and word sense disambiguation method based on CRP clustering

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] A specific embodiment of the present invention will be described in detail below in conjunction with the accompanying drawings, but it should be understood that the protection scope of the present invention is not limited by the specific embodiment.

[0042] The invention discloses a word multi-prototype vector representation and word sense disambiguation method based on CRP clustering, such as figure 1 As shown, the basic idea of ​​the present invention is to construct a multi-prototype vector representation of words on the basis of CRP clustering word context representation, identify polysemous words in sentences or short texts, eliminate the ambiguity of polysemous words, and obtain words in context The word vector representation of specific word meanings, and the multi-prototype representation of word vectors can more accurately represent the different semantics of words in context. Concrete steps of the present invention are as follows:

[0043] In step S1, the te...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a word multi-prototype vector representation and word sense disambiguation method based on CRP clustering, which comprises the following steps: the text in the massive text corpus is purified and pretreated to obtain plain text, CRP algorithm is used to cluster the context window representation of target polysemous word in the text corpus set. The target polysemous words inthe text corpus set are marked according to the clustering classification, and the polysemous words are trained on the marked text corpus set to obtain the multi-prototype vector representation of the polysemous words; 2, the target short text is preprocessed to obtain a short text word sequence, a target polysemous word in a word sequence is identifued, the contextual window of the target polysemous words is used to represent the similarity between the centroids of clusters corresponding to the words in the text corpus, and the word vector corresponding to the maximum similarity clusters isused as the word vector representation of the specific meaning of the polysemous words in the context to disambiguate the meanings of the polysemous words. The invention solves the problem of polysemyexpression in word expression and the problem of ambiguity identification in word meaning expression.

Description

technical field [0001] The invention relates to the field of natural language processing, in particular to a CRP clustering-based word multi-prototype vector representation and word sense disambiguation method. Background technique [0002] Among the many tasks in the field of natural language processing, the basic problem faced is how to represent language symbols as encoding patterns that can be processed by machines. Map and represent language symbols, express words, sentences, texts, etc. as a continuous low-dimensional vector, and realize semantic vector representation of words, sentences, and texts. There are a wide range of applications for tasks such as recommendation engines and automatic text summarization. [0003] Words are the most basic unit of language, and the vectorized representation of words is widely used in natural language processing tasks. A simple word vector representation is One-hot Representation. The disadvantage of this representation method is...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27G06K9/62
CPCG06F40/289G06F40/30G06F18/2321
Inventor 李国佳郭鸿奇杨喜亮王国卿杨振中
Owner NORTH CHINA UNIV OF WATER RESOURCES & ELECTRIC POWER
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products