Chinese text data word vector representation method based on BIE position word list

A technology of text data and word vectors, applied in digital data processing, natural language data processing, instruments, etc., can solve the problems of lexical information loss and difficulty in wide application, achieve high accuracy, and improve the ability to solve entity nesting problems effect of ability

Pending Publication Date: 2022-04-05
CHONGQING UNIV OF POSTS & TELECOMM
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the effect of lexical information integration is strongly related to the Embedding strategy. The WC-LSTM structure has the problem of lexical information loss, and the Multi-digraph structure relies on dictionary labels, making it difficult to be widely used.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese text data word vector representation method based on BIE position word list
  • Chinese text data word vector representation method based on BIE position word list
  • Chinese text data word vector representation method based on BIE position word list

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0046] refer to figure 1 , figure 1 A flow chart of a method for characterizing Chinese text data word vectors based on a BIE position word list provided by an embodiment of the present invention, specifically including:

[0047] It is difficult for Chinese text word vectors to express the position and boundary information of Chinese words, which brings great challenges to Chinese named entity recognition. Therefore, in this embodiment, the discussion mainly focuses on the Chinese text data set.

[0048] How to express the position information of the word in the corresponding word in the word vector is the key of the present invention. In the Chinese entity recognition task, learning lexical boundaries can help the model distinguish entity boundaries, so the three position dimensions of BIE are used to express the position information of words in words. At the same time, in order to take into account the full quantifier set and the strong related word set, the weight T is u...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a Chinese text data word vector representation method based on a BIE position word list, and relates to the field of deep learning and named entity recognition, and the method comprises the following steps: S1, generating a total word set and a strong correlation word set, and constructing the BIE position word list; s2, constructing a position-independent word vector by using the original representation of the word vector; s3, condensing word vector representations in the word set based on a word frequency weighted average pooling algorithm; and S4, weighting the BIE position word vector of the word and splicing the weighted BIE position word vector with the original word vector to generate a word vector containing vocabulary position information. According to the method, the position information of the highly correlated vocabularies can be highlighted while the total position information of the vocabularies is fused into the word vectors. And the character vector representation dimension is expanded, so that the Chinese entity recognition result has higher accuracy.

Description

technical field [0001] The invention belongs to the field of deep learning and named entity recognition, and relates to a Chinese text data word vector representation method based on a BIE position word list. Background technique [0002] Named Entity Recognition (NER) is a basic work in the field of natural language processing, and is a subtask of tasks such as information retrieval, relation extraction, and question answering systems. Unlike natural word segmentation in English text, the number of characters in a word in Chinese text is not fixed, and there is no word segmentation identifier. This makes it difficult to learn lexical boundary information for Chinese named entity recognition tasks. Therefore, it is necessary to incorporate lexical position information into the Chinese word vector representation to improve the accuracy of Chinese named entity recognition. [0003] At present, the widely used vocabulary enhancement methods are vocabulary enhancement methods ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/295G06F40/216G06F40/284
Inventor 王进王猛旗林兴杜雨露孙开伟
Owner CHONGQING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products