Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Chinese named entity identification data enhancement algorithm based on sequence generative adversarial network

A technology for named entity recognition and sequence generation, applied in the Internet field, can solve problems such as spending a lot of manpower and time, lack of a large amount of labeled data, and unsolved problems.

Active Publication Date: 2020-10-02
BEIJING UNIV OF POSTS & TELECOMM
View PDF12 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0015] 1. Although modifying the structure of the deep model can enhance the semantic representation of the text, it does not solve the problem of lacking a large amount of labeled data
[0016] 2. The introduction of external resources requires a lot of manpower and time to collect external resources, and it is necessary to design effective rules to add external resources to the model

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese named entity identification data enhancement algorithm based on sequence generative adversarial network
  • Chinese named entity identification data enhancement algorithm based on sequence generative adversarial network
  • Chinese named entity identification data enhancement algorithm based on sequence generative adversarial network

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0061] refer to figure 1 , 2 As shown, the present invention provides a method for applying a data enhancement algorithm based on a sequence generation confrontation network to a named entity recognition task. Specifically, during training, the method includes:

[0062] Step 1: Process the sentences in the corpus, divide each sentence into entity and non-entity parts according to the entity label information of the sentence, and add the entity and non-entity parts to the dictionary at the same time. Suppose a text sequence {c 1 ,c 2 ,c 3 ,c 4 ,c 5 ,c 6} label is {O,O,B-PER,I-PER,O,O}, you can put c 1 c 2 ,c 5 c 6 Classified as non-substantial parts, c 3 c 4 into entity parts, and then add them and their corresponding labels to the dictionary.

[0063] Step 2: According to the dictionary formed by entities and non-entities, the entities and non-entities in each sentence are mapped to corresponding indexes in the dictionary to form an index sequence.

[0064] Step ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a method for selecting positive sample data in source domain data to extend training data of a target domain by fusing semantic difference and label difference of sentences in the source domain and the target domain, so as to achieve the purpose of enhancing named entity recognition performance of the target domain. Based on a conventional Bi-LSTM + CRF model, in order to fuse semantic differences and label differences of sentences in a source domain and a target domain, semantic difference and label difference are introduced through state representation and reward setting in reinforcement learning; therefore, the trained decision network can select sentences having positive influence on the named entity recognition performance of the target domain in the data of thesource domain, expand the training data of the target domain, solve the problem of insufficient training data of the target domain, and improve the named entity recognition performance of the targetdomain at the same time.

Description

technical field [0001] The invention relates to the technical field of the Internet, in particular to a method of using a sequence generation confrontation network to enhance data and improve the performance of Chinese named entity recognition. Background technique [0002] In recent years, deep learning has made great progress in image, speech and natural language processing. As an emerging technology of machine learning algorithms, deep learning is motivated by the establishment of a neural network that simulates the human brain for analysis and learning. In the field of images, people use deep neural networks to realize target detection in images, such as combining convolutional neural networks with candidate windows to detect pedestrians in images; in the field of speech, deep learning is used for speech synthesis and recognition provide us with an intelligent voice system; in the field of natural language processing, deep learning is applied to various life scenarios, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/295G06F40/216G06F16/31G06F16/35G06F16/36G06N3/04G06N3/08
CPCG06F40/295G06F40/216G06F16/316G06F16/35G06F16/36G06N3/049G06N3/084G06N3/045
Inventor 李思王蓬辉李明正孙忆南
Owner BEIJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products