A Chinese Word Segmentation Method Using Word Context-Based Embedding and Neural Networks

A neural network and context technology, applied in biological neural network models, semantic analysis, electrical digital data processing, etc., can solve problems such as not being able to make full use of word information

Active Publication Date: 2019-06-04
NANJING UNIV
View PDF1 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] Purpose of the invention: Aiming at the shortcomings of the existing word tagging-based models in the current Chinese word segmentation technology that cannot make full use of word information, the present invention proposes a word context-based word embedding learning method to indirectly fuse word-level information, thereby improving Accuracy of Chinese Word Segmentation Task

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Chinese Word Segmentation Method Using Word Context-Based Embedding and Neural Networks
  • A Chinese Word Segmentation Method Using Word Context-Based Embedding and Neural Networks
  • A Chinese Word Segmentation Method Using Word Context-Based Embedding and Neural Networks

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0145] First, the labeling data used in this embodiment is the Chinese version of Binzhou Treebank CTB6.0, including 23401 sentences in the training set, 2078 sentences in the development set, and 2795 sentences in the test set. The automatic segmentation data is obtained from Chinese Gigaword (LDC2011T13) with a total of 41,071,242 sentences.

[0146] The present embodiment utilizes the complete process of the Chinese word segmentation method utilizing word context-based word embedding and neural network in the present invention as follows:

[0147] Step 1-1, determine the labeling system of the word labeling model, and define four types B, M, E, S, the specific meanings are shown in 1-1 in the specification;

[0148] Step 1-2, then train on Gigaword Chinese automatic segmentation data to get word embedding e uni Matrix and dword embedding e bi ;

[0149] Step 2-1, read a Chinese sentence "you come right now", and calculate the score of each position on the mark:

[0150]...

Embodiment 2

[0158] The algorithms used in the present invention are all written and implemented in C++ language. The model used in the experiment of this embodiment is: Intel(R) Core(TM) i7-4790K processor, the main frequency is 4.0GHz, and the memory is 24G. The labeling data used in this embodiment is the Chinese version of Binzhou Treebank CTB6.0, including 23401 sentences in the training set, 2078 sentences in the development set, and 2795 sentences in the test set. The automatic segmentation data is obtained from Chinese Gigaword (LDC2011T13) with a total of 41,071,242 sentences. The model parameters are trained on Gigaword data and CTB6.0 data. The experimental results are shown in Table 1:

[0159] Table 1 Description of the experimental results

[0160]

[0161]

[0162] Among them, Xu and Sun (2016) used a word segmentation model based on dependency recurrent neural network, Liu (2016) was a word segmentation model using segmentation representation, Zhang (2016) was a tra...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention brings forward a Chinese word segmentation method by using character embedding based on word context and a neural network. The character embedding is learnt on large-scale automatic segmentation of data. The learnt character embedding is used as inputting of a segmentation model of the neural network so as to effectively help model learning. The method comprises the concrete steps of learning character embedding on large-scale automatic segmentation of data according to the word context and lexeme marking; and utilizing character-embedding as inputting of the segmentation model of the neural network so that segmentation performance is improved. Compared with other Chinese word segmentation method technologies based on the neural network, the method adopts the character-embedding technology based on the word context so that word information is effectively integrated into the segmentation model. Therefore, the accuracy of a word segmentation task is improved.

Description

technical field [0001] The invention relates to a method for Chinese word segmentation by using a computer, in particular to a method for automatic Chinese word segmentation by combining word embedding based on word context and neural network. Background technique [0002] Chinese word segmentation is a basic task of natural language processing, and its wide application requirements have attracted a lot of related researches and promoted the rapid development of its related technologies. Adhesive languages ​​such as Chinese are different from Western languages ​​in that there is no obvious space between words in Chinese sentences. The smallest unit of general natural language processing tasks is "words", so for Chinese, the first problem is to identify word strings first. At present, the methods of processing Chinese word segmentation can be roughly divided into two categories, rule-based methods and statistical methods. The dictionary-based rule method requires the constr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/27G06N3/02
CPCG06F40/289G06F40/30G06N3/02
Inventor 戴新宇郁振庭陈家骏黄书剑张建兵
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products