A Chinese word segmentation method with feature alignment

A Chinese word segmentation and feature pair technology, applied in special data processing applications, instruments, electrical digital data processing and other directions, can solve problems such as parameter growth

Active Publication Date: 2019-03-15
CHONGQING UNIV OF POSTS & TELECOMM
View PDF8 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Of course, the sequence tagging method based on conditional random field is also a commonly used method to deal with Chinese word segmentation. By using enough context features, although good results have been achieved, due to the nature of conditional random field, the parameters will increase exponentially, so It is unwise to directly train CRFs with multiple features

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Chinese word segmentation method with feature alignment
  • A Chinese word segmentation method with feature alignment
  • A Chinese word segmentation method with feature alignment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0056] For further elaborating the scheme of the present invention, the marked data and the unmarked data of the PKU text in the commonly used Chinese word segmentation corpus SIGAN-2005 are taken as an example to elaborate the technical scheme, refer to figure 1 , figure 1 A flow chart of a feature-aligned Chinese word segmentation method provided in this embodiment:

[0057] Step 1: Extract the binary words composed of adjacent words in the labeled data and unlabeled data in the PKU respectively, and count the number of times the binary words appear in the text. If the number of occurrences is 1, remove the bigram; if there are punctuation marks in the current bigram, it will also be removed, so as to obtain the bigrams for which the labeled data and unlabeled data are used to build the model.

[0058] Step 2: Extract the following 19 features from the binary words of the marked data and unlabeled data in step 1: count the number of times the current binary word appears in ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A Chinese word segmentation method for feature alignment include extracting feature of binary words from marked data and unmarked data, 101, extracting feature of binary words from marked data and unmarked data, extracting feature of binary words from marked data, 102 align that tagged data with the untagged data by an earth move's distance (hereinafter referred to as EMD) method; 103 train that features of the tag data aft the features are aligned by the classifier xgboost, thereby predict the probability of forming the binary words in the untagged data; 104 extracting a portion of the binarywords from the result of the classifier and integrating the binary words of the tag data of step 101 as features of the conditional random field and performing training; 105,The untagged data are serially tagged and segmented by the established model. The invention mainly features that the marked data and the unmarked data are feature aligned by EMD, the word formation probability of the binary words is predicted by learning the classifier, and then a new word segmenter is formed by integrating the conditional random fields in a stacking manner.

Description

technical field [0001] The invention belongs to the field of natural language processing, and in particular relates to a feature-aligned Chinese word segmentation method. Background technique [0002] As the most basic unit of language, words play a very important role in text analysis tasks. As an indispensable part of natural language processing, Chinese word segmentation has achieved tremendous development in recent years and has been widely used in various Chinese natural language processing tasks such as information retrieval, knowledge extraction and question answering. Due to the high cost of labeling data, the evolution of word usage, and the different needs in different scenarios, existing Chinese word segmentation methods still have some problems in practical tasks. Although it has high accuracy on regular text, the segmentation of low-frequency words is still a challenge in many cases. For example, in the absence of a predefined dictionary, the regional word "Ga...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06K9/62
CPCG06F40/242G06F40/289G06F18/2148G06F18/24
Inventor 李智星冯开来沈柯任诗雅王化明李苑孙柱袁龙
Owner CHONGQING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products