A feature-aligned Chinese word segmentation method

A technology of Chinese word segmentation and feature pairing, which is applied in the fields of instruments, computing, and electrical digital data processing, etc., can solve the problem of parameter growth and other problems, and achieve the effect of avoiding over-fitting and alleviating the difference in feature distribution

Active Publication Date: 2022-07-01
CHONGQING UNIV OF POSTS & TELECOMM
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Of course, the sequence tagging method based on conditional random field is also a commonly used method to deal with Chinese word segmentation. By using enough context features, although good results have been achieved, due to the nature of conditional random field, the parameters will increase exponentially, so It is unwise to directly train CRFs with multiple features

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A feature-aligned Chinese word segmentation method
  • A feature-aligned Chinese word segmentation method
  • A feature-aligned Chinese word segmentation method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0055] In order to further illustrate the solution of the present invention, the technical solution is described in detail by taking the marked data and unmarked data of the PKU text in the commonly used Chinese word segmentation corpus SIGAN-2005 as an example. figure 1 , figure 1 A flowchart of a feature-aligned Chinese word segmentation method provided in this embodiment:

[0056] Step 1: Extract the bigrams composed of adjacent words in the marked data in the PKU and the unmarked data respectively, and count the number of times the bigrams appear in the text. If the number of occurrences is 1, the bigram will be removed; if the current bigram has punctuation marks, it will also be removed, so as to obtain the marked data and unmarked data for building the model.

[0057] Step 2: Extract the following 19 features for the big words of the labeled data and unlabeled data in step 1: count the number of times the current big word appears in the document; calculate the multipli...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention claims to protect a feature-aligned Chinese word segmentation method, which includes: 101 extracting features of bigrams from marked data and unmarked data; 102 combining marked data with unmarked data by the Earth Mover's Distance (hereinafter referred to as EMD) method 103 Use the classifier xgboost to train the features of the labeled data after feature alignment, so as to predict the probability of big words in unlabeled data becoming words; 104 Extract some big words and steps from the results of the classifier 101 The bigram integration of labeled data is used as the feature of the conditional random field and trained; 105 The unlabeled data is sequence-labeled and segmented through the established model. The invention mainly performs feature alignment on marked data and unmarked data through EMD, predicts the word formation probability of binary words through classifier learning, and then integrates conditional random fields in a stacking manner to form a new tokenizer.

Description

technical field [0001] The invention belongs to the field of natural language processing, and in particular relates to a feature-aligned Chinese word segmentation method. Background technique [0002] As the most basic unit of language, words play an important role in text analysis tasks. As an indispensable part of natural language processing, Chinese word segmentation has made great progress in recent years and is widely used in various Chinese natural language processing tasks such as information retrieval, knowledge extraction and question answering. Due to the expensive cost of labeling data, the evolution of word usage, and the different needs in different scenarios, the existing Chinese word segmentation methods still have some problems in practical tasks. Although it has high accuracy on regular text, segmentation of low-frequency words is still a challenge in many cases. For example, in the absence of a predefined dictionary, the local word "Gaotang" would be spli...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/289G06F40/242G06K9/62
CPCG06F40/242G06F40/289G06F18/2148G06F18/24
Inventor 李智星冯开来沈柯任诗雅王化明李苑孙柱袁龙
Owner CHONGQING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products