Text word segmentation method and text word segmentation device

A word segmentation method and Chinese word segmentation technology, which can be used in instruments, digital data processing, computing, etc., and can solve the problems of high time cost and labeling a large number of long texts

Pending Publication Date: 2020-09-22
BEIJING DIDI INFINITY TECH & DEV
View PDF4 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] At present, Chinese word segmentation adopts a statistical-based word segmentation method represented by Hidden Mark Model (HMM), and uses a dynamic programming algorithm to mark the sequence of words in the text to be segmented. However, in the environment of massive data, These methods need to label a large number of long texts, and the time cost is high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text word segmentation method and text word segmentation device
  • Text word segmentation method and text word segmentation device
  • Text word segmentation method and text word segmentation device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0087] See figure 1 As shown, the flowchart of a method for word segmentation of a text provided in Embodiment 1 of the present application includes the following steps:

[0088] S101: Acquire the Chinese text to be processed.

[0089] In specific implementation, the Chinese text to be segmented can be obtained first.

[0090] It should be noted that in various scenarios of Chinese natural language processing, we usually need to use words as the smallest basic unit for research. However, Chinese is based on characters, and there is no space between words. Class signs indicate the boundaries of words, so word segmentation becomes the basic work of Chinese text processing. The quality of word segmentation plays an extremely critical role in the subsequent Chinese information processing.

[0091] S102: Divide the Chinese text into a plurality of Chinese short texts; wherein each of the Chinese short texts includes a plurality of consecutive Chinese characters representing a semantic.

[0...

Embodiment approach 1

[0096] Embodiment 1: Input multiple Chinese short texts into a pre-trained Chinese word segmentation model to obtain multiple Chinese short texts after word segmentation, and then stitch all the Chinese short texts after word segmentation to output the word segmentation Chinese text.

Embodiment approach 2

[0097] Embodiment 2: All short Chinese texts can be input in parallel into a pre-trained Chinese word segmentation model to output the Chinese text after word segmentation.

[0098] Here, the Chinese word segmentation model has been trained before the word segmentation of the Chinese text, and can be directly used for the word segmentation of the Chinese text. The Chinese word segmentation model can be a word segmentation model based on string matching, a word segmentation model based on understanding, a word segmentation model based on statistics, and so on.

[0099] In the embodiment of the present application, by acquiring the Chinese text to be processed, the Chinese text is divided into multiple Chinese short texts, where each Chinese short text includes multiple consecutive Chinese characters representing a semantic meaning, which can not only reduce the Chinese text The length of, can also filter out the interference of non-Chinese characters. Further, based on the segmented...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the technical field of Chinese text processing, in particular to a text word segmentation method and a text word segmentation device. The word segmentation method comprises the steps of obtaining a to-be-processed Chinese text, segmenting the Chinese text into a plurality of Chinese short texts; wherein each Chinese short text comprises a plurality of continuous Chinese characters representing a semantic meaning; according to the method, the length of the Chinese text can be reduced, interference of non-Chinese characters can be filtered out, the Chinese text subjectedto word segmentation is output on the basis of the multiple segmented Chinese short texts and the pre-trained Chinese word segmentation model, and the word segmentation efficiency of the Chinese textcan be improved.

Description

Technical field [0001] This application relates to the technical field of Chinese text processing, and in particular to a word segmentation method and word segmentation device for text. Background technique [0002] In various scenarios of Chinese natural language processing, we usually need to use words as the smallest basic unit for research. However, Chinese is based on words, and there are no signs such as spaces between words. Therefore, word segmentation has become the basic work of Chinese text processing. The quality of word segmentation plays an extremely critical role in the subsequent Chinese information processing. [0003] At present, Chinese word segmentation uses a statistical-based word segmentation method represented by the Hidden Mark Model (HMM), and uses dynamic programming algorithms to mark the word sequence of the text to be segmented. However, in the environment of massive data, These methods need to annotate a large amount of long text, and the time cost i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F40/289
CPCY02D10/00
Inventor 陈坦访王伟玮李奘
Owner BEIJING DIDI INFINITY TECH & DEV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products