LSTM-based word segmentation method

A word segmentation method and data technology, applied in neural learning methods, special data processing applications, instruments, etc., can solve the problems of few network layers, low recognition accuracy, low recognition rate, etc., to improve the accuracy of word segmentation. Recognition recognition rate, improve recognition effect

Inactive Publication Date: 2018-03-27
北京知道未来信息技术有限公司
View PDF4 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The knowledge-based artificial neural network model has a small number of network layers in the actual application due to the gradient disappearance problem during model training, and the final word segmentation result has no obvious advantage
[0007] The word segmentation method based on the dictionary relies heavily on the dictionary library, the efficiency is relatively low, and cannot identify unregistered words; among the present invention, registered words refer to words that have appeared in the corpus vocabulary, and unregistered words refer to words that do not appear in word in corpus vocabulary
[0008] Based on the word frequency statistical word segmentation method (such as N-Gram), it can only associate the semantics of the first N-1 words of the current word, and the recognition accuracy is not high enough. When N increases, the efficiency is very low
And the recognition rate for unlogged is low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • LSTM-based word segmentation method
  • LSTM-based word segmentation method
  • LSTM-based word segmentation method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0043] In order to make the above-mentioned features and advantages of the present invention more comprehensible, the following specific embodiments are described in detail in conjunction with the accompanying drawings.

[0044] The flow chart of the present invention is as figure 1 As shown, the implementation can be divided into two stages: 1) training stage and 2) prediction stage.

[0045] (1) Training phase:

[0046] Step 1: If there are multiple word segmentation corpus data, integrate them into one training corpus data OrgData, the format of which is that each word segmentation result occupies one line; then, the original training corpus data is converted into character-level corpus data. Specifically: according to the BMES (Begin, Middle, End, Single) marking method, the original training corpus data characters are segmented and labeled New_Data. Suppose the label corresponding to a word is Label, then the character at the beginning of the word is marked as LabelB, the chara...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an LSTM-based word segmentation method. The method comprises the steps of 1) converting training corpus data into character-level corpus data; 2) dividing the corpus data according to a sentence length to obtain multiple sentences, and according to the sentence length, grouping the obtained sentences to obtain a data set comprising n groups of sentences; 3) extracting multiple pieces of data from the data set to serve as iterative data; 4) converting the iterative data each time into a fixed-length vector, inputting the vector to a deep learning model LSTM, training parameters of the deep learning model LSTM, and when a loss value iterative change generated by the deep learning model is smaller than a set threshold, is no longer reduced or reaches a maximum iterative frequency, stopping the training of the deep learning model, and obtaining a trained deep learning model LSTM; and 5) converting to-be-predicted corpus data into the character-level corpus data, andinputting the character-level corpus data to the trained deep learning model LSTM to obtain a word segmentation result.

Description

Technical field [0001] The invention belongs to the technical field of computer software and relates to a word segmentation method based on LSTM. Background technique [0002] In the natural language processing problem, Asian type text is not like Western text with natural space separators. Many Western text processing methods cannot be directly used to process Asian type (Chinese, Korean and Japanese) text. This is because Asian type (Chinese , Korean and Japanese) must go through the process of word segmentation to maintain consistency with Western. Therefore, word segmentation is the basis of information processing in the processing of Asian type words, and its application scenarios include: [0003] 1. Search engine: An important function of search engine is to do the full-text indexing of documents. Its content is to segment the text, and then the word segmentation results of the document and the document form an inverted index. The user will also query first when querying T...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06N3/04G06N3/08
CPCG06N3/08G06F40/211G06F40/289G06N3/045
Inventor 岳永鹏唐华阳
Owner 北京知道未来信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products