Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Word segmentation method, word segmentation device, named entity identification method and named entity identification system

A word segmentation method and word segmentation technology, which is applied in the direction of instruments, electrical digital data processing, calculation, etc., can solve the problem of low word segmentation accuracy, achieve the effect of reducing video memory usage, ensuring accuracy, and improving accuracy

Pending Publication Date: 2020-02-04
成都数联铭品科技有限公司
View PDF9 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to improve the lack of high word segmentation accuracy in the prior art, provide a word segmentation method and a word segmentation device, and a named entity recognition method and system using the word segmentation method to improve the accuracy of word segmentation results

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Word segmentation method, word segmentation device, named entity identification method and named entity identification system
  • Word segmentation method, word segmentation device, named entity identification method and named entity identification system
  • Word segmentation method, word segmentation device, named entity identification method and named entity identification system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0037] Such as figure 1 As shown, a word segmentation method is schematically provided in this embodiment, including the following steps:

[0038] Step 1 (S1 in the figure), build a dictionary.

[0039] The word segmentation method provided in this embodiment is especially suitable for word segmentation of Chinese sentences, so a Chinese word segmenter is constructed here. Jieba is a frequently used Chinese word segmentation tool. Here, the dictionary of jieba is directly used as the dictionary of the word segmenter, and some words that are not commonly used are deleted, and correct and commonly used words are kept as much as possible to reduce the capacity of the word segmenter. Of course, the dictionary of jieba can also be directly used as the dictionary of the tokenizer in the brief operation without any processing.

[0040] Step 2, based on the dictionary built in step 1, generate a prefix tree (trie tree) for the sentence to be segmented, realize efficient word graph s...

Embodiment 2

[0057] see image 3 , this embodiment provides a method for named entity recognition, which utilizes the word segmentation method described in Embodiment 1. Specifically, the named entity recognition method includes the following steps:

[0058] Step 10, according to the method described in Embodiment 1, perform word segmentation on the sentence to be recognized to obtain several words forming the sentence to be recognized, and several words form a word sequence. Still taking a sentence "Xiang'eqing is a catering company" in the sentence to be recognized as an example, wherein "Xiang'eqing" is an unregistered word, and the result identified based on the method described in Example 1 is: Xiang|E|Qing|is |One|Home|Dining|Company.

[0059] Step 20, input the word sequence obtained after word segmentation into the pre-trained NER model based on word sequence, and output the recognition result, that is, identify the named entity in the sentence to be segmented.

[0060] Among th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a word segmentation method, a word segmentation device, a named entity identification method and a named entity identification system. The word segmentation method comprises the following steps: constructing a dictionary; generating a prefix tree for the sentence to be segmented based on the dictionary and performing word graph scanning to generate a directed acyclic graphformed by all possible word forming conditions; searching a maximum probability path by adopting dynamic planning, and finding out a maximum segmentation combination based on word frequency; and forthe unregistered words which do not exist in the dictionary in the sentence to be segmented, segmenting the unregistered words according to the characters, and segmenting the unregistered words into aplurality of characters. The unregistered words are independently processed and segmented into the single words rather than the segmented words, so that the unregistered names can be prevented from being recombined with the previous and later words after being segmented, and the identification accuracy of the unregistered names can be improved.

Description

technical field [0001] The invention relates to the technical field of natural language processing, in particular to a word segmentation method, a word segmentation device, a named entity recognition method and a system. Background technique [0002] Natural language processing (NLP) is an important direction in the field of computer science and artificial intelligence, usually including sentence classification, information extraction, automatic summarization, entity recognition and other branches. [0003] As the basis of natural language processing technology, word segmentation refers to the process of segmenting a continuous sequence of characters into a sequence of words according to certain specifications. English divides vocabulary by spaces when writing, so it can be directly divided into words based on spaces, while Chinese can usually divide words, sentences and paragraphs based on special symbols, but there is no formal separator for words. Therefore, Chinese word...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/295G06F40/242G06F40/216
Inventor 张发展刘世林罗镇权李焕曾途尹康杨李伟吴桐
Owner 成都数联铭品科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products