Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Natural language lexical analysis method, device and analyzer training method

A lexical analyzer and natural language technology, applied in the field of natural language lexical analyzer training, natural language lexical analysis method, and device field, which can solve the problems of unrecognized, blindly recognized unknown words, and inability to obtain new words composed of words. , to avoid interfering information and improve accuracy

Inactive Publication Date: 2012-09-19
FUJITSU LTD
View PDF4 Cites 23 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Although this method can perform part-of-speech tagging and word segmentation on known words, there is still a relatively blind problem of how to perform word tagging on the recognized unknown words
In addition, the unknown words marked by this method are only limited to words with low frequency in the training corpus. For proper nouns with high frequency in the training corpus, this method cannot obtain the rule of forming new words from words.
For example, if carbon dioxide and ferric chloride appear frequently in the training corpus, these two words are not unknown words. When ferric oxide appears in the test corpus, the method cannot recognize
In addition, for the unknown words marked by this method, its part of speech cannot be known

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Natural language lexical analysis method, device and analyzer training method
  • Natural language lexical analysis method, device and analyzer training method
  • Natural language lexical analysis method, device and analyzer training method

Examples

Experimental program
Comparison scheme
Effect test

no. 1 approach

[0025] According to the first embodiment of the present invention, a natural language lexical analysis method is proposed. figure 1 A schematic flowchart of the method is shown.

[0026] Such as figure 1As shown, in step S110, the input natural language sentence is segmented into a plurality of sequences composed of words that may be words of the first type and / or words that may be components of words of the second type. Here, Chinese is taken as an example for description. It should be noted that the embodiments of the present invention only use Chinese as an illustrative example, but the present invention is not limited thereto. Those skilled in the art can also apply to natural languages ​​such as Japanese and Korean.

[0027] The reason why we say "possibly the first type of word" means to temporarily treat it as "the first type of word", but in the final word segmentation result, it may not be a legal word, or it may not be the first type word. "A character that may ...

no. 2 approach

[0050] According to one aspect of the present invention, a natural language lexical analysis device is provided. image 3 A schematic structural diagram of a natural language lexical analysis device 300 according to an embodiment of the present invention is shown. Such as image 3 As shown, the natural language lexical analysis device 300 may include: a segmentation unit 310 , a statistical probability model storage unit 320 , a score calculation unit 330 , a candidate sequence determination unit 340 and a labeling unit 350 .

[0051] The segmentation unit 310 may be configured to segment the input natural language sentence into a plurality of sequences composed of words that may be words of the first type and / or words that may be components of words of the second type, wherein the first type The words of are words other than words of the second type.

[0052] In this embodiment, for convenience of description, Chinese sentences are still used for description. It should be ...

no. 3 approach

[0066] According to another aspect of the present invention, a method for training a natural language lexical analyzer is provided. Figure 5 A flowchart showing a method for training a natural language lexical analyzer according to another embodiment of the present invention.

[0067] Such as Figure 5 As shown, the natural language lexical analyzer training method may include: labeling natural language sequences as training corpus, wherein only word information is used to mark the words of the first type, and the words of the second type are marked with word information to form the first Words of words of two types, wherein the words of the first type are words other than the words of the second type (step S510).

[0068] Still taking Chinese as an example, for the Chinese sentence "Xiao Ming is going to school tomorrow", the sequence is provided to the analyzer as a training corpus. For example, "tomorrow", "go" and "go to school" can be classified as the first type of wo...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a natural language analysis method, a natural language analysis device and an analyzer training method. The analysis method comprises the following steps: dividing the input natural language sentence into a plurality of sequences composed of the first type word and / or character which possibly is the constituent part of the second type word, wherein the first type word is the word except the second type word; computing the fraction of each sequence by a statistic probability model, wherein the model comprises the statistic the probability of the first type word in the context and the statistic probability of the character as the constituent part of the second type word; determining a candidate sequence according to the fraction; labeling the natural language sentence according to the candidate sequence; for the first type word which possibly exists in the candidate sequence, labeling word information obtained from the statistic probability model, and for the character which possibly exists except the possibly existent first type word, labeling character information obtained from the statistic probability model.

Description

technical field [0001] The present invention relates to the field of natural language processing, in particular, the present invention relates to a natural language lexical analysis method, a device and a natural language lexical analyzer training method. Background technique [0002] Natural language lexical analysis is the segmentation of a natural language sequence (eg, a sentence, or a paragraph) into words that are constituents of the sentence. The level of words is lower than sentences and higher than morphemes (such as characters in Chinese), but there are also sentences composed of a single word, or words composed of a single morpheme. The traditional natural language lexical analysis is to divide the natural language sequence into several combinations of possible words, and calculate the score of each combination according to the probability of each word in the context, select the combination of scores that meet a certain threshold, and Words in the sequence are ta...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
Inventor 孟遥于浩
Owner FUJITSU LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products