Microblog text normalizing, word segmenting and part-speech tagging method and system

A part-of-speech tagging and microblogging technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of error propagation, task error rate increase, low efficiency, etc., to improve performance and improve overall performance Effect

Active Publication Date: 2015-04-22
北京牡丹电子集团有限责任公司数字科技中心
View PDF2 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] 2) Children who do not hand in their homework will have safflowers
[0010] The traditional method of microblog processing is generally serial processing. Text normalization is done first, and then other processing such as word segmentation and part-of-speech tagging is performed. This processing is firstly inefficient and secondly, errors are propagated. If normalization is wrong, It will inevitably lead to an increase in the error rate of subsequent tasks.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Microblog text normalizing, word segmenting and part-speech tagging method and system
  • Microblog text normalizing, word segmenting and part-speech tagging method and system
  • Microblog text normalizing, word segmenting and part-speech tagging method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0067] The principles and features of the present invention are described below in conjunction with the accompanying drawings, and the examples given are only used to explain the present invention, and are not intended to limit the scope of the present invention.

[0068] Such as figure 1 As shown, a microblog text normalization and word segmentation and part-of-speech tagging method, including the following steps:

[0069] Step 1, constructing annotated corpus, and dividing the annotated corpus in the annotated corpus into training set, development set and test set;

[0070] Step 2, using the SVM model to train and learn to construct a microblog dictionary, that is, standardization candidate set;

[0071] Step 3, using the training set, development set and Weibo dictionary, use the BeamSearch method to train and learn a joint model based on Weibo text normalization, word segmentation, and part-of-speech tagging;

[0072] Step 4, use the joint model to perform text normaliz...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a microblog text normalizing, word segmenting and part-speech tagging method. The microblog text normalizing, word segmenting and part-speech tagging method comprises the steps that firstly, a tagged corpus is established, and tagged corpora in the tagged corpus is divided into a training set, a development set and a testing set; secondly, a microblog dictionary is established through SVM model training and learning; thirdly, through the training set, the development set and the microblog dictionary, a text normalizing, word segmenting and part-speech tagging combined model is formed through training and learning with a BeamSearch method; fourthly, through the combined model, text normalizing, word segmenting and part-speech tagging are conducted on a microblog text to be processed at the same time, and the performance of the combined model is tested. According to the method, a large number of microblogs with tagged sentences are used as the training corpus, a candidate result is expanded through the mciroblog dictionary, the established combined model can act on three tasks at the same time, the three tasks influence each other, so that the performance of each task is improved, and therefore the overall performance is improved.

Description

technical field [0001] The invention relates to the field of natural language processing, in particular to a BeamSearch-based microblog text normalization method. Background technique [0002] Commonly used natural language processing techniques such as word segmentation, part-of-speech tagging, and syntactic analysis are all based on normalized text. However, these technologies are not very good at processing non-normalized text such as Weibo. Therefore, models trained using traditional corpus cannot be directly applied to microblog texts, and new research is needed on the processing of microblog texts. [0003] Due to many non-standard language phenomena in Weibo texts, especially the large use of non-standard words. E.g: [0004] 1) I just saw Chen Laoshi's scarf and found out that friend c has come to Ningbo. [0005] 2) Children who do not hand in their homework will have safflowers. [0006] If the traditional model is used for word segmentation and part-of-speech...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06F17/30
Inventor 滕顺祥钱涛姬东鸿白旭
Owner 北京牡丹电子集团有限责任公司数字科技中心
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products