Microblog text normalization method based on context graph random walk and phonetic configuration codes

A random walk and context technology, applied in the computer field, can solve problems such as the inability to meet the standardization requirements of Chinese microblog text

Inactive Publication Date: 2019-07-19
中森云链(成都)科技有限责任公司
View PDF5 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Therefore, existing normalization methods cannot...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Microblog text normalization method based on context graph random walk and phonetic configuration codes
  • Microblog text normalization method based on context graph random walk and phonetic configuration codes
  • Microblog text normalization method based on context graph random walk and phonetic configuration codes

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] The present invention is a microblog text standardization method based on context graph random walk and phonetic-graphic code, the overall process is as follows figure 1 shown, including the following steps:

[0042] Step 1: Segment the Chinese Weibo text.

[0043] Step 2: Use a standard dictionary to identify non-standard words in the microblog text and extract the context of the words.

[0044] Step 3: Construct a context graph according to the word, the context corresponding to the word, and the co-occurrence times of the word and the corresponding context.

[0045] Step 4: Perform a random walk on the context graph to obtain context-based normalized candidate sets for each non-normative word.

[0046]Step 5: Based on the phonetic code of a single Chinese character, find out the phonetic code of the word.

[0047] Step 6: For each non-standard word, extract the feature vector of the phonetic-phonetic code, input it into the phonetic-phonetic code model, and output...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a microblog text normalization method based on context graph random walk and phonetic configuration codes, and belongs to the technical field of computer technology social mediatext content analysis and mining. The method comprises the following steps: identifying non-standard words, and extracting word contexts; constructing a context graph for random walk to obtain a standardized candidate set based on context; obtaining a standardized candidate set based on phonetic configuration by using the phonetic configuration codes of the Chinese characters; and processing thetwo standardized candidate sets to obtain a final standardized result. The method overcomes the defect that Chinese character pronunciation is not fully considered in a traditional method. In essence,the social media is different from written languages such as news and the like and is full of a large number of non-standard abbreviations, homophones and homomorphic words, so that the effect of processing the microblog text by a natural language processing tool is not ideal. Therefore, the invention provides a microblog text normalization method which combines phonetic configuration codes withpredecessor and postdecessor understanding, thereby providing possibility for utilizing a natural language processing tool to analyze and mine after normalization.

Description

technical field [0001] The invention belongs to the technical field of computers, in particular to a microblog text standardization method based on context graph random walk and phonetic-graphic code. Background technique [0002] With the popularity of social networks, new users continue to join the social networks, and tens of thousands of text data are generated on various social platforms every day. Weibo has become one of the most important social networking platforms due to its instant, short, and fast spreading characteristics. It has also become an important medium for people to obtain news and current events, human connection, self-expression, social sharing and social participation. Therefore, these microblog data have great research value. However, there are a large number of non-standard words in the microblog text, which makes the effect of the existing natural language tools not ideal when directly processing the microblog text. If the non-standard words in ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06Q50/00
CPCG06Q50/01G06F40/284G06F40/289
Inventor 不公告发明人
Owner 中森云链(成都)科技有限责任公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products