Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Sensitive word filtering method based on text content

A filtering method and technology for sensitive words, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of poor flexibility, large number of Chinese characters, time-consuming and labor-intensive updating and maintenance of manual dictionaries, etc. , the effect of high recall

Active Publication Date: 2017-12-12
杭州言旭网络科技有限公司
View PDF3 Cites 30 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] The Internet provides people with a free, convenient and open space, and anyone can speak freely in the virtual world; therefore, a large amount of network information appears to people; One after another, it has caused a serious negative impact on society; sensitive information will be disseminated through various carriers, mainly including pictures, sounds, videos and texts; Carrying out melons cannot achieve good results. The timely discovery, tracking and processing of sensitive information technically plays a pivotal role in reducing the harm of sensitive information on the Internet to society; therefore, in the field of information processing, sensitive information filtering has become a One of the urgent and important technical issues; since English words are separated by spaces, and English only contains 26 letters, there are no obvious separators between Chinese words except for the necessary punctuation marks; and The number of Chinese characters is huge, so many English sensitive information filtering algorithms are not suitable for filtering Chinese sensitive information; if you want to achieve better filtering of sensitive information, you must study a method that can be used in information sources, transmission channels, and receivers. All practical Chinese sensitive information filtering algorithm
[0003] The early text filtering technology is mainly a simple keyword matching and word frequency statistics method, this filtering method is relatively simple; in the single pattern matching algorithm, the more classic ones are: BF algorithm, KMP algorithm, BM algorithm, etc.; in multi-pattern matching Among the algorithms, the more classic ones are: AC algorithm, CW algorithm, WM algorithm, etc. These algorithms can successfully match keywords to a certain extent, but there are high time complexity, slow matching speed in practical applications, poor flexibility, and practical problems. Problems such as application difficulties; Later, some scholars used text classification technology to filter text sensitive information; firstly, text features were extracted, and texts were divided into several categories according to their characteristics, and then the sensitivity of text was judged according to which category it belonged to. Then the sensitive text is filtered out; a large number of classification algorithms have also appeared in text classification technology, such as AP clustering algorithm, K-means algorithm based on vector space model and suffix tree (STC) algorithm based on etc.; these algorithms are very important for identifying sensitive text It has made great contributions; but it can't do anything about the sensitive words in the text; there is also a more common filtering method to remove stop words, transliterated words, etc. from the text; and there is no obvious word boundary in the Chinese text, It is difficult to identify sensitive words not included in the dictionary by using the word segmentation method, and it is time-consuming and labor-intensive to update and maintain the artificial dictionary; therefore, there are great obstacles in the word segmentation technology itself, in some short texts, such as Weibo, instant chat People often use stop words such as modal particles and particle words and some emotional punctuation marks on online platforms such as information and circle of friends. If preprocessing operations such as removing stop words and symbols are performed on such texts, users The experience is obviously much worse, and the practicality is not extensive

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Sensitive word filtering method based on text content
  • Sensitive word filtering method based on text content
  • Sensitive word filtering method based on text content

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0048] The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

[0049] Such as figure 1 As shown, a sensitive word filtering method based on text content includes the following steps:

[0050] Construct a Chinese sensitive thesaurus, expand the Chinese words in the Chinese sensitive thesaurus to Chinese spelling mixed words, and form a Chinese spelling mixed sensitive lexicon;

[0051] The expansion of the Chinese sensitive lexicon to the Zhongpin mixed sensitive word text is based on the idea of ​​permutation and combination, and the expansion of each Chinese word and the corresponding pinyin to achieve the completeness and comprehensiveness of the sensitive lexicon; Words indicate words with sensitive political tendencies, violent tendencies, unhealthy colors or uncivilized language.

[0052] Zhongpin mixed sensitive thesaurus can be expressed as:

[0053] C sen_word ={c 0 , c 1 , c 2 ,...,c i ,.....

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a sensitive word filtering method based on text content. The method comprises the steps that a Chinese sensitive word bank is constructed, Chinese words in the Chinese sensitive word bank are expanded to be pinyin blend words, and a pinyin blend sensitive word bank is formed; a transfer function for determining all sensitive words in a finite state automata is established through a sensitive word search tree structure, and sensitive words in the pinyin blend sensitive word bank are made into a sensitive word tree; and the sensitive words are retrieved in a text according to the structure of the sensitive word tree, and the retrieved sensitive words are replaced with designated signs to complete sensitive word filtering. The method is high in recall ratio and easy to implement in practical application.

Description

technical field [0001] The invention relates to the field of filtering sensitive words, in particular to a method for filtering sensitive words based on text content. Background technique [0002] The Internet provides people with a free, convenient and open space, and anyone can speak freely in the virtual world; therefore, a large amount of network information appears to people; One after another, it has caused a serious negative impact on society; sensitive information will be disseminated through various carriers, mainly including pictures, sounds, videos and texts; Carrying out melons cannot achieve good results. The timely discovery, tracking and processing of sensitive information technically plays a pivotal role in reducing the harm of sensitive information on the Internet to society; therefore, in the field of information processing, sensitive information filtering has become a One of the urgent and important technical issues; since English words are separated by s...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/322G06F16/335G06F16/9535G06F40/289
Inventor 李英祥吴珊胡志恒李倩宇
Owner 杭州言旭网络科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products