An unsupervised method for long-text recognition of Internet public opinion spam

A technology of network public opinion and identification method, applied in the field of information processing, can solve problems such as unusable, low accuracy, high cost of monitoring data, etc., and achieve the effect of reducing costs

Active Publication Date: 2021-06-22
南京擎盾信息科技有限公司
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, in actual situations, the cost of obtaining a large amount of supervised data is very high. When there is no supervised data or less supervised data, the effect of such models or methods will be greatly reduced or even unusable.
For the second type of method, when judging whether a long text is a junk text, the accuracy is often relatively low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0052] An unsupervised long-text recognition method for Internet public opinion spam, the long-text recognition method for Internet public opinion to be predicted includes the following steps:

[0053] (1) Acquisition of corpus: Obtain the data of corresponding marked public opinion spam text and normal text from the existing internal system;

[0054] (2) Model training: build two models respectively, including the language model based on the network public opinion text training and the BERT next sentence prediction model based on the network public opinion text, and input the long text of the network public opinion to be predicted into the above language model and BERT respectively In a predictive model;

[0055] The judgment process of the language model is as follows:

[0056] (X1) statistical language model;

[0057] The statistical language model is used to calculate the probability that a sentence S is a normal sentence, formalized p(S)=p(w 1 ,w 2 ,...,w n ), where ...

Embodiment 2

[0068] An unsupervised long-text recognition method for Internet public opinion spam, the long-text recognition method for Internet public opinion to be predicted includes the following steps:

[0069] (1) Acquisition of corpus: Obtain the data of corresponding marked public opinion spam text and normal text from the existing internal system;

[0070] (2) Model training: build two models respectively, including the language model based on the network public opinion text training and the BERT next sentence prediction model based on the network public opinion text, and input the long text of the network public opinion to be predicted into the above language model and BERT respectively In a predictive model;

[0071] The judgment process of the language model is as follows:

[0072] (X1) statistical language model;

[0073] The statistical language model is used to calculate the probability that a sentence S is a normal sentence, formalized p(S)=p(w 1 ,w 2 ,...,w n ), where ...

Embodiment 3

[0091] An unsupervised long-text recognition method for Internet public opinion spam, the long-text recognition method for Internet public opinion to be predicted includes the following steps:

[0092] (1) Acquisition of corpus: Obtain the data of corresponding marked public opinion spam text and normal text from the existing internal system;

[0093] (2) Model training: build two models respectively, including the language model based on the network public opinion text training and the BERT next sentence prediction model based on the network public opinion text, and input the long text of the network public opinion to be predicted into the above language model and BERT respectively In a predictive model;

[0094] The judgment process of the language model is as follows:

[0095] (X1) statistical language model;

[0096] The statistical language model is used to calculate the probability that a sentence S is a normal sentence, formalized p(S)=p(w 1 ,w 2 ,...,w n ), where ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an unsupervised method for identifying long texts of Internet public opinion garbage. The identification method includes the following steps: obtaining data corresponding to tagged public opinion garbage texts and normal texts from an existing internal system; constructing two models respectively , comprising a language model based on network public opinion text training and a BERT next-sentence prediction model based on network public opinion text, inputting the network public opinion long text to be predicted into the above-mentioned language model and the BERT next-sentence prediction model respectively; the present invention utilizes the language model The perplexity index evaluates whether the inside of the sentence is junk text, uses the BERT next sentence prediction model to evaluate the contextual coherence between the sentences of the text, and combines the two to complete the spam text recognition task of long texts, which can automatically identify While generating spam text information, it greatly reduces the cost of obtaining supervised data, allowing a system without supervised data to identify spam text from the beginning.

Description

technical field [0001] The invention relates to the technical field of information processing, in particular to an unsupervised long-text recognition method for network public opinion garbage. Background technique [0002] Internet public opinion refers to the socio-political attitudes, beliefs, and values ​​that the public generates and holds to public issues and social managers around the occurrence, development, and changes of intermediary social events in a certain social space. It is the sum of the beliefs, attitudes, opinions and emotions expressed by a large number of people about various phenomena and problems in society. The rapid formation of Internet public opinion has a huge impact on society. With the rapid development of the Internet in the world, network media has been recognized as the "fourth media" after newspapers, radio and television, and the network has become one of the main carriers to reflect public opinion. For the network public opinion text data...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35
CPCG06F16/353G06F16/355
Inventor 王义真杜向阳吴明勇
Owner 南京擎盾信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products