Unsupervised online public opinion junk long text recognition method

A technology of network public opinion and identification method, applied in the field of information processing, can solve the problems of discounted effect, unusable, low accuracy, etc.

Active Publication Date: 2020-10-02
南京擎盾信息科技有限公司
View PDF6 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, in actual situations, the cost of obtaining a large amount of supervised data is very high. When there is no supervised data or less supervised data, the effect of such models or methods will be greatly reduced or even unusable.
For the second type of method, when judging whether a long text is a junk text, the accuracy is often relatively low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0049] An unsupervised method for identifying long texts of Internet public opinion garbage, the identification method comprising the following steps:

[0050] (1) Acquisition of corpus: Obtain corresponding marked public opinion spam text and normal text data from the existing internal system;

[0051] (2) Model training: build two models respectively, including the language model based on the network public opinion text training and the BERT next sentence prediction model based on the network public opinion text, and input the long text of the network public opinion to be predicted into the above language model and BERT respectively. In a predictive model;

[0052] The judgment process of the language model is as follows:

[0053] (X1) Statistical language model;

[0054] A statistical language model is used to compute a sentence is the probability of a normal sentence, formalized ,in express sentence The probability, Indicates the first in this sentence A mini...

Embodiment 2

[0063] An unsupervised method for identifying long texts of Internet public opinion garbage, the identification method comprising the following steps:

[0064] (1) Acquisition of corpus: Obtain corresponding marked public opinion spam text and normal text data from the existing internal system;

[0065] (2) Model training: build two models respectively, including the language model based on the network public opinion text training and the BERT next sentence prediction model based on the network public opinion text, and input the long text of the network public opinion to be predicted into the above language model and BERT respectively. In a predictive model;

[0066] The judgment process of the language model is as follows:

[0067] (X1) Statistical language model;

[0068] A statistical language model is used to compute a sentence is the probability of a normal sentence, formalized ,in express sentence The probability, Indicates the first in this sentence A mini...

Embodiment 3

[0083] An unsupervised method for identifying long texts of Internet public opinion garbage, the identification method comprising the following steps:

[0084] (1) Acquisition of corpus: Obtain corresponding marked public opinion spam text and normal text data from the existing internal system;

[0085] (2) Model training: build two models respectively, including the language model based on the network public opinion text training and the BERT next sentence prediction model based on the network public opinion text, and input the long text of the network public opinion to be predicted into the above language model and BERT respectively. In a predictive model;

[0086] The judgment process of the language model is as follows:

[0087] (X1) Statistical language model;

[0088] A statistical language model is used to compute a sentence is the probability of a normal sentence, formalized ,in express sentence The probability, Indicates the first in this sentence A mini...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an unsupervised online public opinion junk long text recognition method. The recognition method comprises the following steps of obtaining data of corresponding public opinionjunk texts with marks and normal texts from an existing internal system; respectively constructing two models, including a language model trained based on an online public opinion text and a BERT nextsentence prediction model based on the online public opinion text, and respectively inputting a to-be-predicted online public opinion long text into the language model and the BERT next sentence prediction model; evaluating whether the interior of a sentence is a junk text or not by utilizing a language model confusion index; evaluating the context coherence between sentences of the text by utilizing a next sentence prediction model of BERT; completing the junk text recognition task of the long text by combining the junk text information and the supervision data, thus the junk text information can be automatically recognized, meanwhile, the cost generated by obtaining the supervision data is greatly reduced, and a system without the supervision data can recognize the junk text from the beginning.

Description

technical field [0001] The invention relates to the technical field of information processing, in particular to an unsupervised long-text recognition method for network public opinion garbage. Background technique [0002] Internet public opinion refers to the socio-political attitudes, beliefs, and values ​​that the public generates and holds to public issues and social managers around the occurrence, development, and changes of intermediary social events in a certain social space. It is the sum of the beliefs, attitudes, opinions and emotions expressed by a large number of people about various phenomena and problems in society. The rapid formation of Internet public opinion has a huge impact on society. With the rapid development of the Internet in the world, network media has been recognized as the "fourth media" after newspapers, radio and television, and the network has become one of the main carriers to reflect public opinion. For the Internet public opinion text dat...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/35
CPCG06F16/353G06F16/355
Inventor 王义真杜向阳吴明勇
Owner 南京擎盾信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products