Method for detecting repetition data of social media

A social media, duplicate data technology, applied in the field of detecting duplicate data in social media, can solve the problems of low applicability, poor automation, and difficulty in eliminating semantic differences, and achieve the effect of fast detection and high efficiency

Inactive Publication Date: 2016-06-15
EAST CHINA NORMAL UNIV
View PDF3 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Based on the method of semantic analysis, the method of natural language understanding and processing is used to detect the duplication of documents. This method can not only detect the partial differences in the document sentence, but also realize the different but Repetitive documents with the same literal meaning, but the extraction of text semantic features is more complicated, especially for a language with more complex semantics such as Chinese, semantic analysis is more difficult, and it is difficult to eliminate semantic differences with this method, the accuracy of duplication detection hard to guarantee
Based on the writing style method, it is believed that everyone has their own writing habits. These habitual styles can be used as the fingerprint of the document to detect the duplication of the document. This method is less automated and less applicable.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for detecting repetition data of social media
  • Method for detecting repetition data of social media
  • Method for detecting repetition data of social media

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] The present invention will be further described in detail in conjunction with the following specific embodiments and accompanying drawings. The process, conditions, experimental methods, etc. for implementing the present invention, except for the content specifically mentioned below, are common knowledge and common knowledge in this field, and the present invention has no special limitation content.

[0028] The present invention adopts a Local Sensitive Hashing (LocalSensitiveHashing, LSH) algorithm in social media repeatability detection. The local sensitive hash algorithm was first proposed by PiotrIndyk and RajeevMotwani in 1998, and the specific implementation method was given in 1999 by AristidesGionis and P.Indyk et al. The locality-sensitive hash algorithm believes that if a certain error is allowed, if two elements are similar, the two elements are still similar after the mapping operation, so that we can focus on those elements that are most likely to be simil...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for detecting repetition data of social media. The method comprises following steps: dividing each text data of social data into multiple text elements constituting sets corresponding to text data; utilizing a Hush function to map all text elements in sets to corresponding Hash values and obtaining minimum Hash values, repeating mapping for multiple times in order to obtain an array composed of multiple minimum Hash values as the minimum Hash signature for text data; utilizing a locality-sensitive hashing algorithm to map text elements of each minimum Hash value to different detection queues; and calculating Jaccard similarity between any two text elements in the same detection queue. Text elements with Jaccard similarity larger than threshold value are determined as repetition data.The a method for detecting repetition data of social media is capable of increasing repeatability detection efficiency of large texts.

Description

technical field [0001] The invention belongs to the technical field of data mining, and in particular relates to a method for detecting duplicate data in social media. Background technique [0002] With the rapid development of social media, social media repetitive garbage is also taking the opportunity to quickly breed, plagiarism and copying are common on social media, a large number of completely repetitive or near-repetitive content, flooding social media networks, occupying users' social vision, serious It affects people's normal social life, blurs social topics and trends, infringes the copyright of the original author, and hinders the healthy development of social networks. Therefore, it is extremely important to detect repetitive content and remove repetitive social media spam. [0003] In terms of repeatability detection, there are many mature algorithm theories and technologies. And duplication detection research and technology mainly focus on document duplicatio...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 章群燕石丹丹钱卫宁周傲英
Owner EAST CHINA NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products