Rank-Based Text Matching Method for Plagiarism Detection

A matching method and text technology, which can be used in unstructured text data retrieval, text database clustering/classification, semantic analysis, etc., can solve problems such as poor detection performance, and achieve the effect of improved statistical significance and good performance.

Active Publication Date: 2021-09-03
HEILONGJIANG INST OF TECH
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The purpose of the present invention is to provide a text matching method for plagiarism detection based on sorting, in order to solve the problem of relying on expert experience based on heuristic methods, resulting in poor detection performance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Rank-Based Text Matching Method for Plagiarism Detection
  • Rank-Based Text Matching Method for Plagiarism Detection
  • Rank-Based Text Matching Method for Plagiarism Detection

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0068] Such as Figure 1 to Figure 2 As shown, this embodiment is specifically described as follows for the sorting-based plagiarism detection text matching method:

[0069] 1 about plagiarism

[0070] Generally, plagiarism can be divided into low-ambiguity plagiarism (such as full copy, partial copy, simple modification) and high-ambiguity plagiarism (including paraphrase plagiarism, summary plagiarism, cross-language plagiarism, etc.) (Alzahrani et al., 2012). The low performance of high-fuzzy plagiarism detection is the biggest problem in plagiarism detection at present, and heuristic methods are far from achieving satisfactory performance on high-fuzzy plagiarism detection. The main reason is that the vocabulary of highly fuzzy plagiarized text is quite different from that of the source text, and the number of vocabulary matches is very small, so it is difficult to accurately identify plagiarized matches.

[0071] 2 Analysis of Plagiarism Matching Problems

[0072] To i...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a text matching method for plagiarism detection based on sorting, and relates to the technical field of plagiarism detection. In order to realize the detection of highly fuzzy plagiarism, the present invention solves the problem that the heuristic method relies on expert experience and cannot integrate various effective features in plagiarism detection. Formalizing plagiarized text matching as a ranking task, given a suspicious text segment, the method applies a sequence-based ranking learning method to obtain the most likely plagiarized segment of the segment in the source document. The present invention introduces the evaluation index METEOR of machine translation to capture lexical similarity and semantic similarity. The method is evaluated on the PAN2012 and PAN 2013 plagiarism detection datasets and compared with the best performing method in the PAN2013, 2013 and 2014 evaluations. On the high fuzzy plagiarism and summary plagiarism subsets, the present invention improves the evaluation index Plagdet by 22% and 43% respectively compared with the baseline method. The time efficiency of the inventive method is also better than the baseline method.

Description

technical field [0001] The invention relates to a text matching method for plagiarism detection and relates to the technical field of plagiarism detection. Background technique [0002] Plagiarism text matching is the core task of plagiarism detection, which is dedicated to obtaining plagiarized fragments that match a suspicious document with the source document it plagiarized (Potthast et al., 2012a; 2013a; 2014). Researchers have done a lot of work on plagiarized text matching, most of which are based on heuristic methods, using words or characters to represent suspicious documents and plagiarized source documents, and then by calculating the overlapping characters and words in suspicious documents and source document fragments, Or identify exact or likely plagiarism matches by similarity of text vectors. [0003] Such methods achieve good performance on low-ambiguity plagiarism detection, but unsatisfactory performance on high-fuzzy plagiarism detection. For example, ta...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/30G06F16/35
CPCG06F40/30
Inventor 孔蕾蕾韩中元齐浩亮
Owner HEILONGJIANG INST OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products