A Method of Checking Duplication of Programming Language Codes Based on Tree and Sequence Similarity

A sequence similarity, programming language technology, applied in the field of programming language code checking based on tree and sequence similarity, can solve the problems of low time and space complexity, low detection accuracy, high cost, and achieve accurate and efficient algorithms and strong anti-interference ability. , the effect of improving the accuracy of duplicate checking

Active Publication Date: 2021-06-01
HUAQIAO UNIVERSITY
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Specifically, the detection accuracy of the statistics-based method is low, the method is too abstract, the anti-aliasing ability is very low, the structural characteristics of the program are not considered, and the space complexity is low; the detection accuracy of the Token-based method is low, and its accuracy is mainly Relying on the selection and extraction of Token, its anti-obfuscation ability is low, it is difficult to deal with the implantation of redundant code, it can resist the confusion of replacing variable names, modifying function locations, etc., the time and space complexity is low, mainly based on text structure and lexical analysis ; The detection accuracy of the tree-based method is generally high, and its detection accuracy mainly depends on the degree of refinement of the tree, and its ability to resist confusion is high. The method takes into account the grammatical features, but it is difficult to deal with modifying the location of functions, statement splitting, etc., and its time and space The complexity is high, mainly due to the high cost of building trees; the detection accuracy of the graph-based method is generally high, and its accuracy depends on the degree of refinement of the graph. This method has a high anti-aliasing ability and fully takes into account the syntax of the program Semantic features can resist layout confusion, but it is difficult to resist partial data and control confusion. It has high space-time complexity and high construction cost. Subgraph matching is an NP problem.
In general, statistics-based and Token-based methods have low detection accuracy, and graph-based methods have high accuracy, but their calculation time and space complexity is high. Tree-based code duplication check method has high accuracy and time-space complexity. Low characteristics, suitable for code duplication check in the case of few data samples

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Method of Checking Duplication of Programming Language Codes Based on Tree and Sequence Similarity
  • A Method of Checking Duplication of Programming Language Codes Based on Tree and Sequence Similarity
  • A Method of Checking Duplication of Programming Language Codes Based on Tree and Sequence Similarity

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach

[0073] The code duplication checking method is described in combination with program code 1 and program code 2 as follows, and the specific implementation method is as follows:

[0074] Step a, remove the information that interferes with the similarity in the code.

[0075] As shown in Table 2, for the given program code 1 and program code 2 to be checked, remove the comment content in the program, console information, operators and other information, and give the processed results, as shown in Table 3. Given the sequence of variables in program code 1 and program code 2, the structure of the program is preserved.

[0076] Table 2

[0077]

[0078]

[0079] table 3

[0080]

[0081]

[0082] Step b, constructing a program structure tree according to the program structure.

[0083] The result of processing is built tree, in the present embodiment, leaf node is all function, expresses with Fun, see image 3 As shown, there are 6 leaf nodes in program code 1 and p...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for checking plagiarism of programming language codes based on tree and sequence similarity. Firstly, the two sections of program codes to be compared are preprocessed, including removing text content such as comment information, console output sentences and operators, and determining the duplicate checking method. Then build a tree according to the control structure of the program, and record the position of the variables in each leaf node in the tree; secondly, establish a sequence of relative positions for the variables in each leaf node, and based on this, look for similar variables between functions, and then find out Similar leaf nodes finally determine the similarity between two pieces of code. The method of the present invention not only removes the influence of some irrelevant information on the duplicate checking results, but also has a better duplicate checking effect for variable renaming, modifying function positions and code redundancy. Through the method of the present invention, corresponding codes can be developed The plagiarism checking system can improve the efficiency of code duplication checking, and it has the best effect in the field of computer programming teaching in colleges and universities.

Description

technical field [0001] The invention relates to the field of data analysis and processing, in particular to a method for checking duplication of programming language codes based on tree and sequence similarity. Background technique [0002] Existing methods for program code duplication check include methods based on statistics, methods based on Token, methods based on trees and methods based on graphs. Specifically, the detection accuracy of the statistics-based method is low, the method is too abstract, the anti-aliasing ability is very low, the structural characteristics of the program are not considered, and the space complexity is low; the detection accuracy of the Token-based method is low, and its accuracy is mainly Relying on the selection and extraction of Token, its anti-obfuscation ability is low, it is difficult to deal with the implantation of redundant code, it can resist the confusion of replacing variable names, modifying function locations, etc., the time and...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F8/75
CPCG06F8/751
Inventor 李海波孙映川林汤权童俊成
Owner HUAQIAO UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products