Source code comparison method and system oriented to technical features and program product

A technology of technical features and source code, applied in the field of natural language processing, can solve the problems of not considering semantic information, calling information and structural semantics of code technical features, and achieving the effect of improving accuracy

Active Publication Date: 2022-08-09
SHANDONG UNIV
View PDF9 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The code similarity calculation method based on the call structure is mainly to calculate the similarity between codes based on the logical structure of the code. The logical structure generally includes an abstract syntax tree and a program call graph. The source code similarity evaluation method based on semantic information at functional granularity, by calculating the identifier corresponding to the function and the embedding vector of the control flow graph, this method considers the structure and node information of the control flow graph, but this method does not consider the code import. Semantic information for technical features such as built-in classes
[0005] The code similarity calculation method based on static features is to extract some measurement values ​​from the source code to form feature vectors, and then use the similarity between the code feature vectors as the code similarity, for example: the invention patent with the publication number CN111290784A discloses the applicable Based on the program source code similarity detection method for large-scale samples, the local sensitive hash value is calculated for the text feature sequence and feature weight sequence of each sample to be tested, and this value is used as the sample feature vector. This type of method is generally applicable to most programs. code, but such methods generally do not consider the call information and structural semantics of the technical characteristics of the code
[0006] The binary-based code similarity calculation method generally disassembles the binary code to obtain the instruction sequence of each function, then vectorizes the instruction features, and finally calculates the code similarity through the feature vector, for example: the public number is CN113554101A The Chinese patent discloses a binary code similarity detection method based on deep learning, uses Structure2Vec to generate the graph embedding of the control flow graph of the binary function, and introduces CNN to process the sequence structure information between the basic blocks of the control flow graph, so as to better Clarify the sequence relationship between the internal blocks of the function. This method can well adapt to cross-architecture and cross-version similarity detection, but it does not consider the semantic information of technical features such as the function name of the code.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Source code comparison method and system oriented to technical features and program product
  • Source code comparison method and system oriented to technical features and program product
  • Source code comparison method and system oriented to technical features and program product

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0069] like figure 1 As well as figure 2 Show, a method of comparison of technical characteristics, including:

[0070] Code file pre -processing phase, the function of the function, function name, and built -in class name for output code;

[0071] The functional semantic encoding phase of the function calls is used to use a self -encoder method based on the convolutional neural network of graphic convolutional neural network to obtain the function call structure vector for the function of the function. ,所述图语义编码具体按照以下文献实现:William L. Hamilton, RexYing, Jure Leskovec. Inductive representation learning on large graphs[C].Proceedings of the 31st International Conference on Neural InformationProcessing Systems. 2017: 1025-1035;

[0072] The call information of the function name and built-in class name uses the TF-IDF algorithm encoding phase, and the function vector and built-in class vector are obtained. Similar contrast;

[0073] Finally, the function calls the structure vector, func...

Embodiment 2

[0090] As described as described by an example 1, a technical -oriented source code comparison method, in order to make the structural semantic vector containing rich nodes and edges, so that the neighbor matrix that is constructed is similar to the original adjacent matrix as possible, and use it as possible. The cross -entropy of the adjacent matrix of the reconstruction diagram and the adjacent matrix of the original map is used as a loss function. The calculation of the loss function is as shown in the formula (VI):

[0091]

[0092] In the formula (VI), N Indicates the number of megs set by the function call; Indicates the elements in the adjacent matrix A of the original map; Indicates the adjacent matrix of the reconstruction of the original map Elements in.

Embodiment 3

[0094] An Example 1 The source code comparison method for technical characteristics described in the Example 1,

[0095] The call information coding phase of the function name and the built -in class name includes: in the functional part of the function, the call information vector based on the function name is performed by the function similarity contrast. The algorithm code contains a large number of functions that provide basic functions. The function has a clear function of the function , Entry call parameters and return values, so the function of the code with similar functions is also similar. According to the TF-IDF calculation module of the function name, the function calls the information vector. h f The calculation is as shown in the formula (VII):

[0096]

[0097] In the formula (VII), h f Indicates functional vector; f (fun i ) Indicate the code in the code i The TFIDF value of a single -called function;

[0098] In the similarity of the inner category class, the c...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a technical feature-oriented source code comparison method and system and a program product, and belongs to the technical field of natural language processing. The method comprises the steps that a semantic coding method based on a function call structure is used, and code similarity is analyzed from the aspects of the function call structure, a function name, a built-in class and the like; carrying out graph semantic encoding by adopting an auto-encoder method based on a graph convolutional neural network, and comparing code structure semantics based on semantic vectors; based on the calling information vectors of the functions and the built-in class names, comparing the similarity of the functions with the similarity of the built-in classes; and finally, splicing the structure vector, the function vector and the built-in class vector as an overall technical feature vector, and comparing the code similarity. According to the method, technical feature information such as function names, calling structures and built-in classes is comprehensively considered, and code comparison can be better carried out according to technical features.

Description

Technical field [0001] The present invention discloses a technical code -oriented source code comparison method, system and program products, which belongs to the technical field of natural language processing. Background technique [0002] The open source platform provides the environment of sharing and communication code for scientific researchers. More and more deep learning models and code sharing on the open source platform, creating a reused code ecological environment. Therefore, for specific problems, researchers need to find out to find out Related solutions. Modern algorithm design ideology is a modular construction code, which usually contains a large amount of basic function functions. Its function names, call structure, and built -in category information provides important code technical features. For a variety of solutions for the same problem, such as in text classification, different neural network structures can be used: convolutional neural networks, circulating...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F8/75G06F8/41
CPCG06F8/75G06F8/436G06F8/44Y02D10/00
Inventor 龚斌宁祥东孙宇清万林
Owner SHANDONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products