Method for calculating similarity of XML documents

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A document similarity and similarity technology, applied in the database field, can solve problems such as high time complexity, loss of infrequent paths, loss of structural information, etc.

Inactive Publication Date: 2010-11-03

NANKAI UNIV

View PDF0 Cites 19 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

There are three classic algorithms based on tree edit distance: Selkow, Chawathe and Dalamagas, but the time complexity of tree edit distance algorithms is generally high

The method based on frequent paths can quickly calculate document similarity, but loses all non-frequent paths, thus losing a lot of structural information, and the accuracy rate is relatively low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0055] Embodiment 1: The specific method of constructing the BPC model based on the XML document tree is described as follows:

[0056] 1. According to the present invention, an XML document is defined as an XML document tree, and a BPC model is established for each node on the basis of the document tree. figure 1 shows an XML document and its corresponding XML document tree, Table 1 starts with figure 1 Take the document tree as an example to list the BPC models of each node.

Embodiment 2

[0057] Embodiment 2: A specific method for calculating document similarity based on the N-Gram idea is described as follows:

[0058] Algorithm 1. The method CreateGram that generates i+1-Gram items based on two adjacent i-Gram items

[0059] input: item 1 , item 2 / *Two adjacent i-Gram items represented by positive integers* /

[0060] t / *radix t* /

[0061] Output: item / *(i+1)-Gram item represented by a positive integer* /

[0062] ①.item:=item 1 ×t+item 2 %t;

[0063] ②. RETURN item;

[0064] ③. Algorithm ends

[0065] The algorithm generates (i+1)-Gram items based on two adjacent i-Gram items. The base t in the algorithm is the total number of different labels in the two path constraints to be compared plus 1. For the same path constraint, the base t is introduced. When i≠j, it can be guaranteed that the integer field where the i-Gram item is located and the integer field where the j-Gram item is located have no intersection.

[0066] Algorithm 2. The...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention belongs to the technical field of databases and aims to establish an XML document constraint model known as a bidirectional path constraint model. Based on the model, the invention discloses a new method for calculating the similarity of XML documents. The structural information of the XML documents is extracted more completely through the bidirectional path constraint of a node, so that the similarity of the XML documents is balanced more accurately. A very mature N-Gram thought in the field of natural languages is introduced and an N-Gram-based partition mode is applied in the similarity calculation of path constraint. The extraction and operation of N-Gram information are simplified by making skillful use of positive integers and weight numbers. The method can be used in the fields of XML document classification, clustering, mode extraction and the like.

Description

【Technical field】 [0001] The invention belongs to the technical field of databases, and in particular relates to a method for calculating the similarity of XML documents. 【Background technique】 [0002] Extensible Markup Language XML has become the standard format for representing and exchanging data on the Web. With the promotion and application of XML-related standards, all walks of life use XML as a meta-language to formulate their own domain-specific sub-languages for storing and sharing data involved in this domain. In this context, a large number of XML documents will continue to emerge in various fields. How to mine knowledge from a large number of documents has become an urgent problem to be solved. XML data mining is an important application in knowledge discovery technology, and similarity calculation plays a fundamental role in XML data mining. [0003] XML document mining is divided into content mining and structure mining, which can be used for XML data ext...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

Inventor 汪陈应袁晓洁廉鑫林伟坚

Owner NANKAI UNIV

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Method for calculating similarity of XML documents

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology