Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Method for calculating similarity of XML documents

A document similarity and similarity technology, applied in the database field, can solve problems such as high time complexity, loss of infrequent paths, loss of structural information, etc.

Inactive Publication Date: 2010-11-03
NANKAI UNIV
View PDF0 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

There are three classic algorithms based on tree edit distance: Selkow, Chawathe and Dalamagas, but the time complexity of tree edit distance algorithms is generally high
The method based on frequent paths can quickly calculate document similarity, but loses all non-frequent paths, thus losing a lot of structural information, and the accuracy rate is relatively low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for calculating similarity of XML documents
  • Method for calculating similarity of XML documents
  • Method for calculating similarity of XML documents

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0055] Embodiment 1: The specific method of constructing the BPC model based on the XML document tree is described as follows:

[0056] 1. According to the present invention, an XML document is defined as an XML document tree, and a BPC model is established for each node on the basis of the document tree. figure 1 shows an XML document and its corresponding XML document tree, Table 1 starts with figure 1 Take the document tree as an example to list the BPC models of each node.

Embodiment 2

[0057] Embodiment 2: A specific method for calculating document similarity based on the N-Gram idea is described as follows:

[0058] Algorithm 1. The method CreateGram that generates i+1-Gram items based on two adjacent i-Gram items

[0059] input: item 1 , item 2 / *Two adjacent i-Gram items represented by positive integers* /

[0060] t / *radix t* /

[0061] Output: item / *(i+1)-Gram item represented by a positive integer* /

[0062] ①.item:=item 1 ×t+item 2 %t;

[0063] ②. RETURN item;

[0064] ③. Algorithm ends

[0065] The algorithm generates (i+1)-Gram items based on two adjacent i-Gram items. The base t in the algorithm is the total number of different labels in the two path constraints to be compared plus 1. For the same path constraint, the base t is introduced. When i≠j, it can be guaranteed that the integer field where the i-Gram item is located and the integer field where the j-Gram item is located have no intersection.

[0066] Algorithm 2. The...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention belongs to the technical field of databases and aims to establish an XML document constraint model known as a bidirectional path constraint model. Based on the model, the invention discloses a new method for calculating the similarity of XML documents. The structural information of the XML documents is extracted more completely through the bidirectional path constraint of a node, so that the similarity of the XML documents is balanced more accurately. A very mature N-Gram thought in the field of natural languages is introduced and an N-Gram-based partition mode is applied in the similarity calculation of path constraint. The extraction and operation of N-Gram information are simplified by making skillful use of positive integers and weight numbers. The method can be used in the fields of XML document classification, clustering, mode extraction and the like.

Description

【Technical field】 [0001] The invention belongs to the technical field of databases, and in particular relates to a method for calculating the similarity of XML documents. 【Background technique】 [0002] Extensible Markup Language XML has become the standard format for representing and exchanging data on the Web. With the promotion and application of XML-related standards, all walks of life use XML as a meta-language to formulate their own domain-specific sub-languages ​​for storing and sharing data involved in this domain. In this context, a large number of XML documents will continue to emerge in various fields. How to mine knowledge from a large number of documents has become an urgent problem to be solved. XML data mining is an important application in knowledge discovery technology, and similarity calculation plays a fundamental role in XML data mining. [0003] XML document mining is divided into content mining and structure mining, which can be used for XML data ext...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 汪陈应袁晓洁廉鑫林伟坚
Owner NANKAI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products