An xml big data clustering integration method for parallel AP propagation

An integrated method and big data technology, applied in the direction of electrical digital data processing, special data processing applications, semi-structured data query, etc., can solve the problems of data noise, many isolated points, fast generation speed, and huge volume, etc., to eliminate Effects on ambiguity puzzles, widening differences, and improving performance

Active Publication Date: 2017-05-17
西安蓝雪信息技术有限公司
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] At present, XML big data, like other types of big data, has the characteristics of large volume, complex structure, fast generation speed, huge value but low density, and the data volume ranges from MB to GB, TB, PB to ZB. In addition, it The data presents non-convex characteristics and is very unevenly distributed, with many data noises and outliers, and many data appear on the Web in the form of data streams. Therefore, for these fast-changing and highly time-sensitive XML big data , if traditional algorithms are used for clustering integration, these integration methods have obvious deficiencies in solving large XML data sets, which are mainly manifested in: (1) large storage space occupied, slow prediction speed, and poor prediction effect; (2) Online machine learning is difficult, effective for small-scale data, but poor for large-scale data; (3) poor dynamics and real-time performance, unable to process streaming data; (4) due to lack of prior knowledge, the algorithm cannot grasp the global characteristics of XML data distribution Inaccurate, eventually leading to unsatisfactory requirements for clustering accuracy and clustering results

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • An xml big data clustering integration method for parallel AP propagation
  • An xml big data clustering integration method for parallel AP propagation
  • An xml big data clustering integration method for parallel AP propagation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0035] Step 1: Perform preprocessing such as cleaning, dividing and extracting for each XML big data, that is, after cleaning each XML big data, extract all nodes and their nodes from the big data through the division method combining scale and content Subset, calculate the frequency of the subset of nodes in its data, divide the nodes and their descendants belonging to the same subject content into the same subset as much as possible according to the frequency of nodes, and divide the nodes of different subject content into different sub-sets. and extract n subtrees from the divided subset according to the frequency of keywords, find all the paths from the root node to the leaf nodes of each extracted subtree, and use the path as the input source for disambiguation to resolve ambiguity Words are disambiguated, and the semantic relevance and context semantic similarity of each keyword are obtained;

[0036] Its similarity is obtained as follows: Assume that n subtree sets D'=(...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a parallel AP propagating XML big data clustering integration method. The method includes the steps that preprocessing such as cleaning, dividing and extracting is conducted on each piece of XML big data; all keywords in an extracted subtree are regarded as the feature description of a data point; a clustering integration basic idea is adopted; a large similarity matrix decomposition idea is also related; ultimate clustering integration is achieved. According to the parallel AP propagating XML big data clustering integration method, a random subspace classifier is established, and parallel random selection of the subtree is conducted to enlarge the difference of clustering members and improve the clustering performance; disambiguation processing is introduced, the ambiguity problem caused by the inconformity of semantic related environments and content in each subtree is solved, meanwhile, semantic similarity and path similarity are integrated, and the influence of inaccurate XML document similarity calculation on an initial clustering result is eliminated; a system capacity theory is used, the iterative approach of an attribution matrix and an absorption matrix in an AP algorithm is improved, so that clustering integration of the XML big data is realized, and the clustering integration method efficiency is improved.

Description

technical field [0001] The invention belongs to the application field of big data integration methods, in particular to an XML big data clustering and integration method for parallel AP propagation. Background technique [0002] At present, XML big data, like other types of big data, has the characteristics of large volume, complex structure, fast generation speed, huge value but low density, and the data volume ranges from MB to GB, TB, PB to ZB. In addition, it The data presents non-convex characteristics and is very unevenly distributed, with many data noises and outliers, and many data appear on the Web in the form of data streams. Therefore, for these fast-changing and highly time-sensitive XML big data , if traditional algorithms are used for clustering integration, these integration methods have obvious deficiencies in solving large XML data sets, which are mainly manifested in: (1) large storage space occupied, slow prediction speed, and poor prediction effect; (2) ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30G06F9/44
CPCG06F16/83
Inventor 蒋勇
Owner 西安蓝雪信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products