Method for dealing with online connection of skewed data streams

A processing method and data flow technology, applied in the field of data processing, can solve problems such as low model efficiency and inability to dynamically allocate query nodes, so as to improve throughput and reduce computing costs

Active Publication Date: 2017-11-10
RENMIN UNIVERSITY OF CHINA
View PDF1 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this model cannot dynamically allocate query nodes and requires manual intervention in the parameter setting of data grouping
Especially for the full history connection query of skewed data, the model efficiency is lower

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for dealing with online connection of skewed data streams
  • Method for dealing with online connection of skewed data streams
  • Method for dealing with online connection of skewed data streams

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0053] for query tasks. The present invention selects three query tasks in total. Among them, two are equivalent connections Q3 and Q5 provided by TPC-H, and one is range query (Band). Band query is described as:

[0054] SELECT*, FROM LINEITEM L1, LINEITEM L2

[0055] WHERE ABS(L1.orderkey-L2.orderkey)<=1

[0056] AND(L1.shipmode='TRUCK'AND L2.shipinstruct='NONE')

[0057] AND L1. Quantity>48

[0058] for comparison models. The present invention uses three algorithms to compare and analyze query performance: DB, JB and JB6. DB is an algorithm proposed by this invention, expressed as a dynamic bipartite graph connection model. JB means that the cluster nodes are evenly distributed to each side of the bipartite graph. JB6 means that after the nodes are evenly distributed, the nodes inside each side of the bipartite graph are divided into 6 subgroups for random routing.

[0059] Using 10GB of data with z=1, we compared the throughput and latency of the three models for ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for dealing with online connection of skewed data streams. The method comprises the following steps: performing tuple partition on a data stream R and a data stream S according to a Hash function based on a key value, assigning various tuples to different nodes at the same side for storage, and transmitting the tuples to processing units at the other side synchronously to complete the operation of the online connection; periodically monitoring the load statistics of nodes at each side of a bipartite graph connection model with a preset time interval, and collecting and transmitting the load statistics to a pre-built data stream controller; if the data stream controller monitors that some processing units exceed a critical value of a load balancing factor, then dynamically developing a migration strategy according to a heuristic rule; before data migration, temporarily storing a newly generated data stream in Kafka, and suspending the connection operation of the new data; at the moment, carrying out migration of the data streams and connection state information according to the migration strategy, and updating a routing table synchronously; and continuing to transmit the data temporally stored in the Kafka and the new data, and completing the subsequent online connection operation.

Description

technical field [0001] The invention relates to a data processing method, in particular to a processing method for dealing with skewed data flow online connection. Background technique [0002] Generally, the connection model based on complete bipartite graph can support the connection operation of distributed data flow. The model has the characteristics of memory efficiency, easy scaling and extensibility. However, this model cannot dynamically allocate query nodes and requires manual intervention in the parameter setting of data grouping. Especially for the full history connection query of skewed data, the efficiency of the model is lower. Contents of the invention [0003] In view of the above problems, the object of the present invention is to provide a processing method for online connection of skewed data streams, which can effectively deal with the connection operation of skewed data, further improve the throughput rate of the distributed data stream management sy...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/24542G06F16/24549G06F16/2456G06F16/24568G06F16/27
Inventor 孟小峰王春凯
Owner RENMIN UNIVERSITY OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products