Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and system for implementing partition load balancing in Spark environment

A load balancing and environmental technology, applied in special data processing applications, database distribution/replication, structured data retrieval, etc., can solve the problems of long processing time of Spark application programs, achieve uniform distribution of partitioned data, high execution efficiency, and realize Effects with low complexity

Active Publication Date: 2020-04-03
HUNAN UNIV
View PDF3 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Aiming at the above defects or improvement needs of the prior art, the present invention provides a method and system for realizing partition load balancing in the Spark environment, the purpose of which is to solve the existing hash-based partitioning method when data skew occurs , a technical problem with excessive processing time for the entire Spark application

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for implementing partition load balancing in Spark environment
  • Method and system for implementing partition load balancing in Spark environment
  • Method and system for implementing partition load balancing in Spark environment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0056] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

[0057] The basic idea of ​​the present invention is to estimate the data distribution of the Map end in the SparkShuffle process through the optimized step size-based reception rejection sampling to obtain a more accurate Map end data distribution, and then generate a Map end according to the sampling rate and data distribution. And the repartition strategy of the intermediate data on the reduce...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for implementing partition load balancing in a Spark environment. The method comprises the steps of receiving a Spark application sent by a user, and analyzing the Spark application program to obtain an analysis result, wherein the RDD graph represents the relationship among the plurality of elastic distributed data sets RDDs, and the directed acyclic graph DAG isused in the scheduling stage; sequentially determining a dependency relationship between every two adjacent scheduling stages according to the DAG graph; numbering the wide dependency relationships inall the obtained dependency relationships; setting counter cnt = 1, and judging whether the cnt is greater than the total number of the wide dependency relationships or not; and if not, sampling datain all partitions in the last RDD in the Map end corresponding to the cnt-th wide dependency relationship to obtain hash tables which represent data key distribution and correspond to each partition,and merging the hash tables corresponding to all the obtained partitions. According to the method, the problem of data inclination in big data calculation can be solved, meanwhile, allocation of calculation resources is optimized, and the program running time is shortened.

Description

technical field [0001] The invention belongs to the field of big data and distributed parallel computing, and relates to a method and system for realizing partition load balancing in a Spark environment. Background technique [0002] With the rapid development of the Internet, people's daily behavior will generate a large amount of data, the total amount of data and the growth rate of data are increasing day by day. For the increasingly huge data, stand-alone computing can no longer meet the demand, and the MapReduce programming model came into being. MapReduce is a software framework for parallel processing of massive data in a reliable and fault-tolerant manner. Apache Spark is a fast and general-purpose processing engine for large-scale data based on the MapReduce model. Apache Spark can achieve high performance in both batch processing and stream processing. It is a physical execution engine that includes a stateful directed acyclic graph scheduler and a query optimizer....

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/27G06F16/22
CPCG06F16/2255G06F16/278
Inventor 唐卓刘翔李肯立杜利凡贺凯林李文张学东阳王东周旭刘楚波曹嵘晖
Owner HUNAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products