Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Load balancing method for processing MapReduce data skew

A load balancing and data technology, applied in electrical digital data processing, resource allocation, program control design, etc., can solve problems such as high energy consumption, long completion time, and inability to minimize the completion time of online job sets.

Inactive Publication Date: 2017-05-17
田文洪 +5
View PDF6 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

With the increase in the number of users of the MapReduce cluster system, the emergence of computing power scheduler and Hadoop Fair Scheduler (HFS: Hadoop FairScheduling) provides a more efficient cluster sharing method, but the existing scheduler cannot provide the minimum Support for optimizing the completion time of online job sets. When submitting an online job as a job set, the completion time may be longer, resulting in higher total energy consumption

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Load balancing method for processing MapReduce data skew
  • Load balancing method for processing MapReduce data skew
  • Load balancing method for processing MapReduce data skew

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0043] The specific implementation manners of the present invention will be described in further detail below according to the drawings and examples. The following examples are used to illustrate the present invention, but are not intended to limit the scope of the present invention.

[0044] Such as figure 2 As shown, the embodiment of the present invention provides a load balancing method for processing MapReduce data skew, the method includes steps:

[0045] S101. Sampling and analyzing the input data to determine the average number of tasks on each Reduce node.

[0046] S102. According to the number of tasks and the time coefficient, sort in descending order according to the number of tasks based on the time coefficient, and sort according to the serial number if the number is the same.

[0047] S103. Allocate tasks in sequence according to the principle of the largest remaining resource capacity and the order of the scheduled tasks until all tasks are allocated.

[00...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

An embodiment of the invention discloses a load balancing method for processing MapReduce data skew, and relates to the field of cluster dispatching and load balancing. As large-scale MapReduce clusters are widely used for processing big data, one of current main problems is how to furthest shorten working time and improve MapReduce service efficiency, and data balancing related problems are less related in the past research of MapReduce, so that a load balancing algorithm of a Reduce end is provided to solve the problem of data skew in the running process of the MapReduce. The method includes the steps: performing sampling analysis for inputted data, and determining the average task number of each Reduce node; performing descending sort from big to small according to the task number based on a time coefficient, and performing sort according to a sequence number if the numbers are the same; sequentially distributing tasks according to the principle of maximum resource surplus capacity and a sorted task sequence until all tasks are completely distributed; submitting a distributing mode to a self-defined Partition function, and executing a processing process.

Description

technical field [0001] The invention relates to the technical field of online cluster scheduling, in particular to a load balancing method and device for processing Hadoop cluster task data skew. Background technique [0002] Hadoop is a software framework for distributed processing of large amounts of data in a reliable, efficient, and scalable manner. The main task deployment of Hadoop cluster (cluster) is divided into three parts: client (Client) machine, master node (Master nodes) and slave nodes (Slave nodes), such as figure 1 shown. Data storage (Hadoop Distributed File System, Hadoop Distributed File System, HDFS) and the supervision of parallel computing (MapReduce) running on this data are two key functional modules of Hadoop, which are mainly controlled by the master node Responsible. HDFS adopts the master-slave (Master / Slave) structure model. An HDFS cluster is composed of a name node (NameNode) and several data nodes (DataNode). The MapReduce framework is co...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F9/50
CPCG06F9/5088G06F2209/503
Inventor 田文洪李国忠
Owner 田文洪
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products