Data transmission method and device for Shuffle process

A data transmission method and data technology, applied in the computer field, can solve the problems of long reading times, consumption, and multiple computing resources and time, so as to improve stability and performance, avoid re-execution, and realize decoupling Effect

Pending Publication Date: 2021-12-07
BEIJING WODONG TIANJUN INFORMATION TECH CO LTD +1
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Since the partitions of Spark RDD are usually distributed on different computing nodes, when a node fails (including network failure, hard disk failure, etc.), it will trigger a very serious error (FetchFailed) in Spark, which will cause the Shuffle process to be re-executed. Consumes more computing resources and time
Moreover, since the number of random reads in the Shuffle Read process is related to the number of partitions of the two RDDs before and after the Shuffle process, when the number of partitions in the target RDD (the RDD formed after Shuffle) is too large, often due to the number of reads Too many will cause the reading time to be too long; if the number of partitions of the target RDD is too small, it is easy to cause OOM (OutOfMemoryError, memory overflow) due to the large amount of data processed by each partition of the target RDD

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data transmission method and device for Shuffle process
  • Data transmission method and device for Shuffle process
  • Data transmission method and device for Shuffle process

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] The present disclosure will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain related inventions, rather than to limit the invention. It should also be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.

[0028] It should be noted that, in the case of no conflict, the embodiments in the present disclosure and the features in the embodiments can be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings and embodiments.

[0029] figure 1 An exemplary architecture 100 to which the data transmission method for the Shuffle process or the data transmission device for the Shuffle process of the present disclosure can be applied is shown.

[0030] Such as figure 1 As shown, the system archit...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a data transmission method and a device for a Shuffle process. According to one specific embodiment, the method comprises the steps that a data set before the Shuffle process and the number of partitions included in the data set after the Shuffle process are obtained, and the data set before the Shuffle process comprises at least one partition; obtaining the number of target service components; according to the number of the target service components, determining a corresponding relationship between each target service component and a partition of the data set after the Shuffle process; and respectively sending data of each partition in the data set before the Shuffle process to a target service component corresponding to the data set after the Shuffle process. According to the embodiment, decoupling of the Spark calculation process and the storage process is realized, re-execution of the Shuffle process caused by local faults of the node is effectively avoided, and the stability and performance of the Spark are improved.

Description

technical field [0001] The embodiments of the present disclosure relate to the field of computer technology, and in particular to a data transmission method and device for a Shuffle process. Background technique [0002] With the rapid development of computer technology, general-purpose computing engines suitable for large-scale data processing have emerged as the times require. In the prior art, the Spark computing engine is usually used to complete the mutual conversion of RDDs (Resilient Distributed Datasets, elastic distributed data sets) according to dependencies. When the dependency relationship between the RDDs before and after the conversion is a wide dependency, it is necessary to complete the RDD conversion through Shuffle (shuffling). [0003] Since the partitions of Spark RDD are usually distributed on different computing nodes, when a node fails (including network failure, hard disk failure, etc.), it will trigger a very serious error (FetchFailed) in Spark, wh...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F21/60H04L29/08
CPCG06F21/606H04L67/1044
Inventor 王文生石磊吴雪扬
Owner BEIJING WODONG TIANJUN INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products