Distributed training network system with storage network and service network separated and communication method

A business network and storage network technology, applied in artificial intelligence model training, cloud computing, and data center network fields, can solve the problems of storage stream occupation, communication interference, unpredictable network data transmission time, etc., and achieve the effect of improving training efficiency.

Active Publication Date: 2021-07-16
CLUSTAR TECH LO LTD
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In this way, although the caching of the next round of sample sets and the synchronous execution (overlap) of the current round of training can theoretically avoid waiting between rounds of iterations and improve training efficiency; however, in fact, storage flow and business flow Jointly occupy the same physical network bandwidth; the overlap of the two inevitably leads to disorderly competition and preemption of network bandwidth. Due to mutual interference, the time required for network data transmission becomes unpredictable, and communication becomes the performance bottleneck of distributed training.
This kind of interference is not only because the storage stream usually occupies a large bandwidth, and then interferes with the communication of parameters transfer, training cluster management and other services during the distributed training process, but also includes the parameter server communication under the PS communication model during the distributed training process. The network congestion formed by the bottleneck interferes with the storage network communication, that is, the adverse effect on the sample set caching process

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed training network system with storage network and service network separated and communication method
  • Distributed training network system with storage network and service network separated and communication method
  • Distributed training network system with storage network and service network separated and communication method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] The technical solutions in the embodiments of the present invention are clearly and completely described below in conjunction with the drawings of the embodiments of the present invention. Apparently, the described embodiments are only part of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

[0034] The following are some preferred embodiments of the present invention. in,

[0035] Some of the above preferred embodiments provide a distributed training network system with separate storage and service networks. The physical topology of the distributed training network system with storage and business network separation, such as figure 1 As shown, it includes a storage node, a working node and a parameter server node, where the working node is con...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a distributed training network system with separated storage and service networks and a communication method. A storage / service network controller is coupled with an operating system of a working node, and is coupled with a storage network interface and a service network interface; the storage network and the service network are transmitted in parallel in the physical network and are transmitted in parallel in an isolated manner logically; and therefore, the problem of mutual interference between data transmission (during parallel transmission in the same physical network) in training processes of sample set cache data transmission, training parameter transmission and the like in distributed training is solved through further network communication management, and the distributed training efficiency is further improved.

Description

technical field [0001] The present invention relates to the technical field of artificial intelligence model training, cloud computing, and data center network, in particular, to a distributed training network system and communication method that separates storage and business networks. Background technique [0002] Thanks to the development of algorithms, data and hardware computing capabilities, artificial intelligence is currently in its third development climax. In terms of algorithms, the introduction of the concept of deep learning and the development of related algorithms have greatly improved the ability of machine learning, followed by breakthroughs in algorithm research represented by deep learning and reinforcement learning, and the continuous optimization of algorithm models has greatly improved the application of artificial intelligence. Accuracy (such as speech recognition and image recognition, etc.). In terms of data, with the technological advancement and p...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): H04L29/08G06N20/00
CPCH04L67/10H04L67/1097G06N20/00
Inventor 胡水海孙军欢任正行
Owner CLUSTAR TECH LO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products