Elastic quota scheduling method and device for AI computing cluster and medium
A technology of computing clusters and scheduling methods, which is applied in the field of cloud computing, can solve problems such as load and failure to cluster according to cloud platforms, and achieve the effect of improving utilization
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0035] Embodiments of the present invention provide a quota elastic scheduling method of a AI calculation cluster, see figure 1 , Including the following steps:
[0036] S100, according to the performance setting of the cloud platform, set the scan interval to periodically acquire the GPU and CPU resource load information, write the configuration file int maxgpu = utils.getconf ("MAXGPU"), int GPUADD = Utils. GetConf ("AddGPU") and writes the CPU configuration file in the same way, because when the business user is submitted to the cloud platform, the required GPU reaches the upper limit, the new task of the enterprise user will be sent, The calculation of new tasks until the current computing task is completed.
[0037] Set the expansion threshold, set the expansion threshold according to the overall performance of the cloud platform, this value can be set according to the cloud platform cluster performance, code
[0038]
[0039] Depending on whether the current GPU's idle qua...
Embodiment 2
[0049] Embodiments of the present invention provide a quota elastic scheduling system for AI calculation clusters, see figure 2 , Including: threshold configuration module, load monitoring module, and quota elastic management module;
[0050] The threshold configuration module sets a scanning interval, expand threshold, and expansion policy according to the cloud platform or user needs;
[0051] The load monitoring module features an open source component Prometheus, CADVisor, or directly through the container management component Docker stats command, the load monitoring module scans according to the scan interval, where the container is equivalent to the computing resource cluster purchased by each enterprise user on the cloud platform. The performance of each container is different, so the total GPU of the container, GPU idle, CPU total, CPU idle amount, and the total amount of GPU of the monitored container, GPU idle, CPU total The amount, the CPU idle amount is recorded in th...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com