An Active Prediction Method for Supercomputer Job Failure Based on Application Similarity

A technology of supercomputers and forecasting methods, which is applied in computer parts, forecasting, calculations, etc., can solve the problems of extended waiting time for jobs, unsatisfactory results, waste of system resources, etc., to reduce clustering calculation costs and improve forecasting effects , the effect of strong anti-overfitting ability

Active Publication Date: 2022-04-19
CALCULATION AERODYNAMICS INST CHINA AERODYNAMICS RES & DEV CENT
View PDF10 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] A large number of jobs are submitted and waiting to be executed in the supercomputer, but during the execution of the job, the job may fail due to various reasons, such as system resources cannot meet the job requirements, memory errors, and software and hardware failures
At the same time, job failure will cause waste of system resources, prolong the waiting time of jobs in the queue and other adverse effects. The use of job failure prediction can be used to slow down the impact of these failures. Therefore, how to effectively predict job failure is very important for improving system reliability and System resource utilization is critical
[0003] At present, there are many prediction methods for software and hardware failures of supercomputers (high-performance computing systems), but the research on prediction methods for job failures is relatively scarce, and some statistical methods, such as linear analysis and secondary discriminant analysis, are mainly used for prediction. Job failure, the core idea of ​​this type of method is to try to find the linearly separable relationship of job failure, but the effect is not ideal, because these methods require a large number of data samples and the calculation efficiency is not high
In addition, the characteristics used to predict failure are mostly resource and performance attributes, which are complex and changeable, and cannot accurately describe the application characteristics of the job, which is why the prediction method using linear analysis thinking is not ideal.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • An Active Prediction Method for Supercomputer Job Failure Based on Application Similarity
  • An Active Prediction Method for Supercomputer Job Failure Based on Application Similarity
  • An Active Prediction Method for Supercomputer Job Failure Based on Application Similarity

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0083] Embodiment 1: A method for actively predicting failure of supercomputer jobs based on application similarity, comprising steps:

[0084] S1, extract feature data from the job log, add the job path data and preprocess together, and then use it as the input feature of the machine learning algorithm model;

[0085] S2, after the machine learning algorithm model processes the input feature data, it realizes the active prediction of job failure status.

Embodiment 2

[0086] Embodiment 2: On the basis of Embodiment 1, the work route data comes from additional monitoring information.

Embodiment 3

[0087] Embodiment 3: On the basis of Embodiment 1, the preprocessing in step S1 includes clustering preprocessing.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an active prediction method for supercomputer job failure based on application similarity, which belongs to the field of supercomputers, and includes steps: S1, extracting feature data from job logs, adding job path data and preprocessing together, and then as The input features of the machine learning algorithm model; S2, after the machine learning algorithm model processes the input feature data, it realizes the active prediction of job failure status. The present invention excavates the characteristics that can accurately describe job application attributes, and has a good prediction improvement effect; uses a machine learning algorithm to find a job failure prediction method, improves the robustness of the prediction model, and is especially suitable for nonlinear data; the job application attributes The clustering method significantly reduces clustering calculation overhead and reduces errors; it achieves high prediction efficiency and can be practically applied to large supercomputers.

Description

technical field [0001] The invention relates to the field of supercomputers, and more specifically, to a method for actively predicting job failures of supercomputers based on application similarity. Background technique [0002] A large number of jobs are submitted and waiting to be executed in the supercomputer, but during the execution of the job, the job may fail due to various reasons, such as system resources cannot meet the job requirements, memory errors, and software and hardware failures. At the same time, job failure will cause waste of system resources, prolong the waiting time of jobs in the queue and other adverse effects. The use of job failure prediction can be used to slow down the impact of these failures. Therefore, how to effectively predict job failure is crucial to improving system reliability and System resource utilization is critical. [0003] At present, there are many prediction methods for software and hardware failures of supercomputers (high-pe...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06Q10/04G06N20/00G06K9/62
CPCG06Q10/04G06N20/00G06F18/23213G06F18/22G06F18/214
Inventor 喻杰鲜港杨文祥周隆放王昉王岳青邓亮杨志供赵丹陈呈杨超代喆
Owner CALCULATION AERODYNAMICS INST CHINA AERODYNAMICS RES & DEV CENT
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products