Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Supercomputer job failure active prediction method based on application similarity

A technology of supercomputer and prediction method, applied in computer parts, prediction, calculation and other directions, can solve the problems of prolonged operation waiting time, unsatisfactory effect, waste of system resources, etc., to reduce the cost of clustering calculation and improve the effect of prediction. , the effect of strong anti-overfitting ability

Active Publication Date: 2022-03-11
CALCULATION AERODYNAMICS INST CHINA AERODYNAMICS RES & DEV CENT
View PDF10 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] A large number of jobs are submitted and waiting to be executed in the supercomputer, but during the execution of the job, the job may fail due to various reasons, such as system resources cannot meet the job requirements, memory errors, and software and hardware failures
At the same time, job failure will cause waste of system resources, prolong the waiting time of jobs in the queue and other adverse effects. The use of job failure prediction can be used to slow down the impact of these failures. Therefore, how to effectively predict job failure is very important for improving system reliability and System resource utilization is critical
[0003] At present, there are many prediction methods for software and hardware failures of supercomputers (high-performance computing systems), but the research on prediction methods for job failures is relatively scarce, and some statistical methods, such as linear analysis and secondary discriminant analysis, are mainly used for prediction. Job failure, the core idea of ​​this type of method is to try to find the linearly separable relationship of job failure, but the effect is not ideal, because these methods require a large number of data samples and the calculation efficiency is not high
In addition, the characteristics used to predict failure are mostly resource and performance attributes, which are complex and changeable, and cannot accurately describe the application characteristics of the job, which is why the prediction method using linear analysis thinking is not ideal.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Supercomputer job failure active prediction method based on application similarity
  • Supercomputer job failure active prediction method based on application similarity
  • Supercomputer job failure active prediction method based on application similarity

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0083] Embodiment 1: A method for actively predicting failure of supercomputer jobs based on application similarity, comprising steps:

[0084] S1, extract feature data from the job log, add the job path data and preprocess together, and then use it as the input feature of the machine learning algorithm model;

[0085] S2, after the machine learning algorithm model processes the input feature data, it realizes the active prediction of job failure status.

Embodiment 2

[0086] Embodiment 2: On the basis of Embodiment 1, the work route data comes from additional monitoring information.

Embodiment 3

[0087] Embodiment 3: On the basis of Embodiment 1, the preprocessing in step S1 includes clustering preprocessing.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a super computer operation failure active prediction method based on application similarity, and belongs to the field of super computers, and the method comprises the steps: S1, extracting feature data from an operation log, adding operation path data, preprocessing the feature data and the operation path data, and taking the preprocessed data as input features of a machine learning algorithm model; and S2, after the machine learning algorithm model processes the input feature data, the operation failure state is actively predicted. According to the method, the characteristics capable of accurately describing the job application attributes are mined, and a good prediction improvement effect is achieved; a machine learning algorithm is adopted to find an operation failure prediction method, the robustness of a prediction model is improved, and the method is especially suitable for nonlinear data; the clustering calculation overhead is remarkably reduced and the error is reduced by adopting a clustering method for the job application attributes; the prediction efficiency is high, and the method can be practically applied to large supercomputers.

Description

technical field [0001] The invention relates to the field of supercomputers, and more specifically, to a method for actively predicting job failures of supercomputers based on application similarity. Background technique [0002] A large number of jobs are submitted and waiting to be executed in the supercomputer, but during the execution of the job, the job may fail due to various reasons, such as system resources cannot meet the job requirements, memory errors, and software and hardware failures. At the same time, job failure will cause waste of system resources, prolong the waiting time of jobs in the queue and other adverse effects. The use of job failure prediction can be used to slow down the impact of these failures. Therefore, how to effectively predict job failure is crucial to improving system reliability and System resource utilization is critical. [0003] At present, there are many prediction methods for software and hardware failures of supercomputers (high-pe...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06Q10/04G06N20/00G06K9/62
CPCG06Q10/04G06N20/00G06F18/23213G06F18/22G06F18/214
Inventor 喻杰鲜港杨文祥周隆放王昉王岳青邓亮杨志供赵丹陈呈杨超代喆
Owner CALCULATION AERODYNAMICS INST CHINA AERODYNAMICS RES & DEV CENT
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products