Skip to content

Commit 84f6f79

Browse files
committed
Add draft gpu troubles
1 parent 0f3236f commit 84f6f79

File tree

1 file changed

+92
-0
lines changed

1 file changed

+92
-0
lines changed
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# GPU Troubleshooting Guide
2+
3+
This guide presents how to identify and resolve issues on Amazon EC2 instances with NVIDIA GPUs.
4+
5+
While running High-Performance Computing or Machine Learning workloads, GPUs may fail for various reasons captured by Xid messages.
6+
Those messages are placed in `/var/log/messages` for Amazon Linux or for Ubuntu in `/var/log/syslog` and `/var/log/kern.log`
7+
8+
| Xid | Failure | Resolution | Orchestrator |
9+
| --- | ------------------------------------------------------------------------------------------------- | ------------------- | ------------------------------------------------------- |
10+
| 48 | Double Bit ECC | Terminate instances | [AWS ParallelCluster](#Terminate-and-replace-instances) |
11+
| 64 | ECC page retirement or row remapper recording failure<br> All reserved rows for bank are remapped | Terminate instances | [AWS ParallelCluster](#Terminate-and-replace-instances) |
12+
| | | | |
13+
| 79 | GPU has fallen off the bus | Reboot | |
14+
| 94 | Contained ECC error | Reset GPUs | [AWS ParallelCluster](#reset-gpus) |
15+
| 95 | Uncontained ECC error | Reset GPUs | [AWS ParallelCluster](#reset-gpus) |
16+
| 120 | GSP Error | Terminate instances | [AWS ParallelCluster](#Terminate-and-replace-instances) |
17+
18+
# AWS ParallelCluster
19+
20+
## Terminate and replace instances
21+
22+
1. Create a reservation to isolate the node from being used by any jobs.
23+
24+
```bash
25+
sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE]
26+
```
27+
28+
1. Identify jobs using nodes to terminate
29+
30+
```bash
31+
squeue -w [NODE_TO_TERMINATE] -o %A -h
32+
```
33+
34+
1. Cancel
35+
36+
```bash
37+
scancel [JOB_ID]
38+
```
39+
40+
1. Place the node in **DRAIN**.
41+
42+
```bash
43+
sudo /opt/slurm/bin/scontrol update node=[NODE_TO_TERMINATE] state=drain reason=gpus-fail
44+
```
45+
46+
The node will have a **DRAIN** status. Then the instance will be terminated and replaced.
47+
48+
1. Delete the reservation
49+
50+
```bash
51+
sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER]
52+
```
53+
54+
## Reset GPUs
55+
56+
1. Create a reservation to isolate the node from being used by any jobs.
57+
58+
```bash
59+
sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE]
60+
```
61+
62+
1. Identify jobs using nodes to terminate
63+
64+
```bash
65+
squeue -w [NODE_TO_TERMINATE] -o %A -h
66+
```
67+
68+
1. Cancel
69+
70+
```bash
71+
scancel [JOB_ID]
72+
```
73+
74+
1. Reset the GPUs
75+
76+
```bash
77+
sudo /opt/slurm/bin/srun -w [NODE_TO_TERMINATE] nvidia-smi -r
78+
```
79+
80+
1. Delete the reservation
81+
82+
```bash
83+
sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER]
84+
```
85+
86+
# Amazon SageMaker HyperPod
87+
88+
TBD
89+
90+
# Amazon EKS
91+
92+
TBD

0 commit comments

Comments
 (0)