Skip to content

Commit c9e0f5e

Browse files
committed
Add draft gpu troubles
1 parent b1e89bb commit c9e0f5e

File tree

1 file changed

+70
-0
lines changed

1 file changed

+70
-0
lines changed
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# GPU Troubleshooting Guide
2+
3+
This guide presents how to identify and resolve issues on Amazon EC2 instances with NVIDIA GPUs.
4+
5+
While running High-Performance Computing or Machine Learning workloads, GPUs may fail for various reasons captured by Xid messages.
6+
Those messages are placed in `/var/log/messages` for Amazon Linux or for Ubuntu in `/var/log/syslog` and `/var/log/kern.log`
7+
8+
| Xid | Failure | Resolution | Orchestrator |
9+
| --- | --------------------- | ------------------- | ------------------------------------------------------- |
10+
| 48 | Double Bit ECC | Terminate instances | [AWS ParallelCluster](#Terminate-and-replace-instances) |
11+
| 94 | Contained ECC error | Reset GPUs | [AWS ParallelCluster](#reset-gpus) |
12+
| 95 | Uncontained ECC error | Reset GPUs | [AWS ParallelCluster](#reset-gpus) |
13+
14+
# AWS ParallelCluster
15+
16+
## Terminate and replace instances
17+
18+
1. Create a reservation to isolate the node from being used by any jobs.
19+
```bash
20+
sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE]
21+
```
22+
23+
1. Cancel
24+
```bash
25+
scancel [JOB_ID]
26+
```
27+
28+
1. Place the node in **DRAIN**.
29+
```bash
30+
sudo /opt/slurm/bin/scontrol update node=[NODE_TO_TERMINATE] state=drain reason=gpus-fail
31+
```
32+
33+
The node will have a **DRAIN** status. Then the instance will be terminated and replaced.
34+
35+
36+
1. Delete the reservation
37+
```bash
38+
sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER]
39+
```
40+
41+
## Reset GPUs
42+
43+
1. Create a reservation to isolate the node from being used by any jobs.
44+
```bash
45+
sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE]
46+
```
47+
48+
1. Cancel
49+
```bash
50+
scancel [JOB_ID]
51+
```
52+
53+
1. Reset the GPUs
54+
```bash
55+
sudo /opt/slurm/bin/srun -w [NODE_TO_TERMINATE] nvidia-smi -r
56+
```
57+
58+
1. Delete the reservation
59+
```bash
60+
sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER]
61+
```
62+
63+
64+
# Amazon SageMaker HyperPod
65+
66+
TBD
67+
68+
# Amazon EKS
69+
70+
TBD

0 commit comments

Comments
 (0)