Examples

To use the available resources efficiently while ensuring your data is safely preserved, we recommend the following workflow:

  1. Create a working pod without GPU resources to work on your code. Ensure you are saving your work in a persistent volume. See the Persistent volume example.
  2. To test and debug your code, use a GPU shell. See the GPU shell example, but ensure you are correctly mounting your persistent volume. Please try to limit the time the GPU shell is running while not actively using the GPU.
  3. Once your code is ready, run it as a GPU workload, ensuring it automatically exits once the job is complete. See the GPU workload example.

Simple ephemeral shell

$ kubectl run --rm -it --image debian:bookworm ephemeral-shell -- /bin/bash
If you don't see a command prompt, try pressing enter.
root@ephemeral-shell:/# nproc
256
root@ephemeral-shell:/# exit
exit
Session ended, resume using 'kubectl attach ephemeral-shell -c ephemeral-shell -i -t' command when the pod is running
pod "ephemeral-shell" deleted

The shell and any files in its filesystem are deleted immediately upon exit. No data is preserved.

Simple persistent shell

Your data will not be preserved across restarts. See the next "Persistent volume" example to preserve data. We cannot recover data lost due to not using persistent volumes.

# simple-persistent-shell.yaml
apiVersion: v1
kind: Pod
metadata:
  name: simple-persistent-shell
spec:
  terminationGracePeriodSeconds: 1
  containers:
    - name: app
      image: debian:bookworm
      command: ['/bin/bash', '-c', 'sleep inf']
$ # create the pod
$ kubectl apply -f simple-persistent-shell.yaml
pod/simple-persistent-shell created

$ # open a shell session
$ kubectl exec -it -f simple-persistent-shell.yaml -- bash
root@simple-persistent-shell:/# exit
exit

$ # later, delete the pod
$ kubectl delete -f simple-persistent-shell.yaml
pod "simple-persistent-shell" deleted

Persistent volume

If you want to preserve data across reboots, use a persistent volume.

# persistent-volume.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: persistent-volume
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 100Gi
# persistent-volume-shell.yaml
apiVersion: v1
kind: Pod
metadata:
  name: persistent-volume-shell
spec:
  volumes:
    - name: my-volume
      persistentVolumeClaim:
        claimName: persistent-volume
  terminationGracePeriodSeconds: 1
  containers:
    - name: app
      image: debian:bookworm
      command: ['/bin/bash', '-c', 'sleep inf']
      volumeMounts:
        - name: my-volume
          mountPath: /data
$ # create resources
$ kubectl apply -f persistent-volume.yaml
persistentvolumeclaim/persistent-volume created
$ kubectl apply -f persistent-volume-shell.yaml
pod/persistent-volume-shell created

$ # open a shell session
$ kubectl exec -it -f persistent-volume-shell.yaml -- bash
root@persistent-volume-shell:/# df -h /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/md127p1    100G     0  100G   0% /data

In this example, we mount a 100 GiB volume to the /data directory. Your data will be irrecoverably lost if the PersistentVolumeClaim is deleted.

GPU shell

Use this example to spawn an ephemeral shell with access to GPU resources.

# gpu-shell.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-shell
spec:
  terminationGracePeriodSeconds: 1
  containers:
    - name: app
      image: nvcr.io/nvidia/cuda:12.5.0-base-ubuntu22.04
      command: ['/bin/bash', '-c', 'sleep inf']
      resources:
        limits:
          nvidia.com/gpu: 4
$ kubectl apply -f gpu-shell.yaml
pod/gpu-shell created

$ kubectl exec -it -f gpu-shell.yaml -- bash
root@gpu-shell:/# nvidia-smi
Tue Jun  4 11:55:12 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.5     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:07:00.0 Off |                    0 |
| N/A   24C    P0              53W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   22C    P0              51W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:B7:00.0 Off |                    0 |
| N/A   28C    P0              54W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:BD:00.0 Off |                    0 |
| N/A   29C    P0              58W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

In this example, we create a pod with 4 GPUs attached. These GPU resources are exclusively allocated to your pod as long as this pod is running.

If you allocate GPU resources but let the GPU idle for extended periods of time, we will terminate your pod without warning. Furthermore, your access may be permanently restricted. We actively monitor GPU utilization and take action if we detect abuse.

This warning also applies for "guaranteed" CPU or memory quotas.

GPU workload

Similar to the GPU shell, but exit (and de-allocate GPU resources) once the process terminates.

# gpu-workload.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  terminationGracePeriodSeconds: 1
  restartPolicy: Never
  containers:
    - name: app
      image: nvcr.io/nvidia/cuda:12.5.0-base-ubuntu22.04
      command: ['/bin/bash', '-c', 'nvidia-smi']
      resources:
        limits:
          nvidia.com/gpu: 4

Note the restartPolicy: Never and the modified command lines.

$ # create the pod
$ kubectl apply -f gpu-workload.yaml
pod/gpu-workload created

$ # watch the pod start and eventually exit
$ kubectl get -f gpu-workload.yaml --watch
NAME           READY   STATUS    RESTARTS   AGE
gpu-workload   1/1     Running   0          6s
gpu-workload   0/1     Completed   0          7s
^C

$ # view logs (standard outputs)
$ kubectl logs gpu-workload --follow
Tue Jun  4 12:14:54 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.5     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:07:00.0 Off |                    0 |
| N/A   24C    P0              53W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   22C    P0              51W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:B7:00.0 Off |                    0 |
| N/A   28C    P0              54W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:BD:00.0 Off |                    0 |
| N/A   29C    P0              61W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

$ # clean up the pod
$ kubectl delete -f gpu-workload.yaml
pod "gpu-workload" deleted

Logs may be truncated to save space, and will be permanently deleted if you delete your pod. If you want to preserve logs, we recommend writing them to a persistent volume.

Completed pods do not use node resources. Still, it is a good idea to clean up completed pods you are no longer using, as they can clutter your namespace and may result in name collisions.