SNUCSE GPU Service 3.0

SNUCSE GPU Service (SGS) 3.0 is a free service provided to students in the SNU CSE department. Before using this service, make sure you read this manual very carefully.

This service is provided for research purposes only. Please limit your use to the reasons stated in your workspace application form.

Best-Effort

This service is provided on a best-effort basis. Bacchus volunteers will try our best to ensure smooth operations for everyone, but service may be interrupted either due to technical issues, maintenance, or because the resources are in use by another user.

Your containers may be terminated, without warning, at any time. To prevent data loss, ensure all important data is stored in persistent volumes. We cannot recover lost data from terminated containers.

Resource policy

There are two types of quotas: guaranteed (including GPU and storage) and limits.

Guaranteed resources, when assigned to a running pod, are reserved exclusively for the container for the duration of its run-time. This means that even if your container is completely idle, other users will not be able to use those resources.

If we discover over-allocation of guaranteed resources (including GPUs), without actually using the resources, we may take appropriate action, including:

  • Terminating the offending containers
  • Disabling your workspace
  • Restricting access to this service permanently

Limits are the maximum amount of resources that can be used across all containers in the workspace. These may be overcommited, and we cannot guarantee that compute resources will be available to you unless you use guaranteed quotas.

Cleanup

Bacchus may periodically clean up old workspaces to free up resources. Specifically, we plan on performing cleanup operations at the following times:

  • At the beginning of the spring/fall semesters
  • At the beginning of the summer/winter breaks

We will contact users before performing any cleanup operations. If you do not respond in time, your workspace will be first disabled and later deleted. We cannot recover lost data once your workspace has been deleted.

Security policy

As a general rule, each workspace has similar security properties to running on a shared Unix machine with other users.

For example, other users may be able to see the following:

  • names of your pods and containers
  • images being run in your containers
  • command lines of processes running in your containers

However, the following information is hidden from other users:

  • the contents of your persistent volumes or ephemeral storage
  • contents of your secrets
  • your container logs

If you intend to test the security of the service, please inform us in advance.

Contact

SGS is developed and maintained by volunteers at Bacchus. If you have any questions, please reach out to us at contact@bacchus.snucse.org.

Changes from 2.0

If you have used the SGS 2.0 service, you may notice some changes in the 3.0 service.

Workspace request

We have developed a unified workspace and quota and management system. You can now request a workspace directly from the web interface. You no longer need to create UserBootstrap objects to create resource quotas.

Nodes

We now manage multiple nodes in a single cluster. Each node belongs to a nodegroup (undergraduate or graduate), and each workspace is assigned to a nodegroup. Pods in your workspace are automatically modified to only run in nodes in your nodegroup. If you need to run on a specific node, use nodeSelectors in your pod configuration.

Resource model

We no longer grant guaranteed CPU or memory (request) quotas by default. If you are absolutely sure you need request quotas for your use-case, you must justify your request in the workspace request form.

Pod CPU and memory limits are now automatically set to your workspace quota value using LimitRanges. If you need to run multiple containers (multiple containers in a pod, multiple pods, or even both), adjust the limits in your pod configuration.

Permissions

Users can now query node details. You no longer need to contact Bacchus to check the status of available node resources.

Multiple users can now be added to a single workspace. If you are collaborating with multiple users, for example for coursework, you can now share a single workspace.

We now enforce the baseline Pod Security Standard. Contact us if this is too restrictive for your use-case.

Registry

Harbor projects are now created automatically upon workspace approval. You no longer need to create the project manually.

We now automatically configure imagePullSecrets for your workspace under the default ServiceAccount. You no longer need to configure this manually, or specify imagePullSecrets in your pod configuration.

Request a workspace

Before continuing, you must already have a SNUCSE ID account registered in either the undergraduate or graduate groups.

To request a workspace, fill out the Workspace request form on the SGS workspace management page.

Workspace request form

After submitting the form, your workspace will be in the "Pending approval" state.

Pending approval

Your workspace request will be reviewed by Bacchus volunteers. Once approved, your workspace will transition to the "Enabled" state.

Enabled

You may request updates to your quota or the users list at any time. Use the "reason" field to explain the purpose of the change. Similar to workspace requests, change requests will be reviewed by Bacchus volunteers.

Request changes

Configure access

Install CLI tools

In order to access our cluster, you need to download and install Kubernetes CLI tooling (kubectl) and the authentication plugin (kubelogin). Refer to the linked pages for installation instructions.

Verify CLI tool installation

Verify your CLI tools were installed with the following commands. Your specific version numbers may differ.

$ kubectl version --client
kubectl version --client=true
Client Version: v1.30.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3

$ kubectl oidc-login --version
kubelogin version v1.28.1

Download the kubeconfig file

Open the SGS workspace management page and navigate to your workspace. Click the download button at the bottom of the page to download the kubeconfig file.

Download kubeconfig

Place your downloaded kubeconfig file in the default kubeconfig location

  • Unix (Linux, MacOS): ~/.kube/config
  • Windows: %USERPROFILE%\.kube\config

Verify your configuration

Use the kubectl auth whoami command to check everything is working correctly. It should automatically open a browser window to log in to SNUCSE ID. After logging in, you should see something similar to the following output:

$ kubectl auth whoami
ATTRIBUTE   VALUE
Username    id:yseong
Groups      [id:undergraduate system:authenticated]

Query node details

Query the list of nodes available to you using kubectl:

$ kubectl get node --selector node-restriction.kubernetes.io/nodegroup=undergraduate
NAME   STATUS   ROLES    AGE   VERSION
ford   Ready    <none>   85d   v1.30.1

$ kubectl get node --selector node-restriction.kubernetes.io/nodegroup=graduate
NAME      STATUS   ROLES    AGE   VERSION
bentley   Ready    <none>   31d   v1.30.1

Your containers will automatically be assigned to one of the nodes your workspace's nodegroup.

To query the available resources in your node, you can use the kubectl describe node command.

$ kubectl describe node bentley
Name:               bentley
[...]
Allocatable:
  cpu:                256
  ephemeral-storage:  1699582627075
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1056508660Ki
  nvidia.com/gpu:     8
  pods:               110
[...]
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                100m (0%)  64 (25%)
  memory             10Mi (0%)  64Gi (6%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
  nvidia.com/gpu     4          4

In the above example output, you can see that bentley has the following resources available in total:

ResourceAllocatableAllocated (Requests)
CPU256 vCPUs0.1 vCPUs
Memory~1TiB10 MiB
GPU8 GPUs4 GPUs

Use the registry

We provide a per-workspace private registry for your convenience. You can access the registry at sgs-registry.snucse.org.

The registry shares a high-speed network with the cluster, so image pulls from the registry should be significantly faster than pulling from public registries such as nvcr or dockerhub.

Authenticate to the registry

  1. Log in to the registry web interface at sgs-registry.snucse.org.
  2. Click your account name in the top right corner.
  3. Click "User Profile".
  4. Copy the "CLI secret" from the profile modal.
  5. Configure your Docker client to use the registry:
    $ docker login sgs-registry.snucse.org -u <username> -p <cli-secret>
    

Push images to the registry

Navigate to your workspace project's "Repositoriees" tab and refer to the "Push command" section for instructions on how to push images to the registry.

$ podman tag my-image:latest sgs-registry.snucse.org/ws-5y8frda38hqz1/this/is/my/image:my-tag
$ podman push sgs-registry.snucse.org/ws-5y8frda38hqz1/this/is/my/image:my-tag

Use images from the registry in your workspace

We automatically configure your workspace with the necessary credentials to pull images from your workspace project.

$ kubectl run --rm -it --image sgs-registry.snucse.org/ws-5y8frda38hqz1/this/is/my/image:my-tag registry-test

Do not delete automatically configured sgs-registry secret. It cannot be recovered once deleted.

Run your workload

Check out the examples page to get started.

Storage

Your container's root filesystem is ephemeral and everything on it will be lost when the container is terminated. Also, each container is limited to 10 GiB of ephemeral storage. If your workload exceeds this limit, the container will be automatically terminated.

$ # run a container that writes to the ephemeral storage
$ kubectl run --image debian:bookworm ephemeral-shell -- bash -c 'cat /dev/zero > /example'

$ # after a while, the pod gets killed automatically
$ kubectl get events -w | grep ephemeral
0s          Normal    Scheduled             pod/ephemeral-shell           Successfully assigned ws-5y8frda38hqz1/ephemeral-shell to bentley
0s          Normal    Pulled                pod/ephemeral-shell           Container image "debian:bookworm" already present on machine
0s          Normal    Created               pod/ephemeral-shell           Created container ephemeral-shell
0s          Normal    Started               pod/ephemeral-shell           Started container ephemeral-shell
2s          Warning   Evicted               pod/ephemeral-shell           Pod ephemeral local storage usage exceeds the total limit of containers 10Gi.
2s          Normal    Killing               pod/ephemeral-shell           Stopping container ephemeral-shell

We strongly recommend that you use persistent volumes for your data. For details, see the persistent volume example.

If you need to install large runtimes (Python libraries through pip, etc.), we recommend building your image with the dependencies pre-installed. You can use our private registry to host your images.

Examples

To use the available resources efficiently while ensuring your data is safely preserved, we recommend the following workflow:

  1. Create a working pod without GPU resources to work on your code. Ensure you are saving your work in a persistent volume. See the Persistent volume example.
  2. To test and debug your code, use a GPU shell. See the GPU shell example, but ensure you are correctly mounting your persistent volume. Please try to limit the time the GPU shell is running while not actively using the GPU.
  3. Once your code is ready, run it as a GPU workload, ensuring it automatically exits once the job is complete. See the GPU workload example.

Simple ephemeral shell

$ kubectl run --rm -it --image debian:bookworm ephemeral-shell -- /bin/bash
If you don't see a command prompt, try pressing enter.
root@ephemeral-shell:/# nproc
256
root@ephemeral-shell:/# exit
exit
Session ended, resume using 'kubectl attach ephemeral-shell -c ephemeral-shell -i -t' command when the pod is running
pod "ephemeral-shell" deleted

The shell and any files in its filesystem are deleted immediately upon exit. No data is preserved.

Simple persistent shell

Your data will not be preserved across restarts. See the next "Persistent volume" example to preserve data. We cannot recover data lost due to not using persistent volumes.

# simple-persistent-shell.yaml
apiVersion: v1
kind: Pod
metadata:
  name: simple-persistent-shell
spec:
  terminationGracePeriodSeconds: 1
  containers:
    - name: app
      image: debian:bookworm
      command: ['/bin/bash', '-c', 'sleep inf']
$ # create the pod
$ kubectl apply -f simple-persistent-shell.yaml
pod/simple-persistent-shell created

$ # open a shell session
$ kubectl exec -it -f simple-persistent-shell.yaml -- bash
root@simple-persistent-shell:/# exit
exit

$ # later, delete the pod
$ kubectl delete -f simple-persistent-shell.yaml
pod "simple-persistent-shell" deleted

Persistent volume

If you want to preserve data across reboots, use a persistent volume.

# persistent-volume.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: persistent-volume
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 100Gi
# persistent-volume-shell.yaml
apiVersion: v1
kind: Pod
metadata:
  name: persistent-volume-shell
spec:
  volumes:
    - name: my-volume
      persistentVolumeClaim:
        claimName: persistent-volume
  terminationGracePeriodSeconds: 1
  containers:
    - name: app
      image: debian:bookworm
      command: ['/bin/bash', '-c', 'sleep inf']
      volumeMounts:
        - name: my-volume
          mountPath: /data
$ # create resources
$ kubectl apply -f persistent-volume.yaml
persistentvolumeclaim/persistent-volume created
$ kubectl apply -f persistent-volume-shell.yaml
pod/persistent-volume-shell created

$ # open a shell session
$ kubectl exec -it -f persistent-volume-shell.yaml -- bash
root@persistent-volume-shell:/# df -h /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/md127p1    100G     0  100G   0% /data

In this example, we mount a 100 GiB volume to the /data directory. Your data will be irrecoverably lost if the PersistentVolumeClaim is deleted.

GPU shell

Use this example to spawn an ephemeral shell with access to GPU resources.

# gpu-shell.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-shell
spec:
  terminationGracePeriodSeconds: 1
  containers:
    - name: app
      image: nvcr.io/nvidia/cuda:12.5.0-base-ubuntu22.04
      command: ['/bin/bash', '-c', 'sleep inf']
      resources:
        limits:
          nvidia.com/gpu: 4
$ kubectl apply -f gpu-shell.yaml
pod/gpu-shell created

$ kubectl exec -it -f gpu-shell.yaml -- bash
root@gpu-shell:/# nvidia-smi
Tue Jun  4 11:55:12 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.5     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:07:00.0 Off |                    0 |
| N/A   24C    P0              53W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   22C    P0              51W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:B7:00.0 Off |                    0 |
| N/A   28C    P0              54W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:BD:00.0 Off |                    0 |
| N/A   29C    P0              58W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

In this example, we create a pod with 4 GPUs attached. These GPU resources are exclusively allocated to your pod as long as this pod is running.

If you allocate GPU resources but let the GPU idle for extended periods of time, we will terminate your pod without warning. Furthermore, your access may be permanently restricted. We actively monitor GPU utilization and take action if we detect abuse.

This warning also applies for "guaranteed" CPU or memory quotas.

GPU workload

Similar to the GPU shell, but exit (and de-allocate GPU resources) once the process terminates.

# gpu-workload.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  terminationGracePeriodSeconds: 1
  restartPolicy: Never
  containers:
    - name: app
      image: nvcr.io/nvidia/cuda:12.5.0-base-ubuntu22.04
      command: ['/bin/bash', '-c', 'nvidia-smi']
      resources:
        limits:
          nvidia.com/gpu: 4

Note the restartPolicy: Never and the modified command lines.

$ # create the pod
$ kubectl apply -f gpu-workload.yaml
pod/gpu-workload created

$ # watch the pod start and eventually exit
$ kubectl get -f gpu-workload.yaml --watch
NAME           READY   STATUS    RESTARTS   AGE
gpu-workload   1/1     Running   0          6s
gpu-workload   0/1     Completed   0          7s
^C

$ # view logs (standard outputs)
$ kubectl logs gpu-workload --follow
Tue Jun  4 12:14:54 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.5     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:07:00.0 Off |                    0 |
| N/A   24C    P0              53W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   22C    P0              51W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:B7:00.0 Off |                    0 |
| N/A   28C    P0              54W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:BD:00.0 Off |                    0 |
| N/A   29C    P0              61W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

$ # clean up the pod
$ kubectl delete -f gpu-workload.yaml
pod "gpu-workload" deleted

Logs may be truncated to save space, and will be permanently deleted if you delete your pod. If you want to preserve logs, we recommend writing them to a persistent volume.

Completed pods do not use node resources. Still, it is a good idea to clean up completed pods you are no longer using, as they can clutter your namespace and may result in name collisions.