SNUCSE GPU Service 3.0

SNUCSE GPU Service (SGS) 3.0 is a free service provided to students in the SNU CSE department. Before using this service, make sure you read this manual very carefully.

This service is provided for research purposes only. Please limit your use to the reasons stated in your workspace application form.

Best-Effort

This service is provided on a best-effort basis. Bacchus volunteers will try our best to ensure smooth operations for everyone, but service may be interrupted either due to technical issues, maintenance, or because the resources are in use by another user.

Your containers may be terminated, without warning, at any time. To prevent data loss, ensure all important data is stored in persistent volumes. We cannot recover lost data from terminated containers.

Resource policy

There are two types of quotas: guaranteed (including GPU and storage) and limits.

Guaranteed resources, when assigned to a running pod, are reserved exclusively for the container for the duration of its run-time. This means that even if your container is completely idle, other users will not be able to use those resources.

If we discover over-allocation of guaranteed resources (including GPUs), without actually using the resources, we may take appropriate action, including:

  • Terminating the offending containers
  • Disabling your workspace
  • Restricting access to this service permanently

Limits are the maximum amount of resources that can be used across all containers in the workspace. These may be overcommited, and we cannot guarantee that compute resources will be available to you unless you use guaranteed quotas.

Cleanup

Bacchus may periodically clean up old workspaces to free up resources. Specifically, we plan on performing cleanup operations at the following times:

  • At the beginning of the spring/fall semesters
  • At the beginning of the summer/winter breaks

We will contact users before performing any cleanup operations. If you do not respond in time, your workspace will be first disabled and later deleted. We cannot recover lost data once your workspace has been deleted.

Security policy

As a general rule, each workspace has similar security properties to running on a shared Unix machine with other users.

For example, other users may be able to see the following:

  • names of your pods and containers
  • images being run in your containers
  • command lines of processes running in your containers

However, the following information is hidden from other users:

  • the contents of your persistent volumes or ephemeral storage
  • contents of your secrets
  • your container logs

If you intend to test the security of the service, please inform us in advance.

Contact

SGS is developed and maintained by volunteers at Bacchus. If you have any questions, please reach out to us at contact@bacchus.snucse.org.

Changes from 2.0

If you have used the SGS 2.0 service, you may notice some changes in the 3.0 service.

Workspace request

We have developed a unified workspace and quota and management system. You can now request a workspace directly from the web interface. You no longer need to create UserBootstrap objects to create resource quotas.

Nodes

We now manage multiple nodes in a single cluster. Each node belongs to a nodegroup (undergraduate or graduate), and each workspace is assigned to a nodegroup. Pods in your workspace are automatically modified to only run in nodes in your nodegroup. If you need to run on a specific node, use nodeSelectors in your pod configuration.

Resource model

We no longer grant guaranteed CPU or memory (request) quotas by default. If you are absolutely sure you need request quotas for your use-case, you must justify your request in the workspace request form.

Pod CPU and memory limits are now automatically set to your workspace quota value using LimitRanges. If you need to run multiple containers (multiple containers in a pod, multiple pods, or even both), adjust the limits in your pod configuration.

Permissions & collaborating with others

Users can now query node details. You no longer need to contact Bacchus to check the status of available node resources.

Multiple users can now be added to a single workspace. If you are collaborating with multiple users, for example for coursework, you can now share a single workspace.

We now enforce the baseline Pod Security Standard. Contact us if this is too restrictive for your use-case.

Registry

Harbor projects are now created automatically upon workspace approval. You no longer need to create the project manually.

We now automatically configure imagePullSecrets for your workspace under the default ServiceAccount. You no longer need to configure this manually, or specify imagePullSecrets in your pod configuration.

Resources and Eligibility

For undergraduate researchers

Eligibility

We offer SNUCSE GPU Service to SNU undergraduate researchers who meet one of the following criteria:

  • Enrolled in Computer Science and Engineering as Single major.
  • Enrolled in Computer Science and Engineering as one of their double majors or one of their minor(s).
  • Student in the College of Liberal Studies, majoring in Computer Science.

Node group

The node group for undergraduate researchers includes the "ford" node.

  • ProLiant XL645d Gen10 Plus "ford"
    • NVIDIA A100 40GB x 4
    • AMD EPYC 75F3 x 1, 64 threads
    • 768GB DDR4

For graduate researchers

Eligibility

We offer SNUCSE GPU Service to SNU graduate researchers who meet one of the following criteria:

  • Enrolled in Computer Science and Engineering.
  • Enrolled in Interdisciplinary Program in Artificial Intelligence, with an advisor affiliated with the Department of Computer Science and Engineering.

Node group

The node group for graduate researchers includes the "bentley" node.

  • NVIDIA DGX A100 "bentley"
    • NVIDIA A100 40GB x 8
    • AMD EPYC 7742 x 2, 256 threads
    • 1024 GB DDR4

Request a workspace

Before continuing, you must already have a SNUCSE ID account registered in either the undergraduate or graduate groups.

To request a workspace, fill out the Workspace request form on the SGS workspace management page.

Workspace request form

After submitting the form, your workspace will be in the "Pending approval" state.

Pending approval

Your workspace request will be reviewed by Bacchus volunteers. Once approved, your workspace will transition to the "Enabled" state.

Enabled

You may request updates to your quota or the users list at any time. Use the "reason" field to explain the purpose of the change. Similar to workspace requests, change requests will be reviewed by Bacchus volunteers.

Request changes

GPU Resource Allocation Request Guidelines

Important: In order to ensure that our limited GPU resources are allocated in an optimal and transparent manner, we require all researchers to provide a clear and concise justification for their resource requests.

When submitting a request for GPU resources, please provide a concise explanation of how the allocated resources will be utilized to support your research objectives. This justification should include:

  • If you are requesting a workspace in graduate Nodegroup, it is mandatory to provide the name of your advisor and/or the name of the principal investigator (PI) of the project.
  • A brief overview of your research project, including its goals and objectives.
  • If you (or your team) are targeting a specific journal or conference, please include the CfP(Call for Papers) URL.
    • This may help us to prioritize your resource request accordingly if the deadlines are nearer.
  • A description of the specific GPU resources required, including:
    • A brief description of workloads, you are going to run on the GPU(s).
    • If you are requesting multiple GPUs, please provide detailed justification with regard to VRAM requirements or research challenges.
  • A description of your storage requirements. Explain why you need a specific amount of storage by providing the details on:
    • The size of the model (i.e. the approximate number of model parameters)
    • The size and number of datasets.
    • Any specific storage-related constraints or limitations.
  • If you entered non-zero values to 'CPU Guaranteed' or 'Memory Guaranteed', please provide a detailed justification for your request.
    • In most cases, you don't need to obtain guaranteed CPU or Memory resources, and guaranteed resources only increase the risk of your workloads being terminated due to underutilized guaranteed resources.
  • An estimate of the expected duration of the project.
Failure to provide a satisfactory justification may result in delays or rejection of your resource request.
For collaborative projects involving multiple researchers, please submit a single workspace request and subsequently add your team members to the workspace by providing their id.snucse.org account usernames. This approach eliminates the need for individual workspace requests for each researcher.

Applications may be submitted in either Korean or English, and our reviewers will be able to assess them in either language.

Why this is important?

As system administrators, we are accountable to the CSE department for ensuring that our GPU resources are allocated in a way that maximizes their impact on research projects. By providing a clear justification for your resource requests, you are helping us to:

  • Evaluate the merits and urgency of each request and prioritize allocations accordingly.
  • Report to the CSE department on the effective utilization of our GPU resources and the impact on research outcomes.
  • Continuously improve our resource allocation processes to better support the research community.

Thank you for your understanding, and please feel free to reach out to us at contact@bacchus.snucse.org.

Configure access

Install CLI tools

In order to access our cluster, you need to download and install Kubernetes CLI tooling (kubectl) and the authentication plugin (kubelogin). Refer to the linked pages for installation instructions.

Verify CLI tool installation

Verify your CLI tools were installed with the following commands. Your specific version numbers may differ.

$ kubectl version --client
kubectl version --client=true
Client Version: v1.30.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3

$ kubectl oidc-login --version
kubelogin version v1.28.1

Download the kubeconfig file

Open the SGS workspace management page and navigate to your workspace. Click the download button at the bottom of the page to download the kubeconfig file.

Download kubeconfig

Place your downloaded kubeconfig file in the default kubeconfig location

  • Unix (Linux, MacOS): ~/.kube/config
  • Windows: %USERPROFILE%\.kube\config

Verify your configuration

Use the kubectl auth whoami command to check everything is working correctly. It should automatically open a browser window to log in to SNUCSE ID. After logging in, you should see something similar to the following output:

$ kubectl auth whoami
ATTRIBUTE   VALUE
Username    id:yseong
Groups      [id:undergraduate system:authenticated]

Query node details

Query the list of nodes available to you using kubectl:

$ kubectl get node --selector node-restriction.kubernetes.io/nodegroup=undergraduate
NAME   STATUS   ROLES    AGE   VERSION
ford   Ready    <none>   85d   v1.30.1

$ kubectl get node --selector node-restriction.kubernetes.io/nodegroup=graduate
NAME      STATUS   ROLES    AGE   VERSION
bentley   Ready    <none>   31d   v1.30.1

Your containers will automatically be assigned to one of the nodes your workspace's nodegroup.

To query the available resources in your node, you can use the kubectl describe node command.

$ kubectl describe node bentley
Name:               bentley
[...]
Allocatable:
  cpu:                256
  ephemeral-storage:  1699582627075
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1056508660Ki
  nvidia.com/gpu:     8
  pods:               110
[...]
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                100m (0%)  64 (25%)
  memory             10Mi (0%)  64Gi (6%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
  nvidia.com/gpu     4          4

In the above example output, you can see that bentley has the following resources available in total:

ResourceAllocatableAllocated (Requests)
CPU256 vCPUs0.1 vCPUs
Memory~1TiB10 MiB
GPU8 GPUs4 GPUs

Use the registry

We provide a per-workspace private registry for your convenience. You can access the registry at sgs-registry.snucse.org.

The registry shares a high-speed network with the cluster, so image pulls from the registry should be significantly faster than pulling from public registries such as nvcr or dockerhub.

Authenticate to the registry

  1. Log in to the registry web interface at sgs-registry.snucse.org.
  2. Click your account name in the top right corner.
  3. Click "User Profile".
  4. Copy the "CLI secret" from the profile modal.
  5. Configure your Docker client to use the registry:
    $ docker login sgs-registry.snucse.org -u <username> -p <cli-secret>
    

Push images to the registry

Navigate to your workspace project's "Repositoriees" tab and refer to the "Push command" section for instructions on how to push images to the registry.

$ podman tag my-image:latest sgs-registry.snucse.org/ws-5y8frda38hqz1/this/is/my/image:my-tag
$ podman push sgs-registry.snucse.org/ws-5y8frda38hqz1/this/is/my/image:my-tag

Use images from the registry in your workspace

We automatically configure your workspace with the necessary credentials to pull images from your workspace project.

$ kubectl run --rm -it --image sgs-registry.snucse.org/ws-5y8frda38hqz1/this/is/my/image:my-tag registry-test

Do not delete automatically configured sgs-registry secret. It cannot be recovered once deleted.

Run your workload

Check out the examples page to get started.

Storage

Your container's root filesystem is ephemeral and everything on it will be lost when the container is terminated. Also, each container is limited to 10 GiB of ephemeral storage. If your workload exceeds this limit, the container will be automatically terminated.

$ # run a container that writes to the ephemeral storage
$ kubectl run --image debian:bookworm ephemeral-shell -- bash -c 'cat /dev/zero > /example'

$ # after a while, the pod gets killed automatically
$ kubectl get events -w | grep ephemeral
0s          Normal    Scheduled             pod/ephemeral-shell           Successfully assigned ws-5y8frda38hqz1/ephemeral-shell to bentley
0s          Normal    Pulled                pod/ephemeral-shell           Container image "debian:bookworm" already present on machine
0s          Normal    Created               pod/ephemeral-shell           Created container ephemeral-shell
0s          Normal    Started               pod/ephemeral-shell           Started container ephemeral-shell
2s          Warning   Evicted               pod/ephemeral-shell           Pod ephemeral local storage usage exceeds the total limit of containers 10Gi.
2s          Normal    Killing               pod/ephemeral-shell           Stopping container ephemeral-shell

We strongly recommend that you use persistent volumes for your data. For details, see the persistent volume example.

If you need to install large runtimes (Python libraries through pip, etc.), we recommend building your image with the dependencies pre-installed. You can use our private registry to host your images.

Examples

To use the available resources efficiently while ensuring your data is safely preserved, we recommend the following workflow:

  1. Create a working pod without GPU resources to work on your code. Ensure you are saving your work in a persistent volume. See the Persistent volume example.
  2. To test and debug your code, use a GPU shell. See the GPU shell example, but ensure you are correctly mounting your persistent volume. Please try to limit the time the GPU shell is running while not actively using the GPU.
  3. Once your code is ready, run it as a GPU workload, ensuring it automatically exits once the job is complete. See the GPU workload example.

Simple ephemeral shell

$ kubectl run --rm -it --image debian:bookworm ephemeral-shell -- /bin/bash
If you don't see a command prompt, try pressing enter.
root@ephemeral-shell:/# nproc
256
root@ephemeral-shell:/# exit
exit
Session ended, resume using 'kubectl attach ephemeral-shell -c ephemeral-shell -i -t' command when the pod is running
pod "ephemeral-shell" deleted

The shell and any files in its filesystem are deleted immediately upon exit. No data is preserved.

Simple persistent shell

Your data will not be preserved across restarts. See the next "Persistent volume" example to preserve data. We cannot recover data lost due to not using persistent volumes.

# simple-persistent-shell.yaml
apiVersion: v1
kind: Pod
metadata:
  name: simple-persistent-shell
spec:
  terminationGracePeriodSeconds: 1
  containers:
    - name: app
      image: debian:bookworm
      command: ['/bin/bash', '-c', 'sleep inf']
$ # create the pod
$ kubectl apply -f simple-persistent-shell.yaml
pod/simple-persistent-shell created

$ # open a shell session
$ kubectl exec -it -f simple-persistent-shell.yaml -- bash
root@simple-persistent-shell:/# exit
exit

$ # later, delete the pod
$ kubectl delete -f simple-persistent-shell.yaml
pod "simple-persistent-shell" deleted

Persistent volume

If you want to preserve data across reboots, use a persistent volume.

# persistent-volume.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: persistent-volume
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 100Gi
# persistent-volume-shell.yaml
apiVersion: v1
kind: Pod
metadata:
  name: persistent-volume-shell
spec:
  volumes:
    - name: my-volume
      persistentVolumeClaim:
        claimName: persistent-volume
  terminationGracePeriodSeconds: 1
  containers:
    - name: app
      image: debian:bookworm
      command: ['/bin/bash', '-c', 'sleep inf']
      volumeMounts:
        - name: my-volume
          mountPath: /data
$ # create resources
$ kubectl apply -f persistent-volume.yaml
persistentvolumeclaim/persistent-volume created
$ kubectl apply -f persistent-volume-shell.yaml
pod/persistent-volume-shell created

$ # open a shell session
$ kubectl exec -it -f persistent-volume-shell.yaml -- bash
root@persistent-volume-shell:/# df -h /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/md127p1    100G     0  100G   0% /data

In this example, we mount a 100 GiB volume to the /data directory. Your data will be irrecoverably lost if the PersistentVolumeClaim is deleted.

GPU shell

Use this example to spawn an ephemeral shell with access to GPU resources.

# gpu-shell.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-shell
spec:
  terminationGracePeriodSeconds: 1
  containers:
    - name: app
      image: nvcr.io/nvidia/cuda:12.5.0-base-ubuntu22.04
      command: ['/bin/bash', '-c', 'sleep inf']
      resources:
        limits:
          nvidia.com/gpu: 4
$ kubectl apply -f gpu-shell.yaml
pod/gpu-shell created

$ kubectl exec -it -f gpu-shell.yaml -- bash
root@gpu-shell:/# nvidia-smi
Tue Jun  4 11:55:12 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.5     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:07:00.0 Off |                    0 |
| N/A   24C    P0              53W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   22C    P0              51W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:B7:00.0 Off |                    0 |
| N/A   28C    P0              54W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:BD:00.0 Off |                    0 |
| N/A   29C    P0              58W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

In this example, we create a pod with 4 GPUs attached. These GPU resources are exclusively allocated to your pod as long as this pod is running.

If you allocate GPU resources but let the GPU idle for extended periods of time, we will terminate your pod without warning. Furthermore, your access may be permanently restricted. We actively monitor GPU utilization and take action if we detect abuse.

This warning also applies for "guaranteed" CPU or memory quotas.

GPU workload

Similar to the GPU shell, but exit (and de-allocate GPU resources) once the process terminates.

# gpu-workload.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  terminationGracePeriodSeconds: 1
  restartPolicy: Never
  containers:
    - name: app
      image: nvcr.io/nvidia/cuda:12.5.0-base-ubuntu22.04
      command: ['/bin/bash', '-c', 'nvidia-smi']
      resources:
        limits:
          nvidia.com/gpu: 4

Note the restartPolicy: Never and the modified command lines.

$ # create the pod
$ kubectl apply -f gpu-workload.yaml
pod/gpu-workload created

$ # watch the pod start and eventually exit
$ kubectl get -f gpu-workload.yaml --watch
NAME           READY   STATUS    RESTARTS   AGE
gpu-workload   1/1     Running   0          6s
gpu-workload   0/1     Completed   0          7s
^C

$ # view logs (standard outputs)
$ kubectl logs gpu-workload --follow
Tue Jun  4 12:14:54 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.5     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:07:00.0 Off |                    0 |
| N/A   24C    P0              53W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   22C    P0              51W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:B7:00.0 Off |                    0 |
| N/A   28C    P0              54W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:BD:00.0 Off |                    0 |
| N/A   29C    P0              61W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

$ # clean up the pod
$ kubectl delete -f gpu-workload.yaml
pod "gpu-workload" deleted

Logs may be truncated to save space, and will be permanently deleted if you delete your pod. If you want to preserve logs, we recommend writing them to a persistent volume.

Completed pods do not use node resources. Still, it is a good idea to clean up completed pods you are no longer using, as they can clutter your namespace and may result in name collisions.