Datalab Cluster¶

The purpose of the cluster is to run experiments for extended periods of time or to run experiments on big datasets. The cluster consists of multiple computing nodes. To run experiments a container based system is used (based on docker images). This offers the ability for every user to setup his/her own computing environment (software, libraries, …).

Important notes¶

Resource allocation
Only allocate resources as you need it. E.g. avoid allocate GPU’s if you only run CPU workload.
Login node
Do not perform computationally intensive work on the login node. Start a slurm job instead.
Timelimit

By default jobs are created with a timelimit of 10 days. When a job exceeds it’s timelimit it gets killed. When you know your job is going to run longer, you can submit it using the “..-long” QOS.

See 6185870976

Fair use
For all jobs to be processed, resources should not be reserved if nothing is running. So whenever possible use the basic submit command (see Run a python script), so the resources get freed after the script is finished. Also batch jobs can be used. Furthermore you can set timelimits with the –time option to make sure a job is not running to long without any work.

Changes (August 2018)916-951-9451

  • Singularity can be used for running containers
    • Singularity is a tool for running containers especially on cluster environments
    • It can pull images from either singularity-hub or docker-hub
    • Instead of a central image manger the images are stored as simple files, so the users can manage the images on their own
    • Besides the commands to execute the functionality is the same (but more stable than shifter)

Changes (November 2017)¶

  • Shifter replaces docker (nvidia-docker) for container environment
    • Shifter can be used as nvidia-docker before. It also uses docker images to run containers.
    • Shifter mounts the users into the container, so you have the same environment as on the host. Hence the written data has also the correct permissions.
    • Shifter automatically mounts volumes. This ensures you have the same folder structure in the container as on the host. (e.g. data from /home/user/… is also available in the container as /home/user/…)
    • Port mappings are not needed anymore since ports are directly forwarded to the host. As before you can only use ports assigned to you.
    • Shifter only allows readonly images. So you can not install any software at runtime on the cluster.
  • Login
    • Now you can use your ZHAW credentials to login after it has been enabled.
    • Login to the cluster is now only possible on gpulogin.cloudlab.zhaw.ch.
  • Resource Limits
    • To ensure a user doesn’t use too much resources, limits are introduced. (see Resource limits)

Contribute¶

If you have any tips, tricks, examples, guides that could help others, feel free to add it to this documentation via github pull request here. Or just let us know somehow.