Python with Tensorflow or Torch

!!! CAUTION !!! Please do NOT run python command directly in the SSH terminal!

By doing so, you are running a potentionally intensive computing task on OSCER login machines. Intensive computing tasks had a tendency to crash the login machines, thus preventing all other 1000+ OSCER users from logging in! Please make sure to read the Python Basic Setup instruction to understand how to properly set up a Python environment and submit a Python batch job on the supercomputer.

If you have trouble following the instruction below, feel free to join OSCER weekly zoom help sessions.

Python Tensorflow / Torch Setup WITHOUT Mamba/Conda

If you're doing deep learning neural network research, tensorflow and/or pytorch need no introduction. They were the two most common frameworks for deep learning and AI application. There are many different ways to install tensorflow/torch. In this guide, we show you how to install tensorflow/torch and properly loading required modules on the supercomputer (i.e. cuDNN) in a python virtual environment, WITHOUT using conda/miniconda/mamba. More on this in our Mamba (conda) instruction.

If you haven't read the Python Basic Setup instruction, please do so. The steps are the same with the Python Basic Setup instruction, except:

In step 3 (activating a python virtual environment and installing packages), for tensorflow, type:pip install tensorflow
to install the latest tensorflow version.
If you need a specific tensorflow version, such as 2.13.x, type:pip install tensorflow==2.13.*

If you plan to use tensorRT (real-time inference) and/or to suppress the tensorRT not found warning, type:pip install tensorrt==8.6.0
Note that this will set back about 1GB of your precious home storage, so unless you really need it, you don't have to install it.
As of current writing, sometimes the latest version of tensorRT (8.6.1) would not be installed correctly. Therefore, we recommend installing the tensorRT version that is one minor version behind the latest release (8.6.0).

For the latest pytorch built with CUDA 11.8, first, you need to install wheel package:pip install wheel
Then, according to the official pytorch installation instruction, type:pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
In step 4 (creating and submitting a python batch job), for both tensorflow and pytorch, you need to load a cuDNN module associated with your tensorflow/pytorch version in your batch script before the main command. For example, if you need to load cuDNN version 8.6.0.163 with CUDA 11.8, type:module load cuDNN/8.6.0.163-CUDA-11.8.0
(it's Case SenSiTive)
The cuDNN/8.6.0.163-CUDA-11.8.0 module also works with pytorch built with CUDA 11.8.

Again, to know which cuDNN modules available on the supercomputer, type:module avail cudnn
(case insensitive)

See here for a list of tensorflow builds with tested Python, CUDA, and cuDNN versions.

Example of tensorflow batch script and python code:
Below is the content of my batch script test_tensorflow.sbatch, designed to submit a simple tensorflow python command from my test project on partition debug, requesting 1 CPU, 1GB of memory, for 10 minutes:#!/bin/bash # #SBATCH --partition=debug #SBATCH --output=python_%J_stdout.txt #SBATCH --error=python_%J_stderr.txt #SBATCH --ntasks=1 #SBATCH --mem=1G #SBATCH --time=00:10:00 module load Python/3.10.8-GCCcore-12.2.0 module load cuDNN/8.6.0.163-CUDA-11.8.0 source $HOME/test/test_env/bin/activate python ~/test/test_tensorflow.py

And my test_tensorflow.py file has the following content to print out the tensorflow version:import tensorflow as tf print(tf.__version__)

Example of torch batch script and python code:
Below is the content of my batch script test_torch.sbatch, designed to submit a simple torch python command from my test project on partition debug, requesting 1 CPU, 1GB of memory, for 10 minutes:#!/bin/bash # #SBATCH --partition=debug #SBATCH --output=python_%J_stdout.txt #SBATCH --error=python_%J_stderr.txt #SBATCH --ntasks=1 #SBATCH --mem=1G #SBATCH --time=00:10:00 module load Python/3.10.8-GCCcore-12.2.0 module load cuDNN/8.6.0.163-CUDA-11.8.0 source $HOME/test/test_env/bin/activate python ~/test/test_torch.py

And my test_torch.py file has the following content to print out the torch version:import torch print(torch.__version__)

Mamba/Conda Tensorflow/Pytorch setup

If you insist on setting up your machine-learning/AI environment using mamba/conda, please first read our OSCER Mamba instruction. At the end of step 2, once you created and activated your mamba environment, you can install additional packages, such as tensorflow or pytorch.

For tensorflow: in step 2, after you activated your mamba environment, type:mamba install -c conda-forge tensorflow=2.15.0
This command will install tensorflow 2.15.0 onto your mamba environment. Refer to conda-forge tensorflow website to see the latest version of tensorflow on Anaconda.
Then, in step 3, you need to load the correct CUDA and cuDNN versions associated with your tensorflow version before the main python command in your batch script. For tensorflow 2.15.0, you need to load cuDNN version 8.9.2.26-CUDA-12.2.0. Here is an example batch script for mamba tensorflow 2.15.0:# #SBATCH --partition=debug #SBATCH --output=conda_%J_stdout.txt #SBATCH --error=conda_%J_stderr.txt #SBATCH --ntasks=1 #SBATCH --mem=1G #SBATCH --time=00:10:00 module load cuDNN/8.9.2.26-CUDA-12.2.0 python test_mamba_tensorflow.py
If you prefer to install your own cuDNN and CUDA package from your favorite mamba repo (such as conda-forge), keep in mind that it will consume quite a bit of your home directory (expect at LEAST 3-4GB for cuDNN and CUDA toolkit). Currently, conda-forge repo does NOT have CUDA toolkit 12.2.0 yet - another reason why we do NOT recommend this approach. Once they do, you need to install it and cuDNN 8.9.7.29 after activating your mamba environment in step 2 of OSCER mamba instruction:mamba install -c conda-forge cudatoolkit=12.2.0 mamba install -c conda-forge cudnn=8.9.7.29
If you install your own CUDA toolkit and cuDNN packages from conda-forge repo, you do NOT need to load the OSCER cuDNN module in your batch script.

For PyTorch: in step 2, after you activated your mamba environment, type:mamba install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
This command is based on the official getting-started instruction of PyTorch website. There is no need to load OSCER's CUDA toolkit module to run pytorch in your batch script.

Python with Tensorflow or Torch

Quick Links

!!! CAUTION !!! Please do NOT run python command directly in the SSH terminal!

Python Tensorflow / Torch Setup WITHOUT Mamba/Conda

Mamba/Conda Tensorflow/Pytorch setup