!!! CAUTION !!! Please do NOT run python command directly in the SSH terminal!
By doing so, you are running a potentionally intensive computing task on OSCER login machines. Intensive computing tasks had a tendency to crash the login machines, thus preventing all other 1000+ OSCER users from logging in! Please make sure to read the Python Basic Setup instruction to understand how to properly set up a Python environment and submit a Python batch job on the supercomputer.
If you have trouble following the instruction below, feel free to join OSCER weekly zoom help sessions.
Python Tensorflow Setup WITHOUT Mamba/Conda
If you're doing deep learning neural network research, tensorflow need no introduction. It is one of the very first framework for deep learning and AI application. There are many different ways to install tensorflow. In this guide, we show you how to install tensorflow and properly loading required modules on the supercomputer (i.e. cuDNN) in a python virtual environment, WITHOUT using conda/miniconda/mamba. More on this in our Mamba (conda) instruction.
If you haven't read the Python Basic Setup instruction, please do so. The steps are the same with the Python Basic Setup instruction, except:
- In step 3 (activating a python virtual environment and installing packages), for tensorflow, type:
pip install tensorflow
to install the latest tensorflow version.
If you need a specific tensorflow version, such as 2.13.x, type:
pip install tensorflow==2.13.*
If you plan to use tensorRT (real-time inference) and/or to suppress thetensorRT not found
warning, type:
pip install tensorrt==8.6.0
Note that this will set back about 1GB of your precious home storage, so unless you really need it, you don't have to install it.
As of current writing, sometimes the latest version of tensorRT (8.6.1) would not be installed correctly. Therefore, we recommend installing the tensorRT version that is one minor version behind the latest release (8.6.0).
- In step 4 (creating and submitting a python batch job), for both tensorflow and pytorch, you need to load a cuDNN module associated with your tensorflow version in your batch script before the main command. For example, if you need to load cuDNN version 8.6.0.163 with CUDA 11.8, type:
module load cuDNN/8.6.0.163-CUDA-11.8.0
(it's Case SenSiTive)
The cuDNN/8.6.0.163-CUDA-11.8.0 module also works with pytorch built with CUDA 11.8.
Again, to know which cuDNN modules available on the supercomputer, type:
module avail cudnn
(case insensitive)
See here for a list of tensorflow builds with tested Python, CUDA, and cuDNN versions.
Example of tensorflow batch script and python code:
*** UPDATE 20241007 *** you need to define two environment variables,XLA_FLAGS
andCUDA_DIR
to avoid tensorflow jobs from being terminated early on the supercomputer:
export XLA_FLAGS="--xla_gpu_cuda_data_dir=${CUDA_HOME}"
export CUDA_DIR=${CUDA_HOME}
Below is the content of my batch scripttest_tensorflow.sbatch
, designed to submit a simple tensorflow python command from my test project on partitiondebug
, requesting 1 CPU, 1GB of memory, for 10 minutes:#!/bin/bash
#
#SBATCH --partition=debug
#SBATCH --output=python_%J_stdout.txt
#SBATCH --error=python_%J_stderr.txt
#SBATCH --ntasks=1
#SBATCH --mem=1G
#SBATCH --time=00:10:00
module load Python/3.10.8-GCCcore-12.2.0
module load cuDNN/8.6.0.163-CUDA-11.8.0
source $HOME/test/test_env/bin/activate
export XLA_FLAGS="--xla_gpu_cuda_data_dir=${CUDA_HOME}"
export CUDA_DIR=${CUDA_HOME}
python ~/test/test_tensorflow.py
And mytest_tensorflow.py
file has the following content to print out the tensorflow version:
import tensorflow as tf
print(tf.__version__)
Mamba/Conda Tensorflow setup
If you insist on setting up your machine-learning/AI environment using mamba/conda, please first read our OSCER Mamba instruction. At the end of step 2, once you created and activated your mamba environment, you can install additional packages, such as tensorflow.
For tensorflow: in step 2, after you activated your mamba environment, type:
mamba install -c conda-forge tensorflow=2.15.0
This command will install tensorflow 2.15.0 onto your mamba environment. Refer to conda-forge tensorflow website to see the latest version of tensorflow on Anaconda.
Then, in step 3, you need to load the correct CUDA and cuDNN versions associated with your tensorflow version before the main python command in your batch script.
*** UPDATE 20241007 *** you need to define two environment variables, XLA_FLAGS and CUDA_DIR to avoid tensorflow jobs from being terminated early on the supercomputer:
export XLA_FLAGS="--xla_gpu_cuda_data_dir=${CUDA_HOME}"
export CUDA_DIR=${CUDA_HOME}
Below is the content of my batch script tesFor tensorflow 2.15.0, you need to load cuDNN version 8.9.2.26-CUDA-12.2.0. Here is an example batch script for mamba tensorflow 2.15.0:
#
#SBATCH --partition=debug
#SBATCH --output=conda_%J_stdout.txt
#SBATCH --error=conda_%J_stderr.txt
#SBATCH --ntasks=1
#SBATCH --mem=1G
#SBATCH --time=00:10:00
module load cuDNN/8.9.2.26-CUDA-12.2.0
export XLA_FLAGS="--xla_gpu_cuda_data_dir=${CUDA_HOME}"
export CUDA_DIR=${CUDA_HOME}
python test_mamba_tensorflow.py
If you prefer to install your own cuDNN and CUDA package from your favorite mamba repo (such as conda-forge), keep in mind that it will consume quite a bit of your home directory (expect at LEAST 3-4GB for cuDNN and CUDA toolkit). As of this writing, conda-forge repo does NOT have CUDA toolkit 12.2.0 yet - another reason why we do NOT recommend this approach. Once they do, you need to install it and cuDNN 8.9.7.29 after activating your mamba environment in step 2 of OSCER mamba instruction:
mamba install -c conda-forge cudatoolkit=12.2.0
mamba install -c conda-forge cudnn=8.9.7.29
If you install your own CUDA toolkit and cuDNN packages from conda-forge repo, you do NOT need to load the OSCER cuDNN module in your batch script.