CPU and GPU Resources on the OSCER Supercomputer

With funding from the OU Vice President for Research, the Data Institute for Societal Challenges has acquired a set of CPU and GPU nodes for the OSCER Supercomputer, as well as large-scale storage on OURdisk. These resources are available for use by all members of DISC who are located on the OU campuses, including: faculty, researchers, postdocs, and students.

Resources

Purchased resources (only some are available at this time):

2 x 64-core CPU-only nodes with 2TB of RAM
1 x Quad 80GB A100 nodes
5 x Dual 80GB A100 nodes
2 x Dual 80GB H100 nodes
93 TB OURDisk storage

Getting Access

Become a member of DISC: https://www.ou.edu/disc/about/people/disc-membership
Apply for a supercomputer account: https://www.ou.edu/oscer/support/accounts/new_account
- Sponsor: for faculty: self; for others: list faculty mentor
- Shell: bash
- Group: Unix group name already associated with your research group
Apply for access to the DISC supercomputer resources: https://ousurvey.qualtrics.com/jfe/form/SV_ac6ajVyfgZXeWy2

More Information

High Priority Access

The queues associated with the standard DISC partitions will operate according to the typical OSCER policies, which take into account resource requests for specific jobs and recent resource utilization by specific users. A key effect of these policies is that resource allocations are prioritized to balance available resources across users. For the purposes of meeting deadlines for submission of papers, research proposals/reports, theses, or dissertations, users can request short-term, high priority access to the DISC resources. Jobs in the high priority partitions will generally be executed before the standard partitions, though they will not interrupt currently executing jobs.

Proposals should include:
Project title and short narrative (1 paragraph)
Name and username of requester and (if applicable) name of supervisor
Resources requested: CPUs (number of processes and threads), GPUs (number), and memory footprint
Requested duration of the high priority access in units of months (typical will be 1-2 months)
The nature of the deadline that the user is working to meet
Short discussion of steps taken to optimize the use of the requested resources

Send proposals (pdf format) to: disc@ou.edu

Expectations

Because these resources are shared by a large number of users, it is important for all of us to take steps to make efficient use of them. Specific steps include:

Reserve appropriate amounts of memory and numbers of CPU threads
Reserve GPUs as part of your resource request & only use GPUs that have been explicitly assigned to you
Optimize use of allocated GPUs (our goal is that allocated GPUs be used at near 100% capacity)

Partitions

The current partition state is as follows. Note that not all nodes have been installed yet.

Partition Name	High Priority Name	Notes	Nodes	Threads Available	Memory Available	GPUs Available	Max LSCRATCH
disc		All DISC owned nodes	Curr: 12	n/a	n/a	n/a	n/a
disc_largemem	disc_largemem_hp	Not yet installed	2 c915-c916	128	2 TB	0	852 GB
disc_dual_a100	disc_dual_a100_hp	4/5 installed GPU: 2xA100 80GB	5 c862-c866	128	500 GB	2	852 GB
disc_quad_a100	disc_quad_a100_hp	GPU: 4xA100 80GB	1 c856	128	1 TB	4	852 GB
disc_dual_h100	disc_dual_h100_hp	GPU: 2xH100 80GB	4 c849 … c852	128	500 GB	2	852 GB

Storage

Storage options include:

Your own home directory (~20 GB): /home/username
Temporary storage of data and results: /scratch/username. Data here has a limited lifetime (~2 weeks) & will automatically be deleted

Medium term storage of data:
- DISC OURDisk partition: /ourdisk/hpc/disc/
- This space is managed entirely by the users. We expect that unneeded data will be removed in a timely fashion
- Never ever place conda environments in OURDisk

Local Scratch Disk

Local scratch disks are high-speed storage (SSDs) that are part of each compute node & are dynamically created as a job is started.

The name of your assigned local scratch space is stored in the $LSCRATCH environment variable
The size of your local scratch is 852GB * cpus-per-task / available_threads
The local scratch is destroyed as soon as your job completes, so do not plan to permanently store results here

Status

All 12 nodes are up
DISC OURdisk space is available by request

Details

Batch files are used to specify the details of your executing jobs, including the resource request and the specific program to execute.  For the DISC nodes, your resource request should include the following lines (as applicable):

Select one of the disc partitions:

#SBATCH --partition=disc

Select a specific node or list of nodes to execute on (optional):

#SBATCH --nodelist=c851

Maximum physical memory to use (example: 15 gigabytes):

#SBATCH --mem=15G

Maximum number of threads your program will use (example: 10 threads):

#SBATCH --cpus-per-task=10

Number of GPUs (example: 2 GPUs):

#SBATCH --gpus-per-node=2

(never use GPUs without including this reservation, as you can otherwise disrupt other users)

Getting Help

OSCER supercomputer documentation: https://www.ou.edu/oscer
Presentation on using the OSCER supercomputer (with a focus on deep learning): https://docs.google.com/presentation/d/1ctPshEn6Mj8lYwBqhk0YgJQ8yMYLzattO-y6BRp1Il8/edit?usp=sharing
OSCER-specific help: support@oscer.ou.edu
Setting up experiments, optimizing resource use, deep learning: Dr. Andrew H. Fagg, DISC & School of Computer Science (andrewhfagg@gmail.com)
- Deep learning resources: https://github.com/Symbiotic-Computing-Laboratory/deep_learning_practice

For more information, please contact Chongle Pan at cpan@ou.edu and Andrew Fagg at andrewhfagg@gmail.com