With funding from the OU Vice President for Research, the Data Institute for Societal Challenges has acquired a set of CPU and GPU nodes for the OSCER Supercomputer, as well as large-scale storage on OURdisk. These resources are available for use by all members of DISC who are located on the OU campuses, including: faculty, researchers, postdocs, and students.
Purchased resources (only some are available at this time):
2 x 64-core CPU-only nodes with 2TB of RAM
1 x Quad 80GB A100 nodes
5 x Dual 80GB A100 nodes
2 x Dual 80GB H100 nodes
93 TB OURDisk storage
Become a member of DISC: https://www.ou.edu/disc/about/people/disc-membership
Apply for a supercomputer account: https://www.ou.edu/oscer/support/accounts/new_account
Group: Unix group name already associated with your research group
Apply for access to the DISC supercomputer resources: https://ousurvey.qualtrics.com/jfe/form/SV_ac6ajVyfgZXeWy2
The queues associated with the standard DISC partitions will operate according to the typical OSCER policies, which take into account resource requests for specific jobs and recent resource utilization by specific users. A key effect of these policies is that resource allocations are prioritized to balance available resources across users. For the purposes of meeting deadlines for submission of papers, research proposals/reports, theses, or dissertations, users can request short-term, high priority access to the DISC resources. Jobs in the high priority partitions will generally be executed before the standard partitions, though they will not interrupt currently executing jobs.
Project title and short narrative (1 paragraph)
Name and username of requester and (if applicable) name of supervisor
Resources requested: CPUs (number of processes and threads), GPUs (number), and memory footprint
Requested duration of the high priority access in units of months (typical will be 1-2 months)
The nature of the deadline that the user is working to meet
Send proposals (pdf format) to: disc@ou.edu
Because these resources are shared by a large number of users, it is important for all of us to take steps to make efficient use of them. Specific steps include:
Reserve appropriate amounts of memory and numbers of CPU threads
Reserve GPUs as part of your resource request & only use GPUs that have been explicitly assigned to you
Optimize use of allocated GPUs (our goal is that allocated GPUs be used at near 100% capacity)
The current partition state is as follows. Note that not all nodes have been installed yet.
Partition Name | High Priority Name | Notes | Nodes | Threads Available | Memory Available | GPUs Available | Max LSCRATCH |
---|---|---|---|---|---|---|---|
disc | All DISC owned nodes | Curr: 12 | n/a | n/a | n/a | n/a | |
disc_largemem | disc_largemem_hp | Not yet installed | 2 c915-c916 | 128 | 2 TB | 0 | 852 GB |
disc_dual_a100 | disc_dual_a100_hp | 4/5 installed GPU: 2xA100 80GB | 5 c862-c866 | 128 | 500 GB | 2 | 852 GB |
disc_quad_a100 | disc_quad_a100_hp | GPU: 4xA100 80GB | 1 c856 | 128 | 1 TB | 4 | 852 GB |
disc_dual_h100 | disc_dual_h100_hp | GPU: 2xH100 80GB | 4 c849 … c852 | 128 | 500 GB | 2 | 852 GB |
Storage options include:
Your own home directory (~20 GB): /home/username
Local scratch disks are high-speed storage (SSDs) that are part of each compute node & are dynamically created as a job is started.
The name of your assigned local scratch space is stored in the $LSCRATCH environment variable
The size of your local scratch is 852GB * cpus-per-task / available_threads
The local scratch is destroyed as soon as your job completes, so do not plan to permanently store results here
DISC OURdisk space is available by request
Batch files are used to specify the details of your executing jobs, including the resource request and the specific program to execute. For the DISC nodes, your resource request should include the following lines (as applicable):
Select one of the disc partitions:
#SBATCH --partition=disc
Select a specific node or list of nodes to execute on (optional):
#SBATCH --nodelist=c851
Maximum physical memory to use (example: 15 gigabytes):
#SBATCH --mem=15G
Maximum number of threads your program will use (example: 10 threads):
#SBATCH --cpus-per-task=10
Number of GPUs (example: 2 GPUs):
#SBATCH --gpus-per-node=2
(never use GPUs without including this reservation, as you can otherwise disrupt other users)
OSCER supercomputer documentation: https://www.ou.edu/oscer
Presentation on using the OSCER supercomputer (with a focus on deep learning): https://docs.google.com/presentation/d/1ctPshEn6Mj8lYwBqhk0YgJQ8yMYLzattO-y6BRp1Il8/edit?usp=sharing
OSCER-specific help: support@oscer.ou.edu
Setting up experiments, optimizing resource use, deep learning: Dr. Andrew H. Fagg, DISC & School of Computer Science (andrewhfagg@gmail.com)
Deep learning resources: https://github.com/Symbiotic-Computing-Laboratory/deep_learning_practice
For more information, please contact Chongle Pan at cpan@ou.edu and Andrew Fagg at andrewhfagg@gmail.com