Nvidia DGX-1

The Stanford Computer Vision Lab has added the Nvidia DGX-1 machine to their computer cluster. Currently, it is only accessible via the SAIL network. Please file a help request at http://support.cs.stanford.edu if you have any questions regarding the use of the machine.


Specification

Hostname visionlab-dgx1.stanford.edu
CPU 2x Intel E5-2698 v4 2.2 GHz @ 20-core
RAM 512GB
GPU 8x Tesla P100
Networking 10GbE
Storage 4x 2TB SSD RAID0, NFS-shared storage

Nvidia-Docker

Nvidia suggests using Nvidia-Docker and their provided containers for optimized performance and convenience. The Following table outlines the containers they officially support and available on the DGX as of this writing.

REPOSITORY TAG
nvidia/cuda latest
nvcr.io/nvidia/digits 17.04
nvcr.io/nvidia/caffe 17.04
nvcr.io/nvidia/tensorflow 17.04
nvcr.io/nvidia/pytorch 17.04
nvcr.io/nvidia/caffe2 17.04
nvcr.io/nvidia/theano 17.04
nvcr.io/nvidia/mxnet 17.04
nvcr.io/nvidia/cntk 17.04
nvcr.io/nvidia/torch 17.04

Here is how to get started.

  • Please request your access to visionlab-dgx1.stanford.edu by filling our support request at https://support.cs.stanford.edu. Please state which sponsoring faculty you are working with.
  • SSH into visionlab-dgx1.stanford.edu from campus network or via Stanford VPN service from off campus. (Full, non-split tunnel is required)

#Check the current loaded containers, the nvidia containers should already be loaded. Note the TAG column, you'll need to use this when running the docker command
docker images

REPOSITORY                  TAG                 IMAGE ID            CREATED             SIZE
nvidia/cuda                 latest              569f547756e0        8 days ago          1.671 GB
nvcr.io/nvidia/digits       17.04               3736f3fe071f        4 weeks ago         4.171 GB
nvcr.io/nvidia/caffe        17.04               87c288427f2d        4 weeks ago         2.794 GB
nvcr.io/nvidia/tensorflow   17.04               121558cb5849        6 weeks ago         3.028 GB
nvcr.io/nvidia/pytorch      17.04               2f0834174e65        6 weeks ago         3.793 GB
nvcr.io/nvidia/caffe2       17.04               e5b67a4f6726        6 weeks ago         2.633 GB
nvcr.io/nvidia/theano       17.04               24943feafc9b        6 weeks ago         2.386 GB
nvcr.io/nvidia/mxnet        17.04               24afec0cd359        7 weeks ago         2.338 GB
nvcr.io/nvidia/cntk         17.04               61e61de9fa43        7 weeks ago         5.741 GB
nvcr.io/nvidia/torch        17.04               a337ffb42c8e        7 weeks ago         2.9 GB
nvidia/cuda                 7.5                 cf43500d0050        5 months ago        1.232 GB

#If you want, you can load your own container
docker load --input /raid/scratch/u/<framework>.tar

#Test nvidia-smi
nvidia-docker run --rm nvidia/cuda nvidia-smi

jimmyw@visionlab-dgx1:~$ nvidia-docker run --rm nvidia/cuda nvidia-smi
Wed May 24 23:59:21 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  Off  | 0000:06:00.0     Off |                    0 |
| N/A   35C    P0    31W / 300W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  Off  | 0000:07:00.0     Off |                    0 |
| N/A   37C    P0    34W / 300W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-SXM2...  Off  | 0000:0A:00.0     Off |                    0 |
| N/A   36C    P0    35W / 300W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-SXM2...  Off  | 0000:0B:00.0     Off |                    0 |
| N/A   37C    P0    33W / 300W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P100-SXM2...  Off  | 0000:85:00.0     Off |                    0 |
| N/A   38C    P0    32W / 300W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P100-SXM2...  Off  | 0000:86:00.0     Off |                    0 |
| N/A   35C    P0    32W / 300W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P100-SXM2...  Off  | 0000:89:00.0     Off |                    0 |
| N/A   37C    P0    33W / 300W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P100-SXM2...  Off  | 0000:8A:00.0     Off |                    0 |
| N/A   38C    P0    32W / 300W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

#Launch a framework container in interactive mode. Note the TAG, it is always REPOSITORY:TAG when tag isn't "latest".
nvidia-docker run --rm -ti nvcr.io/nvidia/torch:17.04

jimmyw@visionlab-dgx1:~$ nvidia-docker run --rm -ti nvcr.io/nvidia/torch:17.04
  ______             __   |  Torch7
 /_  __/__  ________/ /   |  Scientific computing for Lua.
  / / / _ \/ __/ __/ _ \  |
 /_/  \___/_/  \__/_//_/  |  https://github.com/torch
                          |  http://torch.ch

NVIDIA Release 17.04 (build 17724)

Container image Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.
Copyright (c) 2016, Soumith Chintala, Ronan Collobert, Koray Kavukcuoglu, Clement Farabet
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

root@0a24ff58cde5:/workspace# th

  ______             __   |  Torch7
 /_  __/__  ________/ /   |  Scientific computing for Lua.
  / / / _ \/ __/ __/ _ \  |  Type ? for help
 /_/  \___/_/  \__/_//_/  |  https://github.com/torch
                          |  http://torch.ch

th>