I’ve been working on a server with 3 NVIDIA GPUs for my internship work. A few months ago, I noticed that the GPU IDs are different in different situations for the same GPU. Therefore, I decided to take a look at how the IDs are enumerated.
What is the issue exactly?
There are potentially two different GPU ID orders that we can get from nvidia-smi
and the CUDA library. Below demonstrates two different ID enumeration schemes observed on the server.
With nvidia-smi
We can get a list of GPUs and their IDs with nvidia-smi
. Below is the output I get from the server,
Sun Dec 10 13:42:07 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.82 Driver Version: 375.82 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:05:00.0 Off | N/A |
| 22% 55C P8 31W / 250W | 11853MiB / 12205MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:06:00.0 Off | N/A |
| 22% 60C P8 18W / 250W | 114MiB / 12207MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:09:00.0 Off | N/A |
| 27% 66C P2 72W / 250W | 8452MiB / 12207MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
The IDs here are generated following the PCI-E bus ID. The GPU with bus ID 0000:05:00.0
(as shown in nvidia-smi
) has ID 0, while the one with 0000:09:00.0
has ID 2.
With CUDA library
This is not really a way to “get” GPU IDs as no such function is available from the CUDA library. However, we can observe how the CUDA library assigns IDs to the GPUs by “setting” the device we want to use and using that device. In the following scenario, I use the bandwidthTest
program provided in the CUDA samples and use CUDA_VISIBLE_DEVICES
to select the GPU I want to use.
$ CUDA_VISIBLE_DEVICES=0 ./bandwidthTest
You may think that it would use GPU 0 shown in nvidia-smi
, it doesn’t! Look at the nvidia-smi
output below (I’ve hidden unrelated processes),
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 6289 C ./bandwidthTest 140MiB |
+-----------------------------------------------------------------------------+
Clearly, it’s using device 2 in nvidia-smi
.
Why are there two different schemes?
After searching online and digging into the CUDA documentation, I finally figured out the reason. If you look at the CUDA Environment Variables section under category “Device Enumeration and Properties”, there is a variable named CUDA_DEVICE_ORDER
with two possible values, FASTEST_FIRST
and PCI_BUS_ID
. The documentation says,
FASTEST_FIRST causes CUDA to guess which device is fastest using a simple heuristic, and make that device 0, leaving the order of the rest of the devices unspecified. PCI_BUS_ID orders devices by PCI bus ID in ascending order.
By default, this environment variable is set to FASTEST_FIRST
. Therefore, it could potentially generate different IDs for the devices compared to PCI_BUS_ID
if you devices happen to have different speeds. After I manually set this variable to PCI_BUS_ID
, the IDs are consistent with the IDs in nvidia-smi
.
$ export CUDA_DEVICE_ORDER=PCI_BUS_ID
$ CUDA_VISIBLE_DEVICES=0 ./bandwidthTest
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 9410 C ./bandwidthTest 140MiB |
+-----------------------------------------------------------------------------+
Why does it matter?
Well, it matters when you want to use a particular GPU (or a set of GPUs). Maybe you want to test a particular device, or you want to select some idle GPUs to run your program. In my case, when I discovered this issue, I was trying to run my program on the idle GPUs.
If you want to be absolutely confident that you use the correct GPUs, I would recommend setting CUDA_DEVICE_ORDER
to PCI_BUS_ID
so that the IDs in CUDA programs are always consistent with what you see in nvidia-smi
.
References
- CUDA C Programming Guide
- Inconsistency of IDs between ‘nvidia-smi -L’ and cuDeviceGetName() - StackOverflow