Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflict with a CUDA-11 PyTorch installation #10729

Open
bryant1410 opened this issue Aug 20, 2024 · 3 comments
Open

Conflict with a CUDA-11 PyTorch installation #10729

bryant1410 opened this issue Aug 20, 2024 · 3 comments

Comments

@bryant1410
Copy link

XGBoost for Python depends on nvidia-nccl-cu12, which is for CUDA 12. I have a PyTorch 2.4.0 installation for CUDA 11.8, but when I use distributed mode, PyTorch picks up on the one installed by XGBoost for NCCL and it gives me problems for my environment.

My workaround is to install the CPU-only version of XGBoost. However, I still want to use XGBoost with CUDA support. It'd be nice if I could use it with nvidia-nccl-cu11 instead. Not sure what the solution could be (maybe optional groups of dependencies for XGBoost, such as a cu11 one, etc; or a different package). Note this could be a future problem when CUDA 13 comes out as well.

@trivialfis
Copy link
Member

trivialfis commented Aug 20, 2024

Thank you for raising an issue. I don't know what is the right solution either. As for a workaround, one can either use conda/mamba, or install xgboost without dependency (pip install --no-deps), then pickup these dependencies later.

The binary wheel itself can use cu11, it's just how to specify it in the pyproject.toml.

@trivialfis
Copy link
Member

May I ask what specific problem you ran into?

@bryant1410
Copy link
Author

After doing a PyTorch distributed mode init, when doing a broadcast_object_list, I ran into some weird NCCL error:

  File "file.py", line 2, in function
    dist.broadcast_object_list(objects, src=src)
  File "env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2901, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2205, in broadcast
    work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.22.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'CUDA driver version is insufficient for CUDA runtime version'

It got fixed after removing nvidia-nccl-cu12 installed by xgboost installation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants