Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add gpu topology-aware scheduling proposal #1115

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

happy2048
Copy link

Ⅰ. Describe what this PR does

Add gpu topology-aware scheduling proposal

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests
  • All checks passed in make test

@koordinator-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign yihuifeng after the PR has been reviewed.
You can assign the PR to them by writing /assign @yihuifeng in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@happy2048 happy2048 force-pushed the feature/add-gputopology-proposal branch from 4ded088 to 23068f5 Compare March 14, 2023 07:46
@codecov
Copy link

codecov bot commented Mar 14, 2023

Codecov Report

Patch coverage has no change and project coverage change: -0.01 ⚠️

Comparison is base (c77422b) 66.99% compared to head (23068f5) 66.98%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1115      +/-   ##
==========================================
- Coverage   66.99%   66.98%   -0.01%     
==========================================
  Files         263      263              
  Lines       28978    28978              
==========================================
- Hits        19413    19412       -1     
- Misses       8201     8205       +4     
+ Partials     1364     1361       -3     
Flag Coverage Δ
unittests 66.98% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@hormes hormes added this to the v1.3 milestone Mar 22, 2023
Copy link
Member

@jasonliu747 jasonliu747 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed in the bi-weekly meeting. Many issues/questions regarding this proposal are still unresolved. I'll mark this proposal as WIP temporarily and you can feel free to request a review when you think it's ready.

## Motivation
NVIDIA Collective Communication Library (NCCL) is a Magnum IO library provided by NVIDIA, which can realize GPU-accelerated collective operations. NCCL is topology-aware (automatically perceives the connection type between GPU cards, no manual configuration is required) and is optimized to pass PCIe, NVLink, Ethernet, and InfiniBand interconnects enable high bandwidth and low latency. In the deep learning distributed training job, the distributed training framework (Pytorch, MPI) combined with the NCCL library can achieve the acceleration effect. The NCCL library can perceive the connection between the GPU cards. Different connection types have different bandwidths. The size of the bandwidth affects the training time of the training job.

The following is a matrix describing the bandwidth between 8 GPU cards on a node, and the unit of value is GB/s:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, not all 8-card gpu type support nvlink, should we annouce some example gpu models like v100\100?(or 1080Ti not support)?

The following is a matrix describing the bandwidth between 8 GPU cards on a node, and the unit of value is GB/s:
```
Bandwidth Matrix:
gpu_0 gpu_1 gpu_2 gpu_3 gpu_4 gpu_5 gpu_6 gpu_7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a picture in images of "nvlink", may help reader to understand the speed bewteen different cards

4. If a node cannot place all the pods of the training job, it will try to place these pods with the fewest nodes to avoid node resource fragmentation.
### Non-Goals/Future Work

1. In this proposal, it is assumed that a training job can tolerate some pods running on the node first while the remaining pods are pending. If the training job cannot tolerate this situation, the GPU topology plugin needs to be used in conjunction with the gang plugin to implement All Or Nothing scheduling; that is, this solution does not implement the All Or Nothing scheduling logic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have question for this part, why we need to say we let some pods running on the node first? may be we should declare that we may need podgroup to describe a group of pod, and find a best schedule-result of them. whether the pods need launch together or not after scheduling is not very related to the topic, this is my opinion.

## Proposal
### User stories
#### Story 1
**Single Pod requests GPU cards:** There is only one pod for the training job, and the number of GPU cards requested by the pod exceeds 1. At the same time, the training job uses the NCCL library for communication between GPU cards. The communication bandwidth between GPU cards needs to be considered when allocating GPU cards to pods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when allocating GPU cards to pods.-> when allocating GPU cards to pod. may be a grammer mistake.

#### Story 1
**Single Pod requests GPU cards:** There is only one pod for the training job, and the number of GPU cards requested by the pod exceeds 1. At the same time, the training job uses the NCCL library for communication between GPU cards. The communication bandwidth between GPU cards needs to be considered when allocating GPU cards to pods.
#### Story 2
**Multiple Pods request GPU cards:** The distributed training job has multiple workers (or multiple pods), the underlying communication framework of the workers uses the NCCL library, and there is data communication between GPU cards. If a node can run these workers, then these workers should be run on a node first to reduce the communication delay between GPUs. If one node cannot run these workers, consider multiple nodes to run these workers; when each node selects GPUs for the workers, which should be run on the node, communication bandwidth between GPU cards should be considered, and GPU combination with the largest bottleneck bandwidth is preferred.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“when each node selects GPUs for the workers, which should be run on the node” emm...seems a little strange.

#### Story 2
**Multiple Pods request GPU cards:** The distributed training job has multiple workers (or multiple pods), the underlying communication framework of the workers uses the NCCL library, and there is data communication between GPU cards. If a node can run these workers, then these workers should be run on a node first to reduce the communication delay between GPUs. If one node cannot run these workers, consider multiple nodes to run these workers; when each node selects GPUs for the workers, which should be run on the node, communication bandwidth between GPU cards should be considered, and GPU combination with the largest bottleneck bandwidth is preferred.

In this scenario, the following situation may occur: some workers(or pods) of the training job are running, while the remaining pods are pending due to untimely scheduling for some reasons. If the training job can tolerate this situation, no special handling is required; If the training job cannot tolerate this situation, the running pods occupy resources and waste resources. To avoid the situation, it is necessary to ensure All Or Nothing resource scheduling. In this case, gang scheduling is required.
Copy link
Contributor

@buptcozy buptcozy Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part comment is the same as before.

#### Story 1
**Single Pod requests GPU cards:** There is only one pod for the training job, and the number of GPU cards requested by the pod exceeds 1. At the same time, the training job uses the NCCL library for communication between GPU cards. The communication bandwidth between GPU cards needs to be considered when allocating GPU cards to pods.
#### Story 2
**Multiple Pods request GPU cards:** The distributed training job has multiple workers (or multiple pods), the underlying communication framework of the workers uses the NCCL library, and there is data communication between GPU cards. If a node can run these workers, then these workers should be run on a node first to reduce the communication delay between GPUs. If one node cannot run these workers, consider multiple nodes to run these workers; when each node selects GPUs for the workers, which should be run on the node, communication bandwidth between GPU cards should be considered, and GPU combination with the largest bottleneck bandwidth is preferred.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if pods must be scheduled between some nodes, for example 2 node, which means 1 node resource must be used as free->0, and the other node may used as free->0 or not 0, for free->0 node, we has no need to consider topology, for free not 0 node, should we consider the topology as worst not best? becuase we may need to remain the good topology to other pod which can run in one node; in other side, if pods cross the node, the bottleneck is the network speed between different nodes, so tring to get a best-schedule on a free not 0 node seems useless.

#### main steps
the main steps are described as:

- When pod1 starts to be scheduled, the GPU topology plugin uses two specific pod labels(will be introduced later) to find pods that have not been scheduled in the same group (including pod1 itself) in preFilter extension, for example, [pod1, pod2, pod3].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why I prefer this feature is so related to cosscheduling? may be we should consider pod1\pod2\pod3 are closely connected in the scheduling queue, only this way can make sure your scheduling-plan can be achieve. however, we can only load one sort plugin in schedule-framework, and coscheduling is also need the sort plugin. actually, we haven't found a case that need nvlink but not need coscheduling, so may be we can reuse coscheduling plugin to achive this feature. coscheduling can also help to recognize the relationship of pod1\pod2\pod3

}
```

- If one node cannot place [pod1, pod2, pod2], then try to place these three pods with 2 nodes. After allocating GPUs to the pods, the combination with less remaining GPU resources on the node is preferred
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[pod1, pod2, pod2]->[pod1, pod2, pod3]

}
```

- If one node cannot place [pod1, pod2, pod2], then try to place these three pods with 2 nodes. After allocating GPUs to the pods, the combination with less remaining GPU resources on the node is preferred
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there 10 pod of a job, and node only can fill 2 pod, so will you try every arrange and combine? like pod1+pod2 or pod1 + pod3 or ....? (C2/100). may be we can assume that all pod in a gpu job that need nvlink is Isomorphic(actually it's indeed Isomorphic in real world), this will lead the problem much easier. we want this semanteme of "A node can place 2 pod, B node can place 1 pod...", not "A node can place pod1+pod3, node2 can place pod2"...
"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if pod1\pod2\pod3 and node1\node2\node3, the arrange and combine is [pod1\pod2 on node1, pod1\pod3 on node1...], [pod1\pod2 on node2, pod1\pod3 on node2...], which determined by pod num and node num, which is terrible, the process may we first calculate each node's max can-assign num, and sort the node, then place the Isomorphic pod to the node from max->min

@zwzhang0107 zwzhang0107 modified the milestones: v1.3, v1.4 Aug 8, 2023
@eahydra eahydra modified the milestones: v1.4, v1.5 Dec 6, 2023
@saintube saintube modified the milestones: v1.5, v1.6 Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants