New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

add gpu topology-aware scheduling proposal #1115

Draft

happy2048 wants to merge 1 commit into koordinator-sh:main from happy2048:feature/add-gputopology-proposal

happy2048 commented Mar 14, 2023

Ⅰ. Describe what this PR does

Add gpu topology-aware scheduling proposal

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

I have written necessary docs and comments
I have added necessary unit tests and integration tests
All checks passed in make test

koordinator-bot bot commented Mar 14, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign yihuifeng after the PR has been reviewed.
You can assign the PR to them by writing /assign @yihuifeng in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

koordinator-bot bot requested review from hormes and yihuifeng

March 14, 2023 07:28

koordinator-bot bot added the size/L label

happy2048 mentioned this pull request

[proposal] support GPU topology-aware scheduling #1116

Closed

happy2048 force-pushed the feature/add-gputopology-proposal branch from 55d848f to 4ded088 Compare

March 14, 2023 07:45


          koord-scheduler: add gpu topology-aware scheduling proposal (koordina…

23068f5

…tor-sh#1116)

Signed-off-by: happy2048 <[email protected]>

happy2048 force-pushed the feature/add-gputopology-proposal branch from 4ded088 to 23068f5 Compare

March 14, 2023 07:46

codecov bot commented Mar 14, 2023 •

edited

Loading

Codecov Report

Patch coverage has no change and project coverage change: -0.01 ⚠️

Comparison is base (c77422b) 66.99% compared to head (23068f5) 66.98%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1115      +/-   ##
==========================================
- Coverage   66.99%   66.98%   -0.01%     
==========================================
  Files         263      263              
  Lines       28978    28978              
==========================================
- Hits        19413    19412       -1     
- Misses       8201     8205       +4     
+ Partials     1364     1361       -3

Flag	Coverage Δ
unittests	`66.98% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

hormes added this to the v1.3 milestone

jasonliu747 reviewed

View reviewed changes

Member

jasonliu747 left a comment •

edited

Loading

As we discussed in the bi-weekly meeting. Many issues/questions regarding this proposal are still unresolved. I'll mark this proposal as WIP temporarily and you can feel free to request a review when you think it's ready.

jasonliu747 marked this pull request as draft

March 24, 2023 03:37

koordinator-bot bot added the do-not-merge/work-in-progress label

buptcozy reviewed

View reviewed changes

docs/proposals/scheduling/20230314-gpu-topology-aware-scheduling.md

+              ## Motivation
+              NVIDIA Collective Communication Library (NCCL) is a Magnum IO library provided by NVIDIA, which can realize GPU-accelerated collective operations. NCCL is topology-aware (automatically perceives the connection type between GPU cards, no manual configuration is required) and is optimized to pass PCIe, NVLink, Ethernet, and InfiniBand interconnects enable high bandwidth and low latency. In the deep learning distributed training job, the distributed training framework (Pytorch, MPI) combined with the NCCL library can achieve the acceleration effect. The NCCL library can perceive the connection between the GPU cards. Different connection types have different bandwidths. The size of the bandwidth affects the training time of the training job.
+              The following is a matrix describing the bandwidth between 8 GPU cards on a node, and the unit of value is GB/s:

Contributor

buptcozy Apr 11, 2023

Strictly speaking， not all 8-card gpu type support nvlink, should we annouce some example gpu models like v100\100?(or 1080Ti not support)?

buptcozy reviewed

View reviewed changes

docs/proposals/scheduling/20230314-gpu-topology-aware-scheduling.md

+              The following is a matrix describing the bandwidth between 8 GPU cards on a node, and the unit of value is GB/s:
+              ```
+              Bandwidth Matrix:
+                     gpu_0   gpu_1   gpu_2   gpu_3   gpu_4   gpu_5   gpu_6   gpu_7

Contributor

buptcozy Apr 11, 2023

there's a picture in images of "nvlink", may help reader to understand the speed bewteen different cards

buptcozy reviewed

View reviewed changes

docs/proposals/scheduling/20230314-gpu-topology-aware-scheduling.md

+. If a node cannot place all the pods of the training job, it will try to place these pods with the fewest nodes to avoid node resource fragmentation.
+              ### Non-Goals/Future Work
+. In this proposal, it is assumed that a training job can tolerate some pods running on the node first while the remaining pods are pending. If the training job cannot tolerate this situation, the GPU topology plugin needs to be used in conjunction with the gang plugin to implement All Or Nothing scheduling; that is, this solution does not implement the All Or Nothing scheduling logic.

Contributor

buptcozy Apr 11, 2023

I have question for this part, why we need to say we let some pods running on the node first? may be we should declare that we may need podgroup to describe a group of pod, and find a best schedule-result of them. whether the pods need launch together or not after scheduling is not very related to the topic, this is my opinion.

buptcozy reviewed

View reviewed changes

docs/proposals/scheduling/20230314-gpu-topology-aware-scheduling.md

+              ## Proposal
+              ### User stories
+              #### Story 1
+              **Single Pod requests GPU cards:** There is only one pod for the training job, and the number of GPU cards requested by the pod exceeds 1. At the same time, the training job uses the NCCL library for communication between GPU cards.  The communication bandwidth between GPU cards needs to be considered when allocating GPU cards to pods.

Contributor

buptcozy Apr 11, 2023

when allocating GPU cards to pods.-> when allocating GPU cards to pod. may be a grammer mistake.

buptcozy reviewed

View reviewed changes

docs/proposals/scheduling/20230314-gpu-topology-aware-scheduling.md

+              #### Story 1
+              **Single Pod requests GPU cards:** There is only one pod for the training job, and the number of GPU cards requested by the pod exceeds 1. At the same time, the training job uses the NCCL library for communication between GPU cards.  The communication bandwidth between GPU cards needs to be considered when allocating GPU cards to pods.
+              #### Story 2
+              **Multiple Pods request GPU cards:** The distributed training job has multiple workers (or multiple pods), the underlying communication framework of the workers uses the NCCL library, and there is data communication between GPU cards. If a node can run these workers, then these workers should be run on a node first to reduce the communication delay between GPUs. If one node cannot run these workers, consider multiple nodes to run these workers; when each node selects GPUs for the workers, which should be run on the node, communication bandwidth between GPU cards should be considered, and GPU combination with the largest bottleneck bandwidth is preferred.

Contributor

buptcozy Apr 11, 2023

“when each node selects GPUs for the workers, which should be run on the node” emm...seems a little strange.

buptcozy reviewed

View reviewed changes

docs/proposals/scheduling/20230314-gpu-topology-aware-scheduling.md

+              #### Story 2
+              **Multiple Pods request GPU cards:** The distributed training job has multiple workers (or multiple pods), the underlying communication framework of the workers uses the NCCL library, and there is data communication between GPU cards. If a node can run these workers, then these workers should be run on a node first to reduce the communication delay between GPUs. If one node cannot run these workers, consider multiple nodes to run these workers; when each node selects GPUs for the workers, which should be run on the node, communication bandwidth between GPU cards should be considered, and GPU combination with the largest bottleneck bandwidth is preferred.
+              In this scenario, the following situation may occur: some workers(or pods) of the training job are running, while the remaining pods are pending due to untimely scheduling for some reasons. If the training job can tolerate this situation, no special handling is required; If the training job cannot tolerate this situation, the running pods occupy resources and waste resources. To avoid the situation, it is necessary to ensure All Or Nothing resource scheduling. In this case, gang scheduling is required.

Contributor

buptcozy Apr 11, 2023 •

edited

Loading

this part comment is the same as before.

buptcozy reviewed

View reviewed changes

docs/proposals/scheduling/20230314-gpu-topology-aware-scheduling.md

+              #### Story 1
+              **Single Pod requests GPU cards:** There is only one pod for the training job, and the number of GPU cards requested by the pod exceeds 1. At the same time, the training job uses the NCCL library for communication between GPU cards.  The communication bandwidth between GPU cards needs to be considered when allocating GPU cards to pods.
+              #### Story 2
+              **Multiple Pods request GPU cards:** The distributed training job has multiple workers (or multiple pods), the underlying communication framework of the workers uses the NCCL library, and there is data communication between GPU cards. If a node can run these workers, then these workers should be run on a node first to reduce the communication delay between GPUs. If one node cannot run these workers, consider multiple nodes to run these workers; when each node selects GPUs for the workers, which should be run on the node, communication bandwidth between GPU cards should be considered, and GPU combination with the largest bottleneck bandwidth is preferred.

Contributor

buptcozy Apr 11, 2023

if pods must be scheduled between some nodes, for example 2 node, which means 1 node resource must be used as free->0, and the other node may used as free->0 or not 0, for free->0 node, we has no need to consider topology, for free not 0 node, should we consider the topology as worst not best? becuase we may need to remain the good topology to other pod which can run in one node; in other side, if pods cross the node, the bottleneck is the network speed between different nodes, so tring to get a best-schedule on a free not 0 node seems useless.

buptcozy reviewed

View reviewed changes

docs/proposals/scheduling/20230314-gpu-topology-aware-scheduling.md

+              #### main steps
+              the main steps are described as:
+              - When pod1 starts to be scheduled, the GPU topology plugin uses two specific pod labels(will be introduced later) to find pods that have not been scheduled in the same group (including pod1 itself) in preFilter extension, for example, [pod1, pod2, pod3].

Contributor

buptcozy Apr 11, 2023

Why I prefer this feature is so related to cosscheduling? may be we should consider pod1\pod2\pod3 are closely connected in the scheduling queue, only this way can make sure your scheduling-plan can be achieve. however, we can only load one sort plugin in schedule-framework, and coscheduling is also need the sort plugin. actually, we haven't found a case that need nvlink but not need coscheduling, so may be we can reuse coscheduling plugin to achive this feature. coscheduling can also help to recognize the relationship of pod1\pod2\pod3

buptcozy reviewed

View reviewed changes

docs/proposals/scheduling/20230314-gpu-topology-aware-scheduling.md

+              }
+              ```
+              - If one node cannot place [pod1, pod2, pod2], then try to place these three pods with 2 nodes. After allocating GPUs to the pods, the combination with less remaining GPU resources on the node is preferred

Contributor

buptcozy Apr 11, 2023

[pod1, pod2, pod2]->[pod1, pod2, pod3]

buptcozy reviewed

View reviewed changes

docs/proposals/scheduling/20230314-gpu-topology-aware-scheduling.md

+              }
+              ```
+              - If one node cannot place [pod1, pod2, pod2], then try to place these three pods with 2 nodes. After allocating GPUs to the pods, the combination with less remaining GPU resources on the node is preferred

Contributor

buptcozy Apr 11, 2023

if there 10 pod of a job, and node only can fill 2 pod, so will you try every arrange and combine? like pod1+pod2 or pod1 + pod3 or ....? (C2/100). may be we can assume that all pod in a gpu job that need nvlink is Isomorphic(actually it's indeed Isomorphic in real world), this will lead the problem much easier. we want this semanteme of "A node can place 2 pod, B node can place 1 pod...", not "A node can place pod1+pod3, node2 can place pod2"...
"

Contributor

buptcozy Apr 11, 2023

if pod1\pod2\pod3 and node1\node2\node3, the arrange and combine is [pod1\pod2 on node1, pod1\pod3 on node1...], [pod1\pod2 on node2, pod1\pod3 on node2...], which determined by pod num and node num, which is terrible, the process may we first calculate each node's max can-assign num, and sort the node, then place the Isomorphic pod to the node from max->min

zwzhang0107 modified the milestones: v1.3, v1.4

eahydra modified the milestones: v1.4, v1.5

saintube modified the milestones: v1.5, v1.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress size/L