-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: NodeOverlay #1305
base: main
Are you sure you want to change the base?
RFC: NodeOverlay #1305
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ellistarn The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
(nit) In the PR description, the NodeOverlays all have the same |
Is this an overlay for Nodes, or an overlay for NodePools (maybe more than one NodePool, if we support matching via a selector)? |
In a real scenario with a saving plan, there is a commitment to a consistent amount of usage. Once the committed usage is exhausted, the price will revert to the on-demand rate. Does this PR take that into consideration? What I expect is:
|
@ellistarn We redefined the value of |
It sounds like you don't agree with the zone naming used in an existing integration. You can:
I don't think this PR would need to change to accommodate the ask. |
Thanks for trying to tackle this! I wonder if there is a cleaner way to represent the overlays to reduce sprawl. Thinking more generally, each dimension can be additive, adding a new custom resource, and/or adjustment, price changes or resource modification. (Assuming we aren't supporting deleting properties at this time). Price is a slight outlier from resources so we can separate out. Would it make sense to have at the highest level price and resources. Then subsection of adjustment and additions. Adjustment would support +/- percent or values and additions would just be a dict of new custom resources and values? |
@sftim I mean nodeoverlay can provide more overlay options, not just resources. For example, Lables. |
### Price | ||
|
||
Each instance offering comes with a price, which is used as a minimization function in Karpenter's scheduling algorithm. However, Karpenter cannot perfectly understand all factors that influence price, such as an external vendor's fee, a pricing discount, or an external motiviation such as carbon-offset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth describing the potential interaction of client-side use (and customization) of pricing with provider-specific server-side price-sensitive allocation strategies (like those supported by EC2 Fleet).
Hi all! Our use case is we run Circle CI runner jobs on a EKS cluster. And we have this limit for buildah to work:
But we got the following error:
|
operator: In | ||
values: ["m5.large", "m5.2xlarge", "m5.4xlarge", "m5.8xlarge", "m5.12xlarge"] | ||
capacity: | ||
smarter-devices/fuse: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, the "extended resource support" in this design would only support statically applied extended resources on selected nodes, correct? This would not cover cases where dynamic extended resources are needed, for example setting an extended resource for available network bandwidth which varies by EC2 instance type.
Additionally, karpenter still needs to have its provisioning and deprovisioning logic updated to account for arbitrary extended resources other than the gpu resource, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would not cover cases where dynamic extended resources are needed, for example setting an extended resource for available network bandwidth which varies by EC2 instance type.
You could achieve through multiple NodeOverlays, that match different selectors.
e.g. m5.large has 2, m5.2xlarge has 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think defining a NodeOverlay for each network bandwidth amount is a reasonable solution for my use case. We use a wide variety of instance types and sizes.
Do you think it could be feasible to incorporate some level of templating for values that can be easily retrieved from the EC2 API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For NetworkBandwidth, I'd love to just support this one first class.
We could always try to do something that attempts to model this like:
resourcesPerCPU:
network: 1Gi
But it's hard to capture all potential relationships without devolving into something like https://github.com/google/cel-spec
Pull Request Test Coverage Report for Build 9882748974Details
💛 - Coveralls |
Hi All I'm a bit confused about the relationship between kubernetes-sigs/karpenter and aws/karpenter-provider-aws. The lack of support for custom resource requests/limits is blocking me from autocalling with xilinx devices. I want to understand how to deploy this branch to EKS. Any help or understanding would be appreciated. |
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
This branch adds a “request for comments“. @ellistarn should the RFC and implementation land as one commit? Seems unusual. |
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
i think this shouldn't be closed. |
/lifecycle frozen |
@njtran: The In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What is the progress now? Why does it seems not to be updated. |
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
Fixes:
vmMemoryOverheadPercent
aws/karpenter-provider-aws#5161Description
How was this change tested?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.