Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Price Improvement Threshold #1440

Open
wmgroot opened this issue Jul 17, 2024 · 7 comments
Open

Price Improvement Threshold #1440

wmgroot opened this issue Jul 17, 2024 · 7 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@wmgroot
Copy link
Contributor

wmgroot commented Jul 17, 2024

Description

One piece of implementation for RFC #1433 (Consolidation Policies)

What problem are you trying to solve?
Make it possible to balance the cost of disrupting pods against the cost of using a more expensive instance type.

Currently, single-node consolidation will replace any instance with a cheaper version that can house the same pods, resulting in very frequent pod disruption. Disruption pods has its own cost, which for certain workloads can negate savings or result in a net loss of money.

This concept has already been documented as an improvement for spot consolidation, but I see no reason why it wouldn't be useful for generalized consolidation.
https://github.com/kubernetes-sigs/karpenter/blob/main/designs/spot-consolidation.md#2-price-improvement-factor

How important is this feature to you?
Very important. We are currently running a fork of Karpenter with single-node consolidation completely disabled due to the frequent node disruption we are seeing for small price improvements (individual pods are frequently being disrupted multiple times an hour).

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@wmgroot wmgroot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 17, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 17, 2024
@ellistarn
Copy link
Contributor

Just brainstorming:

I'm curious if we might want to reason about this as spec.disruption.cost. We factor in a percentage of the cost of the node when making consolidation decisions. We might use this to sort nodes, and favor smaller nodes for consolidation. It also happens to directly map to priceImprovementFactor: 1.1 == disruption.cost: .1. Another thing to think about is that Kubernetes APIs don't allow floats, so we may need to do something like priceImprovementPercent: 110 or disruption.costPercent: 10.

@wmgroot
Copy link
Contributor Author

wmgroot commented Jul 19, 2024

To check my understanding of disruption.costPercent: 10, you're effectively giving users a way to say that "I'm losing 10% of what I'm paying for this node, so make sure the savings for the replacement node are more than 10% before replacing this node"?

If I'm understanding that correctly, it sounds to me like it would be better placed on an individual NodeClaim spec rather than in NodePool.spec.disruption.

@ellistarn
Copy link
Contributor

If I'm understanding that correctly, it sounds to me like it would be better placed on an individual NodeClaim spec rather than in NodePool.spec.disruption.

This is a good point. Similar line of reasoning to TGP and expireAfter.

If we model it as priceImprovementFactor, then it might be considered a property of the NodePool. I think it makes more sense at the NodeClaim level, though. Given this, how would you expect it to interact with drift?

@wmgroot
Copy link
Contributor Author

wmgroot commented Jul 19, 2024

Additionally, it could potentially support fixed or percent values via IntOrString, such as:

NodeClaim.spec.disruptionCost: 10%
NodeClaim.spec.disruptionCost: 20 (effectively saying a one-time disruption of this node would cost $20)

The problem with fixed value disruption costs is that karpenter has no way to predict how long the node would be around in the future to gauge if a one-time $20 cost is worth it. Maybe it could use historical data on node age to make a prediction.

@wmgroot
Copy link
Contributor Author

wmgroot commented Jul 19, 2024

Given this, how would you expect it to interact with drift?

I expect drift to not take this feature into account at all. If the node needs to be replaced, it needs to be replaced regardless of cost. When a new nodeclaim is provisioned to replace each node, the usual provisioning logic should still choose the cheapest available, which could result in minor cost improvements that are below the configured threshold.

@njtran
Copy link
Contributor

njtran commented Aug 12, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 12, 2024
@njtran
Copy link
Contributor

njtran commented Aug 12, 2024

I've marked this as accepted, but I think we definitely need to talk this out. Excited to discuss more about this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants