Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add option to configure extra pod capacity for alternate cnis #6042

Closed
wants to merge 1 commit into from

Conversation

cnmcavoy
Copy link
Contributor

@cnmcavoy cnmcavoy commented Apr 15, 2024

Fixes #5780

Description
Adds an option to configure extra pod capacity for alternative cni runtimes. Karpenter hardcodes 2 extra pods beyond the ENI IP limits, to account for aws-cni and kube-proxy. This exposes that value as a configurable option while leaving the default untouched.

We want to be able to use Cilium w/kube-proxy replacement mode enabled, and so we need to be able to set this to "1" to account for 1 fewer host networked pod.

How was this change tested?
make presubmit

Does this change impact docs?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@cnmcavoy cnmcavoy requested a review from a team as a code owner April 15, 2024 18:56
@cnmcavoy cnmcavoy requested a review from engedaam April 15, 2024 18:56
Copy link

netlify bot commented Apr 15, 2024

Deploy Preview for karpenter-docs-prod canceled.

Name Link
🔨 Latest commit 96e8c1f
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/6696b9657762e40008020ba0

Copy link
Contributor

github-actions bot commented May 1, 2024

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@njtran
Copy link
Contributor

njtran commented Jun 6, 2024

Sorry this closed and slipped through the cracks. I wanted to truly understand the use-cases here before we make this configurable, but I agree that if the default is the same, this should be easy to reason about.

@njtran njtran self-assigned this Jun 6, 2024
@coveralls
Copy link

coveralls commented Jun 6, 2024

Pull Request Test Coverage Report for Build 9407938329

Details

  • 9 of 9 (100.0%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.07%) to 82.571%

Totals Coverage Status
Change from base Build 9355595220: 0.07%
Covered Lines: 5557
Relevant Lines: 6730

💛 - Coveralls

@cnmcavoy
Copy link
Contributor Author

cnmcavoy commented Jun 6, 2024

Sorry this closed and slipped through the cracks. I wanted to truly understand the use-cases here before we make this configurable, but I agree that if the default is the same, this should be easy to reason about.

The hardcoded assumption that there are "2" pods of capacity that can be assumed extra on each node, because of the deamonsets is the core change here we are asking for. With Cilium in kube-proxyless mode, that assumption does not hold. I assume that even more rare multus cni setups or chained cni setups also need a different value.

@njtran
Copy link
Contributor

njtran commented Jun 7, 2024

@cnmcavoy does this solve your issue? kubernetes-sigs/karpenter#1305

@ellistarn can add more details here, but this could potentially add this as an overlay on the node.

Do you think that this expected pod overhead differs by instance type? or should the overhead be the same across the cluster since it's related to the networking setup for the cluster?

@Bryce-Soghigian
Copy link
Contributor

I have some relevant context as we just had the discussion as to whether or not we should lower the ip configurations we generate in the azure provider based on the Static Host Networking addons.

I was just looking at IP allocation for static addons in our provider. We had a similar value "StaticAzureCNIHostNetworkAddons", and we decided to just respect the max pods values we pass into kubelet and not add the decifit at all. Im not sure how aws provider generates ip configurations or how eni works at all, but maybe this is a useful datapoint.

See this commit for more details

cc: @paulgmiller who is our networking lead at AKS who might have more context.

@Bryce-Soghigian
Copy link
Contributor

" NOTE: There are some pods that we see on every node for networking like kube-proxy, and ip-masq-agent). We could be subtracting ips for these pods, and we do for AKSEngine. In the case of AKS However, we have a dynamic number of host networking addon pods. So we are explicitly choosing to not subtract these IPs like AKS engine and some parts of AKS do. The AKS networking team regrets adding this behavior and karpenter is a chance to do better so we shall."

@cnmcavoy
Copy link
Contributor Author

cnmcavoy commented Jun 11, 2024

@Bryce-Soghigian

Interesting, if I am following your context with AKS, you are suggesting Karpenter should remove the hardcoded static limit entirely and present the stated ec2 pod capacity. The consequences of that would be that the pod capacity would not match the AWS eni capacity of the node, for those of us using ENI IPAM. What do you do when pods try to start on a node and can't receive an IP, because the pod capacity != CNI capacity? I opened this PR because we hit this in production @ Indeed.

@njtran

I do not see how node-overlays addresses this bug report. As a user of Karpenter, I want the pod capacity of node's to be correct. I do not want to set my own custom pod capacity. I want the node pod capacity to match the number of ENI's that can be attached to a node. Currently it is not because there is hardcoded magic numbers in the math that assumes too much. Does this help clarify what this request is about?

@ellistarn ellistarn changed the title fix: Add option to configure extra pod capacity for alternate cnis feat: Add option to configure extra pod capacity for alternate cnis Jun 11, 2024
@ellistarn
Copy link
Contributor

ellistarn commented Jun 11, 2024

There's a broader challenge here that host-networked pods are not correctly accounted for. We make a bad assumption that there will be 2 (e.g. kubeproxy/vpccni), but in reality, there are often far more than this due to other daemons, which artificially restricts nodes from having fewer pods than they could.

Another way to achieve your feature request is via a negative overhead node overlay.

kind: NodeOverlay
metadata:
  name: host-networking
spec:
  overhead:
    pods: -1

This will select all nodes, and tell the scheduler that there's an additional pod slot available.

Copy link
Contributor

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

Copy link
Contributor

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

Copy link
Contributor

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

Copy link
Contributor

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Karpenter pod capacity exceeds available IPs from ENIs with other CNIs
6 participants