Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharded kube state metrics returns stale metrics #2431

Open
mal-berbatov-ci opened this issue Jun 25, 2024 · 4 comments
Open

Sharded kube state metrics returns stale metrics #2431

mal-berbatov-ci opened this issue Jun 25, 2024 · 4 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@mal-berbatov-ci
Copy link

Hi,

This is tangentially related to #2372, but in our case, the state that causes this is definitely not label changes to the kube state metrics statefulset

What happened:
Occasionally, we will see that a shard of our KSM is reporting stale metrics; namely that a pod is stuck in “Pending” state. We can easily verify that this isn’t the case, and rolling out the statefulset will clear the issue.

Anything else we need to know?:
We have been seeing this at least since v2.9.2 of KSM, and only on sharded KSM installations. It is rare, but happens occasionally.
After upgrading KSM to v2.12.0, we saw an issue where a version upgrade of a component across our k8s fleet (resulting in updating labels for these pods) caused a flurry of alerts about these pods being stuck in a pending state. The pods themselves were fine, but caused our KSM installations in v2.12.0 to all start serving stale metrics. Just to clarify, the statefulset labels for KSM were unchanged. Rather, the labels of a non related component were updated.

We were also staggering the v2.12.0 update of KSM, meaning that our dev/staging clusters were on v2.12.0, whilst our production clusters were on v2.9.2. The component version upgrade was being rolled out to all clusters, however it was only our dev/staging clusters that saw this issue of KSM serving stale metrics. It’s looking quite likely that some sort of change from v2.9.2 → v2.12.0 has made this issue worse.

Prior to this particular component version upgrade, and KSM on v2.12.0, I haven’t been able to spot any kind of pattern for when KSM falls into the state of serving stale metrics. I can verify that in all cases however the statefulset labels for KSM remain unchanged. We do not drop the kube_statefulset_labels so I have been able to verify this.

Environment:
kube-state-metrics version: v2.12.0 & v2.9.2
Kubernetes version (use kubectl version): v1.27 to v1.30
Cloud provider or hardware configuration: GKE

@mal-berbatov-ci mal-berbatov-ci added the kind/bug Categorizes issue or PR as related to a bug. label Jun 25, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 25, 2024
@dgrisonnet
Copy link
Member

/assign @CatherineF-dev
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 27, 2024
@CatherineF-dev
Copy link
Contributor

CatherineF-dev commented Jun 27, 2024

serving stale metrics

Do we know this metric?

Sharded kube state metrics

Does non-sharded kube-state-metrics have this issue or not?

Also, could you reproduce this issue consistently?

@mal-berbatov-ci
Copy link
Author

The stale metric that we saw were kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}. We have an alert checking this expression which made us weary of the issue. The pods in question were all reporting "Pending". I did not check other metrics. I'd hazard a guess at yes, but if the issue happens again, I can double check.

Non sharded KSM do not have this issue.

I could not reproduce the issue consistently. I tried to once but couldn't replicate it

@mal-berbatov-ci
Copy link
Author

Actually, looking at historical metrics, they are in fact all stale for the specific exported_pod's that were seeing this issue. It also looks like when the roll out of the service we were updating happened, the metric swapped from shard 0 to shard 1. It also looks like until we rollout restarted the KSM statefulset, some metrics for other components were not getting registered.

All in all, it definitely does look like stale metrics were being served across every single shard of KSM, and until they were restarted, values were not updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants