Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idle nodes on gcs cluster #769

Open
rabernat opened this issue Oct 4, 2020 · 11 comments
Open

idle nodes on gcs cluster #769

rabernat opened this issue Oct 4, 2020 · 11 comments

Comments

@rabernat
Copy link
Member

rabernat commented Oct 4, 2020

I randomly logged into the google cloud console to monitor our cluster tonight. I found that the cluster was scaled up to 8 nodes / 34 vCPUs / 170 GB memory.

image

However, afaict there are only two jupyter users logged in:

image

I poked around the nodepools, and the nodes seemed to be heavily undersubscribed.

image

This is as far as my debugging skills go. I don't know how to figure out what pods are running on those nodes. I wish the elastic nodepools would scale down. Maybe there are some permanent services whose pods got stuck on those nodes and now they can't be scaled down?

This is important because it costs a lot of money to have these VMs constantly running.

@TomAugspurger
Copy link
Member

TomAugspurger commented Oct 4, 2020

Seems like Dask Gateway and some JupyterHub pods are occupying these nodes.

$ kubectl get pod -o wide -n prod | grep highmem 
api-gcp-uscentral1b-prod-dask-gateway-55dffbc7d4-dpcwn           1/1     Running   0          4d18h   10.36.33.179    gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-snip   <none>           <none>
continuous-image-puller-bgfgp                                    1/1     Running   0          11d     10.36.33.185    gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-snip   <none>           <none>
continuous-image-puller-jkk8q                                    1/1     Running   0          11d     10.37.170.127   gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>
continuous-image-puller-smrd8                                    1/1     Running   0          11d     10.36.24.14     gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-mzj9   <none>           <none>
continuous-image-puller-xgp8w                                    1/1     Running   0          11d     10.36.29.15     gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-1frw   <none>           <none>

$ kubectl get pod -o wide -n prod | grep standard
continuous-image-puller-hh8kk                                    1/1     Running   0          11d     10.36.248.192   gke-pangeo-uscentral-nap-n1-standard--a4dc6106-p50c   <none>           <none>
continuous-image-puller-n4779                                    1/1     Running   0          11d     10.37.142.146   gke-pangeo-uscentral-nap-n1-standard--a4dc6106-4955   <none>           <none>
controller-gcp-uscentral1b-prod-dask-gateway-5f77b8d797-s84ph    1/1     Running   0          16d     10.36.248.64    gke-pangeo-uscentral-nap-n1-standard--a4dc6106-p50c   <none>           <none>
gcp-uscentral1b-prod-grafana-7fdb568f65-zxj9c                    2/2     Running   0          16d     10.36.248.62    gke-pangeo-uscentral-nap-n1-standard--a4dc6106-p50c   <none>           <none>
gcp-uscentral1b-prod-ingress-nginx-controller-79cd5cd96c-z8crw   1/1     Running   1          16d     10.37.142.217   gke-pangeo-uscentral-nap-n1-standard--a4dc6106-4955   <none>           <none>
gcp-uscentral1b-prod-kube-state-metrics-58d7c65fd7-kxhwk         1/1     Running   3          16d     10.37.142.218   gke-pangeo-uscentral-nap-n1-standard--a4dc6106-4955   <none>           <none>
gcp-uscentral1b-prod-prome-operator-6b5b49dccb-w4ph8             2/2     Running   0          16d     10.36.248.63    gke-pangeo-uscentral-nap-n1-standard--a4dc6106-p50c   <none>           <none>
prometheus-gcp-uscentral1b-prod-prome-prometheus-0               3/3     Running   4          16d     10.37.142.219   gke-pangeo-uscentral-nap-n1-standard--a4dc6106-4955   <none>           <none>
traefik-gcp-uscentral1b-prod-dask-gateway-55d7854bf7-xgdhc       1/1     Running   1          16d     10.36.248.61    gke-pangeo-uscentral-nap-n1-standard--a4dc6106-p50c   <none>           <none>

The dask-gateway pods should likely be in the core node pool, along with JupyterHub. There's no reason to keep them separate I think. I can take care of that.

I'm not sure about the continuous image-puller. I gather that it's a JupyterHub thing. I'm not sure what the impact of disabling it would be though. It seems to me like it shouldn't be the sole thing keeping a node from scaling down (and maybe if we fix the dask-gateway pods, it would scale down).

@rabernat
Copy link
Member Author

rabernat commented Oct 5, 2020

Thanks for looking into this Tom!

Do we need some sort of cron job that checks whether these services are running non-core nodes?

@TomAugspurger
Copy link
Member

With dask/dask-gateway#325 and dask/dask-gateway#324 we'll be able to set things up so that these pods don't run on non-core nodes in the first place. That'll need to wait for the next dask-gateway release.

In the meantime, we can patch around it

# file: patch.yaml
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: hub.jupyter.org/node-purpose
                operator: In
                values:
                - core
$ kubectl -n staging patch deployment traefik-gcp-uscentral1b-staging-dask-gateway --patch="$(cat patch.yaml
l)"
deployment.apps/traefik-gcp-uscentral1b-staging-dask-gateway patched

$ kubectl -n staging patch deployment api-gcp-uscentral1b-staging-dask-gateway --patch="$(cat patch.yaml)"
deployment.apps/api-gcp-uscentral1b-staging-dask-gateway patched

$ kubectl -n staging patch deployment controller-gcp-uscentral1b-staging-dask-gateway --patch="$(cat patch.yaml)"
deployment.apps/controller-gcp-uscentral1b-staging-dask-gateway patched

I've confirmed that those were moved to the default pool for staging at least, and things seem to still work. Still to do are

  • Do the same for prod.
  • integrate this into CI (so we don't lose it each deployment).
  • Verify that the continuous image puller stuff doesn't keep a node-pool alive.
  • Ensure that the grafana monitoring things live in the core pool.
  • Maybe clean up the pangeo forge stuff (a prefect agent, and some other pod I don't recognize) in staging

I'll get to those later.

@TomAugspurger
Copy link
Member

I might have broken some prometheus / grafana things (the hub should be fine)

Error: UPGRADE FAILED: cannot patch "gcp-uscentral1b-staging-grafana" with kind Ingress: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post https://gcp-uscentral1b-prod-ingress-nginx-controller-admission.prod.svc:443/extensions/v1beta1/ingresses?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-ingress-nginx-controller-admission" && cannot patch "gcp-uscentral1b-staging-pr-alertmanager.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-etcd" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-general.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-k8s.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kube-apiserver-availability.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kube-apiserver-slos" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kube-apiserver.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kube-prometheus-general.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kube-prometheus-node-recording.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kube-scheduler.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kube-state-metrics" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kubelet.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kubernetes-apps" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kubernetes-resources" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kubernetes-storage" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kubernetes-system-apiserver" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kubernetes-system-controller-manager" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kubernetes-system-kubelet" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kubernetes-system-scheduler" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kubernetes-system" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-node-network" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-prometheus-operator" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-prometheus" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator"

I need to figure out what pods are actually needed per namespace for prometheus-operator to function.

@TomAugspurger
Copy link
Member

TomAugspurger commented Oct 7, 2020

@consideRatio the GCP cluster has a node with just system pods and two continuous-image-puller pods (one for prod and staging):

$ kubectl get pod -o wide --all-namespaces | grep gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd
kube-system   fluentd-gke-kv5n6                                                 2/2     Running     0          49d    10.128.0.108    gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>
kube-system   gke-metadata-server-p8wpm                                         1/1     Running     0          49d    10.128.0.108    gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>
kube-system   gke-metrics-agent-px4vg                                           1/1     Running     0          49d    10.128.0.108    gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>
kube-system   kube-dns-7c976ddbdb-kqglx                                         4/4     Running     2          49d    10.37.170.162   gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>
kube-system   kube-proxy-gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd    1/1     Running     0          68d    10.128.0.108    gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>
kube-system   netd-nbvfh                                                        1/1     Running     0          69d    10.128.0.108    gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>
kube-system   prometheus-to-sd-9fsqg                                            1/1     Running     0          69d    10.128.0.108    gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>
prod          continuous-image-puller-52hgv                                     1/1     Running     0          10h    10.37.170.239   gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>
staging       continuous-image-puller-7fvmg                                     1/1     Running     0          10h    10.37.170.229   gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>

That node is in an auto-provisioned node pool set to auto-scale down all the way to zero. I wouldn't expect the continuous-image-puller pods to keep a node from auto-scaling down, though perhaps that's incorrect. Does that look strange to you?

@TomAugspurger
Copy link
Member

From https://zero-to-jupyterhub.readthedocs.io/en/latest/administrator/optimization.html:

It is important to realize that if the continuous-image-puller together with a Cluster Autoscaler (CA) won’t guarantee a reduced wait time for users. It only helps if the CA scales up before real users arrive, but the CA will generally fail to do so. This is because it will only add a node if one or more pods won’t fit on the current nodes but would fit more if a node is added, but at that point users are already waiting. To scale up nodes ahead of time we can use user-placeholders.

Suggests that the continuous-image-puller isn't all that useful on it's own, and we aren't using user-placeholders, so perhaps we just remove the continuous-image-puller.

TomAugspurger added a commit to TomAugspurger/pangeo-cloud-federation that referenced this issue Oct 7, 2020
@consideRatio
Copy link
Member

Hmmm i guess if you only pull a single image, and dont have user placeholders, then its just a pod requesting no resources and can be evicted by other pods if needed.

It is very harmless in latest z2jh release, and it wont block scale down. I would inspect all pods on the nodes individually with kubectl describe nodes and see what pods ran on them, and i would inspect what the cluster autoscaler status configmap were saying in the kube-system namespace

@TomAugspurger
Copy link
Member

TomAugspurger commented Oct 7, 2020 via email

@TomAugspurger
Copy link
Member

TomAugspurger commented Nov 6, 2020

A few more stray pods that I'll pin down to the core pool

  • gcp-uscentral1b-prod-ingress-nginx-controller-79cd5cd96c-qdmzj
  • gcp-uscentral1b-staging-ingress-nginx-controller-865cfd455mrcps
  • mlflow-84f8c9d9c-2vssl

TomAugspurger added a commit to TomAugspurger/pangeo-cloud-federation that referenced this issue Nov 6, 2020
This removes a few more pieces from the metrics based on
prometheus-operator, which we replaced with separate prometheus and
grafana charts.

The dependency on nginx-ingress caused the stray pods in
pangeo-data#769 (comment),
which were unused.
TomAugspurger added a commit that referenced this issue Nov 6, 2020
This removes a few more pieces from the metrics based on
prometheus-operator, which we replaced with separate prometheus and
grafana charts.

The dependency on nginx-ingress caused the stray pods in
#769 (comment),
which were unused.
@TomAugspurger
Copy link
Member

Leaving a note here for future debugging. I noticed that the node gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-88l4 wasn't scaling down, despite having just kube-system pods and the prometheus-node-exporter DameonSet. https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-autoscaler-visibility suggests viewing the logs at https://console.cloud.google.com/logs/query;query=logName%3D%22projects%2Fpangeo-181919%2Flogs%2Fcontainer.googleapis.com%252Fcluster-autoscaler-visibility%22?authuser=1&angularJsUrl=%2Flogs%2Fviewer%3Fsupportedpurview%3Dproject%26authuser%3D1&project=pangeo-181919&supportedpurview=project&query=%0A, with this filter

logName="projects/pangeo-181919/logs/container.googleapis.com%2Fcluster-autoscaler-visibility"

I see a NoDecisionStatus, and in the logs

reason: {
  parameters: [
    0: "metrics-server-v0.3.6-5cf765ff9-9pvxn"
  ]
  messageId: "no.scale.down.node.pod.kube.system.unmovable"
}

So there's a system pod that was added to the highmemory pool. Ideally those would be in the core pool. I'll see if I can add an annotation to it.

@TomAugspurger
Copy link
Member

TomAugspurger commented Nov 16, 2020

Hmm, according to https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-to-set-pdbs-to-enable-ca-to-move-kube-system-pods

Metrics Server is best left alone, as restarting it causes the loss of metrics for >1 minute, as well as metrics in dashboard from the last 15 minutes. Metrics Server downtime also means effective HPA downtime as it relies on metrics. Add PDB for it only if you're sure you don't mind.

We're probably OK with that. I wonder if defining a PDB is better than (somehow?) setting the nodeAffinity so that it ends up in the core pool in the first place? We would want the affinity regardless so that it doesn't bounce between non-core nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants