GeoServer Pod Restart Failure with Thread Local Errors in Kubernetes Environment #610

not-Karot · 2023-12-11T19:32:16Z

not-Karot
Dec 11, 2023

What is the bug or the crash?

GeoServer running in a Kubernetes pod encounters startup failures after a pod restart. The primary issue involves ThreadLocal errors and WebappClassLoaderBase illegal state exceptions. This seems to be occurring specifically when the pod, which mounts a persistent volume for the GeoServer data directory, is restarted.

Steps to reproduce the issue

Deploy GeoServer on Kubernetes using the kartoza charts with this value for persistence.

persistence:
  geoserverDataDir:
    enabled: true
    existingClaim: "geoserver-data-dir-pvc"
    mountPath: /opt/persistence/data_dir
    size: 1Ti
    storageClass: "filestore-sc"
    accessModes:
      - ReadWriteMany
    annotations: {}

  geowebcacheCacheDir:
    enabled: true
    existingClaim: "geoserver-cache-dir-pvc"
    mountPath: /opt/persistence/data_dir/gwc
    size: 1Ti
    storageClass: "filestore-sc"
    accessModes:
      - ReadWriteMany
    annotations: {}

storageClassFS:
  volumeBindingMode : Immediate #WaitForFirstConsumer
  reclaimPolicy: Retain #Delete
  parameters:
    tier: standard
    network: ${network}

Attach a persistent volume to the GeoServer pod for storing the data directory.
Add workspace, stores and layers to the instance.
Kill the pod
Let k8s automatically restart the GeoServer pod.
Observe the errors in the pod logs.

Versions

GeoServer Version: 2.23.2 (but appears with any version)
Docker Image: docker.io/kartoza/geoserver:2.23.2
GCP Kubernetes Engine standard cluster
Filestore istance as persistent volume

Additional context

The problem arises specifically when restarting the pod. Initial deployment, with no data linked to geoserver instance, doesn't show these errors. The persistent volume seems to be correctly configured, and this setup worked seamlessly before. I found recommendations online suggesting a change in the default data_dir. Consequently, I have mounted my persistent volume to a different location (/opt/persistence/data_dir) and updated the container environment variables accordingly:

GEOSERVER_DATA_DIR: /opt/persistence/data_dir
GEOWEBCACHE_CACHE_DIR: /opt/persistence/data_dir/gwc

This change was expected to resolve the issue, but the startup errors persisted.

Additionally, as a solution to this problem, I am open to recommendations on best practices for maintaining the state of GeoServer when deploying on Kubernetes, ensuring horizontal scalability.

NyakudyaA · 2023-12-31T05:05:28Z

NyakudyaA
Dec 31, 2023
Maintainer

Can we. Move this to a discussion rather than issue

0 replies

Ducksonspeed · 2024-01-25T13:30:27Z

Ducksonspeed
Jan 25, 2024

So I ran into this issue with multiple replicas within Kubernetes. I'm using a shared storage system (EFS), I have it setup in a cluster. One leader node, many read only replicas. Apache-MQ using a postgres database for persistence (3 replicas in a statefulset + headless service so each pod can be contacted via a cluster DNS)

Initally with small deployments 1-2 readonly replicas as statefulsets we had no problems, once I changed the type to deployment as I wanted to improve the rollout time of a readonly (stateless) pod I started to run into issues.

We mount the same storage from leader to readonly, however we cannot set the mount for the data_dir to be readonly, each geoserver instance expects to write some small data (logs, cluster-node config), and here in lies our problem.

Each pod that starts will attempt to hash a new password to an already existing hash that is shared between pods, causing the leader when it sees the change to send out a request to all workers to reload configration. What I would often see with more than 2/3 pods starting at the same time is a few files going blank.

./data_dir/security/roles/default/roles.xml + roles.xml.orig
./data_dir/security/usegroup/default/users.xml + users.xml.orig

These files would be wiped out & tomcat would not be able to start causing the pod to fail its health check. This is confirmed from /usr/local/tomcat/logs/localhost.x.log (a more detailed error will describe the problem)

My fix (create a custom readonly image that removes some of the wrapper scripts around password handling (only our master instance should be making these changes)

Since this change, I can spawn as many pods without a problem.

1 reply

adityabhuvanraj Jun 19, 2024

Hi, what you've done here is quite interesting. I'm also facing some similar issue and would like to hear your thoughts on it if you're willing to. Pls drop an email to [email protected] or ping your email id here. Thank you in advance.

greghall76 · 2024-04-20T11:51:46Z

greghall76
Apr 20, 2024

I am seeing this on openshift w/ nfs-provisioner and only a single pod that is redeployed.

1 reply

NyakudyaA Jun 26, 2024
Maintainer

@greghall76 This is already handled by the image. Basically at each start of a pod the password init can be prevented by doing a couple of things

mounting a persistent volume

-v data:/settings

This will prevent the password script from being run. This however might not work if you do not want to share volumes between pods

start the replicas with env EXISTING_DATA_DIR=TRUE
This will stop the password script from running. The trick part is that you want to set the env variable for only the replicas and let master be able to change passwords
use of entrypoint script - add a config that will write a lockfile that is checked update password so thereby skipping password init. You will still need a way to identify master.

In summary I think the easiest solution would be mounting a persistent volume.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GeoServer Pod Restart Failure with Thread Local Errors in Kubernetes Environment #610

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

GeoServer Pod Restart Failure with Thread Local Errors in Kubernetes Environment #610

not-Karot Dec 11, 2023

What is the bug or the crash?

Steps to reproduce the issue

Versions

Additional context

Replies: 3 comments · 2 replies

NyakudyaA Dec 31, 2023 Maintainer

Ducksonspeed Jan 25, 2024

adityabhuvanraj Jun 19, 2024

greghall76 Apr 20, 2024

NyakudyaA Jun 26, 2024 Maintainer

not-Karot
Dec 11, 2023

Replies: 3 comments 2 replies

NyakudyaA
Dec 31, 2023
Maintainer

Ducksonspeed
Jan 25, 2024

greghall76
Apr 20, 2024

NyakudyaA Jun 26, 2024
Maintainer