Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to start a distributed DB using unidirectional out-of-band peering #2386

Open
rmedina97 opened this issue Mar 13, 2024 · 4 comments
Open
Labels
kind/bug Something isn't working

Comments

@rmedina97
Copy link

What happened:

I attempted to deploy the Liqo example of a stateful application in my hierarchical architecture, comprising one consumer and two provider clusters. However, only the first POD, db-mariadb-galera-0, successfully starts. The second POD fails to connect with the first and enters a CrashLoopBackOff state. Both PODs were scheduled in the provider clusters.

What you expected to happen:

I expected the PODs to be able to communicate with each other.

How to reproduce it (as minimally and precisely as possible):

Create 3 clusters using k3s (with different POD and service CIDR), peer them with Liqo as 1 consumer and 2 providers, and install the example Helm chart.

Anything else we need to know?:

I found a working solution: the entire DB is able to start only if there is a working POD in the consumer cluster. Otherwise, only the first POD starts. Additionally, bidirectional peering between every cluster resolves the issue, but my preference is to adhere to the hierarchical structure.
I first noticed this problem using the Percona XtraDB operator (another distributed DB application)with three PODs. In the event that the POD in the consumer cluster is deleted and scheduled to another provider cluster, this POD will again be in CrashLoopBackOff, but the other running PODs will continue to work as normal

Environment:

  • Liqo version: v0.10.1
  • Liqoctl version: v0.10.1
  • Kubernetes version (use kubectl version): k3s v1.24.17+k3s1
  • Cloud provider or hardware configuration:
  • Node image: Linux Ubuntu Server 20.04
  • Network plugin and version:
  • Install tools:
  • Others:
@rmedina97 rmedina97 added the kind/bug Something isn't working label Mar 13, 2024
@aleoli
Copy link
Member

aleoli commented Mar 13, 2024

Hi @RiccardoStud! For better reproducibility, how do you install the MariaDB-galera cluster? Do you use an operator or chart? If yes, please indicate which

@rmedina97
Copy link
Author

Hi @aleoli! I used the Helm chart from the Liqo guide, running the command:
helm install db bitnami/mariadb-galera -n liqo-demo -f manifests/values.yaml.
I only changed the namespace name to match mine. For additional context, when I develop the chart using only two of my clusters (one provider and one consumer), it functions normally.

@fra98
Copy link
Member

fra98 commented Mar 25, 2024

Hi @rmedina97.
I reproduced your deployment and can confirm it is not working with this specific topology.
This is because in Liqo by design pods on different leaf clusters can't communicate directly with original IPs, but they are remapped on the external CIDR of the originating cluster. The deployment could still work in some cases:

  • pods in leaf clusters connect to a pod in the originating cluster (if the DB does not require a full mesh between all replicas)
  • leaf clusters are peered directly (although you do not need a bidirectional peer, just a peering in any direction). This works because leaf clusters now know each other podCIDRs and do not need to use the ExternalCIDR for IP mappings.

Please note that a new redesigned network will be merged soon and we will test again distributed DB scenarios.

@rmedina97
Copy link
Author

Thanks for the comprehensive answer, I will adopt one of the suggested solutions for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants