storcon: Updating of observed state is racy #9124

VladLazar · 2024-09-24T14:12:55Z

The "normal" code path for updating tenant observed state is via Service::process_result: once a reconciler finishes, it pushes the result on a channel. We read from the channel in a background loop and set the observed state of the shard to whatever the ReconcileResult suggests. Reconcilers themselves operate on a snapshot on the observed state.

However, we also update the observed state inline by grabbing the lock (potentially non-exhaustive list below):

Service::node_configure (probably the biggest offender): updates observed state in response to nodes coming online/offline
Service::re_attach: this looks safe at a first approximation 🤷
Service::do_tenant_shard_split
Service::node_drop
Service::node_delete

It's probably obvious by now, that this pattern is no bueno, but let's use https://github.com/neondatabase/cloud/issues/17362 as an example race:

In response to the attached node A going unavailable, we migrate shard X to node B. We are now in AttachedSingle with both A and B (different generations tho). Node A is unavailable, so we skip detaching from it for now.
Node A comes back online. We update the observed state for all shards on A here. This includes a location with A: AttachedSingle for shard X.
Reconcile from step (1) finishes and clobbers the updates we made in step (2). Now we've lost knowledge of X's stale attachment on A
and can't detach it.

Note that we now pass the observed state around with storage controller rolling restarts, so these inconsistencies propagate through restarts.

The text was updated successfully, but these errors were encountered:

VladLazar added c/storage/controller Component: Storage Controller t/bug Issue Type: Bug c/storage Component: storage labels Sep 24, 2024

VladLazar mentioned this issue Sep 30, 2024

storcon: Deal with timeline CRUD in multi-attached states correctly #9144

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storcon: Updating of observed state is racy #9124

storcon: Updating of observed state is racy #9124

VladLazar commented Sep 24, 2024

storcon: Updating of observed state is racy #9124

storcon: Updating of observed state is racy #9124

Comments

VladLazar commented Sep 24, 2024