Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] when snapshot metadata has no size, it's treated as live even if it is not #1071

Open
3 tasks done
baroquebobcat opened this issue Sep 12, 2024 · 0 comments
Open
3 tasks done
Labels
bug Something isn't working

Comments

@baroquebobcat
Copy link
Contributor

baroquebobcat commented Sep 12, 2024

Describe the bug

Trying to upgrade an index node that is based on an image from before #940 to one after #1030 will cause it to attempt to delete all snapshot metadata for its partition because it will think the metadata is all stale live metadata and then crash when that fails.

This is because of two things

  1. the size defaults to 0
  2. liveness of snapshot metadata is now determined by checking if size is 0

To workaround it, I think we would need to update our snapshot metadata zk nodes to have non-zero sizes for the persistent snapshots, which we could do by either

  1. deploying a post Add sizeInBytes field #940 pre New S3 object storage implementation / BlobStore #1030 image, then letting the older snapshots expire, and deploying post New S3 object storage implementation / BlobStore #1030, or
  2. writing a migration script that adds size to all the persistent zk nodes

To fix it so we don't need an ops workaround, we could revert the SnapshotMetadata::isLive back to the previous implementation, or some other implementation that is tolerant of previous SnapshotMetadata schemas (eg use -1 as the size for live snapshots so it doesn't overlap with the default value, or just introduce a new field to mark liveness).

I think reverting back to checking the prefix of the id is the best least effort solution with the current state of things as the code that adds the prefix is still in place, and there is another part of astra that references it, the query service:

private static String getRawSnapshotName(SearchMetadata searchMetadata) {
return searchMetadata.snapshotName.startsWith("LIVE")
? searchMetadata.snapshotName.substring(5) // LIVE_
: searchMetadata.snapshotName;
}

for (SearchMetadata searchMetadata : queryableSearchMetadataNodes) {
if (!searchMetadata.snapshotName.startsWith("LIVE")) {
cacheNodeHostedSearchMetadata.add(searchMetadata);
}
}

Requirements (place an x in each of the [ ])**

  • I've read and understood the Contributing guidelines and have done my best effort to follow them.
  • I've read and agree to the Code of Conduct.
  • I've searched for any related issues and avoided creating a duplicate issue.

To Reproduce

  1. deploy an index node at a sha before Add sizeInBytes field #940
  2. wait until a number of snapshots have been generated
  3. deploy it again after New S3 object storage implementation / BlobStore #1030
  4. see it start to crash after the task that deletes stale live snapshots fails

Expected behavior

Not crashing.

Screenshots

If applicable, add screenshots to help explain your problem.

Reproducible in:

Astra version:

JVM version:

OS version(s):

Additional context

Add any other context about the problem here.

@baroquebobcat baroquebobcat added the bug Something isn't working label Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant