[BUG] when snapshot metadata has no size, it's treated as live even if it is not #1071

baroquebobcat · 2024-09-12T18:37:05Z

Describe the bug

Trying to upgrade an index node that is based on an image from before #940 to one after #1030 will cause it to attempt to delete all snapshot metadata for its partition because it will think the metadata is all stale live metadata and then crash when that fails.

This is because of two things

the size defaults to 0
liveness of snapshot metadata is now determined by checking if size is 0

To workaround it, I think we would need to update our snapshot metadata zk nodes to have non-zero sizes for the persistent snapshots, which we could do by either

deploying a post Add sizeInBytes field #940 pre New S3 object storage implementation / BlobStore #1030 image, then letting the older snapshots expire, and deploying post New S3 object storage implementation / BlobStore #1030, or
writing a migration script that adds size to all the persistent zk nodes

To fix it so we don't need an ops workaround, we could revert the SnapshotMetadata::isLive back to the previous implementation, or some other implementation that is tolerant of previous SnapshotMetadata schemas (eg use -1 as the size for live snapshots so it doesn't overlap with the default value, or just introduce a new field to mark liveness).

I think reverting back to checking the prefix of the id is the best least effort solution with the current state of things as the code that adds the prefix is still in place, and there is another part of astra that references it, the query service:

astra/astra/src/main/java/com/slack/astra/logstore/search/AstraDistributedQueryService.java

Lines 305 to 309 in 26f72e9

    
           private static String getRawSnapshotName(SearchMetadata searchMetadata) { 
        
             return searchMetadata.snapshotName.startsWith("LIVE") 
        
                 ? searchMetadata.snapshotName.substring(5) // LIVE_ 
        
                 : searchMetadata.snapshotName; 
        
           }

astra/astra/src/main/java/com/slack/astra/logstore/search/AstraDistributedQueryService.java

Lines 322 to 326 in 26f72e9

    
           for (SearchMetadata searchMetadata : queryableSearchMetadataNodes) { 
        
             if (!searchMetadata.snapshotName.startsWith("LIVE")) { 
        
               cacheNodeHostedSearchMetadata.add(searchMetadata); 
        
             } 
        
           }

Requirements (place an `x` in each of the `[ ]`)**

I've read and understood the Contributing guidelines and have done my best effort to follow them.
I've read and agree to the Code of Conduct.
I've searched for any related issues and avoided creating a duplicate issue.

To Reproduce

deploy an index node at a sha before Add sizeInBytes field #940
wait until a number of snapshots have been generated
deploy it again after New S3 object storage implementation / BlobStore #1030
see it start to crash after the task that deletes stale live snapshots fails

Expected behavior

Not crashing.

Screenshots

If applicable, add screenshots to help explain your problem.

Reproducible in:

Astra version:

JVM version:

OS version(s):

Additional context

Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

baroquebobcat added the bug Something isn't working label Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] when snapshot metadata has no size, it's treated as live even if it is not #1071

[BUG] when snapshot metadata has no size, it's treated as live even if it is not #1071

baroquebobcat commented Sep 12, 2024 •

edited

Loading

[BUG] when snapshot metadata has no size, it's treated as live even if it is not #1071

[BUG] when snapshot metadata has no size, it's treated as live even if it is not #1071

Comments

baroquebobcat commented Sep 12, 2024 • edited Loading

Describe the bug

Requirements (place an x in each of the [ ])**

To Reproduce

Expected behavior

Screenshots

Reproducible in:

Additional context

baroquebobcat commented Sep 12, 2024 •

edited

Loading

Requirements (place an `x` in each of the `[ ]`)**