Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrease allowed elasticsearch delay to refresh interval in dedupe rule run #35154

Closed
wants to merge 1 commit into from

Conversation

gherceg
Copy link
Contributor

@gherceg gherceg commented Sep 30, 2024

Product Description

Technical Summary

The current allowed delay of 1 hour is too lenient and leading to performance issues on elasticsearch.

Our current configuration is set to refresh an index every 5 seconds. As far as I know, there isn't a reported bug in Elasticsearch v5.6 around the refresh interval. It is the case that it takes time to refresh the index, which appears to be roughly ~10 seconds, so refreshes are occurring every 15 seconds in steady state. As documented by Elasticsearch, it is also the case that:

By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds

So my math is 5s refresh interval + 10s avg refresh time + 30s grace period = 45 seconds, and rounding up to a minute seems fine. This should ensure that if the case update is not reflected in ES when the dedupe processor runs, it will definitely trigger a refresh for the index since it performed a search request, and be requeued/resaved to try again. Once it is retried, if it is still not showing up in elasticsearch, I'm inclined to say the reason is unrelated to a stale index in elasticsearch until we have proof otherwise.

Feature Flag

Safety Assurance

Safety story

This change only impacts the deduplication feature. In the worst case, it will lead to fewer duplicates being flagged, but my understanding is that it is already the case that we were still reaching this code path very frequently even with the 1 hour grace period set. If we want to be sure that this change does not have a drastic impact, perhaps the best way is to add a metric in this code path that reports the count for how often a matching case is not found in ES and the grace period has been exceeded, and see how much this change impacts that metric. @mjriley would that make you feel better about this change? PR here

Automated test coverage

QA Plan

No

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

At most, the time it takes for a case update to be reflected in
elasticsearch should be roughly the refresh interval + refresh time and
an additional buffer to ensure the index has been read from to trigger a
refresh.
@@ -1166,7 +1166,8 @@ def _handle_case_duplicate(self, case, rule):
dedupe_load_counter('unknown', case.domain)()

if not case_matching_rule_criteria_exists_in_es(case, rule):
ALLOWED_ES_DELAY = timedelta(hours=1)
# refresh interval + avg time to refresh + extra buffer = 1 minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that 1 hour is to accomodate for the pillow lag and not the elasticsearch delay?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah I don't think it was initially set to 1 hour to accommodate that specifically (Matt can correct me if I'm wrong), but it is true that it helps with that issue. Here's how I think about it.

The case to es processor and dedupe processor both run within the case pillow, so the lag we actually care about is between when the case to es processor ran and when the dedupe processor ran. It seems more than likely that when the dedupe processor runs, the updated case will not be available in ES, so we have to resave it. The problem is that we are determining lag based on the server_modified_on property, which pillow lag is going to impact.

To your point, this change is too extreme in its current form because as soon as pillow lag is above 1 minute, given that it seems likely the dedupe processor won't be able to find the case in ES on its first attempt, we won't retry it since enough time has past since the initial server_modified_on time.

@AmitPhulera
Copy link
Contributor

which appears to be roughly ~10 seconds

Curious where you found out this metric from?

@gherceg
Copy link
Contributor Author

gherceg commented Oct 1, 2024

Eh I think I'm misinterpreting ES metrics actually. ES reports the amount of time spent refreshing, which you can see as a cumulative metric since the process started running, or as a count which shows something like "each second, here is how much time ES has spent refreshing over the last second?", which would make sense in the context of multiple shards/nodes. However I don't think that means a refresh on any particular index takes ~10 seconds so ignore me. It is the case that every ~15 seconds, there is a spike in the time spent refreshing that I don't have a great grasp on.

@gherceg
Copy link
Contributor Author

gherceg commented Oct 1, 2024

Closing as this isn't the right approach, but has sparked some offline conversations to find another way to resolve this issue.

@gherceg gherceg closed this Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants