Decrease allowed elasticsearch delay to refresh interval in dedupe rule run #35154

gherceg · 2024-09-30T12:04:25Z

Product Description

Technical Summary

The current allowed delay of 1 hour is too lenient and leading to performance issues on elasticsearch.

Our current configuration is set to refresh an index every 5 seconds. As far as I know, there isn't a reported bug in Elasticsearch v5.6 around the refresh interval. It is the case that it takes time to refresh the index, which appears to be roughly ~10 seconds, so refreshes are occurring every 15 seconds in steady state. As documented by Elasticsearch, it is also the case that:

By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds

So my math is 5s refresh interval + 10s avg refresh time + 30s grace period = 45 seconds, and rounding up to a minute seems fine. This should ensure that if the case update is not reflected in ES when the dedupe processor runs, it will definitely trigger a refresh for the index since it performed a search request, and be requeued/resaved to try again. Once it is retried, if it is still not showing up in elasticsearch, I'm inclined to say the reason is unrelated to a stale index in elasticsearch until we have proof otherwise.

Feature Flag

Safety Assurance

Safety story

This change only impacts the deduplication feature. In the worst case, it will lead to fewer duplicates being flagged, but my understanding is that it is already the case that we were still reaching this code path very frequently even with the 1 hour grace period set. If we want to be sure that this change does not have a drastic impact, perhaps the best way is to add a metric in this code path that reports the count for how often a matching case is not found in ES and the grace period has been exceeded, and see how much this change impacts that metric. @mjriley would that make you feel better about this change? PR here

Automated test coverage

QA Plan

No

Rollback instructions

This PR can be reverted after deploy with no further considerations

Labels & Review

Risk label is set correctly
The set of people pinged as reviewers is appropriate for the level of risk of the change

At most, the time it takes for a case update to be reflected in elasticsearch should be roughly the refresh interval + refresh time and an additional buffer to ensure the index has been read from to trigger a refresh.

AmitPhulera · 2024-10-01T08:08:59Z

corehq/apps/data_interfaces/models.py

@@ -1166,7 +1166,8 @@ def _handle_case_duplicate(self, case, rule):
        dedupe_load_counter('unknown', case.domain)()

        if not case_matching_rule_criteria_exists_in_es(case, rule):
-            ALLOWED_ES_DELAY = timedelta(hours=1)
+            # refresh interval + avg time to refresh + extra buffer = 1 minute


I thought that 1 hour is to accomodate for the pillow lag and not the elasticsearch delay?

Ah yeah I don't think it was initially set to 1 hour to accommodate that specifically (Matt can correct me if I'm wrong), but it is true that it helps with that issue. Here's how I think about it.

The case to es processor and dedupe processor both run within the case pillow, so the lag we actually care about is between when the case to es processor ran and when the dedupe processor ran. It seems more than likely that when the dedupe processor runs, the updated case will not be available in ES, so we have to resave it. The problem is that we are determining lag based on the server_modified_on property, which pillow lag is going to impact.

To your point, this change is too extreme in its current form because as soon as pillow lag is above 1 minute, given that it seems likely the dedupe processor won't be able to find the case in ES on its first attempt, we won't retry it since enough time has past since the initial server_modified_on time.

AmitPhulera · 2024-10-01T08:09:16Z

which appears to be roughly ~10 seconds

Curious where you found out this metric from?

gherceg · 2024-10-01T10:28:47Z

Eh I think I'm misinterpreting ES metrics actually. ES reports the amount of time spent refreshing, which you can see as a cumulative metric since the process started running, or as a count which shows something like "each second, here is how much time ES has spent refreshing over the last second?", which would make sense in the context of multiple shards/nodes. However I don't think that means a refresh on any particular index takes ~10 seconds so ignore me. It is the case that every ~15 seconds, there is a spike in the time spent refreshing that I don't have a great grasp on.

gherceg · 2024-10-01T12:54:39Z

Closing as this isn't the right approach, but has sparked some offline conversations to find another way to resolve this issue.

Decrease allowed elasticsearch delay in dedupe

42143e2

At most, the time it takes for a case update to be reflected in elasticsearch should be roughly the refresh interval + refresh time and an additional buffer to ensure the index has been read from to trigger a refresh.

gherceg requested review from mjriley and AmitPhulera September 30, 2024 12:04

gherceg mentioned this pull request Sep 30, 2024

Add metric for no matching case in dedupe #35155

Merged

3 tasks

AmitPhulera reviewed Oct 1, 2024

View reviewed changes

gherceg closed this Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decrease allowed elasticsearch delay to refresh interval in dedupe rule run #35154

Decrease allowed elasticsearch delay to refresh interval in dedupe rule run #35154

gherceg commented Sep 30, 2024 •

edited

Loading

AmitPhulera Oct 1, 2024

gherceg Oct 1, 2024

AmitPhulera commented Oct 1, 2024

gherceg commented Oct 1, 2024

gherceg commented Oct 1, 2024

Decrease allowed elasticsearch delay to refresh interval in dedupe rule run #35154

Decrease allowed elasticsearch delay to refresh interval in dedupe rule run #35154

Conversation

gherceg commented Sep 30, 2024 • edited Loading

Product Description

Technical Summary

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

Rollback instructions

Labels & Review

AmitPhulera Oct 1, 2024

Choose a reason for hiding this comment

gherceg Oct 1, 2024

Choose a reason for hiding this comment

AmitPhulera commented Oct 1, 2024

gherceg commented Oct 1, 2024

gherceg commented Oct 1, 2024

gherceg commented Sep 30, 2024 •

edited

Loading