Re-implement RotatingScope with thread local storage #5573

wangjian-pg · 2024-05-23T13:35:57Z

What this PR does / why we need it:

The current implementation of metric rotation relies on timing to ensure that the scope pointer used by the worker thread is valid, which is probably not a good practice or design as far as I know.

After digging into the envoy's implementation of ThreadLocalStoreImpl(which faces similar issues of how to propagate scoped metrics to worker threads and expire them when the scope is released), I found that storing the active scope reference in thread local storage and replacing it with the new one when rotation occurs might be a better design. With this approach, we eliminate the use of raw pointers, making rotation behavior more predictable and still lock-free.

I re-implemented the rotation as described and tested the code in my environment, it works as expected. Please take a look. Looking forward to any feedback.

istio-policy-bot · 2024-05-23T13:36:00Z

😊 Welcome @wangjian-pg! This is either your first contribution to the Istio proxy repo, or it's been
a while since you've been here.

You can learn more about the Istio working groups, Code of Conduct, and contribution guidelines
by referring to Contributing to Istio.

Thanks for contributing!

Courtesy of your friendly welcome wagon.

linux-foundation-easycla · 2024-05-23T13:36:01Z

The committers listed above are authorized under a signed CLA.

✅ login: wangjian-pg / name: wangjian (6175d6b, e35dd28)

istio-testing · 2024-05-23T13:36:08Z

Hi @wangjian-pg. Thanks for your PR.

I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

lei-tang · 2024-05-23T16:18:58Z

/ok-to-test

wangjian-pg · 2024-05-24T10:36:44Z

@lei-tang I have formatted the changes with clang-lint, and all tests have passed. Could you please review the code?

lei-tang

@wangjian-pg Thank you for your PR!

Can you provide a concrete example to describe the problems of current implementation?

wangjian-pg · 2024-05-26T17:34:40Z

@lei-tang The current implementation generally works well for most regular cases and under normal circumstances.

However, consider a scenario where thousands of processes are running on a host with limited resources, such as a system with only two CPU cores. In this case, the operating system may schedule the worker thread off the CPU core when it attempts to allocate a new gauge metric with the currently active scope pointer through the function call Stats::Utility::counterFromStatName. Meanwhile, the main thread continues to run on another CPU core. In a heavily loaded system, there is a possibility that the worker thread may not get any CPU time for a prolonged period (more than 1 second). Subsequently, when the worker thread is scheduled back onto a CPU core and continues the logic processing of the function Stats::Utility::counterFromStatName, the scope pointer has already expired and been released by the main thread. This leads to a problem of working with a dangling pointer.

The real problem is that it relies on timing to ensure that the scope pointer used by the worker threads is valid and lacks any other synchronization mechanism. This behavior is non-deterministic in a time-sharing operating system.

By the way, as far as I know, using std::atomic may introduce a memory barrier and could potentially lead to performance degradation. However, I haven't yet benchmarked this and it shouldn't be a big deal.

jezhang2014 · 2024-05-27T07:00:26Z

If the http.stats plugin is used as the statistics plugin, when the onDelete callback is triggered, Is it possible that all the historical statistical data will be cleared? if so, this could lead to a sudden change in the statistical data of the monitoring system, potentially triggering alerts.

To avoid this issue, it would be better to only clear the historical statistical data for targets (e.g., Pods) that have not received requests for a long time, while leaving the statistical data for targets that are still actively receiving requests unaffected. This would help maintain the continuity of the statistical data and prevent unnecessary alerts.

wangjian-pg · 2024-05-27T13:02:58Z

@jezhang2014
I believe it's not a problem of dropping historical stats data. Whether the envoy proxy is deployed as a sidecar or a gateway, it can be terminated at any time for reasons like upgrades or scale-down. This is common in dynamic cloud environments, such as with k8s HPA. The in-memory stats data would be lost whenever the envoy proxy is terminated or restarted. This shouldn't be a problem, and it's the responsibility of the monitoring system to handle this situation.

AFAIK, the widely adopted monitoring system Prometheus has a built-in function to detect stats data reset. You may find more information here: https://prometheus.io/docs/prometheus/latest/querying/functions/#resets

wangjian-pg · 2024-05-29T08:29:23Z

@kyessenov Could you please take a look at this PR?

wangjian-pg · 2024-07-10T09:12:17Z

@kyessenov Could you please take a look at this PR?

zirain · 2024-07-11T01:07:15Z

not stale

istio-testing · 2024-08-27T04:24:56Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

wangjian-pg requested a review from a team as a code owner May 23, 2024 13:35

istio-testing added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 23, 2024

istio-testing added the needs-ok-to-test label May 23, 2024

wangjian-pg force-pushed the refactor/stats branch from 2babfbd to 4a1e3e0 Compare May 23, 2024 13:40

refactor(stats): re-implement RotatingScope with thread local storage

e35dd28

wangjian-pg force-pushed the refactor/stats branch from 4a1e3e0 to e35dd28 Compare May 23, 2024 13:51

istio-testing added ok-to-test Set this label allow normal testing to take place for a PR not submitted by an Istio org member. and removed needs-ok-to-test labels May 23, 2024

lint: format istio_stats.cc with clang-format

6175d6b

lei-tang reviewed May 24, 2024

View reviewed changes

wangjian-pg requested a review from lei-tang May 28, 2024 02:32

zirain requested a review from kyessenov June 1, 2024 13:39

kyessenov self-assigned this Jul 10, 2024

istio-policy-bot closed this Jul 10, 2024

istio-policy-bot added the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Jul 10, 2024

zirain reopened this Jul 11, 2024

zirain removed the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Jul 11, 2024

istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-implement RotatingScope with thread local storage #5573

Re-implement RotatingScope with thread local storage #5573

wangjian-pg commented May 23, 2024

istio-policy-bot commented May 23, 2024

linux-foundation-easycla bot commented May 23, 2024 •

edited

Loading

istio-testing commented May 23, 2024

lei-tang commented May 23, 2024

wangjian-pg commented May 24, 2024

lei-tang left a comment

wangjian-pg commented May 26, 2024

jezhang2014 commented May 27, 2024

wangjian-pg commented May 27, 2024

wangjian-pg commented May 29, 2024 •

edited

Loading

wangjian-pg commented Jul 10, 2024

zirain commented Jul 11, 2024

istio-testing commented Aug 27, 2024

Re-implement RotatingScope with thread local storage #5573

Are you sure you want to change the base?

Re-implement RotatingScope with thread local storage #5573

Conversation

wangjian-pg commented May 23, 2024

istio-policy-bot commented May 23, 2024

linux-foundation-easycla bot commented May 23, 2024 • edited Loading

istio-testing commented May 23, 2024

lei-tang commented May 23, 2024

wangjian-pg commented May 24, 2024

lei-tang left a comment

Choose a reason for hiding this comment

wangjian-pg commented May 26, 2024

jezhang2014 commented May 27, 2024

wangjian-pg commented May 27, 2024

wangjian-pg commented May 29, 2024 • edited Loading

wangjian-pg commented Jul 10, 2024

zirain commented Jul 11, 2024

istio-testing commented Aug 27, 2024

linux-foundation-easycla bot commented May 23, 2024 •

edited

Loading

wangjian-pg commented May 29, 2024 •

edited

Loading