Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement RotatingScope with thread local storage #5573

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wangjian-pg
Copy link

What this PR does / why we need it:

The current implementation of metric rotation relies on timing to ensure that the scope pointer used by the worker thread is valid, which is probably not a good practice or design as far as I know.

After digging into the envoy's implementation of ThreadLocalStoreImpl(which faces similar issues of how to propagate scoped metrics to worker threads and expire them when the scope is released), I found that storing the active scope reference in thread local storage and replacing it with the new one when rotation occurs might be a better design. With this approach, we eliminate the use of raw pointers, making rotation behavior more predictable and still lock-free.

I re-implemented the rotation as described and tested the code in my environment, it works as expected. Please take a look. Looking forward to any feedback.

@wangjian-pg wangjian-pg requested a review from a team as a code owner May 23, 2024 13:35
@istio-policy-bot
Copy link

😊 Welcome @wangjian-pg! This is either your first contribution to the Istio proxy repo, or it's been
a while since you've been here.

You can learn more about the Istio working groups, Code of Conduct, and contribution guidelines
by referring to Contributing to Istio.

Thanks for contributing!

Courtesy of your friendly welcome wagon.

Copy link

linux-foundation-easycla bot commented May 23, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@istio-testing istio-testing added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 23, 2024
@istio-testing
Copy link
Collaborator

Hi @wangjian-pg. Thanks for your PR.

I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@lei-tang
Copy link
Contributor

/ok-to-test

@istio-testing istio-testing added ok-to-test Set this label allow normal testing to take place for a PR not submitted by an Istio org member. and removed needs-ok-to-test labels May 23, 2024
@wangjian-pg
Copy link
Author

@lei-tang I have formatted the changes with clang-lint, and all tests have passed. Could you please review the code?

Copy link
Contributor

@lei-tang lei-tang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangjian-pg Thank you for your PR!

Can you provide a concrete example to describe the problems of current implementation?

@wangjian-pg
Copy link
Author

@lei-tang The current implementation generally works well for most regular cases and under normal circumstances.

However, consider a scenario where thousands of processes are running on a host with limited resources, such as a system with only two CPU cores. In this case, the operating system may schedule the worker thread off the CPU core when it attempts to allocate a new gauge metric with the currently active scope pointer through the function call Stats::Utility::counterFromStatName. Meanwhile, the main thread continues to run on another CPU core. In a heavily loaded system, there is a possibility that the worker thread may not get any CPU time for a prolonged period (more than 1 second). Subsequently, when the worker thread is scheduled back onto a CPU core and continues the logic processing of the function Stats::Utility::counterFromStatName, the scope pointer has already expired and been released by the main thread. This leads to a problem of working with a dangling pointer.

The real problem is that it relies on timing to ensure that the scope pointer used by the worker threads is valid and lacks any other synchronization mechanism. This behavior is non-deterministic in a time-sharing operating system.

By the way, as far as I know, using std::atomic may introduce a memory barrier and could potentially lead to performance degradation. However, I haven't yet benchmarked this and it shouldn't be a big deal.

@jezhang2014
Copy link

If the http.stats plugin is used as the statistics plugin, when the onDelete callback is triggered, Is it possible that all the historical statistical data will be cleared? if so, this could lead to a sudden change in the statistical data of the monitoring system, potentially triggering alerts.

To avoid this issue, it would be better to only clear the historical statistical data for targets (e.g., Pods) that have not received requests for a long time, while leaving the statistical data for targets that are still actively receiving requests unaffected. This would help maintain the continuity of the statistical data and prevent unnecessary alerts.

@wangjian-pg
Copy link
Author

@jezhang2014
I believe it's not a problem of dropping historical stats data. Whether the envoy proxy is deployed as a sidecar or a gateway, it can be terminated at any time for reasons like upgrades or scale-down. This is common in dynamic cloud environments, such as with k8s HPA. The in-memory stats data would be lost whenever the envoy proxy is terminated or restarted. This shouldn't be a problem, and it's the responsibility of the monitoring system to handle this situation.

AFAIK, the widely adopted monitoring system Prometheus has a built-in function to detect stats data reset. You may find more information here: https://prometheus.io/docs/prometheus/latest/querying/functions/#resets

@wangjian-pg wangjian-pg requested a review from lei-tang May 28, 2024 02:32
@wangjian-pg
Copy link
Author

wangjian-pg commented May 29, 2024

@kyessenov Could you please take a look at this PR?

@zirain zirain requested a review from kyessenov June 1, 2024 13:39
@wangjian-pg
Copy link
Author

@kyessenov Could you please take a look at this PR?

@kyessenov kyessenov self-assigned this Jul 10, 2024
@istio-policy-bot istio-policy-bot added the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Jul 10, 2024
@zirain zirain reopened this Jul 11, 2024
@zirain zirain removed the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Jul 11, 2024
@zirain
Copy link
Member

zirain commented Jul 11, 2024

not stale

@istio-testing
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR needs to be rebased before being merged ok-to-test Set this label allow normal testing to take place for a PR not submitted by an Istio org member. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants