-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added SVE implementation to improve the performance on ARM architecture #10680
base: master
Are you sure you want to change the base?
Conversation
Thank you for the PR! I'm not an expert in SIMDs, is it guaranteed to have aligned pointers and padded memory allocation for the intrinsics? |
60554bc
to
5194c17
Compare
Hi @trivialfis |
Thank you for the detailed info. Could you please help explain why it works without the use of specialized allocators like https://en.cppreference.com/w/c/memory/aligned_alloc ? It's important for us to know the logic for future maintenance. |
Specialized allocators like aligned_alloc() doesn't help with SVE intrinsics because:
|
Hi @trivialfis, |
Hi @trivialfis, As @divya2108 mentioned, SVE has predication support. These lines create masks which limit the load/stores from going out of bounds: xgboost/src/common/hist_util.cc Lines 265 to 266 in 5194c17
SVE is also happy to do element-aligned loads and stores rather than full vectors. |
Thank you for the explanation! I will take a deeper look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Started looking into this PR today. Thank you for working on using the arm intrinsic, but could you please add detailed code comments and extract the code into an independent section (like a function that can be inlined)? Most people here (me included) have a background closer to data science instead of low-level programming.
CMakeLists.txt
Outdated
@@ -265,6 +265,51 @@ if(${CMAKE_SYSTEM_NAME} MATCHES "OS400") | |||
set(CMAKE_CXX_ARCHIVE_CREATE "<CMAKE_AR> -X64 qc <TARGET> <OBJECTS>") | |||
endif() | |||
|
|||
if(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please extract this into a module similar to cmake/PrefetchIntrinsics.cmake
?
CMakeLists.txt
Outdated
if(RUN_RESULT EQUAL 0) | ||
message(STATUS "ARM SVE hardware support detected") | ||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=armv8-a+sve") | ||
string(APPEND CMAKE_CXX_FLAGS " -DSVE_SUPPORT_DETECTED") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please prefix the flag with XGBOOST_
and use targeted flags instead of the CMAKE_CXX_FLAGS
.
src/common/hist_util.cc
Outdated
svfloat64_t pgh_t0_vec = svdup_n_f64(pgh_t[0]); | ||
svfloat64_t pgh_t1_vec = svdup_n_f64(pgh_t[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you don't need the pgh_t
in the SVE code section. Could you please use p_gpair
and have the names of the loaded vectors, such as svfloat64_t grad, svfloat64_t hess
, for readability?
Hi @trivialfis, Thank you for suggesting the appropriate changes. I have made the modifications as recommended. Could you please review the updated changes? |
The CMake logic looks right. It only compiles SVE code when the compiler supports it and during the runtime it triggers the SVE code only when the hardware supports SVE (I see there's a runtime HW check for SVE ISA). Changes LGTM. |
Can you point me to where the runtime check happens? As far as I can tell, this only works if the build environment supports compiling and running SVE. The new path is conditionally compiled with This makes me think this is only for users compiling from sources on a specific piece of hardware. If we wanted this to work in the generically distributed wheel, we'd have to do the SVE runtime check instead of the Correct me if I'm wrong 😸 |
I agree with you. I found the HW detection logic here: https://github.com/dmlc/xgboost/pull/10680/files#diff-5650b69c609ef22dea88915eb256a6838341248d3ddfd17430388f7f7e58c4feR24 But this is just for compile time. I think similar logic need to be used during runtime and a runtime check is required. Since there's already a working SVE HW detection logic, should be easy to reintroduce it in the source code file. CC @divya2108 |
Is SVE guaranteed to be available for ARM implementation? |
- Changed cmake design by extracting the code into cmake/CheckSVEsupport.cmake - Prefixed the flags with XGBOOST_ and used targeted flags - Extracted the SVE code into an inlined function - Added detailed code comments - Modified vector names for better readability
Signed-off-by: divya2108 <[email protected]>
dca00be
to
f5edc42
Compare
No, SVE is not guaranteed to be available on all ARM implementations. While ARMv8-A architecture, which includes SVE support, is present in newer processors like Graviton3, Graviton4, Grace, it is not mandatory for all ARM CPUs to implement SVE. The code in hist_util.cc checks for SVE support at runtime to ensure that the target hardware supports it & runs the default code otherwise. |
Yes, I agree. Thank you for bringing this to notice. I have verified this by building the code on different architectures. Here is a summary for more clarity: |
@trivialfis Thanks for the initial review and your comments. Can you please suggest any additional feedback which might need further clarification/evaluation from our side or any improvements to incorporate in the proposed implementation. Thanks. cc: @divya2108 |
Sorry for the slow reply, got stuck at some other work lately. One question, is it possible to reduce the call frequency of |
Yes, it's possible to reduce the frequency of calls to check_sve_hw_support by implementing a caching mechanism that checks the SVE hardware support status only once at the beginning of a training session. I have stored the result and it is being reused throughout the session. |
78302b5
to
5fa3712
Compare
Motivation: This pull request aims to improve the performance of training algorithm of XGBoost on ARM architecture by leveraging SVE intrinsics.
Brief description: