Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: gpu cleanup #1704

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

maryamtahhan
Copy link
Collaborator

@maryamtahhan maryamtahhan commented Aug 19, 2024

  • remove init() functions in gpu devices - change to dynamic device registration.
  • remove custom gpu tags from the Makefile with the exception of habana (libhlml.go cgo implementation in vendor directory is problematic) so that all device implementations are built in.
  • add missing copyright tags
  • cleanup dcgm, nvml, and habana globals

Copy link

github-actions bot commented Aug 19, 2024

🤖 SeineSailor

Here is a concise summary of the pull request changes:

Summary: The "chore: gpu cleanup" pull request refactors and cleans up GPU-related code, centralizing GPU configuration and updating accelerator interfaces. Key changes include:

  • Modified init() functions in GPU devices
  • Removed custom GPU flags
  • Updated GetActiveAcceleratorByType function calls
  • Centralized GPU configuration in main() function
  • Refactored AcceleratorType constant and related functions
  • Changed Registry type to use string keys
  • Updated Device() method in Accelerator interface
  • Changed New() function to accept a string type for the accelerator
  • Added habanaCheck function for Habana GPU initialization error checking

Impact: These internal cleanups do not affect the external interface or behavior of the code. However, developers should be aware of the updated import path and altered function signatures, which may impact dependent code.

Observations/Suggestions:

  • The changes seem to improve code organization and consistency, making it easier to maintain and extend GPU-related functionality.
  • It would be beneficial to include additional tests to ensure the refactored code behaves as expected, especially for the updated GetActiveAcceleratorByType function calls.
  • Consider documenting the changes and their rationale to facilitate knowledge transfer and future maintenance.

@maryamtahhan maryamtahhan force-pushed the cleanup-gpu branch 3 times, most recently from 7989ce0 to 2e14406 Compare August 19, 2024 13:16
@maryamtahhan maryamtahhan marked this pull request as ready for review August 19, 2024 13:27
ifeq ($(shell ldconfig -p | grep -q libnvml_injection.so && echo exists),exists)
GPU_TAGS := nvml
endif
ifeq ($(shell ldconfig -p | grep -q libdcgm.so && echo exists),exists)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rootfs is it ok to remove the dcgm library check?

Copy link
Collaborator Author

@maryamtahhan maryamtahhan Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is removing the GPU_TAGS := dcgm meaning the dcgm and NVML GPU code will be built into Kepler by default. The device plugins/gpu code checks at runtime for the libraries and don't run if the library doesn't exist.

The only reason I kept the HABANA tag is because there are issues building habana if the library doesn't exist on the system because of their cgo dependency

@maryamtahhan
Copy link
Collaborator Author

will rebase once #1788 is merged

Cleanup the init() calls in the gpu code. Remove
gpu tags from the Makefile with the exception of
habana (as the libhlml.go cgo implementation in
vendor directory is problematic if libhlml.so isn't
available).

Signed-off-by: Maryam Tahhan <[email protected]>
@maryamtahhan maryamtahhan force-pushed the cleanup-gpu branch 2 times, most recently from 0b12e73 to 330fe0f Compare September 30, 2024 12:45
@maryamtahhan
Copy link
Collaborator Author

Tested with Kubeadm :

 kubectl logs -n kepler kepler-exporter-wm9sd
WARNING: failed to read int from file: open /sys/devices/system/cpu/cpu0/online: no such file or directory
I0930 13:33:58.022593  779728 exporter.go:103] Kepler running on version: 0b12e73b-dirty
I0930 13:33:58.022740  779728 config.go:293] using gCgroup ID in the BPF program: true
I0930 13:33:58.022774  779728 config.go:295] kernel version: 5.14
I0930 13:33:58.022938  779728 config.go:322] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0930 13:33:58.022952  779728 config.go:256] ENABLE_EBPF_CGROUPID: true
I0930 13:33:58.022959  779728 config.go:257] ENABLE_GPU: true
I0930 13:33:58.022964  779728 config.go:258] ENABLE_PROCESS_METRICS: false
I0930 13:33:58.022970  779728 config.go:259] EXPOSE_HW_COUNTER_METRICS: true
I0930 13:33:58.022975  779728 config.go:260] EXPOSE_IRQ_COUNTER_METRICS: true
I0930 13:33:58.022981  779728 config.go:261] EXPOSE_BPF_METRICS: true
I0930 13:33:58.022988  779728 config.go:262] EXPOSE_COMPONENT_POWER: true
I0930 13:33:58.022994  779728 config.go:263] EXPOSE_ESTIMATED_IDLE_POWER_METRICS: true. This only impacts when the power is estimated using pre-prained models. Estimated idle power is meaningful only when Kepler is running on bare-metal or with a single virtual machine (VM) on the node.
I0930 13:33:58.023002  779728 config.go:264] EXPERIMENTAL_BPF_SAMPLE_RATE: 0
I0930 13:33:58.023046  779728 power.go:59] use sysfs to obtain power
I0930 13:33:58.023119  779728 csv_cred.go:68] failed to read csv file: node name u37-h31-000-r7425.rdu3.labs.perfscale.redhat.com not found in file /etc/redfish/redfish.csv
I0930 13:33:58.023130  779728 redfish.go:173] failed to initialize node credential: no supported node credential implementation
I0930 13:33:58.025843  779728 acpi.go:71] Could not find any ACPI power meter path. Is it a VM?
I0930 13:33:58.025851  779728 power.go:79] using none to obtain power
I0930 13:33:58.029368  779728 dcgm.go:66] Initializing dcgm Successful
E0930 13:33:58.029379  779728 device.go:135] Device with type NVML doesn't exist
I0930 13:33:58.029390  779728 device.go:163] Try to Register DCGM
I0930 13:33:58.029395  779728 device.go:121] Adding the device to the registry [gpu][DCGM]
I0930 13:33:58.029402  779728 device.go:170] Registered DCGM
I0930 13:33:58.029411  779728 dcgm.go:69] Using DCGM to obtain processor power
I0930 13:33:58.029415  779728 fakehabana.go:24] Error initializing habana: ERROR_LIBRARY_NOT_FOUND
I0930 13:33:58.033739  779728 nvml.go:50] Initializing nvml Successful
I0930 13:33:58.033779  779728 nvml.go:55] Error registering nvml: DCGM already registered. Skipping NVML
I0930 13:33:58.033792  779728 accelerator.go:139] Initializing the Accelerator of type gpu
I0930 13:33:58.033796  779728 device.go:183] Starting up DCGM
I0930 13:33:58.046281  779728 dcgm.go:455] Created device group "dev-grp-2024-09-30-13-33-58"
I0930 13:33:58.092921  779728 dcgm.go:123] DCGM initialized successfully
I0930 13:33:58.092944  779728 dcgm.go:88] Using DCGM to obtain gpu power
I0930 13:33:58.092952  779728 accelerator.go:151] Startup gpu Accelerator successful

@maryamtahhan
Copy link
Collaborator Author

Maybe one improvement would be to not do init twice (once as a registration test then again for the real initialisation), I'm open to suggestion.

Signed-off-by: Maryam Tahhan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants