chore: gpu cleanup #1704

maryamtahhan · 2024-08-19T12:32:31Z

remove init() functions in gpu devices - change to dynamic device registration.
remove custom gpu tags from the Makefile with the exception of habana (libhlml.go cgo implementation in vendor directory is problematic) so that all device implementations are built in.
add missing copyright tags
cleanup dcgm, nvml, and habana globals

github-actions · 2024-08-19T12:33:11Z

🤖 SeineSailor

Here is a concise summary of the pull request changes:

Summary: The "chore: gpu cleanup" pull request refactors and cleans up GPU-related code, centralizing GPU configuration and updating accelerator interfaces. Key changes include:

Modified init() functions in GPU devices
Removed custom GPU flags
Updated GetActiveAcceleratorByType function calls
Centralized GPU configuration in main() function
Refactored AcceleratorType constant and related functions
Changed Registry type to use string keys
Updated Device() method in Accelerator interface
Changed New() function to accept a string type for the accelerator
Added habanaCheck function for Habana GPU initialization error checking

Impact: These internal cleanups do not affect the external interface or behavior of the code. However, developers should be aware of the updated import path and altered function signatures, which may impact dependent code.

Observations/Suggestions:

The changes seem to improve code organization and consistency, making it easier to maintain and extend GPU-related functionality.
It would be beneficial to include additional tests to ensure the refactored code behaves as expected, especially for the updated GetActiveAcceleratorByType function calls.
Consider documenting the changes and their rationale to facilitate knowledge transfer and future maintenance.

pkg/sensors/accelerator/devices/device.go

pkg/sensors/accelerator/devices/mock_device.go

marceloamaral · 2024-09-12T14:05:44Z

Makefile

-ifeq ($(shell ldconfig -p | grep -q libnvml_injection.so && echo exists),exists)
-	GPU_TAGS := nvml
-endif
-ifeq ($(shell ldconfig -p | grep -q libdcgm.so && echo exists),exists)


@rootfs is it ok to remove the dcgm library check?

this is removing the GPU_TAGS := dcgm meaning the dcgm and NVML GPU code will be built into Kepler by default. The device plugins/gpu code checks at runtime for the libraries and don't run if the library doesn't exist.

The only reason I kept the HABANA tag is because there are issues building habana if the library doesn't exist on the system because of their cgo dependency

maryamtahhan · 2024-09-27T15:34:01Z

will rebase once #1788 is merged

Cleanup the init() calls in the gpu code. Remove gpu tags from the Makefile with the exception of habana (as the libhlml.go cgo implementation in vendor directory is problematic if libhlml.so isn't available). Signed-off-by: Maryam Tahhan <[email protected]>

Signed-off-by: Maryam Tahhan <[email protected]>

maryamtahhan · 2024-09-30T12:48:07Z

Tested with Kubeadm :

 kubectl logs -n kepler kepler-exporter-wm9sd
WARNING: failed to read int from file: open /sys/devices/system/cpu/cpu0/online: no such file or directory
I0930 13:33:58.022593  779728 exporter.go:103] Kepler running on version: 0b12e73b-dirty
I0930 13:33:58.022740  779728 config.go:293] using gCgroup ID in the BPF program: true
I0930 13:33:58.022774  779728 config.go:295] kernel version: 5.14
I0930 13:33:58.022938  779728 config.go:322] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0930 13:33:58.022952  779728 config.go:256] ENABLE_EBPF_CGROUPID: true
I0930 13:33:58.022959  779728 config.go:257] ENABLE_GPU: true
I0930 13:33:58.022964  779728 config.go:258] ENABLE_PROCESS_METRICS: false
I0930 13:33:58.022970  779728 config.go:259] EXPOSE_HW_COUNTER_METRICS: true
I0930 13:33:58.022975  779728 config.go:260] EXPOSE_IRQ_COUNTER_METRICS: true
I0930 13:33:58.022981  779728 config.go:261] EXPOSE_BPF_METRICS: true
I0930 13:33:58.022988  779728 config.go:262] EXPOSE_COMPONENT_POWER: true
I0930 13:33:58.022994  779728 config.go:263] EXPOSE_ESTIMATED_IDLE_POWER_METRICS: true. This only impacts when the power is estimated using pre-prained models. Estimated idle power is meaningful only when Kepler is running on bare-metal or with a single virtual machine (VM) on the node.
I0930 13:33:58.023002  779728 config.go:264] EXPERIMENTAL_BPF_SAMPLE_RATE: 0
I0930 13:33:58.023046  779728 power.go:59] use sysfs to obtain power
I0930 13:33:58.023119  779728 csv_cred.go:68] failed to read csv file: node name u37-h31-000-r7425.rdu3.labs.perfscale.redhat.com not found in file /etc/redfish/redfish.csv
I0930 13:33:58.023130  779728 redfish.go:173] failed to initialize node credential: no supported node credential implementation
I0930 13:33:58.025843  779728 acpi.go:71] Could not find any ACPI power meter path. Is it a VM?
I0930 13:33:58.025851  779728 power.go:79] using none to obtain power
I0930 13:33:58.029368  779728 dcgm.go:66] Initializing dcgm Successful
E0930 13:33:58.029379  779728 device.go:135] Device with type NVML doesn't exist
I0930 13:33:58.029390  779728 device.go:163] Try to Register DCGM
I0930 13:33:58.029395  779728 device.go:121] Adding the device to the registry [gpu][DCGM]
I0930 13:33:58.029402  779728 device.go:170] Registered DCGM
I0930 13:33:58.029411  779728 dcgm.go:69] Using DCGM to obtain processor power
I0930 13:33:58.029415  779728 fakehabana.go:24] Error initializing habana: ERROR_LIBRARY_NOT_FOUND
I0930 13:33:58.033739  779728 nvml.go:50] Initializing nvml Successful
I0930 13:33:58.033779  779728 nvml.go:55] Error registering nvml: DCGM already registered. Skipping NVML
I0930 13:33:58.033792  779728 accelerator.go:139] Initializing the Accelerator of type gpu
I0930 13:33:58.033796  779728 device.go:183] Starting up DCGM
I0930 13:33:58.046281  779728 dcgm.go:455] Created device group "dev-grp-2024-09-30-13-33-58"
I0930 13:33:58.092921  779728 dcgm.go:123] DCGM initialized successfully
I0930 13:33:58.092944  779728 dcgm.go:88] Using DCGM to obtain gpu power
I0930 13:33:58.092952  779728 accelerator.go:151] Startup gpu Accelerator successful

maryamtahhan · 2024-09-30T12:55:34Z

Maybe one improvement would be to not do init twice (once as a registration test then again for the real initialisation), I'm open to suggestion.

Signed-off-by: Maryam Tahhan <[email protected]>

maryamtahhan force-pushed the cleanup-gpu branch 3 times, most recently from 7989ce0 to 2e14406 Compare August 19, 2024 13:16

maryamtahhan commented Aug 19, 2024

View reviewed changes

pkg/sensors/accelerator/devices/device.go Outdated Show resolved Hide resolved

maryamtahhan commented Aug 19, 2024

View reviewed changes

pkg/sensors/accelerator/devices/mock_device.go Outdated Show resolved Hide resolved

maryamtahhan marked this pull request as ready for review August 19, 2024 13:27

rootfs requested a review from marceloamaral September 12, 2024 13:59

marceloamaral reviewed Sep 12, 2024

View reviewed changes

maryamtahhan mentioned this pull request Sep 16, 2024

fixes: aa66ada adding the needed libraries to the builder stage #1785

Merged

maryamtahhan force-pushed the cleanup-gpu branch from bba5100 to 448635e Compare September 30, 2024 11:59

maryamtahhan added 2 commits September 30, 2024 08:06

chore: gpu cleanup

28130c3

Cleanup the init() calls in the gpu code. Remove gpu tags from the Makefile with the exception of habana (as the libhlml.go cgo implementation in vendor directory is problematic if libhlml.so isn't available). Signed-off-by: Maryam Tahhan <[email protected]>

chore: add missing copyright tags to gpu code

dfe9c28

Signed-off-by: Maryam Tahhan <[email protected]>

maryamtahhan force-pushed the cleanup-gpu branch 2 times, most recently from 0b12e73 to 330fe0f Compare September 30, 2024 12:45

maryamtahhan requested review from sthaha, dave-tucker and rootfs September 30, 2024 12:48

chore: remove gpu global vars

330fe0f

Signed-off-by: Maryam Tahhan <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: gpu cleanup #1704

chore: gpu cleanup #1704

maryamtahhan commented Aug 19, 2024 •

edited

Loading

github-actions bot commented Aug 19, 2024 •

edited

Loading

marceloamaral Sep 12, 2024

maryamtahhan Sep 12, 2024 •

edited

Loading

maryamtahhan commented Sep 27, 2024

maryamtahhan commented Sep 30, 2024

maryamtahhan commented Sep 30, 2024

chore: gpu cleanup #1704

Are you sure you want to change the base?

chore: gpu cleanup #1704

Conversation

maryamtahhan commented Aug 19, 2024 • edited Loading

github-actions bot commented Aug 19, 2024 • edited Loading

marceloamaral Sep 12, 2024

Choose a reason for hiding this comment

maryamtahhan Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

maryamtahhan commented Sep 27, 2024

maryamtahhan commented Sep 30, 2024

maryamtahhan commented Sep 30, 2024

maryamtahhan commented Aug 19, 2024 •

edited

Loading

github-actions bot commented Aug 19, 2024 •

edited

Loading

maryamtahhan Sep 12, 2024 •

edited

Loading