Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax NVIDIA ABI version check for version ranges with no changes #10628

Open
EtiennePerot opened this issue Jul 7, 2024 · 4 comments
Open
Labels
area: gpu Issue related to sandboxed GPU access type: enhancement New feature or request

Comments

@EtiennePerot
Copy link
Contributor

EtiennePerot commented Jul 7, 2024

Description

Currently, nvproxy's ABI version tree describes the ABI of each individual version number. This means that users need to have exactly the right driver version in order to use runsc. This hampers usability of nvproxy; see #10605 and #10624 for recent examples.

I propose that nvproxy's logic is relaxed for version ranges with no ABI differences. In other words, if the ABI has not changed from version 1.20.30 to version 1.40.50, then when running on a host with driver version 1.25.10, nvproxy should detect that this version falls in the middle of the range and therefore it can automatically decide to use its definition for ABI version 1.40.50.

This of course assumes that every version in the middle of this range indeed doesn't have any ABI changes. Therefore, in order to support this feature, the first task is to retroactively verify that this is the case between existing supported nvproxy ABI versions. For example, if there was any ABI change that was later reverted in the middle of the range, the nvproxy ABI version tree needs to have this range split to reflect the reality. An NVIDIA driver diffing tool should help here.

Is this feature related to a specific bug?

N/A

Do you have a specific solution in mind?

See above.

@EtiennePerot EtiennePerot added the type: enhancement New feature or request label Jul 7, 2024
@EtiennePerot
Copy link
Contributor Author

/cc @AC-Dap @ayushr2

@ayushr2
Copy link
Collaborator

ayushr2 commented Jul 9, 2024

That makes sense. Maybe we need to re-organize how the version->abi data is stored into something like a segment tree, which is more effective in indicating ranges.

This of course assumes that every version in the middle of this range indeed doesn't have any ABI changes.

Yes, this is a big assumption because roll-backs and roll-forwards are possible within ranges. The driver diffing tool is necessary here.

@nixprime
Copy link
Member

nixprime commented Jul 9, 2024

This of course assumes that every version in the middle of this range indeed doesn't have any ABI changes. Therefore, in order to support this feature the first task is to retroactively verify that this is the case between existing supported nvproxy ABI versions.

IIUC this isn't a one-off cost, we'd also need to verify this property for every future version, extending the driver versions supported by nvproxy to "every version within some min/max bounds", which seems like potentially quite a lot of dev burden.

@EtiennePerot
Copy link
Contributor Author

EtiennePerot commented Jul 11, 2024

we'd also need to verify this property for every future version, extending the driver versions supported by nvproxy to "every version within some min/max bounds", which seems like potentially quite a lot of dev burden.

True, but I don't think this is a bad thing to do even without this feature. If between version A and B, an NVIDIA struct is changed and then is changed back within the range, chances are that the meaning of the fields in that struct may also have changed between A and B even if the fields are back to being identical. At least, that probability is much higher than the case where the struct didn't change at all in the version range. So it'd be a good time to look at that struct again.

@ayushr2 ayushr2 added the area: gpu Issue related to sandboxed GPU access label Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: gpu Issue related to sandboxed GPU access type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants