DRAFT: Propose an alternative to the types in capacity_types.go #8

klueska · 2024-05-03T12:59:38Z

I spent some time today going through the types in the file pkg/api/capacity_types.go in detail. I have proposed my changes to it as an entirely new file (instead of a diff to the original) so that we can more easily comment on it line-by-line without the diff getting in the way.

The biggest thing to note is the following:
I'm coming around to the idea that DevicePool (or whatever future resource models we come up with) could be
top level objects, rather than wrapping them in multiple-levels of indirection like we have today to support the NamedResources model:

type ResourceSlice struct {
	metav1.TypeMeta
	metav1.ObjectMeta

	NodeName string `json:"nodeName,omitempty" protobuf:"bytes,2,opt,name=nodeName"`
	DriverName string `json:"driverName" protobuf:"bytes,3,name=driverName"`
	ResourceModel `json:",inline" protobuf:"bytes,4,name=resourceModel"`
}

type ResourceModel struct {
	NamedResources *NamedResourcesResources
}

type NamedResourcesResources struct {
	Instances []NamedResourcesInstance `json:"instances" protobuf:"bytes,1,name=instances"`
}

Having all of these levels of indirection doesn't actually add much value. The original idea was that the scheduler (or whoever else cares about ResourceSlices) only has one object type to subscribe to, and we can simply extend its embedded ResourceModel with more types over time.

This actually makes it awkward, however, when we need to split our Instances at the lowest level across multiple API server objects. Each object needs to be wrapped in multiple levels of indirection, making it a somewhat hard to reason about when trying to reconstruct all objects from a given driver.

On the contrary, by making DevicePool (or whatever future resource models we come up with) a top-level API server object, we have a natural way of breaking things apart. Each pool is it's own self-contained object, so there is no ambiguity on where the object boundaries should be. After thinking about it for a while, this actually feels like the more natural and "Kubernetes-like" way of abstracting these objects.

The major difference being that now the scheduler can't just subscribe to a single ResourceSlices object and have all supported model implementations from all drivers follow that. It potentially has to subscribe to multiple top-level objects representing different models and react to them accordingly. This seems like a reasonable trade-off since one would always have to write the code to be aware of the different models anyway.

The second biggest thing to note is the following:
I have (slightly) renamed and "future-proofed" all of the types that represent named resources and their requests. Whether we want to do this is still being debated, but I wanted to get a concrete proposal for what it would look like as a diff to @johnbelamaric's prototype, rather than continuing to talk in "what-ifs".

The third biggest thing to note is the following:
We can now have a meaningful Status attached to the top-level type. In this case I have expanded it to include all allocated devices and the claims they have been allocated to.

Signed-off-by: Kevin Klues <[email protected]>

k8s-ci-robot · 2024-05-03T12:59:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [klueska]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pohly · 2024-05-03T14:16:13Z

The original idea was that the scheduler (or whoever else cares about ResourceSlices) only has one object type to subscribe to, and we can simply extend its embedded ResourceModel with more types over time.

This is what kubernetes/kubernetes#123699 depends upon: kubelet knows that drivers may access ResourceSlice and ResourceClaim through the REST proxy and denies access to everything else. When combining kubelet from 1.31 with a future DRA driver which wants to publish a "SomeFutureType" instead of a "ResourceSlice", the driver's request will get denied. I'm working on a KEP update for this issue and have a prototype implementation ready that I'll share together with it.

This actually makes it awkward, however, when we need to split our Instances at the lowest level across multiple API server objects. Each object needs to be wrapped in multiple levels of indirection, making it a somewhat hard to reason about when trying to reconstruct all objects from a given driver.

There's one additional indirection ("namedResources"), right? I don't find it hard to ignore that - but perhaps I am simply used to it. 🤷

klueska · 2024-05-03T14:44:08Z

I think where it got awkward was when I was tryin to adopt @johnbelamaric*s model to fit into this:

type DeviceModel struct {
    metav1.TypeMeta
    metav1.ObjectMeta

    Spec   DeviceModelSpec
    Status DeviceModelStatus
}

type DeviceModelSpec struct {
    NodeName *string
    DriverName string 
    *DeviceModelVariant
}

struct DeviceModelVariant struct {
    DevicePools *DevicePools
    // Other future models
}

type DevicePools struct {
    Pools []DevicePool
}

type DevicePool struct {
...
}

These 5 structs turn into one (with multiple instances of the last as top-level objects).

klueska · 2024-05-03T15:24:13Z

I'm not saying we necessarily want to do this, but couldn't the kubelet be updated to whitlelist any new model objects we define (same as the scheduler needing to know to pull them in too)?

johnbelamaric · 2024-05-03T15:25:39Z

This is what kubernetes/kubernetes#123699 depends upon: kubelet knows that drivers may access ResourceSlice and ResourceClaim through the REST proxy and denies access to everything else. When combining kubelet from 1.31 with a future DRA driver which wants to publish a "SomeFutureType" instead of a "ResourceSlice", the driver's request will get denied. I'm working on a KEP update for this issue and have a prototype implementation ready that I'll share together with it.

Let's not forget we have versioning in the API. So, we start with v1 without the indirection. One option then is to add a new type; another option is to create DevicePool v2 which in turn has the indirection but roundtrips to v1. Kubelet can use an Unstructured for passthrough, so it can still check the types without changes and without versioning issues between apiserver/kubelet/driver. Unless I am missing something.

A couple questions as you think through that design:

Should kubelet enforce that pool.Driver == the one for the specified driver? Or even just set it, assuming the driver registers with kubelet?
Similarly, should the pool name be something the driver comes up with, or kubelet? Pool names coming from nodes (we could have ones that come from elsewhere) should generally be <node-name>-< something>. Are there exceptions to that? Should the driver or the kubelet control it?

pohly · 2024-05-03T15:37:49Z

Kubelet can use an Unstructured for passthrough, so it can still check the types without changes and without versioning issues between apiserver/kubelet/driver.

kubelet still needs to check whether a request for some unknown resource should be allowed. Let's discuss in kubernetes/enhancements#4615

pohly · 2024-05-03T15:40:14Z

So, we start with v1 without the indirection. One option then is to add a new type; another option is to create DevicePool v2 which in turn has the indirection but roundtrips to v1.

This doesn't work without the one-of-many concept already being in v1. A new field can be added to v1 to make the round-trip possible, but a client built against v1 without that field will never see it. We have to define in advance how such a client can detect that it doesn't have some information that it should have.

*Claim Model* - `DeviceType` is gone. Instead, this version takes Tim's suggestion: - `DeviceClass` is mandatory for claims. - `DeviceClass` can refer either to the criteria that defines a set of devices (an individual class), OR to a label selector. - The label selector will be used to identify a set of classes. So, essentially, a class is now either a "leaf" class, or it is a set of classes. - If a claim specifies a "non leaf" class, then the scheduler must flatten that to a list of leaf classes, and can consider any of those classes as containing possible solutions. This is more expressive than DeviceType and fits better into the API conventions. - `DeviceClaimInstance.DeviceClaimDetail` is now a pointer so clients can check if it is nil or not, in order to differentiate between that and the `OneOf` form of `DeviceClaimInstance`. An alternative would be to add `AllOf` as a peer to `OneOf`. This is just a bit more awkward, with an additional YAML stanza with a single entry for the expected most common case. But it may match better with API conventions. - Moved AdminAccess from class to claim. This is necessary because if we are to control by quota, it's not clear that we can look up the AdminAccess boolean that is in the *class* when we apply quota to the *claim*. This still seems like the wrong way to do this, but it does avoid a separate top-level type. For AdminAccess, make it clear only the Constraints and Driver matter; the rest of the claim details do not. Additionally, ALL devices matching the Driver and Constraints will be associated with the claim and able to be added to a container. - Changed requests.count to requests.devices to be more explicit. Defaults to 1. - Removed Limits. Instead, we now have `MaxDevices`, which will default to the value of `requests.devices`. Thus, to get exactly N devices, you only need to specify `requests.devices = N` in the claim. If you want 1-N devices, you only need to specify `maxDevices = N`. If you want N-M devices, you would specify `requests.devices = N, maxDevices = M`. - `DeviceClaimStatus` now has an atomic listMap for allocations, owned by the scheduler, and a separate listMap with entries owned by the drivers to capture the results of the actuation of that allocation. *Capacity Model* - `DevicePool.Resources` => `DevicePool.SharedResources` (as in kubernetes-sigs#8) - `DevicePoolStatus` changes from kubernetes-sigs#8 - `Device.Requests` => `Device.ConsumesSharedResources` (similar to kubernetes-sigs#8) - `Device.Resources` => `Device.ProvidesClaimResources` *NVIDIA DGX A100 Example* - Added `memory` to the `Device.ProvidesClaimResources` *PodSpec Examples* - Still working on these. There are still some open questions that I will summarize in a comment.

klueska · 2024-05-09T12:57:49Z

k8srm-prototype/pkg/api/capacity_types_alternate.go

+// different types of resources this struct can represent.
+type DeviceResource struct {
+	Name string `json:"name"`
+	*DeviceResourceValue `json:",inline"`


I've noticed that this indirection (and the others like it) still haven't made it into the prototype or been discussed much. I really feel like limiting ourselves to just a quantity is going to bite us in the future.

I agree, this is probably needed. We'll need to convince @thockin that the complexity is worth it. I think that given that the "capacity" model is primarily for driver developers, a little more complexity is OK. Users will have to be able to understand it though for troubleshooting.

Can you make sure this covered in the "open questions" doc?

I'm not sure I understand what this is?

k8s-triage-robot · 2024-08-18T18:15:19Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-09-17T18:23:18Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

DRAFT: Propose an alternative to the types in capacity_types.go

ddde878

Signed-off-by: Kevin Klues <[email protected]>

k8s-ci-robot requested a review from johnbelamaric May 3, 2024 12:59

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 3, 2024

k8s-ci-robot requested a review from pohly May 3, 2024 12:59

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 3, 2024

klueska marked this pull request as draft May 3, 2024 13:00

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 3, 2024

klueska mentioned this pull request May 3, 2024

Changes based on review feedback #5

Merged

klueska commented May 9, 2024

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: Propose an alternative to the types in capacity_types.go #8

DRAFT: Propose an alternative to the types in capacity_types.go #8

klueska commented May 3, 2024 •

edited

Loading

k8s-ci-robot commented May 3, 2024

pohly commented May 3, 2024

klueska commented May 3, 2024 •

edited

Loading

klueska commented May 3, 2024

johnbelamaric commented May 3, 2024 •

edited

Loading

pohly commented May 3, 2024

pohly commented May 3, 2024

klueska May 9, 2024

johnbelamaric May 9, 2024

thockin May 20, 2024

k8s-triage-robot commented Aug 18, 2024

k8s-triage-robot commented Sep 17, 2024

DRAFT: Propose an alternative to the types in capacity_types.go #8

Are you sure you want to change the base?

DRAFT: Propose an alternative to the types in capacity_types.go #8

Conversation

klueska commented May 3, 2024 • edited Loading

k8s-ci-robot commented May 3, 2024

pohly commented May 3, 2024

klueska commented May 3, 2024 • edited Loading

klueska commented May 3, 2024

johnbelamaric commented May 3, 2024 • edited Loading

pohly commented May 3, 2024

pohly commented May 3, 2024

klueska May 9, 2024

Choose a reason for hiding this comment

johnbelamaric May 9, 2024

Choose a reason for hiding this comment

thockin May 20, 2024

Choose a reason for hiding this comment

k8s-triage-robot commented Aug 18, 2024

k8s-triage-robot commented Sep 17, 2024

klueska commented May 3, 2024 •

edited

Loading

klueska commented May 3, 2024 •

edited

Loading

johnbelamaric commented May 3, 2024 •

edited

Loading