Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Propose an alternative to the types in capacity_types.go #8

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

klueska
Copy link
Contributor

@klueska klueska commented May 3, 2024

I spent some time today going through the types in the file pkg/api/capacity_types.go in detail. I have proposed my changes to it as an entirely new file (instead of a diff to the original) so that we can more easily comment on it line-by-line without the diff getting in the way.

The biggest thing to note is the following:
I'm coming around to the idea that DevicePool (or whatever future resource models we come up with) could be
top level objects, rather than wrapping them in multiple-levels of indirection like we have today to support the NamedResources model:

type ResourceSlice struct {
	metav1.TypeMeta
	metav1.ObjectMeta

	NodeName string `json:"nodeName,omitempty" protobuf:"bytes,2,opt,name=nodeName"`
	DriverName string `json:"driverName" protobuf:"bytes,3,name=driverName"`
	ResourceModel `json:",inline" protobuf:"bytes,4,name=resourceModel"`
}

type ResourceModel struct {
	NamedResources *NamedResourcesResources
}

type NamedResourcesResources struct {
	Instances []NamedResourcesInstance `json:"instances" protobuf:"bytes,1,name=instances"`
}

Having all of these levels of indirection doesn't actually add much value. The original idea was that the scheduler (or whoever else cares about ResourceSlices) only has one object type to subscribe to, and we can simply extend its embedded ResourceModel with more types over time.

This actually makes it awkward, however, when we need to split our Instances at the lowest level across multiple API server objects. Each object needs to be wrapped in multiple levels of indirection, making it a somewhat hard to reason about when trying to reconstruct all objects from a given driver.

On the contrary, by making DevicePool (or whatever future resource models we come up with) a top-level API server object, we have a natural way of breaking things apart. Each pool is it's own self-contained object, so there is no ambiguity on where the object boundaries should be. After thinking about it for a while, this actually feels like the more natural and "Kubernetes-like" way of abstracting these objects.

The major difference being that now the scheduler can't just subscribe to a single ResourceSlices object and have all supported model implementations from all drivers follow that. It potentially has to subscribe to multiple top-level objects representing different models and react to them accordingly. This seems like a reasonable trade-off since one would always have to write the code to be aware of the different models anyway.

The second biggest thing to note is the following:
I have (slightly) renamed and "future-proofed" all of the types that represent named resources and their requests. Whether we want to do this is still being debated, but I wanted to get a concrete proposal for what it would look like as a diff to @johnbelamaric's prototype, rather than continuing to talk in "what-ifs".

The third biggest thing to note is the following:
We can now have a meaningful Status attached to the top-level type. In this case I have expanded it to include all allocated devices and the claims they have been allocated to.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 3, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from pohly May 3, 2024 12:59
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 3, 2024
@klueska klueska marked this pull request as draft May 3, 2024 13:00
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 3, 2024
@pohly
Copy link
Contributor

pohly commented May 3, 2024

The original idea was that the scheduler (or whoever else cares about ResourceSlices) only has one object type to subscribe to, and we can simply extend its embedded ResourceModel with more types over time.

This is what kubernetes/kubernetes#123699 depends upon: kubelet knows that drivers may access ResourceSlice and ResourceClaim through the REST proxy and denies access to everything else. When combining kubelet from 1.31 with a future DRA driver which wants to publish a "SomeFutureType" instead of a "ResourceSlice", the driver's request will get denied. I'm working on a KEP update for this issue and have a prototype implementation ready that I'll share together with it.

This actually makes it awkward, however, when we need to split our Instances at the lowest level across multiple API server objects. Each object needs to be wrapped in multiple levels of indirection, making it a somewhat hard to reason about when trying to reconstruct all objects from a given driver.

There's one additional indirection ("namedResources"), right? I don't find it hard to ignore that - but perhaps I am simply used to it. 🤷

@klueska
Copy link
Contributor Author

klueska commented May 3, 2024

I think where it got awkward was when I was tryin to adopt @johnbelamaric*s model to fit into this:

type DeviceModel struct {
    metav1.TypeMeta
    metav1.ObjectMeta

    Spec   DeviceModelSpec
    Status DeviceModelStatus
}

type DeviceModelSpec struct {
    NodeName *string
    DriverName string 
    *DeviceModelVariant
}

struct DeviceModelVariant struct {
    DevicePools *DevicePools
    // Other future models
}

type DevicePools struct {
    Pools []DevicePool
}

type DevicePool struct {
...
}

These 5 structs turn into one (with multiple instances of the last as top-level objects).

@klueska
Copy link
Contributor Author

klueska commented May 3, 2024

I'm not saying we necessarily want to do this, but couldn't the kubelet be updated to whitlelist any new model objects we define (same as the scheduler needing to know to pull them in too)?

@johnbelamaric
Copy link
Contributor

johnbelamaric commented May 3, 2024

This is what kubernetes/kubernetes#123699 depends upon: kubelet knows that drivers may access ResourceSlice and ResourceClaim through the REST proxy and denies access to everything else. When combining kubelet from 1.31 with a future DRA driver which wants to publish a "SomeFutureType" instead of a "ResourceSlice", the driver's request will get denied. I'm working on a KEP update for this issue and have a prototype implementation ready that I'll share together with it.

Let's not forget we have versioning in the API. So, we start with v1 without the indirection. One option then is to add a new type; another option is to create DevicePool v2 which in turn has the indirection but roundtrips to v1. Kubelet can use an Unstructured for passthrough, so it can still check the types without changes and without versioning issues between apiserver/kubelet/driver. Unless I am missing something.

A couple questions as you think through that design:

  • Should kubelet enforce that pool.Driver == the one for the specified driver? Or even just set it, assuming the driver registers with kubelet?
  • Similarly, should the pool name be something the driver comes up with, or kubelet? Pool names coming from nodes (we could have ones that come from elsewhere) should generally be <node-name>-< something>. Are there exceptions to that? Should the driver or the kubelet control it?

@pohly
Copy link
Contributor

pohly commented May 3, 2024

Kubelet can use an Unstructured for passthrough, so it can still check the types without changes and without versioning issues between apiserver/kubelet/driver.

kubelet still needs to check whether a request for some unknown resource should be allowed. Let's discuss in kubernetes/enhancements#4615

@pohly
Copy link
Contributor

pohly commented May 3, 2024

So, we start with v1 without the indirection. One option then is to add a new type; another option is to create DevicePool v2 which in turn has the indirection but roundtrips to v1.

This doesn't work without the one-of-many concept already being in v1. A new field can be added to v1 to make the round-trip possible, but a client built against v1 without that field will never see it. We have to define in advance how such a client can detect that it doesn't have some information that it should have.

johnbelamaric added a commit to johnbelamaric/wg-device-management that referenced this pull request May 6, 2024
*Claim Model*

- `DeviceType` is gone. Instead, this version takes Tim's suggestion:
  - `DeviceClass` is mandatory for claims.
  - `DeviceClass` can refer either to the criteria that defines a set of devices
    (an individual class), OR to a label selector.
  - The label selector will be used to identify a set of classes. So,
    essentially, a class is now either a "leaf" class, or it is a set of
    classes.
  - If a claim specifies a "non leaf" class, then the scheduler must flatten
    that to a list of leaf classes, and can consider any of those classes as
    containing possible solutions. This is more expressive than DeviceType and
    fits better into the API conventions.
- `DeviceClaimInstance.DeviceClaimDetail` is now a pointer so clients can check
  if it is nil or not, in order to differentiate between that and the `OneOf`
  form of `DeviceClaimInstance`. An alternative would be to add `AllOf` as a
  peer to `OneOf`. This is just a bit more awkward, with an additional YAML
  stanza with a single entry for the expected most common case. But it may match
  better with API conventions.
- Moved AdminAccess from class to claim. This is necessary because if we are to
  control by quota, it's not clear that we can look up the AdminAccess boolean
  that is in the *class* when we apply quota to the *claim*. This still seems
  like the wrong way to do this, but it does avoid a separate top-level type.
  For AdminAccess, make it clear only the Constraints and Driver matter; the
  rest of the claim details do not. Additionally, ALL devices matching the
  Driver and Constraints will be associated with the claim and able to be added
  to a container.
- Changed requests.count to requests.devices to be more explicit. Defaults to 1.
- Removed Limits. Instead, we now have `MaxDevices`, which will default to the
  value of `requests.devices`. Thus, to get exactly N devices, you only need to
  specify `requests.devices = N` in the claim. If you want 1-N devices, you only
  need to specify `maxDevices = N`. If you want N-M devices, you would specify
  `requests.devices = N, maxDevices = M`.
- `DeviceClaimStatus` now has an atomic listMap for allocations, owned by the
  scheduler, and a separate listMap with entries owned by the drivers to capture
  the results of the actuation of that allocation.

*Capacity Model*

- `DevicePool.Resources` => `DevicePool.SharedResources` (as in kubernetes-sigs#8)
- `DevicePoolStatus` changes from kubernetes-sigs#8
- `Device.Requests` => `Device.ConsumesSharedResources` (similar to kubernetes-sigs#8)
- `Device.Resources` => `Device.ProvidesClaimResources`

*NVIDIA DGX A100 Example*

- Added `memory` to the `Device.ProvidesClaimResources`

*PodSpec Examples*

- Still working on these.

There are still some open questions that I will summarize in a comment.
// different types of resources this struct can represent.
type DeviceResource struct {
Name string `json:"name"`
*DeviceResourceValue `json:",inline"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed that this indirection (and the others like it) still haven't made it into the prototype or been discussed much. I really feel like limiting ourselves to just a quantity is going to bite us in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this is probably needed. We'll need to convince @thockin that the complexity is worth it. I think that given that the "capacity" model is primarily for driver developers, a little more complexity is OK. Users will have to be able to understand it though for troubleshooting.

Can you make sure this covered in the "open questions" doc?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what this is?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants