Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update API for 1.31 without partitionable model #36

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

johnbelamaric
Copy link
Contributor

This attempts to update the API to:

  1. Remove partitionable model
  2. Allow extension points for future addition of partitionable model, in line with the discussions in dra-evolution: partitioning of devices #20.

This is rough, not exactly what would go into k/k. But I think it has what we need. It allows us to do common attrs and common capacity (common, not shared). It looks little like some of Kevin's earlier prototypes and a sort of simplified version of #31 and #35.

cc @pohly @klueska

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 18, 2024
Comment on lines 62 to 66
// SharedCapacity defines the set of shared capacity consumable by
// devices in this ResourceSlice.
// CommonData represents information that is common across various devices,
// to reduce redundancy in the devices list.
//
// Must not have more than 128 entries.
//
// +listType=atomic
// +optional
SharedCapacity []SharedCapacity `json:"sharedCapacity,omitempty"`
CommonData []CommonDeviceData `json:"commonData,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't include this. If we are ripping out the model that supports partitioning, I think we should should just go back to an explicit list of attributes per device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an extension point to insert the things we are debating in #20. That's the primary reason that is here. Adding the common things was just a way to make it have some meaning and not be a stub. With this, an old scheduler an see that there is an entry in this list, but when it unmarshals it, all the fields the scheduler is aware of are nil, so it knows "ooo, I should ignore this slice, I don't know what to do with it".

If we don't have this, then we need some other way for an older scheduler to know that "it doesn't understand this object".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative would be to have a discriminator field, like SliceContent: "SimpleDevices" or something. Then we would add new values to that so old schedulers know if they should handle the slice or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current thinking is that we would not have any extension points directly in the device at this point. Instead every device will have the ability to point to one (and exactly one) DeviceShape/DeviceTemplate. That is what will have all of these extra bells and whistles. The device itself is still just a named list of attributes (and now capacities).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that works for me. The thing is, we still need a way for old schedulers to know "this slice is not something I understand". So a "SliceType" discriminator perhaps? Or we have to have a field like I have done here, that has a one-of.

// +optional
SharedCapacity []SharedCapacity `json:"sharedCapacity,omitempty"`
// +required
SliceContent string `json:"sliceContent"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an enum instead, @pohly ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the scheduler will decode this object, read this field, and then decide if it knows how to work with it or not? an enum seems appropriate then, since this should only be allowed to have a specific set of values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

If a current-rev scheduler only knows how to work with "devices", it uses the slice.devices field to find those.

If a future API adds "gizmos" or some variant of "devices" (remember, I wanted to keep the current field named as "namedDevices"...), then it adds a slice.gizmos field. A current-rev scheduler will see an empty slice and ignore it because it contains no devices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is the intent to add additional fields to the Device struct? In that case such a SliceContent discriminator would be useful because an unknown value tells old clients that there is something in the slice.devices that they don't know about, but should know.

I just wouldn't call it SliceContent, but rather DeviceType: it's a description of the slice.devices.

If we add slice.gizmos (separate field), then they can have a slice.gizmoType.

Copy link
Contributor Author

@johnbelamaric johnbelamaric Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We likely would add another field in addition to "Devices" to qualify or add onto the devices.

If we want to avoid the discriminator, I would avoid the term "Devices". But I don't think "NamedDevices" works because "named" isn't a differentiating feature. Maybe "SimpleDevices" or something?

If we keep the discriminatory, I do not think "DeviceType" is right - because it wouldn't make sense for "gizmos", for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having said that, I am okay with adding the discriminator. It may be redundant when adding a second slice with a different type, but for the case where the same type gets extended (https://github.com/kubernetes-sigs/wg-device-management/pull/36/files/3a7aa4f9789c7979426ea3c65497545136428f55#r1646662809) it is useful.

My question about validation after a downgrade applies also here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the bullet on "enumerated types" here, we should be OK as long as we document that schedulers that see unknown slice types should ignore those slices. I think older clients doing a write would be subject to the API validation of the API they are using, meaning they would not be able to use the new enum. I think if there is a field associated with the new enum value, we should clear it (the client doesn't know about the new enum, but we do). I think that's consistent with the logic described here, but we should discuss with api reviewers.

One question: Should (can?) we ensure all slices of a pool are of the same slice type? If not, old schedulers would have to ignore the whole pool, at least for "all", if ANY slice is of an unknown type...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would seem to me that adding an enum value should require a bump in the API version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we have reached v1, we cannot bump it any further?

I'm still not exactly clear on how to implement validation, but there must have been precedences, so I guess it is okay.

Copy link

@thockin thockin Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A discriminator is required when you want the absence of config to imply a default, so that version skew isn't perceived as the default case.

For cases that are truly one-of (exactly one field, empty lists don't count), then a discriminator is not needed, because any actor can perceive the lack of a value, and choose not to act. A discriminator can make for better error messages, but that's about all.

Whether we extend Device to encompass partitions (assuming we can do that safely) or we add PartitionableDevice here, it shouldn't matter -- both are viable. Likewise, whether we want devices/gizmos/widgets/partitionables to co-exist in a slice or to be mutually exclusive, I think both are viable. There may be other things to consider to make those decisions, but I don't think this aspect matters.

@johnbelamaric
Copy link
Contributor Author

Ok, I simplifed it and now it's basically just a discriminator with a single valid value in 1.31, but more will come.

pohly added a commit to pohly/kubernetes that referenced this pull request Jul 8, 2024
@pohly pohly mentioned this pull request Jul 8, 2024
9 tasks
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants