Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dra-evolution: quota mechanism #24

Closed
pohly opened this issue Jun 5, 2024 · 9 comments
Closed

dra-evolution: quota mechanism #24

pohly opened this issue Jun 5, 2024 · 9 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@pohly
Copy link
Contributor

pohly commented Jun 5, 2024

The "future extension" proposal in #14 for quota was to check quota at allocation time.

Pros:

  • Can limit resource consumption based on what actually gets made available to a user, compared to basing it on what is requested (might be a lower limit).
  • Supports creating more claims and pods than can run at the moment ("batching" - might not be relevant).
  • Can support "one of" (if X exceeds quota, use Y).

Cons:

  • All schedulers need to also consider the ResourceQuota when checking devices.
  • Exceeding quota has to be reported as part of scheduling failures. OTOH, users typically also don't create ResourceClaims manually, so there is some indirection with admission checks, too.
@pohly
Copy link
Contributor Author

pohly commented Jun 5, 2024

@johnbelamaric: let's continue here.

You wrote:

Whether it's memory or devices, as a workload author, I should be notified at admission time that I am asking for more than allowed.

Note that workload authors typically don't create claims and pods directly. They create an app object (Deployment) and the app controller creates the pods. Such a user doesn't get feedback at admission time, do they?

With DRA, such a user creates a ResourceClaimTemplate or uses some future "create claim for me" syntactic sugar. Then the resource claim controller tries to create the ResourceClaim and gets an admission error which needs to be propagated back to the user. Currently that is with an event for the pod.

@johnbelamaric
Copy link
Contributor

Note that workload authors typically don't create claims and pods directly. They create an app object (Deployment) and the app controller creates the pods. Such a user doesn't get feedback at admission time, do they?

Hmm. Good point...

@pohly
Copy link
Contributor Author

pohly commented Jun 6, 2024

The alternative for "check at allocation time" is "check at admission time". What exactly should be checked and how would need to be defined further. It cannot rely on ResourcePool information (might not be available yet), so simulating the actual allocation isn't feasible. Counting the usage of a (then mandatory) DeviceClass was mentioned as a possible check.

I already implemented that for the "class per claim" approach from classic DRA, see kubernetes/kubernetes#120611. We haven't merged it because it wasn't clear whether it's the right approach.

Besides not being precise when trying to limit "total GPU memory used by a user" (admission only sees what was requested, not what was actually given to the user) I also don't see how we can support future use cases like "give me all GPUs on a node". At admission time it is unknown how many GPUs that will be, so should such a claim be allowed or denied when there is a maximum for "number of GPUs"?

If it's allowed, we would still need to check the actual number at allocation time, which defeats the purpose of checking at admission time that "different schedulers don't need to know about quota".

@pohly
Copy link
Contributor Author

pohly commented Jun 11, 2024

We now support "give me all GPUs on a node" in #24, so we have to figure out how to do quota for such requests.

Another example where "quota by class at admission time" breaks down is the future "give me device X, otherwise Y" use case.

Suppose at admission time, X is not allowed, but Y is. Should the request be allowed or denied? If it's allowed, the scheduler has to replicate the quota checks, which is what "check at admission time" was meant to avoid. If it's denied, it's a false negative because the claim could have been satisfied.

pohly added a commit to pohly/enhancements that referenced this issue Jun 11, 2024
@pohly
Copy link
Contributor Author

pohly commented Jun 11, 2024

More complete proposal for allocation-time quota checking:

// Quota controls whether a ResourceClaim may get allocated.
// Quota is namespaced and applies to claims within the same namespace.
type Quota struct {
    metav1.TypeMeta
    // Standard object metadata.
    metav1.ObjectMeta

    Spec QuotaSpec
}

type QuotaSpec struct {
    // Controls whether devices may get allocated with admin access
    // (concurrent with normal use, potentially privileged access permissions
    // depending on the driver). If multiple quota objects exist and at least one
    // has a true value, access will be allowed. The default to deny such access.
    //
    // +optional
    AllowAdminAccess bool `json:"allowManagementAccess,omitempty"`

    // The total number of allocated devices matching these selectors must not be
    // exceeded. This has to be checked in addition to other claim constraints
    // when checking whether a device can be allocated for a claim.
    MaxDeviceCounts []MaxDeviceCount

    // The sum of some quantity attribute of allocated devices must not
    // exceed a maximum value. This has to be checked in addition to other claim constraints
    // when checking whether a device can be allocated for a claim.
    MaxQuantity []MaxQuantity

    // Other useful future extensions (>= 1.32):

    // DeviceLimits is a CEL expression which take the currently allocated
    // devices and their attributes and some new allocations as input and
    // returns false if those allocations together are not permitted in the
    // namespace.
    //
    // DeviceLimits string

    // A class listed in DeviceClassDenyList must not be used in this
    // namespace. This can be useful for classes which contain
    // configuration parameters that a user in this namespace should not have
    // access to.
    //
    // DeviceClassDenyList []string

    // A class listed in ResourceClassAllowList may be used in this namespace
    // even when that class is marked as "privileged". Normally classes
    // are not privileged and using them does not require explicit listing
    // here, but some classes may contain more sensitive configuration parameters
    // that not every user should have access to.
    //
    // DeviceClassAllowList []string
}

type MaxDeviceCount struct {
    // The maximum number of allocated devices. May be zero to prevent using
    // certain devices.
    Maximum resource.Quantity

    // Only devices matching all selectors are counted.
    //
    // +listType=atomic
    Selectors []Selector
}

type MaxQuantity struct {
    // The maximum sum of a certain quantity attribute. Only allocated devices which
    // have this attribute as a quantity contribute towards the sum.
    Maximum resource.Quantity

    // The fully-qualified attribute name ("<domain>/<identifier>").
    AttributeName string
}

As it stands now, a single ResourceQuota object per namespace can contain all restrictions.
If we add a new field to the type (like the DeviceLimits or class allow/deny list),
old schedulers will ignore them and thus not apply the new restrictions.

Should we make QuotaSpec a one-of? Exactly one field would need to be set (true
or non-empty) per object. This prevents putting all restrictions into a single
object.

Or shall we nest a one-of inside a slice called "rules" or "settings"?

The semantic in both cases then is that if an old scheduler encounters a
ResourceQuota object with empty one-of, it knows that it cannot handle
allocation inside the namespace.

@pohly
Copy link
Contributor Author

pohly commented Jun 12, 2024

We now support "give me all GPUs on a node" in #24, so we have to figure out how to do quota for such requests.

There are two (crude) options:

  • Deny admission of claims which use this feature.
  • Do a worst-case analysis and assume that "all" = as many as allowed per claim = 32 (current limit for allocation result).

For a future "device X or Y" request the admission would have to be denied unless both X and Y are under quota.

I still don't think that we should do quota that way, but at least there would be a way to adapt kubernetes/kubernetes#120611 to per-request device classes.

As I mentioned in kubernetes/enhancements#4709 (comment), that admission check can also enforce that a device class name is set in namespaces where the check is configured. Elsewhere the name can remain optional.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 10, 2024
@pohly
Copy link
Contributor Author

pohly commented Sep 11, 2024

/close

Let's plan for an allocation-time quota mechanism in kubernetes/enhancements#4840.

@k8s-ci-robot
Copy link
Contributor

@pohly: Closing this issue.

In response to this:

/close

Let's plan for an allocation-time quota mechanism in kubernetes/enhancements#4840.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants