Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ws): initial Workspace and WorkspaceKind controller loops #22

Merged
merged 19 commits into from
Jul 15, 2024

Conversation

Adembc
Copy link

@Adembc Adembc commented Jul 6, 2024

This PR makes the following changes to the workspace controller:

  • Adds necessary RBAC permissions.
  • Updates NewControllerManagedBy to ensure our controller own StatefulSets and Services.
  • Sets the owner reference of the Workspace to include the WorkspaceKind it uses
  • Creates special indexes for cached resources (StatefulSet, Service, WorkspaceKind) to optimize lookups.
  • Adds watchers to the Workspace controller for:
    • WorkspaceKind resources that own the Workspace.
    • Pods that have workspaceName in their labels.
  • Generates StatefulSet and service from Workspace and WorkspaceKind CRDs with a generateName prefix set tows-{WORKSPACE_NAME}- .
  • Implements retry logic on AlreadyExists error when creating StatefulSets and Services.
  • Updates the status field during reconciliation to reflect the current state of the Workspace

@thesuperzapper
Copy link
Member

We can do a proper review of this in our session on Monday, but until then feel free to keep working on it.

@jiridanek you also might want to take a look at this (if you were making a similar PR).

@jiridanek
Copy link
Member

@jiridanek you also might want to take a look at this (if you were making a similar PR).

I weren't, in the end.

Does it run with

?

Adembc and others added 2 commits July 7, 2024 17:23
Signed-off-by: Adem Baccara <[email protected]>
Signed-off-by: Mathew Wicks <[email protected]>
@google-oss-prow google-oss-prow bot added size/XXL and removed size/XL labels Jul 9, 2024
@thesuperzapper
Copy link
Member

@Adembc I have made a significant commit 54e7ed2 to your branch, honestly it's hard to list all the changes, but it should give you a good starting point so we can get closer to mering this.

The keys things you need to do are:

  1. Review the TODO's I have left:
    • Not all of them need to be done before merging, but many do.
  2. Do some tests with the new commit (I am 100% certain it does not fully work right now, I was coding blind a lot of it so I am sure stuff will need to be fixed when you run it in the cluster).
    - Please document what tests you run and how you run them.
    - If possible, update the make test-e2e to automate some of them

I won't do any more commits until our next review, so feel free to make new commits on the branch to move the PR forward.

Signed-off-by: Mathew Wicks <[email protected]>
@thesuperzapper
Copy link
Member

@Adembc I have made one more change in 01ce3ff

Which is to implement:

  • spec.deferUpdates in the Workspace (to allow users to prevent updates being applied when the notebook is paused)
  • status.podTemplateOptions.imageConfig.redirectChain on the Workspace (to allow the UI to know the full chain of redirects that are going to be applied on restart, rather than having to calculate it itself)

You should be good to make your own commits now, I dont have any more I need to make in the short term.

Signed-off-by: Mathew Wicks <[email protected]>
@thesuperzapper
Copy link
Member

@Adembc I have added another change in 5f2597f

The changes are:

  • Adding spec.podTemplate.extraVolumeMounts and spec.podTemplate.extraVolumes to WorkspaceKind
    • This is for PyTorch, so users can mount a memory-backed emptyDir to /dev/shm.
  • Correctly handle the fact that the same data volume can be mounted multiple times in Workspace, by chaning the listMapKey to "mountPath"
  • Renamed spec.podTemplate.volumes.data[].name to spec.podTemplate.volumes.data[].pvcName in Workspace
  • Added spec.podTemplate.volumes.data[].readOnly in Workspace to allow read-only data volumes to be mounted

Signed-off-by: Adem Baccara <[email protected]>
Signed-off-by: Mathew Wicks <[email protected]>
@thesuperzapper
Copy link
Member

thesuperzapper commented Jul 15, 2024

@Adembc I have made some changes in a49ffba

The key changes are:

  • Implemented the WorkspaceKind controller
  • Made the make test work properly for Workspace and WorkspaceKind:
    • We now run the proper two reconsillers in the suite_test.go
  • Factored out a common r.updateWorkspaceState() function in the Workspace controller (which immediately patches the status and returns, because its mostly used for permeant error states, where we need to wait for some external change).
  • Set finalizers on the WorkspaceKind to prevent in-use ones being deleted (and remove it when no Workspaces use the WorkspaceKind anymore, in the WorkspaceKind controller).

Also, please leave the containerSecurityContext as allowPrivilegeEscalation: false and capabilities ... drop ... all, we might need to make the default image the RC for Kubeflow 1.9.0 (which I fixed support for these kinds of securityContexts), e.g:

  • docker.io/kubeflownotebookswg/jupyter-scipy:v1.9.0-rc.2

I have left some TODOs, especially in the tests. Remember that the StatefulSet controller doesnt run in envtest, so the Pods will never actually be created, but we can still test most of it (and do the actual Pod testing in the make e2e in a future PR).

Adembc and others added 2 commits July 15, 2024 16:47
Signed-off-by: Mathew Wicks <[email protected]>
@thesuperzapper
Copy link
Member

@Adembc I have made some changes in 27d03c4

They key changes are:

  • Added spec.podTemplate.options.imageConfig.values[].spec.ports[].id to WorkspaceKind:
    • This ID is used for the HTTP path, and so we can reference it consistently in the spec.podTemplate.extraEnv go templates
  • Changed the go templating in extraEnv[].value to be a function called httpPathPrefix that allows users to reference the port ID they want the HTTP path of.
  • Massively reduced log spam by moving all non-important logs for log.V(2) and making most non-fatal error logs into log.V(0):
    • I have also filtered out the UPDATE conflict errors and told the controllers to requeue when they happen (this happens a lot because we have multiple reconcile requests all fighting over the chance to update the resources)

I will make one more commit, then we can merge this so we can do the TODOs in separate follow up PRs to spread the work out between people.

Signed-off-by: Mathew Wicks <[email protected]>
@thesuperzapper
Copy link
Member

@Adembc I made one last change in aa403b1

The change was to move the default of each option in WorkspaceKind under a spawner section, to help clarify that this is NOT a default on the Workspace, but only for the Spawner UI.

It also creates space to do future "display configs" for each option in the Spawner UI.

@thesuperzapper thesuperzapper changed the title feat(ws): implement a reconciliation loop for the workspace feat(ws): initial Workspace and WorkspaceKind controller loops Jul 15, 2024
@thesuperzapper
Copy link
Member

@Adembc I am going to merge this now, let's create follow up PRs for the TODOs and tasks in your doc.

/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: thesuperzapper

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 0cacff7 into kubeflow:notebooks-v2 Jul 15, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants