Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap for migrating existing Pangeo hubs to 2i2c operation #12

Closed
rabernat opened this issue Nov 9, 2020 · 13 comments
Closed

Roadmap for migrating existing Pangeo hubs to 2i2c operation #12

rabernat opened this issue Nov 9, 2020 · 13 comments

Comments

@rabernat
Copy link

rabernat commented Nov 9, 2020

The Pangeo Cloud Operations Working Group met today (myself, @jhamman, @consideRatio, and @scottyhq) discussed our future plans. We have another impending sustainability crisis: @salvis2 @TomAugspurger and Scott have been doing lots of the heavy lifting in terms of operations. They have either already left (Sebastian) or are soon leaving the project. So we need to plan a transition. I'd like to use this thread to discuss that plan.

Key questions:

  • What infrastructure are we currently operating?
  • What is special / custom about Pangeo hubs / binders compared to vanilla deployments?
  • What do we want Pangeo Cloud Jupyter infrastructure to look like in 4 months?
  • What effort is required to migrate our existing deployments to 2i2c-managed hubs?
  • What effort is required to maintain / operate the hubs post migration?

I will try to answer some of these in subsequent replies, but for now I'll just open the issue with those questions. Anyone else should feel free to chime in.

@choldgraf
Copy link
Member

choldgraf commented Nov 9, 2020

Two other key questions:

  • what kind of time frame are we working with? You mentioned that some folks are leaving "soon" but how soon are we talking and for which people?
  • what kind of resources will Pangeo have to support running hub infrastructure after this transition? From @rabernat's sub-award that will be enough to fund a person...do we need more than that? If so, do we need to go out and get more money, or do we have pre-existing money to work with?

This is related to a conversation that @yuvipanda and I had this morning, which is that we should try to take the current model we have for the "hubs for all" educational pilot, and adapt it for pangeo-style hubs. This would entail some combination of:

  • Define what is "a Pangeo Hub" and provide a replicable pattern around it that improves upon the current "hubploy" model
  • Make it easy to deploy an N+1th hub, both in general but specifically on 2i2c infra (similar to the Hubs for All project where it's just a YAML file you edit)
  • Make it easier to maintain a collection of such hubs

@TomAugspurger
Copy link

TomAugspurger commented Nov 9, 2020 via email

@scottyhq
Copy link

scottyhq commented Nov 9, 2020

Thanks for starting the convo @rabernat . I also won't be disappearing, but do plan to scale back significantly on infrastructure-related work ;)

What effort is required to maintain / operate the hubs post migration?

pangeo infrastructure efforts have generally been flush with credits from cloud providers, but short on dedicated personnel to to develop and maintain infrastructure. I've been lucky enough to have ~30% FTE from a NASA award over the last two years to dedicate time to keeping pangeo infrastructure running on AWS. We also had @salvis2 working ~75% on this for the last year. That award is spinning down in the next 6 months, so we're trying our best to simply keep development frozen and focus on research applications.

a lot of time is required when you want new things (dask-gateway integration, scratch buckets for all, helm chart version bumps, shiny new things like spawner modifications, custom NFS solutions, metrics, cost-tracking, etc).

What infrastructure are we currently operating?

aws-uswest2.pangeo.io
aws-uswest2-binder.pangeo.io
(2 EKS clusters running on a UW eScience-managed account)

What is special / custom about Pangeo hubs / binders compared to vanilla deployments?

  • automated admission to anyone with a github account who fills out a google form
  • object storage scratch space that wipes every 24hrs / X days
  • integrated with dask-gateway
  • generous home directory quotas ;)
  • curated geoscience-focused docker images
  • multicore notebook server comparable to 'standard laptop' specs in the cloud
  • runs on AWS and Azure in addition to GCP
  • pangeo gallery is powered by equivalent binderhubs running on GCP and AWS

What do we want Pangeo Cloud Jupyter infrastructure to look like in 4 months?

I absolutely love having access to a capable environment on AWS us-west-2, and the binderhub in AWS us-west-2 has been instrumental for demos and workshops. The hubs and binderhubs are a rallying point for the pangeo community and the most frictionless way to get people experimenting with cloud-based workflows and dask distributed computing in my opinion. Their fully open nature is also extremely unsustainable!

At the very least I'd like to see an open-access binderhub on AWS us-west-2 maintained. I'd love for 2i2c to figure out a pay-per-team pangeo hub solution, where usage costs are tracked at the research group or grant team level. And in the meantime, if we are flush with credits and have a person dedicated to maintenance, I'd selfishly love for the AWS us-west-2 jupyterhub to continue as is!

@robfatland
Copy link

I have a couple thoughts...

If Pangeo pivots into a more operational service its probably worth talking to CloudBank: For back office stuff; account management. Still leaves a tech-ops gap but it might lead to something.

I suggest set aside 2+ days for an internal "messaging workshop". All the comments above suggest this to me. Maybe the goal of that time spent would be a draft proposal. I use the term 'messaging' because Pangeo is among other things a testbed and this -- I claim! -- raises some impedence matching questions. Someone might ask "Isn't Pangeo too much horsepower / too experimental for the 2i2c democratization plan?' and so on.

@choldgraf
Copy link
Member

@robfatland could you go into more detail on this comment:

"Isn't Pangeo too much horsepower / too experimental for the 2i2c democratization plan?'

And just as extra explanation - the goal of 2i2c is not just to democratize access to straightforward hub infrastructure, but also to be a place for cutting-edge-style development in collaboration with research/education partners. Those two kinds of models won't be the same hub (or the same group of collaborators), but I think the vision for 2i2c is that it can help amplify the kinds of experimental work done in Pangeo in a more sustainable manner.

@rabernat
Copy link
Author

What infrastructure are we currently operating?

Just want to mention that our main deployment is still the GCP-based cluster https://us-central1-b.gcp.pangeo.io/.

@rabernat
Copy link
Author

Just cross referencing an issue on the pangeo-cloud-federation repo that we basically have no capacity right now to handle: Action Required: Suspicious Activity Observed on Google Cloud Project pangeo . This highlights the need for someone to monitor these hubs on a regular basis.

@TomAugspurger
Copy link

I just posted pangeo-data/pangeo-cloud-federation#874, which tries to document some of the special things about our GCP deployment.

I'm mostly offline for the next few days still, but I want to make a transition to 2i2c as smooth as possible.

@yuvipanda
Copy link
Member

Thinking of two kinds of deployments, mostly from the perspective of who / what pays for them.

  1. 'community' deployments, open for a wide array of folks to use. This would be all the hubs in pangeo-cloud-federation right now, primarily funded by grants to 'further the mission', as such. Would include all the binderhubs too.
  2. 'Organizational' deployments, restricted to a particular community of users from a distinct 'organization'. 2i2c run a hub for Farallon Institute for use by folks there. These are almost 'internal' hubs, but run exactly like PANGEO hubs - just different funding source and support setups.

Funding source and support expectations differentiate (1) and (2), and we should aim to keep everything else exactly the same. This eases support burden.

Operationally, I think the next important thing is probably to make sure that the community deployments can keep going. Given that credits seem to exist, it looks like to me the issue is one of onboarding 'maintainers' - folks who can take care of the infrastructure and work on improvements where needed - the jobs that @TomAugspurger, @scottyhq, @salvis2 and others have been graciously doing for now. Does this sound right?

@rabernat
Copy link
Author

Given that credits seem to exist, it looks like to me the issue is one of onboarding 'maintainers' - folks who can take care of the infrastructure and work on improvements where needed - the jobs that @TomAugspurger, @scottyhq, @salvis2 and others have been graciously doing for now. Does this sound right?

Correct. But we are not expecting anyone to do this for free. Our Moore Foundation and NSF EarthCube grants have funding for this type of work. (Just money flowed to Anaconda and UW for this purpose.)

@yuvipanda
Copy link
Member

Once 2i2c-org/infrastructure#134 is merged, we can create many Pangeo style hubs (notebook + dask) by editing hubs.yaml trivially. Eventually, I want to be able to trivially spin up / down hubs with minimal setup cost, so we can do ephemeral clusters for specific events without putting a lot of pressure on long-running hubs.

@rabernat
Copy link
Author

@yuvipanda and I had a great meeting today with the Pangeo cloud infrastructure working group. (@TomAugspurger and @jhamman were the other attendees). We did a deep dive on the compatibility between Pangeo's existing hubs and what 2i2c is doing. The good news is that it looks like the migration should be pretty straightforward.

Notes from the meeting are here:
https://docs.google.com/document/d/1I-2VNNHoAjjeYvlCezQhFLmiu2OevqGDS5nUAP-6Hfw/edit

We can work on translating these to some specific action items as issues in this repo.

@yuvipanda
Copy link
Member

berkeley-dsep-infra/datahub#2124 (comment) from @rabernat as the kinda thing that'd be nice in these hubs :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants