Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to specify extras in a python wheel installation for Databricks Asset Bundles #1602

Open
aabilov-dataminr opened this issue Jul 17, 2024 · 4 comments
Assignees
Labels
Bug Something isn't working DABs DABs related issues

Comments

@aabilov-dataminr
Copy link

aabilov-dataminr commented Jul 17, 2024

Describe the issue

When packaging a wheel in Python it's standard practice to put some libraries in extras groups. This is commonly used in GPU/ML experimentation repositories to scope dependency groups for specific use-cases or workflows.

When attempting to specify an extras group in the libraries config for a DABs project the bundle build throws an error:

databricks bundle deploy                                  
Building llm-workflows...
Error: file dist/*.whl[train] is referenced in libraries section but doesn't exist on the local file system

Hoping that this can be resolved! The only possible workarounds are of now are very destructive to standard python packaging workflows:

  • workaround 1: enumerate all extra libraries in each task config
    • major downside: need to duplicate all library in the package definition & databricks.yml
  • workaround 2: split up the package into multiple separate wheels
    • major downside: huge overhead to create new dependency groups

If there are other/better workarounds, I'd love to hear them!

Configuration (shortened for brevity)

In pyproject.toml:

dependencies = [
    "databricks-sdk>=0.29.0"
]

[project.optional-dependencies]
train = [
    "transformers==4.41.2"
]

In databricks.yml (shortened for brevity):

experimental:
  python_wheel_wrapper: true

artifacts:
  llm-workflows:
    type: whl
    path: ./
    build: python3 -m build . --wheel


# ...task config
      tasks:
        - task_key: "task"
          spark_python_task:
            python_file: "./llm_workflows/cli/generate.py"
          libraries:
            - whl: ./dist/*.whl[train]

Steps to reproduce the behavior

databricks bundle deploy

Expected Behavior

Instead of attempting to find a local file ./dist/*.whl[train] the bundle should correctly identify that [train] is an extras group and install the extras appropriately. This is standard behavior in python wheels.

Actual Behavior

Bundle build fails because the wheel file can't be found.

OS and CLI version

OS X, Databricks CLI v0.219.0

Is this a regression?

No

@aabilov-dataminr aabilov-dataminr added the DABs DABs related issues label Jul 17, 2024
@aabilov-dataminr
Copy link
Author

Actually workaround 2 does not work... I tried splitting the repo into two packages, but it seems like DABs cleans up the dist folder in-between wheel builds 😞

Artifacts config:

artifacts:
  llm-workflows-core:
    type: whl
    path: ./
    build: python3 -m build llm_workflows_core --wheel --outdir dist
  llm-workflows-train:
    type: whl
    path: ./
    build: python3 -m build llm_workflows_train --wheel --outdir dist

Deploy run:

databricks bundle deploy                                 
Building llm-workflows-core...
Building llm-workflows-train...
Uploading dm_llm_workflows_core-0.4.3.post8+git.94ebd0cd.dirty.2024.7.17t17.27.32-py3-none-any.whl...
Error: upload for llm-workflows-core failed, error: unable to read /Users/aabilov/git/dm-llm-workflows/dist/dm_llm_workflows_core-0.4.3.post8+git.94ebd0cd.dirty.2024.7.17t17.27.32-py3-none-any.whl: no such file or directory

Expected behavior:
When specifying two artifacts to be built to a specific dist folder, I would expect both of them to be in the dist folder:

> python3 -m build llm_workflows_core --wheel --outdir dist
> python3 -m build llm_workflows_train --wheel --outdir dist
> ls dist                                                   
dm_llm_workflows_core-0.4.3.post8+git.94ebd0cd.dirty.2024.7.17t17.30.10-py3-none-any.whl
dm_llm_workflows_train-0.4.3.post8+git.94ebd0cd.dirty.2024.7.17t17.30.15-py3-none-any.whl

@pietern pietern added the Bug Something isn't working label Jul 18, 2024
@pietern
Copy link
Contributor

pietern commented Jul 18, 2024

Thanks for reporting the issue, @aabilov-dataminr.

We'll take a look at the cleanup of dist in between builds, that seems wrong. In the mean time, you could try having both builds output to a different directory (that won't be cleaned up).

As for proper extras support, we'll take a look as well. If this works at the API, we should keep the extras suffix intact when we glob to find the wheel file.

@andrewnester andrewnester self-assigned this Jul 18, 2024
github-merge-queue bot pushed a commit that referenced this issue Jul 24, 2024
## Changes
Now prepare stage which does cleanup is execute once before every build,
so artifacts built into the same folder are correctly kept

Fixes workaround 2 from this issue #1602

## Tests
Added unit test
@andrewnester
Copy link
Contributor

@aabilov-dataminr the fix to support workaround 2 has been merged and released in 0.224.1 version, please give it a try. In the meantime, I'm verifying if Databricks backend support providing libraries with extras, I'll keep this issue updated

@j-4
Copy link

j-4 commented Aug 29, 2024

We use another workaround for installing extra dependencies for our integration tests: After specifying the wheel file(s) as cluster dependency, we install the extra dependencies on runtime with a subprocess call.

the test ressource config:

targets:
  test:
    sync:
      include:
        - ../dist/*.whl
    resources: 
      jobs: 
        integration-test: 
          name: integration-test
          tasks:
            - task_key: "main"
              spark_python_task:
                python_file: ${workspace.file_path}/tests/entrypoint.py
              libraries:
                - whl: ../dist/*.whl
              ...
          job_clusters:
            - job_cluster_key: test-cluster
              new_cluster:
                ...
                spark_env_vars:
                  DIST_FOLDER_PATH: ${workspace.file_path}/dist
...

databricks.yml

...
artifacts:
  default:
    type: whl
    path: .
...

and the entrypoint.py file

import os
import subprocess
import sys

if __name__ == "__main__":
    # no bytecode io
    sys.dont_write_bytecode = True
    # install extra dependencies, workaround for https://github.com/databricks/cli/issues/1602
    dist_folder: str = os.environ.get("DIST_FOLDER_PATH")
    if dist_folder is None:
        raise KeyError(
            "The env variable DIST_FOLDER_PATH is not set but is needed to run the tests."
        )
    wheel_files = [os.path.join(dist_folder, f) for f in os.listdir(dist_folder) if f.endswith("whl")]
    for wheel_file in wheel_files:
        subprocess.check_call(
            [sys.executable, "-m", "pip", "install", f"{wheel_file}[test]"]
        )
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working DABs DABs related issues
Projects
None yet
Development

No branches or pull requests

4 participants