Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow import-db within stages #10575

Open
fabiannagel opened this issue Oct 1, 2024 · 1 comment
Open

Allow import-db within stages #10575

fabiannagel opened this issue Oct 1, 2024 · 1 comment
Labels
feature request Requesting a new feature p2-medium Medium priority, should be done, but less important

Comments

@fabiannagel
Copy link

I'm missing the possibility to run import-db as part of my pipeline. Consider the following scenario:

stages:
  
  convert:
    cmd: python convert.py
    deps:
      - data/raw_input_data
    outs:
      - data/converted_input_data

  ingest:
    cmd: python ingest.py && dvc import-db --table ingested_data --conn pgsql -o data/database
    deps:
      - data/coverted_input_data
    outs:
      - data/database

  run:
    cmd: python app.py
    deps:
      - data/database

ingest consumes my converted input data, applies some transformations and populates the application database. The state of the database is persisted via import-db, which is the data dependency for running the application.

Right now, dvc repro throws the following error with this config:

Running stage 'ingest':
> dvc import-db --table ingested_data --conn pgsql -o data/database
ERROR: output 'data/database' is already specified in stage: 'ingest'.
Use `dvc remove ingest` to stop tracking the overlapping output.
ERROR: failed to reproduce 'ingest': failed to run: dvc import-db --table ingested_data --conn pgsql -o data/database, exited with 255

It would be great to have a flag telling dvc import-db that it is part of a pipeline such that overlapping outputs are not an issue.

@shcheklein shcheklein added the feature request Requesting a new feature label Oct 1, 2024
@shcheklein
Copy link
Member

Make sense to expand pipeline stages to be DB import (or other imports?), wdyt @skshetry ?

For now I would recommend to run the query directly via Python script. You can take a look into DbDependency implementation and get some SQL wrapper code from it (it should not be very complicated I think).

@shcheklein shcheklein added the p2-medium Medium priority, should be done, but less important label Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

2 participants