Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull: stale data when converted from "imported" to pipeline-local file #10457

Open
bakaleks opened this issue Jun 11, 2024 · 0 comments
Open
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? p3-nice-to-have It should be done this or next sprint triage Needs to be triaged

Comments

@bakaleks
Copy link

Bug Report

Description

Imagine scenario with a DVC data pipeline using imported file from a Data Registry as a stage dependency. There are two users working on the same pipeline.

If user1 changes the imported data file locally and reproduces the pipeline, the imported file is automatically 'dvc commited' but local changes are not pushed to remote. This is by design I guess, because one should not change imported files locally, but change them in the data registry. If user2 clones the pipeline and dvc pulls, he or she naturally won't get local changes made by the first user.

Further if user1 tries to rectify situation by converting the imported data file to a "local" pipeline datafile by removing .dvc-file and running 'dvc add' this will indeed do the trick and the updated file will be dvc pushed to remote.

Nevertheless, and that's where this bug manifests itself, when user2 'git pull' 'dvc pull' the file content is still the "dataregistry" one and not pipeline-local.

The fix for user2 is to remove the datafile file and .dvc/cache folder and rerun 'dvc pull' again, that would pull correct version of the file.

Reproduce

echo "INFO: Creating data registry" 
mkdir dataregistry_bare 
git init --bare dataregistry_bare
git clone dataregistry_bare dataregistry
cd dataregistry
dvc init
rm -rf /tmp/dvc-storage-dataregistry
mkdir -p /tmp/dvc-storage-dataregistry
dvc remote add -d local /tmp/dvc-storage-dataregistry
echo "file from dataregistry">data.txt
dvc add data.txt
git add .
git commit -m 'added data.txt to dataregistry'
git push
dvc push
cd ..

echo "INFO: user1 creates the pipeline"
mkdir pipeline_bare
git init --bare pipeline_bare
git clone pipeline_bare pipeline_user1
cd pipeline_user1
dvc init
rm -rf /tmp/dvc-storage-pipeline
mkdir -p /tmp/dvc-storage-pipeline
dvc remote add -d local /tmp/dvc-storage-pipeline
dvc import ../dataregistry_bare data.txt
dvc stage add -n do_something -d data.txt echo "Doing something"
dvc repro
git add .
git commit -m 'repro pipeline with dep file from dataregistry'
git push
cd ..

echo "INFO: user2 clones the piplines"
git clone pipeline_bare pipeline_user2
cd pipeline_user2
dvc pull
cat data.txt
echo "INFO: expecting 'file from dataregistry'"

echo "INFO: back to pipeline_user1 and manually change data.txt before dvc repro"
cd ../pipeline_user1
echo "changed file locally in pipeline">data.txt
dvc repro
git add .
git commit -m 'changed file locally in pipeline'
git push
dvc push

echo "INFO: switching back to pipline_user2"
cd ../pipeline_user2
git pull
dvc pull
cat data.txt
echo 'INFO: expecting "file from dataregistry"'
echo "INFO: of course, local changes in pipeline_user1 to data.txt are not visible here"

echo "INFO: back to pipeline1_user1 to fix the issue by adding data file locally to the pipeline"
cd ../pipeline_user1
rm data.txt.dvc
dvc add data.txt
git add .
git commit -m 'dvc add data.txt locally to the pipeline'
git push
dvc push

echo 'INFO: back to pipeline_user2'
cd ../pipeline_user2
git pull
dvc pull
cat data.txt
echo 'INFO: expected "changed file locally in pipeline"'

echo "INFO: Fix by removing data file and cache and new dvc pull"
rm -rf .dvc/cache data.txt
dvc pull
cat data.txt
echo 'INFO: expected "changed file locally in pipeline"'

Expected

Included as echo "INFO: " in the reproduce script

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.51.2 (deb)
-------------------------
Platform: Python 3.10.8 on Linux-4.18.0-513.24.1.el8_9.x86_64-x86_64-with-glibc2.31
Subprojects:

Supports:
        azure (adlfs = 2024.4.1, knack = 0.11.0, azure-identity = 1.16.0),
        gdrive (pydrive2 = 1.19.0),
        gs (gcsfs = 2024.5.0),
        hdfs (fsspec = 2024.5.0, pyarrow = 16.1.0),
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2024.5.0, boto3 = 1.34.106),
        ssh (sshfs = 2024.4.1),
        webdav (webdav4 = 0.9.8),
        webdavs (webdav4 = 0.9.8),
        webhdfs (fsspec = 2024.5.0)
Config:
        Global: /home/cd4tll/.config/dvc
        System: /etc/xdg/dvc

Additional Information (if any):

@shcheklein shcheklein added bug Did we break something? triage Needs to be triaged A: data-sync Related to dvc get/fetch/import/pull/push labels Jun 11, 2024
@dberenbaum dberenbaum added the p3-nice-to-have It should be done this or next sprint label Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? p3-nice-to-have It should be done this or next sprint triage Needs to be triaged
Projects
None yet
Development

No branches or pull requests

3 participants