Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DVC do not cache output of pipeline properly #10549

Open
imyhxy opened this issue Sep 5, 2024 · 11 comments
Open

DVC do not cache output of pipeline properly #10549

imyhxy opened this issue Sep 5, 2024 · 11 comments
Labels
A: data-management Related to dvc add/checkout/commit/move/remove A: pipelines Related to the pipelines feature optimize Optimizes DVC

Comments

@imyhxy
Copy link

imyhxy commented Sep 5, 2024

Bug Report

repro: doesn't cache output properly with reflink setup.

Description

I have 4 pipeline to transform the same input dataset for different tasks. The images was process the same way, and the cache.type was setting to reflink. So, according to the document, there should be only one copy of the output images. But this is not the truth. All output of the pipeline was not set to reflink with the cached file.

If I run the dvc checkout -R --reflink after the pipelink was executed. Then the disk usage behavior normally.

The output of btrfs fi du -s . right after repro:

     Total   Exclusive  Set shared  Filename
  90.50GiB    31.21GiB    29.64GiB  .

The output of btrfs fi du -s . right after dvc checkout -R --reflink:

     Total   Exclusive  Set shared  Filename
  90.50GiB     1.07GiB    29.83GiB  .

Reproduce

Expected

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.55.2 (deb)
-------------------------
Platform: Python 3.10.8 on Linux-6.7.10-060710-generic-x86_64-with-glibc2.39
Subprojects:
	
Supports:
	azure (adlfs = 2024.7.0, knack = 0.12.0, azure-identity = 1.17.1),
	gdrive (pydrive2 = 1.20.0),
	gs (gcsfs = 2024.6.1),
	hdfs (fsspec = 2024.6.1, pyarrow = 17.0.0),
	http (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
	oss (ossfs = 2023.12.0),
	s3 (s3fs = 2024.6.1, boto3 = 1.35.7),
	ssh (sshfs = 2024.6.0),
	webdav (webdav4 = 0.10.0),
	webdavs (webdav4 = 0.10.0),
	webhdfs (fsspec = 2024.6.1)
Config:
	Global: /home/fkwong/.config/dvc
	System: /etc/xdg/xdg-ubuntu/dvc
Cache types: reflink, hardlink, symlink
Cache directory: btrfs on /dev/sda1
Caches: local
Remotes: None
Workspace directory: btrfs on /dev/sda1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/055b5579042ae6f272efc40fc232cbdd

Additional Information (if any):

@skshetry
Copy link
Member

skshetry commented Sep 5, 2024

Can you try removing hardlink and symlink from cache types config? You can remove the cache.type config entirely as reflink, copy is the default.

It'd be great if you could debug, and see why it's not being reflinked by adding a breakpoint here in dvc-objects.

https://github.com/iterative/dvc-objects/blob/716dba66f1687162f12ec85b08959196709111e0/src/dvc_objects/fs/generic.py#L337

You can also try with the following snippets and see if they are getting reflinked:

from dvc.fs import LocalFileSystem

fs = LocalFileSystem()
fs.reflink("existing-file", "cloned-file")

@skshetry skshetry added the awaiting response we are waiting for your reply, please respond! :) label Sep 5, 2024
@imyhxy
Copy link
Author

imyhxy commented Sep 5, 2024

@skshetry I almost find the reason. When multiple pipeline create the same output. Only the first one got reflink.

Here is a screenshot that 5 pipelines operate on a one-image dataset. And I use filefrag to check the status of the output image. You can see that only the first file is set to shared.

image

@imyhxy
Copy link
Author

imyhxy commented Sep 5, 2024

I am able to create a minimal reproducted project.

  1. A clean project.

     Total   Exclusive  Set shared  Filename
    1.83MiB     1.83MiB       0.00B  .
  2. Add raw image: dvc add data/raw/input/testing.jpg

     Total   Exclusive  Set shared  Filename
    3.64MiB    16.00KiB     1.81MiB  .
  3. repro pipeline in the first time: dvc repro

     Total   Exclusive  Set shared  Filename
    12.70MiB     9.08MiB     1.81MiB  .
  4. check with filefrag:

    Filesystem type is: 9123683e
    File size of data/prepared/myrepo-one/testing.jpg is 1897127 (464 blocks of 4096 bytes)
     ext:     logical_offset:        physical_offset: length:   expected: flags:
       0:        0..     463:  640195273.. 640195736:    464:             last,eof
    data/prepared/myrepo-one/testing.jpg: 1 extent found
    Filesystem type is: 9123683e
    File size of data/prepared/myrepo-two/testing.jpg is 1897127 (464 blocks of 4096 bytes)
     ext:     logical_offset:        physical_offset: length:   expected: flags:
       0:        0..     463:  640317096.. 640317559:    464:             last,eof
    data/prepared/myrepo-two/testing.jpg: 1 extent found
    Filesystem type is: 9123683e
    File size of data/prepared/myrepo-three/testing.jpg is 1897127 (464 blocks of 4096 bytes)
     ext:     logical_offset:        physical_offset: length:   expected: flags:
       0:        0..     463:  640213630.. 640214093:    464:             last,eof
    data/prepared/myrepo-three/testing.jpg: 1 extent found
    Filesystem type is: 9123683e
    File size of data/prepared/myrepo-four/testing.jpg is 1897127 (464 blocks of 4096 bytes)
     ext:     logical_offset:        physical_offset: length:   expected: flags:
       0:        0..     463:  640287558.. 640288021:    464:             last,eof
    data/prepared/myrepo-four/testing.jpg: 1 extent found
    Filesystem type is: 9123683e
    File size of data/prepared/myrepo-five/testing.jpg is 1897127 (464 blocks of 4096 bytes)
     ext:     logical_offset:        physical_offset: length:   expected: flags:
       0:        0..     463:  640195737.. 640196200:    464:             last,eof
    data/prepared/myrepo-five/testing.jpg: 1 extent found
  5. dvc checkout -R --reflink:

     Total   Exclusive  Set shared  Filename
    12.70MiB    16.00KiB     1.81MiB  .

dvc_testing2.zip

@imyhxy
Copy link
Author

imyhxy commented Sep 5, 2024

The reason why the first file on the screenshot provided is shared and all the files in the reproducted project is not shared, because the pipeline in the screenshot have changed the input image. So we could draw a conclusion that if a output file is already in the dvc cache. Then dvc won't create a reflink from the cache version to the workspace version.

@skshetry
Copy link
Member

skshetry commented Sep 5, 2024

I think this is due to a relink optimization that I did recently for checkout (which is used during repro): iterative/dvc-data#548.

DVC looks at the file in the workspace, and tries to determine if it needs to relink based on cache-types. So, for example, if a file is a not a symlink, and you have cache_type = symlink set, it'll have to relink via symlink.

But, DVC does not have a way to determine if a file should be reflinked or not. So, it leaves it as-is in the workspace, which saves us from doing checkout which can be expensive.

If you are worried about storage, I think dvc checkout --relink is a correct fix.

@imyhxy
Copy link
Author

imyhxy commented Sep 5, 2024

filefrag is able to check whether a file if reflink or not.

@imyhxy
Copy link
Author

imyhxy commented Sep 5, 2024

I have a solution. When cache.type = reflink, dvc perform checkout with touch when workspace version equal to the cached version, that is, make a reflink from the cache to the workspace and update the created time on the cache. In that case, the timestamp of the cached version should always newer than the workspace one. The pesedu code should be:

if cache.type == reflink:
    if md5sum of cached file == md5sum of workspace file:
        if timestamp of cached file older than workspace file:
            create reflink from cache to workspace and touch the cache to update timestamp
        else:
            do nothing
    else:
        create reflink from cache to workspace and touch the cache to update timestamp

@imyhxy
Copy link
Author

imyhxy commented Sep 5, 2024

Besides, I don't think this is an issue can be ignored. Even there is no multiple pipeline to generate the same output, if user updates some existing pipeline to generate a new output with most of the files is same as those in the cache. All those files will be duplicated in the cache and the workspace.

@skshetry
Copy link
Member

skshetry commented Sep 5, 2024

I maybe open to some config to force-relink. Any thoughts @dberenbaum, @shcheklein?

@shcheklein shcheklein added bug Did we break something? A: pipelines Related to the pipelines feature optimize Optimizes DVC A: data-management Related to dvc add/checkout/commit/move/remove and removed awaiting response we are waiting for your reply, please respond! :) bug Did we break something? labels Sep 7, 2024
@shcheklein
Copy link
Member

just to clarify, better understand things first folks, a few questions:

filefrag is able to check whether a file if reflink or not.

do we know how it does this? is it FS specific or is there a general sys call that can do this? Is it expensive or not?

@skshetry if we had a call isreflink - would that help? (I assume it would, right?)

So, it leaves it as-is in the workspace, which saves us from doing checkout which can be expensive.

could you clarify a bit - is it expensive because we would do a full output checkout (all files), since we can't detect the difference?

we still traverse and check the link type, right? would be the same or less expensive in case of reflinks specifically to force relink right away w/o doing those checks?

@imyhxy
Copy link
Author

imyhxy commented Sep 7, 2024

FYI, https://github.com/tytso/e2fsprogs/blob/950a0d69c82b585aba30118f01bf80151deffe8c/misc/filefrag.c#L269, this line is where the filefrag get the file flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove A: pipelines Related to the pipelines feature optimize Optimizes DVC
Projects
None yet
Development

No branches or pull requests

3 participants