Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import ignore cache #10657

Open
konstantin-frolov opened this issue Dec 20, 2024 · 1 comment
Open

Import ignore cache #10657

konstantin-frolov opened this issue Dec 20, 2024 · 1 comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? p1-important Important, aka current backlog of things to do

Comments

@konstantin-frolov
Copy link

konstantin-frolov commented Dec 20, 2024

Bug Report

DVC 3.56 Import ignore cache

Description

I have local DVC repo with json annotations added each one and large data storage with thousand image files added as full folder.
I use symlinks for cache.
But import in external storage doesn't create symlinks for images data storage. DVC download first, than link files.

Reproduce

Local data repo

Config

cache.type=symlink
core.autostage=true

Local storage dirs:

annotations/
     master_annotation.json
     train_annotation.json
     test_annotations.json
data_storage/
     image_0
     image_1
     ...
     image_N

In local storage comands

dvc add ./annotations/*
dvc add ./data_storage
Project repo

Config

cache.type=symlink
cache.dir=path/to/local/data/repo/.dvc/cache
core.autostage=true

commands:

dvc import path/to/local/data/repo data_storage

This command start downloading copies files from cache

dvc import path/to/local/data/repo data_storage --no-download

Check data_storage.dvc file and create it in project repo, but

dvc checkout data_storage.dvc

or

dvc checkout data_storage.dvc --relink

start downloading files again

Expected

I think DVC must create symlink for files without downloading originals

Environment information

Output of dvc doctor in local data repo:

-------------------------
Platform: Python 3.12.7 on Linux-5.15.0-86-generic-x86_64-with-glibc2.31
Subprojects:

Supports:
        azure (adlfs = 2024.7.0, knack = 0.12.0, azure-identity = 1.19.0),
        gdrive (pydrive2 = 1.21.1),
        gs (gcsfs = 2024.10.0),
        hdfs (fsspec = 2024.10.0, pyarrow = 18.0.0),
        http (aiohttp = 3.10.10, aiohttp-retry = 2.9.0),
        https (aiohttp = 3.10.10, aiohttp-retry = 2.9.0),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2024.10.0, boto3 = 1.35.36),
        ssh (sshfs = 2024.9.0),
        webdav (webdav4 = 0.10.0),
        webdavs (webdav4 = 0.10.0),
        webhdfs (fsspec = 2024.10.0)
Config:
        Global: /home/user/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: nfs on ip-addr:/storage/
Caches: local
Remotes: None
Workspace directory: nfs on [ip-addr:/storage/](ip-addr:/storage/)
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/76de345055c7e5635fd954ee44e5d4e2

Output of dvc doctor in project repo:

DVC version: 3.56.0 (deb)
-------------------------
Platform: Python 3.12.7 on Linux-5.15.0-86-generic-x86_64-with-glibc2.31
Subprojects:

Supports:
        azure (adlfs = 2024.7.0, knack = 0.12.0, azure-identity = 1.19.0),
        gdrive (pydrive2 = 1.21.1),
        gs (gcsfs = 2024.10.0),
        hdfs (fsspec = 2024.10.0, pyarrow = 18.0.0),
        http (aiohttp = 3.10.10, aiohttp-retry = 2.9.0),
        https (aiohttp = 3.10.10, aiohttp-retry = 2.9.0),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2024.10.0, boto3 = 1.35.36),
        ssh (sshfs = 2024.9.0),
        webdav (webdav4 = 0.10.0),
        webdavs (webdav4 = 0.10.0),
        webhdfs (fsspec = 2024.10.0)
Config:
        Global: /home/user/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: nfs on ip-addr:/storage/
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/sda2
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/f967073321531b0cc07fba234dd73d7b
@shcheklein
Copy link
Member

Yes, I can confirm that it first copies the files from cache, then removes them and replaces with links. It can be problematic in case of a share NAS cache where we want manipulate with links alone, we don't or can't have data on the disk).

@skshetry do you remember if this is expected behavior or a regression?

@shcheklein shcheklein added bug Did we break something? p1-important Important, aka current backlog of things to do A: data-sync Related to dvc get/fetch/import/pull/push labels Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? p1-important Important, aka current backlog of things to do
Projects
None yet
Development

No branches or pull requests

2 participants