Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] FileLock dependency incompatible with filesystem #329

Closed
jarednielsen opened this issue Jun 30, 2020 · 11 comments
Closed

[Bug] FileLock dependency incompatible with filesystem #329

jarednielsen opened this issue Jun 30, 2020 · 11 comments

Comments

@jarednielsen
Copy link
Contributor

I'm downloading a dataset successfully with
load_dataset("wikitext", "wikitext-2-raw-v1")

But when I attempt to cache it on an external volume, it hangs indefinitely:
load_dataset("wikitext", "wikitext-2-raw-v1", cache_dir="/fsx") # /fsx is an external volume mount

The filesystem when hanging looks like this:

/fsx
----downloads
       ----94be...73.lock
----wikitext
       ----wikitext-2-raw
             ----wikitext-2-raw-1.0.0.incomplete

It appears that on this filesystem, the FileLock object is forever stuck in its "acquire" stage. I have verified that the issue lies specifically with the filelock dependency:

open("/fsx/hello.txt").write("hello") # succeeds

from filelock import FileLock
with FileLock("/fsx/hello.lock"):
    open("/fsx/hello.txt").write("hello") # hangs indefinitely

Has anyone else run into this issue? I'd raise it directly on the FileLock repo, but that project appears abandoned with the last update over a year ago. Or if there's a solution that would remove the FileLock dependency from the project, I would appreciate that.

@thomwolf
Copy link
Member

Hi, can you give details on your environment/os/packages versions/etc?

@jarednielsen
Copy link
Contributor Author

jarednielsen commented Jun 30, 2020

Environment is Ubuntu 18.04, Python 3.7.5, nlp==0.3.0, filelock=3.0.12.

The external volume is Amazon FSx for Lustre, and it by default creates files with limited permissions. My working theory is that FileLock creates a lockfile that isn't writable, and thus there's no way to acquire it by removing the .lock file. But Python is able to create new files and write to them outside of the FileLock package.

When I attempt to use FileLock within a Docker container by writing to /root/.cache/hello.txt, it succeeds. So there's some permissions issue. But it's not a Docker configuration issue; I've replicated it without Docker.

echo "hello world" >> hello.txt
ls -l

-rw-rw-r-- 1 ubuntu ubuntu 10 Jun 30 19:52 hello.txt

@jarednielsen
Copy link
Contributor Author

jarednielsen commented Jun 30, 2020

Looks like the flock syscall does not work on Lustre filesystems by default: tox-dev/filelock#67.

I added the -o flock option when mounting the filesystem, as described here, which fixed the issue.

@thomwolf
Copy link
Member

thomwolf commented Jul 1, 2020

Awesome, thanks a lot for sharing your fix!

@orm011
Copy link

orm011 commented Jun 22, 2022

I'm wondering if this can be revisited. In some managed environments the same person using HF cannot change the file-system mount flags, (and the organization may be unwilling to change these flags due to other concerns) but can ensure that there won't be concurrent writes, for example because HF is offline and the models/datasets were downloaded earlier.

The real fix would be to FileLock itself, which does not seem very active and seems to not deal with failed system flock calls , which would be one way to fix this, as they mention in the issue below also raised by @jarednielsen

tox-dev/filelock#67

@jgbos
Copy link

jgbos commented Sep 8, 2022

I'm wondering if this can be revisited. In some managed environments the same person using HF cannot change the file-system mount flags, (and the organization may be unwilling to change these flags due to other concerns) but can ensure that there won't be concurrent writes, for example because HF is offline and the models/datasets were downloaded earlier.

I am one of those users. Is there a work around for this?

@orm011
Copy link

orm011 commented Sep 8, 2022

The machines I use have a shared FS which has the filelock problem as well as a local one that does not. Using some env vars (HF_HOME, which controls both models and datasets, and HF_DATASETS_OFFLINE) for both transformers and datasets library one can influence where these downloads happen, and whether the locks get taken. I think some of the relevant documentation is here https://huggingface.co/docs/transformers/installation#cache-setup. I do end up using different settings when I download the models and when I use them, and have to rsync the models to the local file system using a separate script.

@jgbos
Copy link

jgbos commented Sep 8, 2022

Thanks @orm011 . These filesystems are such a pain. I'll dig around, looks like setting cache_dir to a non-lustre filesystem works for transformers but not datasets.

@orm011
Copy link

orm011 commented Sep 8, 2022

Note I export HF_HOME= in the shell prior to running python (I do not use the cache_dir argument, I think I ran into similar issues with it, nor HF_DATASETS_CACHE , though maybe that works, or maybe you can set it in python prior to importing the library ), and I change no other variables. Then datasets.load_dataset() works without any additional flags, and they go into HF_HOME/datasets/ and the models go into HF_HOME/transformers/ (and the lock files are all there as well).

@amanikiruga
Copy link

I am using a shared cluster with a lustre system that I can't change. I am unable to download or load datsets onto the filesystem because of file lock. @thomwolf can this issue be reopened?

@natalialanzoni
Copy link

I am using a shared cluster with a lustre system that I can't change. I am unable to download or load datsets onto the filesystem because of file lock. @thomwolf can this issue be reopened?

Hi, I am having this issue as well. Has there been a solution for this? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants