[Bug] FileLock dependency incompatible with filesystem #329

jarednielsen · 2020-06-30T19:45:31Z

I'm downloading a dataset successfully with
load_dataset("wikitext", "wikitext-2-raw-v1")

But when I attempt to cache it on an external volume, it hangs indefinitely:
load_dataset("wikitext", "wikitext-2-raw-v1", cache_dir="/fsx") # /fsx is an external volume mount

The filesystem when hanging looks like this:

/fsx
----downloads
       ----94be...73.lock
----wikitext
       ----wikitext-2-raw
             ----wikitext-2-raw-1.0.0.incomplete

It appears that on this filesystem, the FileLock object is forever stuck in its "acquire" stage. I have verified that the issue lies specifically with the filelock dependency:

open("/fsx/hello.txt").write("hello") # succeeds

from filelock import FileLock
with FileLock("/fsx/hello.lock"):
    open("/fsx/hello.txt").write("hello") # hangs indefinitely

Has anyone else run into this issue? I'd raise it directly on the FileLock repo, but that project appears abandoned with the last update over a year ago. Or if there's a solution that would remove the FileLock dependency from the project, I would appreciate that.

The text was updated successfully, but these errors were encountered:

thomwolf · 2020-06-30T19:47:25Z

Hi, can you give details on your environment/os/packages versions/etc?

jarednielsen · 2020-06-30T19:56:04Z

Environment is Ubuntu 18.04, Python 3.7.5, nlp==0.3.0, filelock=3.0.12.

The external volume is Amazon FSx for Lustre, and it by default creates files with limited permissions. My working theory is that FileLock creates a lockfile that isn't writable, and thus there's no way to acquire it by removing the .lock file. But Python is able to create new files and write to them outside of the FileLock package.

When I attempt to use FileLock within a Docker container by writing to /root/.cache/hello.txt, it succeeds. So there's some permissions issue. But it's not a Docker configuration issue; I've replicated it without Docker.

echo "hello world" >> hello.txt
ls -l

-rw-rw-r-- 1 ubuntu ubuntu 10 Jun 30 19:52 hello.txt

jarednielsen · 2020-06-30T21:23:52Z

Looks like the flock syscall does not work on Lustre filesystems by default: tox-dev/filelock#67.

I added the -o flock option when mounting the filesystem, as described here, which fixed the issue.

thomwolf · 2020-07-01T06:55:58Z

Awesome, thanks a lot for sharing your fix!

orm011 · 2022-06-22T14:11:41Z

I'm wondering if this can be revisited. In some managed environments the same person using HF cannot change the file-system mount flags, (and the organization may be unwilling to change these flags due to other concerns) but can ensure that there won't be concurrent writes, for example because HF is offline and the models/datasets were downloaded earlier.

The real fix would be to FileLock itself, which does not seem very active and seems to not deal with failed system flock calls , which would be one way to fix this, as they mention in the issue below also raised by @jarednielsen

tox-dev/filelock#67

jgbos · 2022-09-08T19:24:34Z

I'm wondering if this can be revisited. In some managed environments the same person using HF cannot change the file-system mount flags, (and the organization may be unwilling to change these flags due to other concerns) but can ensure that there won't be concurrent writes, for example because HF is offline and the models/datasets were downloaded earlier.

I am one of those users. Is there a work around for this?

orm011 · 2022-09-08T19:38:08Z

The machines I use have a shared FS which has the filelock problem as well as a local one that does not. Using some env vars (HF_HOME, which controls both models and datasets, and HF_DATASETS_OFFLINE) for both transformers and datasets library one can influence where these downloads happen, and whether the locks get taken. I think some of the relevant documentation is here https://huggingface.co/docs/transformers/installation#cache-setup. I do end up using different settings when I download the models and when I use them, and have to rsync the models to the local file system using a separate script.

jgbos · 2022-09-08T20:35:59Z

Thanks @orm011 . These filesystems are such a pain. I'll dig around, looks like setting cache_dir to a non-lustre filesystem works for transformers but not datasets.

orm011 · 2022-09-08T20:58:22Z

Note I export HF_HOME= in the shell prior to running python (I do not use the cache_dir argument, I think I ran into similar issues with it, nor HF_DATASETS_CACHE , though maybe that works, or maybe you can set it in python prior to importing the library ), and I change no other variables. Then datasets.load_dataset() works without any additional flags, and they go into HF_HOME/datasets/ and the models go into HF_HOME/transformers/ (and the lock files are all there as well).

amanikiruga · 2023-10-07T17:07:52Z

I am using a shared cluster with a lustre system that I can't change. I am unable to download or load datsets onto the filesystem because of file lock. @thomwolf can this issue be reopened?

natalialanzoni · 2024-12-26T15:13:37Z

I am using a shared cluster with a lustre system that I can't change. I am unable to download or load datsets onto the filesystem because of file lock. @thomwolf can this issue be reopened?

Hi, I am having this issue as well. Has there been a solution for this? Thanks!

jarednielsen closed this as completed Jun 30, 2020

kelvinAI mentioned this issue Mar 31, 2022

Dataset loads indefinitely after modifying default cache path (~/.cache/huggingface) #3986

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] FileLock dependency incompatible with filesystem #329

[Bug] FileLock dependency incompatible with filesystem #329

jarednielsen commented Jun 30, 2020

thomwolf commented Jun 30, 2020

jarednielsen commented Jun 30, 2020 •

edited

Loading

jarednielsen commented Jun 30, 2020 •

edited

Loading

thomwolf commented Jul 1, 2020

orm011 commented Jun 22, 2022

jgbos commented Sep 8, 2022

orm011 commented Sep 8, 2022

jgbos commented Sep 8, 2022

orm011 commented Sep 8, 2022 •

edited

Loading

amanikiruga commented Oct 7, 2023

natalialanzoni commented Dec 26, 2024

[Bug] FileLock dependency incompatible with filesystem #329

[Bug] FileLock dependency incompatible with filesystem #329

Comments

jarednielsen commented Jun 30, 2020

thomwolf commented Jun 30, 2020

jarednielsen commented Jun 30, 2020 • edited Loading

jarednielsen commented Jun 30, 2020 • edited Loading

thomwolf commented Jul 1, 2020

orm011 commented Jun 22, 2022

jgbos commented Sep 8, 2022

orm011 commented Sep 8, 2022

jgbos commented Sep 8, 2022

orm011 commented Sep 8, 2022 • edited Loading

amanikiruga commented Oct 7, 2023

natalialanzoni commented Dec 26, 2024

jarednielsen commented Jun 30, 2020 •

edited

Loading

jarednielsen commented Jun 30, 2020 •

edited

Loading

orm011 commented Sep 8, 2022 •

edited

Loading