Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixed permissions of blobs/locks in a multi-user Hub cache #2580

Open
jantrienes opened this issue Sep 30, 2024 · 3 comments
Open

Mixed permissions of blobs/locks in a multi-user Hub cache #2580

jantrienes opened this issue Sep 30, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@jantrienes
Copy link

jantrienes commented Sep 30, 2024

Describe the bug

We would like to share models across users. To this end, we configured HF_HUB_CACHE which worked great for a while! However, we started to run into PermissionError related to files in .locks.

The problem seems to be to mixed group permissions for .locks. I'm attaching the artifacts list of this model below, but we see the problem for other models, too. The output of umask is 0002 for all users of the system.

Questions:

  1. Is setting HF_HUB_CACHE sufficient for sharing hub cache across users?
  2. If I understand correctly, the lock files should be released after use. However, they are not actually deleted by FileLock which may explain the problem we are facing. The relevant logic seems to be here:

try:
return lock.release()
except OSError:
try:
Path(lock_file).unlink()
except OSError:
pass

A workaround would be to delete the .locks files, but not all users have permissions to do that, and asking each individual user to delete their files is tedious. So I'm curios to hear your thoughts on this scenario. Thanks!

Reproduction

No response

Logs

Here is a full stack trace and a list of the artifacts with permission mismatch.
$ python -c "import transformers; transformers.AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct')"
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/trienes/.conda/envs/test/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 844, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/trienes/.conda/envs/test/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 676, in get_tokenizer_config
    resolved_config_file = cached_file(
                           ^^^^^^^^^^^^
  File "/home/trienes/.conda/envs/test/lib/python3.12/site-packages/transformers/utils/hub.py", line 403, in cached_file
    resolved_file = hf_hub_download(
                    ^^^^^^^^^^^^^^^^
  File "/home/trienes/.conda/envs/test/lib/python3.12/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/trienes/.conda/envs/test/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/trienes/.conda/envs/test/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1232, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/trienes/.conda/envs/test/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1380, in _hf_hub_download_to_cache_dir
    with WeakFileLock(lock_path):
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/trienes/.conda/envs/test/lib/python3.12/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/home/trienes/.conda/envs/test/lib/python3.12/site-packages/huggingface_hub/utils/_fixes.py", line 98, in WeakFileLock
    lock.acquire()
  File "/home/trienes/.local/lib/python3.12/site-packages/filelock/_api.py", line 295, in acquire
    self._acquire()
  File "/home/trienes/.local/lib/python3.12/site-packages/filelock/_unix.py", line 42, in _acquire
    fd = os.open(self.lock_file, open_flags, self._context.mode)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: '/scratch_shared/ag_seifertg/.cache/huggingface/hub/.locks/models--meta-llama--Meta-Llama-3.1-8B-Instruct/db88166e2bc4c799fd5d1ae643b75e84d03ee70e.lock'

"blob" files get group read-write:

ls -la $HF_HUB_CACHE/models--meta-llama--Meta-Llama-3.1-8B-Instruct/blobs/
total 15672361
drwxrwsr-x 2 trienes ag_seifertg          9 Sep 23 17:11 .
drwxrwsr-x 6 trienes ag_seifertg          6 Sep 30 16:30 ..
-rw-rw-r-- 1 trienes ag_seifertg 4999802720 Sep 23 17:10 09d433f650646834a83c580877bd60c6d1f88f7755305c12576b5c7058f9af15
-rw-rw-r-- 1 trienes ag_seifertg        855 Sep 23 16:55 0bb6fd75b3ad2fe988565929f329945262c2814e
-rw-rw-r-- 1 trienes ag_seifertg      23950 Sep 23 17:08 0fd8120f1c6acddc268ebc2583058efaf699a771
-rw-rw-r-- 1 trienes ag_seifertg 4976698672 Sep 23 17:09 2b1879f356aed350030bb40eb45ad362c89d9891096f79a3ab323d3ba5607668
-rw-rw-r-- 1 trienes ag_seifertg 1168138808 Sep 23 17:10 92ecfe1a2414458b4821ac8c13cf8cb70aed66b5eea8dc5ad9eeb4ff309d6d7b
-rw-rw-r-- 1 trienes ag_seifertg        184 Sep 23 17:11 cc7276afd599de091142c6ed3005faf8a74aa257
-rw-rw-r-- 1 trienes ag_seifertg 4915916176 Sep 23 17:10 fc1cdddd6bfa91128d6e94ee73d0ce62bfcdb7af29e978ddcab30c66ae9ea7fa

While ".locks" file don't get the same set of permissions.

ls -la $HF_HUB_CACHE/.locks/models--meta-llama--Meta-Llama-3.1-8B-Instruct/
total 96
drwxrwsr-x  2 trienes  ag_seifertg 13 Aug  5 15:36 .
drwxrwsr-x 50 trienes  ag_seifertg 50 Sep 30 17:10 ..
-rw-rw-r--  1 trienes  ag_seifertg  0 Jul 25 10:32 02ee80b6196926a5ad790a004d9efd6ab1ba6542.lock
-rw-rw-r--  1 trienes  ag_seifertg  0 Jul 25 10:30 09d433f650646834a83c580877bd60c6d1f88f7755305c12576b5c7058f9af15.lock
-rw-rw-r--  1 trienes  ag_seifertg  0 Jul 25 10:29 0bb6fd75b3ad2fe988565929f329945262c2814e.lock
-rw-rw-r--  1 trienes  ag_seifertg  0 Jul 25 10:29 0fd8120f1c6acddc268ebc2583058efaf699a771.lock
-rw-rw-r--  1 trienes  ag_seifertg  0 Jul 25 10:29 2b1879f356aed350030bb40eb45ad362c89d9891096f79a3ab323d3ba5607668.lock
-rw-rw-r--  1 trienes  ag_seifertg  0 Jul 25 10:32 421cda369d1e01e742b01d82e3a39c7cc82a8586.lock
-rw-rw-r--  1 trienes  ag_seifertg  0 Jul 25 10:32 5cc5f00a5b203e90a27a3bd60d1ec393b07971e8.lock
-rw-rw-r--  1 trienes  ag_seifertg  0 Jul 25 10:32 92ecfe1a2414458b4821ac8c13cf8cb70aed66b5eea8dc5ad9eeb4ff309d6d7b.lock
-rw-rw-r--  1 trienes  ag_seifertg  0 Jul 25 10:32 cc7276afd599de091142c6ed3005faf8a74aa257.lock
-rw-r--r--  1 derzhana ag_seifertg  0 Aug  5 15:36 db88166e2bc4c799fd5d1ae643b75e84d03ee70e.lock

^^^^^^^^^^^^^^ here is the conflicting file

-rw-rw-r--  1 trienes  ag_seifertg  0 Jul 25 10:31 fc1cdddd6bfa91128d6e94ee73d0ce62bfcdb7af29e978ddcab30c66ae9ea7fa.lock

System info

- huggingface_hub version: 0.25.1
- Platform: Linux-4.18.0-425.3.1.el8.x86_64-x86_64-with-glibc2.28
- Python version: 3.12.6
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: /home/trienes/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: jantrienes
- Configured git credential helpers:
- FastAI: N/A
- Tensorflow: N/A
- Torch: N/A
- Jinja2: N/A
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: N/A
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 2.1.1
- pydantic: N/A
- aiohttp: N/A
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /scratch_shared/ag_seifertg/.cache/huggingface/hub
- HF_ASSETS_CACHE: /home/trienes/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/trienes/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
@jantrienes jantrienes added the bug Something isn't working label Sep 30, 2024
@catwell
Copy link

catwell commented Dec 20, 2024

👋 I have the same issue, has anyone looked at it / solved it?

@jantrienes
Copy link
Author

jantrienes commented Dec 20, 2024

We ended up creating a cron job for this.

# Ensure huggingface cache files being group-writeable
*/10 * * * * root chmod -R g+rwxs [HF_HUB_CACHE] >> /var/log/cron 2>&1

@catwell
Copy link

catwell commented Dec 20, 2024

@jantrienes Thanks, it's not a fix but it's a good workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants