You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run some trainings on Google's TPUs using Huggingface's DataLoader on SlimPajama-627B and c4, but I end up running into 429 Client Error: Too Many Requests for URL error when I call load_dataset. The even odder part is that I am able to sucessfully run trainings with the wikitext dataset. Is there something I need to setup to specifically train with SlimPajama or C4 with TPUs because I am not clear why I am getting these errors.
Steps to reproduce the bug
These are the commands you could run to produce the error below but you will require a ClearML account (you can create one here) with a queue setup to run on Google TPUs
git clone https://github.com/clankur/muGPT.git
cd muGPT
python -m train --config-name=slim_v4-32_84m.yaml +training.queue={NAME_OF_CLEARML_QUEUE}
The error I see:
Traceback (most recent call last):
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/clearml/binding/hydra_bind.py", line 230, in _patched_task_function
return task_function(a_config, *a_args, **a_kwargs)
File "/home/clankur/.clearml/venvs-builds/3.10/task_repository/muGPT.git/train.py", line 1037, in main
main_contained(config, logger)
File "/home/clankur/.clearml/venvs-builds/3.10/task_repository/muGPT.git/train.py", line 840, in main_contained
loader = get_loader("train", config.training_data, config.training.tokens)
File "/home/clankur/.clearml/venvs-builds/3.10/task_repository/muGPT.git/input_loader.py", line 549, in get_loader
return HuggingFaceDataLoader(split, config, token_batch_params)
File "/home/clankur/.clearml/venvs-builds/3.10/task_repository/muGPT.git/input_loader.py", line 395, in __init__
self.dataset = load_dataset(
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 2112, in load_dataset
builder_instance = load_dataset_builder(
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 1798, in load_dataset_builder
dataset_module = dataset_module_factory(
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 1495, in dataset_module_factory
raise e1 from None
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 1479, in dataset_module_factory
).get_module()
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 1034, in get_module
else get_data_patterns(base_path, download_config=self.download_config)
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/data_files.py", line 457, in get_data_patterns
return _get_data_files_patterns(resolver)
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/data_files.py", line 248, in _get_data_files_patterns
data_files = pattern_resolver(pattern)
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/data_files.py", line 340, in resolve_pattern
for filepath, info in fs.glob(pattern, detail=True).items()
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 409, in glob
return super().glob(path, **kwargs)
File "/home/clankur/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/fsspec/spec.py", line 602, in glob
allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 429, in find
out = self._ls_tree(path, recursive=True, refresh=refresh, revision=resolved_path.revision, **kwargs)
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 358, in _ls_tree
self._ls_tree(
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 375, in _ls_tree
for path_info in tree:
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3080, in list_repo_tree
for path_info in paginate(path=tree_url, headers=headers, params={"recursive": recursive, "expand": expand}):
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/utils/_pagination.py", line 46, in paginate
hf_raise_for_status(r)
File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 477, in hf_raise_for_status
raise _format(HfHubHTTPError, str(e), response) from e
huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/cerebras/SlimPajama-627B/tree/2d0accdd58c5d5511943ca1f5ff0e3eb5e293543?recursive=True&expand=True&cursor=ZXlKbWFXeGxYMjVoYldVaU9pSjBaWE4wTDJOb2RXNXJNUzlsZUdGdGNHeGxYMmh2YkdSdmRYUmZPVFEzTG1wemIyNXNMbnB6ZENKOTo2MjUw (Request ID: Root=1-67673de9-1413900606ede7712b08ef2c;1304c09c-3e69-4222-be14-f10ee709d49c)
maximum queue size reached
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Expected behavior
I'd expect the DataLoader to load from the SlimPajama-627B and c4 dataset without issue.
Describe the bug
I am trying to run some trainings on Google's TPUs using Huggingface's DataLoader on SlimPajama-627B and c4, but I end up running into
429 Client Error: Too Many Requests for URL
error when I callload_dataset
. The even odder part is that I am able to sucessfully run trainings with the wikitext dataset. Is there something I need to setup to specifically train with SlimPajama or C4 with TPUs because I am not clear why I am getting these errors.Steps to reproduce the bug
These are the commands you could run to produce the error below but you will require a ClearML account (you can create one here) with a queue setup to run on Google TPUs
git clone https://github.com/clankur/muGPT.git cd muGPT python -m train --config-name=slim_v4-32_84m.yaml +training.queue={NAME_OF_CLEARML_QUEUE}
The error I see:
Expected behavior
I'd expect the DataLoader to load from the SlimPajama-627B and c4 dataset without issue.
Environment info
datasets
version: 2.14.4The text was updated successfully, but these errors were encountered: