Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting Arrow to WebDataset TAR Format for Offline Use #7347

Closed
katie312 opened this issue Dec 27, 2024 · 4 comments
Closed

Converting Arrow to WebDataset TAR Format for Offline Use #7347

katie312 opened this issue Dec 27, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@katie312
Copy link

Feature request

Hi,

I've downloaded an Arrow-formatted dataset offline using the hugggingface's datasets library by:

import json
from datasets import load_dataset

dataset = load_dataset("pixparse/cc3m-wds")
dataset.save_to_disk("./cc3m_1") 

now I need to convert it to WebDataset's TAR format for offline data ingestion.
Is there a straightforward method to achieve this conversion without an internet connection? Can I simply convert it by

tar -cvf 

btw, when I tried:

import webdataset as wds
from huggingface_hub import get_token
from torch.utils.data import DataLoader

hf_token = get_token()
url = "https://huggingface.co/datasets/timm/imagenet-12k-wds/resolve/main/imagenet12k-train-{{0000..1023}}.tar"
url = f"pipe:curl -s -L {url} -H 'Authorization:Bearer {hf_token}'"
dataset = wds.WebDataset(url).decode()
dataset.save_to_disk("./cc3m_webdataset") 

error occured:

AttributeError: 'WebDataset' object has no attribute 'save_to_disk'

Thanks a lot!

Motivation

Converting Arrow to WebDataset TAR Format

Your contribution

No clue yet

@katie312 katie312 added the enhancement New feature or request label Dec 27, 2024
@hamad350
Copy link

hamad350 commented Dec 27, 2024 via email

@lhoestq
Copy link
Member

lhoestq commented Dec 27, 2024

now I need to convert it to WebDataset's TAR format for offline data ingestion.

you can directly download the .TAR files from HF using e.g. huggingface-cli download and load them in webdataset :)

@hamad350
Copy link

hamad350 commented Dec 27, 2024 via email

@katie312
Copy link
Author

now I need to convert it to WebDataset's TAR format for offline data ingestion.

you can directly download the .TAR files from HF using e.g. huggingface-cli download and load them in webdataset :)

Thanks a lot! I completely forgot to use Hugging Face-CLI download. Thanks for the reminding!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants