-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converting Arrow to WebDataset TAR Format for Offline Use #7347
Comments
Hi,
I've downloaded an Arrow-formatted dataset offline using the hugggingface's datasets library by:
import json
from datasets import load_dataset
dataset = load_dataset("pixparse/cc3m-wds")
dataset.save_to_disk("./cc3m_1")
now I need to convert it to WebDataset's TAR format for offline data ingestion.
Is there a straightforward method to achieve this conversion without an internet connection? Can I simply convert it by
tar -cvf
btw, when I tried:
import webdataset as wds
from huggingface_hub import get_token
from torch.utils.data import DataLoader
hf_token = get_token()
url = "https://huggingface.co/datasets/timm/imagenet-12k-wds/resolve/main/imagenet12k-train-{{0000..1023}}.tar"
url = f"pipe:curl -s -L {url} -H 'Authorization:Bearer {hf_token}'"
dataset = wds.WebDataset(url).decode()
dataset.save_to_disk("./cc3m_webdataset")
error occured:
AttributeError: 'WebDataset' object has no attribute 'save_to_disk'
Thanks a lot!
Motivation
Converting Arrow to WebDataset TAR Format
Your contribution
No clue yet
احصل على Outlook لـ iOS<https://aka.ms/o0ukef>
…________________________________
من: katie312 ***@***.***>
تم الإرسال: Friday, December 27, 2024 4:41:21 AM
إلى: huggingface/datasets ***@***.***>
نسخة: Subscribed ***@***.***>
الموضوع: [huggingface/datasets] Converting Arrow to WebDataset TAR Format for Offline Use (Issue #7347)
Feature request
Hi,
I've downloaded an Arrow-formatted dataset offline using the hugggingface's datasets library by:
import json
from datasets import load_dataset
dataset = load_dataset("pixparse/cc3m-wds")
dataset.save_to_disk("./cc3m_1")
now I need to convert it to WebDataset's TAR format for offline data ingestion.
Is there a straightforward method to achieve this conversion without an internet connection? Can I simply convert it by
tar -cvf
btw, when I tried:
import webdataset as wds
from huggingface_hub import get_token
from torch.utils.data import DataLoader
hf_token = get_token()
url = "https://huggingface.co/datasets/timm/imagenet-12k-wds/resolve/main/imagenet12k-train-{{0000..1023}}.tar"
url = f"pipe:curl -s -L {url} -H 'Authorization:Bearer {hf_token}'"
dataset = wds.WebDataset(url).decode()
dataset.save_to_disk("./cc3m_webdataset")
error occured:
AttributeError: 'WebDataset' object has no attribute 'save_to_disk'
Thanks a lot!
Motivation
Converting Arrow to WebDataset TAR Format
Your contribution
No clue yet
—
Reply to this email directly, view it on GitHub<#7347>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQJDZ2X2RUIIULBJEF5R2HL2HSV4DAVCNFSM6AAAAABUH5QSLCVHI2DSMVQWIX3LMV43ASLTON2WKOZSG43DAMRYGIZTGOI>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
you can directly download the .TAR files from HF using e.g. |
الفله سنه والطبقه يوم
احصل على Outlook لـ iOS<https://aka.ms/o0ukef>
…________________________________
من: Quentin Lhoest ***@***.***>
تم الإرسال: Friday, December 27, 2024 4:14:43 PM
إلى: huggingface/datasets ***@***.***>
نسخة: hamad350 ***@***.***>; Comment ***@***.***>
الموضوع: Re: [huggingface/datasets] Converting Arrow to WebDataset TAR Format for Offline Use (Issue #7347)
now I need to convert it to WebDataset's TAR format for offline data ingestion.
you can directly download the .TAR files from HF using e.g. huggingface-cli download and load them in webdataset :)
—
Reply to this email directly, view it on GitHub<#7347 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQJDZ2R5M3Z7L2MZZYARYID2HVHEHAVCNFSM6AAAAABUH5QSLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRTGY4TCNJXGA>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Thanks a lot! I completely forgot to use Hugging Face-CLI download. Thanks for the reminding! |
Feature request
Hi,
I've downloaded an Arrow-formatted dataset offline using the hugggingface's datasets library by:
now I need to convert it to WebDataset's TAR format for offline data ingestion.
Is there a straightforward method to achieve this conversion without an internet connection? Can I simply convert it by
btw, when I tried:
error occured:
Thanks a lot!
Motivation
Converting Arrow to WebDataset TAR Format
Your contribution
No clue yet
The text was updated successfully, but these errors were encountered: