You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently estimated row count assumes file sizes are consistent across the entire set, from what I've seen this results in wildly inaccurate estimates for WebDataset. Typically WebDatasets are created with a set number of samples per file, therefore a simpler more accurate estimate can be calculated from the row count of one shard multiplied by the total number of shards.
The text was updated successfully, but these errors were encountered:
Where we take a sample from the dataset by streaming the first 5GB of in-memory data.
It was made to work for arbitrary file formats.
For webdataset afaik there is no strict rule to have a fixed number of samples per shard, I don't know how often your method would be more accurate or less accurate. Unless this rule is enforced somewhere ?
Yes I am aware how the current estimator works, as stated in the issue this assumes a consistent file size across the entire set.
There may not be a strict rule to have a fixed number of samples, just as there is no fixed rule that the number of samples in the first 5GB is the same as the rest. Nevertheless a fixed number of samples per shard is the typical usage and Webdataset's ShardWriter does enforce a fixed number of samples per shard. I don't know how inaccurate the current method is across the entire of Hugging Face, you'd have to check that, I do know it's inaccurate for at least 2 of my own datasets, 288k estimated vs 237k actual in one case and 689k estimated vs 929k actual in another.
Oh great to see that the ShardWriter does enforce this. Since it's the official implementation and most people use it we can probably rely on this assumption :)
I'd be happy to provide some guidance if you want to look into how to implement this !
Currently estimated row count assumes file sizes are consistent across the entire set, from what I've seen this results in wildly inaccurate estimates for WebDataset. Typically WebDatasets are created with a set number of samples per file, therefore a simpler more accurate estimate can be calculated from the row count of one shard multiplied by the total number of shards.
The text was updated successfully, but these errors were encountered: