Improve estimated row count #3055

hlky · 2024-08-31T21:28:52Z

Currently estimated row count assumes file sizes are consistent across the entire set, from what I've seen this results in wildly inaccurate estimates for WebDataset. Typically WebDatasets are created with a set number of samples per file, therefore a simpler more accurate estimate can be calculated from the row count of one shard multiplied by the total number of shards.

lhoestq · 2024-09-03T18:54:39Z

The current estimator works using this formula

estimated_num_rows = total_files_bytes / sampled_bytes * num_rows_in_sample

Where we take a sample from the dataset by streaming the first 5GB of in-memory data.

It was made to work for arbitrary file formats.

For webdataset afaik there is no strict rule to have a fixed number of samples per shard, I don't know how often your method would be more accurate or less accurate. Unless this rule is enforced somewhere ?

hlky · 2024-09-03T19:52:51Z

Yes I am aware how the current estimator works, as stated in the issue this assumes a consistent file size across the entire set.

There may not be a strict rule to have a fixed number of samples, just as there is no fixed rule that the number of samples in the first 5GB is the same as the rest. Nevertheless a fixed number of samples per shard is the typical usage and Webdataset's ShardWriter does enforce a fixed number of samples per shard. I don't know how inaccurate the current method is across the entire of Hugging Face, you'd have to check that, I do know it's inaccurate for at least 2 of my own datasets, 288k estimated vs 237k actual in one case and 689k estimated vs 929k actual in another.

lhoestq · 2024-09-04T09:45:09Z

Oh great to see that the ShardWriter does enforce this. Since it's the official implementation and most people use it we can probably rely on this assumption :)

I'd be happy to provide some guidance if you want to look into how to implement this !

severo added improvement / optimization P2 Nice to have labels Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve estimated row count #3055

Improve estimated row count #3055

hlky commented Aug 31, 2024

lhoestq commented Sep 3, 2024

hlky commented Sep 3, 2024

lhoestq commented Sep 4, 2024

Improve estimated row count #3055

Improve estimated row count #3055

Comments

hlky commented Aug 31, 2024

lhoestq commented Sep 3, 2024

hlky commented Sep 3, 2024

lhoestq commented Sep 4, 2024