Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support custom fingerprinting with Dataset.from_generator #6194

Open
bilelomrani1 opened this issue Aug 29, 2023 · 7 comments
Open

Support custom fingerprinting with Dataset.from_generator #6194

bilelomrani1 opened this issue Aug 29, 2023 · 7 comments
Labels
enhancement New feature or request

Comments

@bilelomrani1
Copy link

bilelomrani1 commented Aug 29, 2023

Feature request

When using Dataset.from_generator, the generator is hashed when building the fingerprint. Similar to .map, it would be interesting to let the user bypass this hashing by accepting a fingerprint argument to .from_generator.

Motivation

Using the .from_generator constructor with a non-picklable generator fails. By accepting a fingerprint argument to .from_generator, the user would have the opportunity to manually fingerprint the dataset and thus bypass the crash.

Your contribution

If validated, I can try to submit a PR for this.

@bilelomrani1 bilelomrani1 added the enhancement New feature or request label Aug 29, 2023
@mariosasko
Copy link
Collaborator

The fingerprint parameter serves a slightly different purpose - we use it to inject a new fingerprint after transforming a Dataset (computed from the previous fingerprint + transform + transform args), e.g., to be able to compute the cache file for a transform. There is no concept of fingerprint before a Dataset is fully initialized, but we still need to hash the args (e.g., generator func) of the "dataset creation methods" (from_generator, from_csv, etc.) to compute the cache directory (to store the initial version and transformed dataset versions)

I agree it should be easier to bypass the hashing mechanism in this instance, too. However, we should probably first address #5080 before solving this (e.g., maybe exposing hash in load_dataset/load_dataset_builder.

@mlin
Copy link

mlin commented Sep 6, 2023

Adding +1 here:

If the generator needs to access some external resources or state, then it's not always straightforward to make it pickle-able. So I'd like to be able to override how the default cache key derivation needs to pickle the generator (and of course, I'd accept responsibility for that part of cache consistency).

Appears to be a recurrent roadbump: #6118 #5963 #5819 #5750 #4983

@mlin
Copy link

mlin commented Sep 6, 2023

Silly hack incoming:

import uuid

class _DatasetGeneratorPickleHack:
    def __init__(self, generator, generator_id=None):
        self.generator = generator
        self.generator_id = (
            generator_id if generator_id is not None else str(uuid.uuid4())
        )

    def __call__(self, *args, **kwargs):
        return self.generator(*kwargs, **kwargs)

    def __reduce__(self):
        return (_DatasetGeneratorPickleHack_raise, (self.generator_id,))


def _DatasetGeneratorPickleHack_raise(*args, **kwargs):
    raise AssertionError("cannot actually unpickle _DatasetGeneratorPickleHack!")

Now Dataset.from_generator(_DatasetGeneratorPickleHack(gen)) works even if gen is unpicklable, because Dataset just pickles the shim object that avoids actually traversing gen. Then, one can work out how to set generator_id meaningfully to allow cache reuse.

@hartmans
Copy link
Contributor

I'd like some way to do this too. I find that sometimes the hash doesn't cover enough, and that the dataset is not regenerated even when underlying data has changed, and by supplying a custom fingerprint I could do a better job of controlling when my dataset is regenerated.

@urjitbhatia
Copy link

I ran into the same thing - my actual generator reads from a disk source that might have new data (images) available at some point and it ends up ignoring calling the generator. Thanks for the hack @mlin 👋

@vttrifonov
Copy link

vttrifonov commented Dec 22, 2024

just wanted to pitch my support for an easy control over the generator id. requiring that generators are pickleable just to get a unique id is limiting: plenty of classes (maybe even hf.datasets own) are written with no pickle support in mind. also as mentioned above the state of a generator might extend beyond its pickle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants