-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support custom fingerprinting with Dataset.from_generator
#6194
Comments
The I agree it should be easier to bypass the hashing mechanism in this instance, too. However, we should probably first address #5080 before solving this (e.g., maybe exposing |
Adding +1 here: If the generator needs to access some external resources or state, then it's not always straightforward to make it pickle-able. So I'd like to be able to override how the default cache key derivation needs to pickle the generator (and of course, I'd accept responsibility for that part of cache consistency). Appears to be a recurrent roadbump: #6118 #5963 #5819 #5750 #4983 |
Silly hack incoming: import uuid
class _DatasetGeneratorPickleHack:
def __init__(self, generator, generator_id=None):
self.generator = generator
self.generator_id = (
generator_id if generator_id is not None else str(uuid.uuid4())
)
def __call__(self, *args, **kwargs):
return self.generator(*kwargs, **kwargs)
def __reduce__(self):
return (_DatasetGeneratorPickleHack_raise, (self.generator_id,))
def _DatasetGeneratorPickleHack_raise(*args, **kwargs):
raise AssertionError("cannot actually unpickle _DatasetGeneratorPickleHack!") Now |
I'd like some way to do this too. I find that sometimes the hash doesn't cover enough, and that the dataset is not regenerated even when underlying data has changed, and by supplying a custom fingerprint I could do a better job of controlling when my dataset is regenerated. |
I ran into the same thing - my actual generator reads from a disk source that might have new data (images) available at some point and it ends up ignoring calling the generator. Thanks for the hack @mlin 👋 |
just wanted to pitch my support for an easy control over the generator id. requiring that generators are pickleable just to get a unique id is limiting: plenty of classes (maybe even hf.datasets own) are written with no pickle support in mind. also as mentioned above the state of a generator might extend beyond its pickle. |
Feature request
When using
Dataset.from_generator
, the generator is hashed when building the fingerprint. Similar to.map
, it would be interesting to let the user bypass this hashing by accepting afingerprint
argument to.from_generator
.Motivation
Using the
.from_generator
constructor with a non-picklable generator fails. By accepting afingerprint
argument to.from_generator
, the user would have the opportunity to manually fingerprint the dataset and thus bypass the crash.Your contribution
If validated, I can try to submit a PR for this.
The text was updated successfully, but these errors were encountered: