Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable the Audio Feature to decode / read with an offset + duration #7310

Open
TParcollet opened this issue Dec 7, 2024 · 2 comments
Open
Labels
enhancement New feature or request

Comments

@TParcollet
Copy link

Feature request

For most large speech dataset, we do not wish to generate hundreds of millions of small audio samples. Instead, it is quite common to provide larger audio files with frame offset (soundfile start and stop arguments). We should be able to pass these arguments to Audio() (column ID corresponding in the dataset row).

Motivation

I am currently generating a fairly big dataset to .parquet(). Unfortunately, it does not work because all existing functions load the whole .wav file corresponding to the row. All my attempts at bypassing this failed. We should be able to put in the Table only the bytes corresponding to what soundfile reads with an offset (and subset of the audio file).

Your contribution

I can totally test whatever code on my large dataset creation script.

@TParcollet TParcollet added the enhancement New feature or request label Dec 7, 2024
@TParcollet TParcollet changed the title Enable the Audio feature to decode / read with an offset + duration Enable the Audio Feature to decode / read with an offset + duration Dec 7, 2024
@lhoestq
Copy link
Member

lhoestq commented Dec 9, 2024

Hi ! What about having audio + start + duration columns and enable something like this ?

for example in ds:
    array = example["audio"].read(start=example["start"], frames=example["duration"])

@TParcollet
Copy link
Author

TParcollet commented Dec 9, 2024

Hi @lhoestq, this would work with a file-based dataset but would be terrible for a sharded one as it would duplicate the large audio file many times. Also, very long audio files are not embedded very well in the parquet file, even with large_binary(). It crashed a few times for me until I switched to one sample == one file :-(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants