Enable the Audio Feature to decode / read with an offset + duration #7310

TParcollet · 2024-12-07T22:01:44Z

Feature request

For most large speech dataset, we do not wish to generate hundreds of millions of small audio samples. Instead, it is quite common to provide larger audio files with frame offset (soundfile start and stop arguments). We should be able to pass these arguments to Audio() (column ID corresponding in the dataset row).

Motivation

I am currently generating a fairly big dataset to .parquet(). Unfortunately, it does not work because all existing functions load the whole .wav file corresponding to the row. All my attempts at bypassing this failed. We should be able to put in the Table only the bytes corresponding to what soundfile reads with an offset (and subset of the audio file).

Your contribution

I can totally test whatever code on my large dataset creation script.

lhoestq · 2024-12-09T16:39:05Z

Hi ! What about having audio + start + duration columns and enable something like this ?

for example in ds:
    array = example["audio"].read(start=example["start"], frames=example["duration"])

TParcollet · 2024-12-09T21:09:07Z

Hi @lhoestq, this would work with a file-based dataset but would be terrible for a sharded one as it would duplicate the large audio file many times. Also, very long audio files are not embedded very well in the parquet file, even with large_binary(). It crashed a few times for me until I switched to one sample == one file :-(

TParcollet added the enhancement New feature or request label Dec 7, 2024

TParcollet changed the title ~~Enable the Audio feature to decode / read with an offset + duration~~ Enable the Audio Feature to decode / read with an offset + duration Dec 7, 2024

TParcollet mentioned this issue Dec 8, 2024

[Audio Features - DO NOT MERGE] PoC for adding an offset+sliced reading to audio file. #7312

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable the Audio Feature to decode / read with an offset + duration #7310

Enable the Audio Feature to decode / read with an offset + duration #7310

TParcollet commented Dec 7, 2024

lhoestq commented Dec 9, 2024

TParcollet commented Dec 9, 2024 •

edited

Loading

Enable the Audio Feature to decode / read with an offset + duration #7310

Enable the Audio Feature to decode / read with an offset + duration #7310

Comments

TParcollet commented Dec 7, 2024

Feature request

Motivation

Your contribution

lhoestq commented Dec 9, 2024

TParcollet commented Dec 9, 2024 • edited Loading

TParcollet commented Dec 9, 2024 •

edited

Loading