You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For most large speech dataset, we do not wish to generate hundreds of millions of small audio samples. Instead, it is quite common to provide larger audio files with frame offset (soundfile start and stop arguments). We should be able to pass these arguments to Audio() (column ID corresponding in the dataset row).
Motivation
I am currently generating a fairly big dataset to .parquet(). Unfortunately, it does not work because all existing functions load the whole .wav file corresponding to the row. All my attempts at bypassing this failed. We should be able to put in the Table only the bytes corresponding to what soundfile reads with an offset (and subset of the audio file).
Your contribution
I can totally test whatever code on my large dataset creation script.
The text was updated successfully, but these errors were encountered:
TParcollet
changed the title
Enable the Audio feature to decode / read with an offset + duration
Enable the Audio Feature to decode / read with an offset + duration
Dec 7, 2024
Hi @lhoestq, this would work with a file-based dataset but would be terrible for a sharded one as it would duplicate the large audio file many times. Also, very long audio files are not embedded very well in the parquet file, even with large_binary(). It crashed a few times for me until I switched to one sample == one file :-(
Feature request
For most large speech dataset, we do not wish to generate hundreds of millions of small audio samples. Instead, it is quite common to provide larger audio files with frame offset (soundfile start and stop arguments). We should be able to pass these arguments to Audio() (column ID corresponding in the dataset row).
Motivation
I am currently generating a fairly big dataset to .parquet(). Unfortunately, it does not work because all existing functions load the whole .wav file corresponding to the row. All my attempts at bypassing this failed. We should be able to put in the Table only the bytes corresponding to what soundfile reads with an offset (and subset of the audio file).
Your contribution
I can totally test whatever code on my large dataset creation script.
The text was updated successfully, but these errors were encountered: