Creating new dataset from list loses information. (Audio Information Lost - either Datatype or Values). #7306

ai-nikolai · 2024-12-05T09:07:53Z

Describe the bug

When creating a dataset from a list of datapoints, information is lost of the individual items.

Specifically, when creating a dataset from a list of datapoints (from another dataset). Either the datatype is lost or the values are lost. See examples below.

-> What is the best way to create a dataset from a list of datapoints?

e.g.:
When running this code:

from datasets import load_dataset, Dataset
commonvoice_data = load_dataset("mozilla-foundation/common_voice_17_0", "it", split="test", streaming=True)
datapoint = next(iter(commonvoice_data))
out = [datapoint]
new_data = Dataset.from_list(out) #this loses datatype information
new_data2= Dataset.from_list(out,features=commonvoice_data.features) #this loses value information

We get the following:

datapoint: (the original datapoint)

'audio': {'path': 'it_test_0/common_voice_it_23606167.mp3', 'array': array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
       2.21619011e-05, 2.72628222e-05, 0.00000000e+00]), 'sampling_rate': 48000}

Original Dataset Features:

>>> commonvoice_data.features
'audio': Audio(sampling_rate=48000, mono=True, decode=True, id=None)

Here we see column "audio", has the proper values (both path & and array) and has the correct datatype (Audio).

new_data[0]:

# Cannot be printed (as it prints the entire array).

New Dataset 1 Features:

>>> new_data.features
'audio': {'array': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), 'path': Value(dtype='string', id=None), 'sampling_rate': Value(dtype='int64', id=None)}

Here we see that the column "audio", has the correct values, but is not the Audio datatype anymore.

new_data2[0]:

'audio': {'path': None, 'array': array([0., 0., 0., ..., 0., 0., 0.]), 'sampling_rate': 48000},

New Dataset 2 Features:

>>> new_data2.features
'audio': Audio(sampling_rate=48000, mono=True, decode=True, id=None),

Here we see that the column "audio", has the correct datatype, but all the array & path values were lost!

Steps to reproduce the bug

Run:

from datasets import load_dataset, Dataset
commonvoice_data = load_dataset("mozilla-foundation/common_voice_17_0", "it", split="test", streaming=True)
datapoint = next(iter(commonvoice_data))
out = [datapoint]
new_data = Dataset.from_list(out) #this loses datatype information
new_data2= Dataset.from_list(out,features=commonvoice_data.features) #this loses value information

Expected behavior

Expected:

datapoint == new_data[0]

AND

datapoint == new_data2[0]

Environment info

datasets version: 3.1.0
Platform: Linux-6.2.0-37-generic-x86_64-with-glibc2.35
Python version: 3.10.12
huggingface_hub version: 0.26.2
PyArrow version: 15.0.2
Pandas version: 2.2.2
fsspec version: 2024.3.1

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating new dataset from list loses information. (Audio Information Lost - either Datatype or Values). #7306

Creating new dataset from list loses information. (Audio Information Lost - either Datatype or Values). #7306

ai-nikolai commented Dec 5, 2024 •

edited

Loading

Creating new dataset from list loses information. (Audio Information Lost - either Datatype or Values). #7306

Creating new dataset from list loses information. (Audio Information Lost - either Datatype or Values). #7306

Comments

ai-nikolai commented Dec 5, 2024 • edited Loading

Describe the bug

We get the following:

Steps to reproduce the bug

Run:

Expected behavior

Expected:

Environment info

ai-nikolai commented Dec 5, 2024 •

edited

Loading