You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inconsistent operation of data_files and data_dir in load_dataset method.
Steps to reproduce the bug
First
I have three files, named 'train.json', 'val.json', 'test.json'.
Each one has a simple dict {text:'aaa'}.
Their path are /data/train.json, /data/val.json, /data/test.json
I load dataset with data_files argument:
Two results are not the same. Their behaviors are not equal, even if the statement here said that their behaviors are equal.
Second
If some filename include 'test' while others do not, load_dataset only return test dataset and others files are abandoned.
Given two files named test.json and 1.json
Each one has a simple dict {text:'aaa'}.
I load the dataset using:
Things do not change even I manually set split='train'
Expected behavior
Fix the above bugs.
Although the document says that load_dataset method will Find which file goes into which split (e.g. train/test) based on file and directory names or on the YAML configuration, I hope I can manually decide whether to do so. Sometimes users may accidentally put a test string in the filename but they just want a single train dataset. If the number of files in data_dir is huge, it's not easy to find out what cause the second situation metioned above.
Environment info
datasets==3.2.0
Ubuntu18.84
The text was updated successfully, but these errors were encountered:
Describe the bug
Inconsistent operation of data_files and data_dir in load_dataset method.
Steps to reproduce the bug
First
I have three files, named 'train.json', 'val.json', 'test.json'.
Each one has a simple dict
{text:'aaa'}
.Their path are
/data/train.json
,/data/val.json
,/data/test.json
I load dataset with
data_files
argument:And I get:
However, If I load dataset with
data_dir
argument:And I get:
Two results are not the same. Their behaviors are not equal, even if the statement here said that their behaviors are equal.
Second
If some filename include 'test' while others do not,
load_dataset
only returntest
dataset and others files are abandoned.Given two files named
test.json
and1.json
Each one has a simple dict
{text:'aaa'}
.I load the dataset using:
Only
test
is returned,1.json
is missing:Things do not change even I manually set
split='train'
Expected behavior
Find which file goes into which split (e.g. train/test) based on file and directory names or on the YAML configuration
, I hope I can manually decide whether to do so. Sometimes users may accidentally put atest
string in the filename but they just want a singletrain
dataset. If the number of files indata_dir
is huge, it's not easy to find out what cause the second situation metioned above.Environment info
datasets==3.2.0
Ubuntu18.84
The text was updated successfully, but these errors were encountered: