Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can Autosklearn handle Multi-Class/Multi-Label Classification and which classifiers will it use? #1429

Open
asmgx opened this issue Mar 25, 2022 · 8 comments · May be fixed by #1530
Open

Can Autosklearn handle Multi-Class/Multi-Label Classification and which classifiers will it use? #1429

asmgx opened this issue Mar 25, 2022 · 8 comments · May be fixed by #1530
Assignees
Labels
Milestone

Comments

@asmgx
Copy link

asmgx commented Mar 25, 2022

I have been trying to use AutoSklearn with Multi-class classification

so my labels are like this

0 1 2 3 4 ... 200
1 0 1 1 1 ... 1
0 1 0 0 1 ... 0
1 0 0 1 0 ... 0
1 1 0 1 0 ... 1
0 1 1 0 1 ... 0
1 1 1 0 0 ... 1
1 0 1 0 1 ... 0

I used this code

y = y[:, (65,67,54,133,122,63,102,105,39)]
X = df.drop(Code, axis=1, errors='ignore')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


automl = autosklearn.classification.AutoSklearnClassifier(
include={'feature_preprocessor': ["no_preprocessing"], 
 },
exclude={ 'classifier': ['random_forest']},
time_left_for_this_task=60*5,
per_run_time_limit=60*1,
memory_limit = 1024 * 10,
n_jobs=-1,
metric=autosklearn.metrics.f1_macro,
        )

but now I want to train Autosklearn on Multi-class Multi-label classification

Which method of these shall i use?

1-

clf = OneVsRestClassifier(automl, n_jobs=-1)
clf.fit(X_train, y_train)

2-


clf = automl
clf.fit(X_train, y_train)

3-

I have to loop one class at a time and use

clf = automl
clf.fit(X_train, y_train)

so it will be like

for i in (65,67,54,133,122,63,102,105,39):
       y = z[:, i]
       X = df.drop(Code, axis=1, errors='ignore')
       X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
      automl = autosklearn.classification.AutoSklearnClassifier(
      include={'feature_preprocessor': ["no_preprocessing"], 
       },
      exclude={ 'classifier': ['random_forest']},
      time_left_for_this_task=60*5,
      per_run_time_limit=60*1,
      memory_limit = 1024 * 10,
      n_jobs=1,
      metric=autosklearn.metrics.f1_macro,
              )


      clf = automl
      clf.fit(X_train, y_train)

so I get a different model for each label?

@eddiebergman
Copy link
Contributor

Hey again @asmgx,

Just as a note, the example you give at first is multi-label as there are multiple label columns, and not just one.

Method 2 will not work as we do not natively support Multi-class mutli-label classification. This is due to the fact sklearn models usually don't support this naitevly and require adapters, similiar to the ones you show in option 1.. However option 1. will also not work, read the description of it carefully https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn-multiclass-onevsrestclassifier. It supports one or the other but not both simultaneously.

In general, I don't think support for multi-class multi-label is very widespread and I would advise reframing the problem as you suggest in 3.. One option as you suggest is to fit one classifier per multi-class target column, combining their results at the end. Another option is basically one hot-encode each multi-class target column into multiple binary one. In the same way you can one-hot encode categorical columns, you can do the same to target columns which contain multiple classes, repeating this for each column in your output. This can increase your target columns dramitically depending on the number of classes and it also makes translating between your original targets and the one-hot encoded variant more difficult to implement.

But to reiterate, we don't support it natively and implementation is left to the user.

Best,
Eddie

@vgargan2
Copy link

Hello to all,

For my undergraduate thesis, I am trying to benchmark some automl tools. Specifically, I am trying to plot ROC curves and calculate Area under ROC for multiclass (not multilabel) classification for some datasets coming from OpenML-CC18 using Autosklearn. Basicaly I am trying to implement this using AutoSklearnClassifier.

As eddiebergman already correctly pointed out, the clf = OneVsRestClassifier(automl, n_jobs=-1) clf.fit(X_train, y_train) bit cann't be directly used.

Can you please provide me an example of how can be done?

Thanks in advance!

@eddiebergman
Copy link
Contributor

Hi @vgargan2,

We support regular Multi-class classification out of the box. I realize we don't have an example to show this but we regular test on benchmark openml/s/218 which is similar in spirit to OpenML-CC18.

Incase this thread begins to confuse other readers, I'm going to make the 4 distinctions and clarify which we support.

  • Binary Classification - Supported | e.g. [0 1 1 0 0 1]
  • Multiclass Classification - Supported | e.g. [0, 2, 3, 1, 3, 3, 2, 1, 0]
  • Mutlilabel Classification - Supported | e.g. [[0, 1, 0], [1, 1, 0], [1, 0, 0]]
  • Multilabel Multiclass Classification Not Supported | e.g. [[1, 2, 0], [2, 1, 0], [3, 2, 1]]

Best,
Eddie

@asmgx
Copy link
Author

asmgx commented Apr 11, 2022

@eddiebergman this is confusing.
you are saying that Mutlilabel Classification is supported, which is the same example I mentioned in the beginning of this post.

Do you mean if I have a data set with targeted values as following is Supported?

RowNo   Feature1  Feature2  Feature3   |  Label1   Label2   Label3   Label4   Label5
-------------------------------------------------------------------------------------------
1               73             84            34         |       0           1             1           0           1
2               37             88            84         |       0           0             0           1           1
3               93             90            58         |       1           0             1           1           0
4               77             44            66         |       1           1             1           0           0
5               48             82            38         |       1           1             0           1           1
6               53             87            42         |       0           1             0           0           0
7               80             55            28         |       1           0             0           1           0
8               66             74            97         |       0           0             1           1           1

Can you advice how can we work with this example?

@eddiebergman
Copy link
Contributor

@asmgx, I apologise, I misread your example in the very first section. Yes it would support that example which is multilabel. Nothing needs to be done to support it, autosklearn will work out of the box with those labels automl = AutoSklearnClassifier(); automl.fit(X, y)

I read the column headers as being non binary and assumed you meant multiclass-multilabel classification, especially given the title of the issue.

This whole issue seems to illuminate that we should have a clear section about this. I also sometimes mix up which is multiclass vs multilabel as well as I don't expect everyone knows that you can combine the two to get the entirely different multiclass-multilabel which sklearn has limited support for.

For those scrolling to the bottom of the issue

# Nothing has to be done for mutli-label OR multi-class
X = np.random.rand(4, 2)  # 4 examples, 2 features


# For binary
binary_y = [1, 0, 1, 1]
automl = AutoSklearnClassifier()
automl.fit(X, binary_y)

# For multiclass
multiclass_y = [1, 2, 0, 2]
automl = AutoSklearnClassifier()
automl.fit(X, multiclass_y)

# For multilabel
multilabel_y = [[1, 0], [0, 0], [1, 1], [1, 0]]
automl = AutoSklearnClassifier()
automl.fit(X, multilabel_y)

# For multiclass-multilabel y
# NOT SUPPORTED
mutliclass_multilabel_y = [[1, 2], [0, 2], [0, 0], [2, 1]]

@asmgx
Copy link
Author

asmgx commented Apr 11, 2022

@eddiebergman Thanks,
is there more documentation on how does AutoSklearn support Multi-Label datasets? How it does build its models?
I know that not all Algorithms support Multi-Labels natively, so does it use OneVsRestClassifier internally or does it loop over all the labels?

Any documents support that?

@eddiebergman
Copy link
Contributor

There are no special things done, when doing multi-label classification, we only consider models that natively support multilabel classification.

if dataset_properties.get('multilabel') is True and \

There's no document to support this but there probably should be to describe all this.

@mfeurer
Copy link
Contributor

mfeurer commented Apr 19, 2022

We document the supported tasks here, but we should potentially rename this to "support target types" and link to scikit-learn's glossary, for example for multi-label we should make this a link to https://scikit-learn.org/stable/glossary.html#term-multilabel. Indeed, we have no documentation on which classifier is used for which target types and it would be great to have that.

@eddiebergman eddiebergman changed the title How to use Autosklearn with Multi-Class Multi-Label Classification? How to use Autosklearn with Multi-Class/Multi-Label Classification? Apr 19, 2022
@eddiebergman eddiebergman added documentation Something to be documented and removed Feedback-Required labels Apr 19, 2022
@eddiebergman eddiebergman changed the title How to use Autosklearn with Multi-Class/Multi-Label Classification? Can Autosklearn handle Multi-Class/Multi-Label Classification and which classifiers will it use? Apr 24, 2022
@eddiebergman eddiebergman linked a pull request Jun 10, 2022 that will close this issue
@eddiebergman eddiebergman self-assigned this Jun 10, 2022
@eddiebergman eddiebergman added this to the v0.16 milestone Jun 10, 2022
@eddiebergman eddiebergman linked a pull request Jun 24, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment