Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sample naming #248

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Add sample naming #248

wants to merge 1 commit into from

Conversation

gkobeaga
Copy link
Contributor

It might be useful to have named samples for debugging and exploring the data, e.g. identifying outliers.

Comments:

  • I have my doubts with the method sample_names (and feature_names) returning a non-empty vector even if the names for the samples are not set? As it is, when sample_names is empty, it returns vec!["sample-1", "sample-2",..,"sample-{nsamples}"]. The same happens for feature_names. However, in some methods where new dataset are created transforming the input dataset, the output datasets have a non empty sample_names field even if the input dataset has an empty sample_names field. For instance, it happens for the preprocessing methods. An alternative choice could be to make public the feature_names and the sample_names fields (as for the weights field), and to clone them if needed.
  • I have added sample names for the example datasets (diabetes...). I wanted to implement test_retain_sample_names for the preprocessing methods. Is it ok?
  • The method with_feature_names does not check if the given vector has length nfeatures. Do you want to implement it in this PR, as I have done for with_sample_names?

@YuhanLiin
Copy link
Collaborator

YuhanLiin commented Oct 19, 2022

I don't think having named samples is a good idea, because there's almost always way more samples than features, and dedicating one string per sample can get out of hand very quickly. Also, sample names don't play well with methods that reorder or extract samples from the dataset (this isn't done for features).

I do agree about feature_names returning non-empty vector if when there are no names. It should instead just return empty list. Ideally the signature should be fn feature_names(&self) -> &[String] to prevent cloning, but this can be in a separate MR. Also with_feature_names should definitely check the length of the input vector. The input should be Vec<String> (again, this change can be in another MR).

In #70 it mentions target names, so you could take a crack at that instead.

@gkobeaga
Copy link
Contributor Author

I understand your doubts, but I think that we can resolve them.

Regarding the memory requirements, the sample names are optional. If for some use cases these might be memory-consuming, they can just be ignored. We can also set a string capacity for sample names if we anticipate performance issues in large datasets.

Regarding the reordering of the samples, I agree that it should be done carefully when sample names are present. But we already do it for the targets. Anyway, as far as I know, the reordering of samples is only done in the cross-validation, and in this case, we can ignore the sample names, since the intermediate databases are not returned by the method.

Thanks for your suggestions for features_names, I will implement them. (When you say MR do you mean PR? sorry I don't know what MR means).

@YuhanLiin
Copy link
Collaborator

The memory usage would be too large for most real-world datasets. Capping the string capacity doesn't solve this because the problem is the number of strings, which must match the number of samples, leading to an unacceptable number of allocations. The usefulness of sample names is also limited IMO, since you can easily track down an outlier by finding it in the original dataset. Also, most datasets for ML don't come with sample names that you can just import, so users will need to go through the trouble of creating their own generic names for every sample, making it hard to use.

Sample names also cause trouble with methods that extracts subsets of samples from a dataset, such as bootstrap_samples. This happens a lot more with samples than with targets, and the sample names will need to be returned. I don't want to add the maintenance overhead of splitting sample names correctly in those functions.

Finally, in the scikit-learn dataset API, feature and target names are available but sample names are not. Sample names have no precedent in the popular ML library, are not easy to use (in fact they are easy to misuse, since they'd consume too much memory on most datasets), and their usecase is ultimately too niche to justify the maintenance overhead IMO.

MR is Gitlab-speak for PR. I mixed it up since I'm using it at my day job lol.

@YuhanLiin YuhanLiin mentioned this pull request Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants