Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cost-Efficient Gradient Boosting (GreedyMiser) for XGBoost #11010

Open
eugeneyarovoi opened this issue Nov 20, 2024 · 2 comments
Open

Cost-Efficient Gradient Boosting (GreedyMiser) for XGBoost #11010

eugeneyarovoi opened this issue Nov 20, 2024 · 2 comments

Comments

@eugeneyarovoi
Copy link

eugeneyarovoi commented Nov 20, 2024

I'm interested in having support for cost-efficient gradient boosting in XGBoost. Glossing over non-essential details, CEBG is the application of a GreedyMiser-like strategy to second-order XGBoost-style optimization. LightGBM has support for this under parameters beginning with "cegb" (see https://lightgbm.readthedocs.io/en/latest/Parameters.html)

To summarize the idea, we maintain a set of indicator variables indicating whether each feature has already been used somewhere in the forest that has been trained. Whenever a new feature that has not yet been used anywhere in the forest is being considered for a split, it has to pay a penalty (that can be specified per-feature, or is the same for all features) that can be seen as lowering the gain of the split. The goal is to nudge the tree towards re-using features it has already used, unless the unused features result in significantly better splits.

The purpose of this feature is to enable the training of models that are parsimonious in terms of how many features they use. In industry, models using thousands of features are not uncommon, but many of the features likely have low importance and could be removed without much impact. Models that use fewer features are cheaper to serve in production and require smaller payloads, reducing networking and data marshaling costs.

There are many ways to do feature selection, including methods that are done independently of the model, but XGBoost itself is a decent feature selection algorithm. Therefore one approach to feature selection can look like this: 1) train an XGBoost model on all features; 2) get the feature importance; 3) train another XGBoost model with only the top K features; 4) check that the new model is not much worse than the model having all the features. However, this feature selection strategy is not ideal, because highly-correlated but important features are likely to all be used during training. With the proposed feature-use penalty, if features A and B are correlated, and only A has been used thus far, then B will only be used if gain(best split of B) > gain(best split of A) + penalty_term. So, training with this feature turns XGBoost into a more powerful feature selector based on a fairly theoretically-sound basis: a feature is admitted when it provides a contribution to reducing loss than cannot be captured by existing features.

I want to get feedback on this idea:

  • If I were to contribute this change, and it could be done neatly in the code, would it be accepted?
  • Is there any reason this would be especially difficult to implement? It is straightforward to implement in a naive sequential implementation of XGBoost learning, but I'm not sure if having something like indicator variables would prove difficult in a distributed context.
@trivialfis
Copy link
Member

Thank you for raising the issue and for the very nice summary. I will look into the paper (I think I read it before but couldn't recall).

If I were to contribute this change, and it could be done neatly in the code, would it be accepted?

I think so. Will update this once I go through the paper and see if there's an easy way to do it, at least for one tree method and for one hardware architecture.

Is there any reason this would be especially difficult to implement?

I can't answer that until I fully understand the algorithm.

@eugeneyarovoi
Copy link
Author

eugeneyarovoi commented Dec 6, 2024

Just to clarify, the CEGB paper actually implements three kinds of cost penalties:

  • a penalty for the evaluation time taken to evaluate each tree
  • a penalty for the first time a feature is used for a given data row
  • a penalty for the first time a feature is used while constructing a tree

I'm only proposing to implement the last of these three. The first one was designed to trade off tree depth and evaluation speed (but in practice XGBoost inference is pretty fast, so I don't think it's needed), and the second one only makes sense if features are fetched as-needed as a model is being evaluated (only if used in a path that's taken in a tree).

The third one, the penalty for the first time a feature is used in tree construction, should be the simplest and the one I think has the most value, so it is the only one I'm proposing to implement. I figure almost all systems that do inference with XGBoost compute all the features used up-front without attempting to determine which features are strictly needed because they appear on a path that will be taken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants