Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when trying to get feature importance of multilabel binary classifier #10686

Open
shreyaspuducheri23 opened this issue Aug 9, 2024 · 11 comments

Comments

@shreyaspuducheri23
Copy link

  • Operating System: linux
  • Python Version: 3.10.14
  • XGBoost Version: 2.1.0

I am experiencing a segmentation fault with XGBoost 2.1.0 when trying to access feature importances in a multi-label binary classification model. The model trains and predicts as expected; however, when I attempt to retrieve feature importances using either xgb_model.feature_importances_ or xgb_model.get_score(importance_type='weight'), the process fails. In a Jupyter kernel, this results in a kernel crash, and when executed from the terminal, it outputs "Segmentation fault". The issue occurs specifically under these conditions, without any problems during other operations like fitting or predicting.

@trivialfis
Copy link
Member

Thank you for sharing! Will try to reproduce it.

@trivialfis
Copy link
Member

trivialfis commented Aug 12, 2024

Hi @shreyaspuducheri23 , could you please share a reproducible example? I tried to following toy example and did not observe a segfault:

from sklearn.datasets import make_multilabel_classification
import xgboost as xgb


X, y = make_multilabel_classification()
clf = xgb.XGBClassifier()
clf.fit(X, y)
clf.feature_importances_
clf.get_booster().get_score(importance_type='weight')

@shreyaspuducheri23
Copy link
Author

Hi @trivialfis the issue arrises when using the vector leaf option:

X, y = make_multilabel_classification(n_classes=2, n_labels=2,
                                      allow_unlabeled=False,
                                      random_state=1)

clf = xgb.XGBClassifier(multi_strategy='multi_output_tree')
clf.fit(X, y)
clf.feature_importances_
clf.get_booster().get_score(importance_type='weight')

@trivialfis
Copy link
Member

Ah, the parameter is still working in progress. Will implement feature importance after sorting out some current work.

@shreyaspuducheri23
Copy link
Author

I see, thank you! Do you have an estimated time frame- i.e. weeks, months, etc.? Just wondering whether it would be in my best interest to wait for the feature or just switch to one-output-per-tree for my current project.

@trivialfis
Copy link
Member

Opened a PR to add support for weight: #10700 . Other types can take some time, I don't have an eta yet.

If the PR is approved, you can use the nightly build for testing.

@abseejp
Copy link

abseejp commented Sep 5, 2024

@trivialfis, I'm here because of the same issue @shreyaspuducheri23 has. I can see that your last change ( #10700) is approved and merged but I still can't access feature importance properly (When I tried, it just returned 0.0 as the feature importance for all my features) when I set multi_strategy to multi_output_tree.

On a separate note, when I set multi_strategy to one_output_per_tree, I get a single 1D array of feature importance (even though I have 3 labels). What's going on under the hood, I was expecting to get feature importance for each label since three different independent models are built.

@AnthonyYao7
Copy link

I would like to work on this

@trivialfis
Copy link
Member

I was expecting to get feature importance for each label since three different independent models are built.

They were combined to represent the whole model instead of individual models.

I would like to work on this

Thank you for volunteering! Maybe #10700 can be a good start for looking into where it's calculated?

@abseejp
Copy link

abseejp commented Sep 19, 2024

Thanks @trivialfis for your response. When you say they were combined, what combination method is used? Is it average of all feature importance across all the models for each feature?

@trivialfis
Copy link
Member

Either total or average, depending on the type of the gain you specified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants