Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does max_cat_threshold actually control? #10844

Open
oliviermeslin opened this issue Sep 24, 2024 · 1 comment
Open

What does max_cat_threshold actually control? #10844

oliviermeslin opened this issue Sep 24, 2024 · 1 comment

Comments

@oliviermeslin
Copy link

TL;DR: I'd like to know what exactly max_cat_threshold controls and I may suggest marginal improvements of the documentation.


I'm quite interested in XGBoost's support for categorical features. I dived into the documentation, but can't understand the exact effect of max_cat_threshold. By reading the C++ code (here), I understand that it is used to determine the begin oand end points of the double scan of the sorted histogram. Here is an example:

Case with max_cat_threshold = 1

In this case all partitions are considered.

     | Forward  | Backward |
     |----------+----------|
     | [BCDEF] A | F [ABCDE] |
     | [CDEF] AB | EF [ABCD] |
     | [DEF] ABC | DEF [ABC] |
     | [EF] ABCD | CDEF [AB] |
     | [F] ABCDE | BCDEF [A] |

Case with max_cat_threshold = 2

In this case only partitions with 2+ categories are considered.

      | Forward  | Backward |
      |----------+----------|
      | [CDEF] AB | EF [ABCD] |
      | [DEF] ABC | DEF [ABC] |
      | [EF] ABCD | CDEF [AB] |

Is this the way max_cat_threshold? If yes, I might open a PR to add a paragraph here. Does it sound like a good idea?

@trivialfis
Copy link
Member

trivialfis commented Sep 25, 2024

Hi, apologies for the ambiguity there. I wrote the description in https://xgboost.readthedocs.io/en/latest/parameter.html#cat-param

Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting.

It means the split enumeration would stop once it hit the max_cat_threshold, MAXimum number of CATegories considered for each split.

Case with max_cat_threshold = 1
In this case all partitions are considered.

No, the split enumeration would consider 1 partition for backward and forward enumeration. So:

| [BCDEF] A | F [ABCDE] |

then stop. So, only one category is considered for each scan direction. As shown in the definition of iteration end:

it_end = it_begin + n_bins - 1;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants