Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Optimal intercept initialization for simple objectives #10298

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

david-cortes
Copy link
Contributor

@david-cortes david-cortes commented May 18, 2024

ref #9899

This PR modifies the intercept initialization for simple objectives (logistic, poisson, gamma, tweedie) to use their closed-form optimal solutions (as in: the number that minimizes the objective function) instead of a non-optimal one-step Newton.

For these objectives, the optimal intercept corresponds simply to the link function applied to the mean of the response variable. Since base_score already undergoes this transformation, the PR here just changes calculation to the mean of the response variable in those cases.

For multi-target versions of these objectives, it sets them to zero instead as otherwise applying a common intercept might not make much sense for the given problem.

Note that there's still room for improvements:

  • Custom user-defined functions would most likely be better served by a default score of zero or by a 1D newton estimation. I wasn't sure where in the code to identify when a user-defined objective is passed though.
  • Other objectives would likely benefit from using more than one newton step for the intercept estimation.

Note1: I wasn't sure about how to calculate a weighted sample mean here (not familiar with GPU computing and the 'devices' logic). Would be helpful to have a WeightedMean function under stats if possible, to use in case there's sample weights.

Note2: The compiler checks here don't like turning a linalg::Tensor<T, 2> into linalg::Tensor<T, 1> by reinterpret_cast. I'm also not sure what would be the right way to do it without a data copy.

Note3: I wasn't sure where to add tests for the changes here. For example, would be ideal to test that binary:logistic and binary:logitraw produce the same raw scores, but I'm not sure where's the right place to add such test.

@david-cortes david-cortes changed the title Optimal intercept initialization for simple objectives [WIP] Optimal intercept initialization for simple objectives May 18, 2024
@trivialfis
Copy link
Member

trivialfis commented May 20, 2024

Thank you for working on this! I will look into the changes.

@trivialfis trivialfis self-assigned this Jul 15, 2024
@trivialfis trivialfis force-pushed the intercepts branch 2 times, most recently from eaa8453 to 70ac561 Compare July 31, 2024 16:26
@trivialfis
Copy link
Member

trivialfis commented Jul 31, 2024

Hi @david-cortes , I have added sample mean and weighted sample mean functions. However, I removed the new inv link zero function since we are moving away from the binary model and will be dropping the capability of saving the binary model in this release, I don't think it's necessary to add new behavior in the codebase to workaround it.

Custom user-defined functions would most likely be better served by a default score of zero or by a 1D newton estimation. I wasn't sure where in the code to identify when a user-defined objective is passed though.

We will work on this in a different PR. At the moment, the custom objective is quite primitive and there are lots of work we need to do to make it closer to the builtin objectives in terms of functionality.

Other objectives would likely benefit from using more than one newton step for the intercept estimation.

It's gradient boosting, we can always stack more models on top instead of making one model perfect.

@trivialfis
Copy link
Member

Please let me know if you want me to take over from here.

@david-cortes
Copy link
Contributor Author

david-cortes commented Jul 31, 2024

@trivialfis Thanks for looking into this.

Since the binary format will be dropped, would be better to do this more correctly then, by making the changes deeper:

  • Making the base_score work on the untransformed / raw prediction scale, so that the user can output any real number regardless of what the link function is, and so that it would have the same interpretation for built-in and custom objectives.
  • Defaulting to zero in the untransformed scale for custom objective functions.
  • Not converting the base_score to a string, thereby avoiding potential losses of precision.
  • Adding a compatibility workaround for transforming models that use the old json/ubjson base_score to the new one (guess it would involve adding transformations for each link function).

So yes, please take over.

By the way, while you're at it, I understand that the removal of the old binary format should also pave the way for vector-valued intercepts - in such case, once those get added, would also be nice to arrange the intercepts for multinomial logistic objective in such a way that they would sum to zero, like GLMNET does.

@trivialfis
Copy link
Member

These are excellent suggestions.

Since the binary format will be dropped

We will remove the capability of saving it in this release, then the next release will remove the loading part. Can take some time.

Making the base_score work on the untransformed / raw prediction scale

To implement this, we will have to add a new parameter, probably called intercept. Otherwise, it's a breaking change that's impossible to show warning.

Not converting the base_score to a string, thereby avoiding potential losses of precision

This one is more difficult, currently, all parameters are passed through strings. Using JSON might help, but floating point serialization requires matching the encoder and decoder, otherwise there will be precision losses.

I will take over the PR, please let me know if there are other things I can help with the R package. Hoping to include it in the next release (2.2). We still have plenty of time.

@trivialfis
Copy link
Member

trivialfis commented Aug 1, 2024

Hmm, run into an issue with the weighted sample mean. If the weight is not strictly a proba simplex summing up to 1, we can generate an invalid mean value, like a mean value greater than 1 for logistic regression.

@david-cortes
Copy link
Contributor Author

Hmm, run into an issue with the weighted sample mean. If the weight is not strictly a proba simplex summing up to 1, we can generate an invalid mean value, like a mean value greater than 1 for logistic regression.

I'm not sure what kind of impact it'd have, but there's always the option of doing a more precise sequential calculation like this:
https://github.com/david-cortes/isotree/blob/f8983ae2170c8c675b4ccda1f6ed394678c26a0a/src/mult.hpp#L183

Or to use compensated sums.

Although from a look at the code, could it perhaps be that some matrix is in the wrong memory layout? I see it uses:

auto cidx = i / n_rows;
auto ridx = i % n_rows;

But it doesn't validate that the matrix is column-major.

@trivialfis
Copy link
Member

trivialfis commented Aug 2, 2024

I'm not sure what kind of impact it'd have

It violates the range of logistic regression [0, 1], the prediction (after applying inverse link) cannot exceed 1. But with inappropriate weight, we can make the weighted mean greater than 1. At the moment, we only require the weights to be positive, which is insufficient to provide a valid sample mean for logistic regression. One such example (for illustration) would be simply reusing the target y as weight, resulting 0 weight for 0 label, and 1 weight for 1 label.

could it perhaps be that some matrix is in the wrong memory layout

I will add a restriction to it once the weight issue is resolved. The matrix view supports different memory layout type, we can dispatch.

@david-cortes
Copy link
Contributor Author

david-cortes commented Aug 2, 2024

One such example (for illustration) would be simply reusing the target y as weight, resulting 0 weight for 0 label, and 1 weight for 1 label.

But that is not an issue with the computation: if the data contains only examples of one class, then the optimal solution for a regularized method like XGBoost's is to make the intercept infinite and not use any feature for generating scores.

If a user wishes to use XGBoost as one-class classifier, the correct behavior from the user would be to manually set the intercept to zero.

I guess the most logical course of action in such cases would be to throw an error explaining that the response is constant.

At the moment, we only require the weights to be positive, which is insufficient to provide a valid sample mean for logistic regression

If the weights are non-negative and the floating point computations had infinite precision, then it shouldn't be possible to arrive at a value greater or smaller than the max/min of the inputs that go into that mean. But I can see it going wrong with fp32 when the number of rows is large, whether the weights sum to 1 or not.

Some other ideas:

  • Perhaps you could divide the weights by their sum outside of the function that goes into the transformation, so as to force the operation order to be y * (w / sum(w)) rather than (y * w) / sum(w) (or perhaps fma(y, w/sum(w), agg)) - not sure if that's the issue here though.
  • For the specific case of logistic regression, another possibility could be to sum the weights by class instead (fewer numbers to sum = more precise computations).
  • Maybe the aggregate could be kept in fp64 precision?

Are you able to provide an example where the result is outside the range [0,1] when the inputs are all {0,1}?

@trivialfis
Copy link
Member

I guess the most logical course of action in such cases would be to throw an error explaining that the response is constant.

Make sense.

Some other ideas:

Thank you for sharing, will look into them. Spent a bit of time today to prove the sample mean being the minimizer. Will continue to work on the PR next week.

@trivialfis
Copy link
Member

trivialfis commented Aug 9, 2024

could it perhaps be that some matrix is in the wrong memory layout

Added notes about why the reduction is written in this way.

todo for myself:

  • add allreduce synchronization.
  • add tests.
  • edge cases, including empty labels for binary classification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants