[WIP] Optimal intercept initialization for simple objectives #10298

david-cortes · 2024-05-18T20:43:16Z

This PR modifies the intercept initialization for simple objectives (logistic, poisson, gamma, tweedie) to use their closed-form optimal solutions (as in: the number that minimizes the objective function) instead of a non-optimal one-step Newton.

For these objectives, the optimal intercept corresponds simply to the link function applied to the mean of the response variable. Since base_score already undergoes this transformation, the PR here just changes calculation to the mean of the response variable in those cases.

For multi-target versions of these objectives, it sets them to zero instead as otherwise applying a common intercept might not make much sense for the given problem.

Note that there's still room for improvements:

Custom user-defined functions would most likely be better served by a default score of zero or by a 1D newton estimation. I wasn't sure where in the code to identify when a user-defined objective is passed though.
Other objectives would likely benefit from using more than one newton step for the intercept estimation.

Note1: I wasn't sure about how to calculate a weighted sample mean here (not familiar with GPU computing and the 'devices' logic). Would be helpful to have a WeightedMean function under stats if possible, to use in case there's sample weights.

Note2: The compiler checks here don't like turning a linalg::Tensor<T, 2> into linalg::Tensor<T, 1> by reinterpret_cast. I'm also not sure what would be the right way to do it without a data copy.

Note3: I wasn't sure where to add tests for the changes here. For example, would be ideal to test that binary:logistic and binary:logitraw produce the same raw scores, but I'm not sure where's the right place to add such test.

trivialfis · 2024-05-20T04:17:09Z

Thank you for working on this! I will look into the changes.

trivialfis · 2024-07-31T16:30:29Z

Hi @david-cortes , I have added sample mean and weighted sample mean functions. However, I removed the new inv link zero function since we are moving away from the binary model and will be dropping the capability of saving the binary model in this release, I don't think it's necessary to add new behavior in the codebase to workaround it.

Custom user-defined functions would most likely be better served by a default score of zero or by a 1D newton estimation. I wasn't sure where in the code to identify when a user-defined objective is passed though.

We will work on this in a different PR. At the moment, the custom objective is quite primitive and there are lots of work we need to do to make it closer to the builtin objectives in terms of functionality.

Other objectives would likely benefit from using more than one newton step for the intercept estimation.

It's gradient boosting, we can always stack more models on top instead of making one model perfect.

trivialfis · 2024-07-31T18:13:19Z

Please let me know if you want me to take over from here.

david-cortes · 2024-07-31T18:48:11Z

@trivialfis Thanks for looking into this.

Since the binary format will be dropped, would be better to do this more correctly then, by making the changes deeper:

Making the base_score work on the untransformed / raw prediction scale, so that the user can output any real number regardless of what the link function is, and so that it would have the same interpretation for built-in and custom objectives.
Defaulting to zero in the untransformed scale for custom objective functions.
Not converting the base_score to a string, thereby avoiding potential losses of precision.
Adding a compatibility workaround for transforming models that use the old json/ubjson base_score to the new one (guess it would involve adding transformations for each link function).

So yes, please take over.

By the way, while you're at it, I understand that the removal of the old binary format should also pave the way for vector-valued intercepts - in such case, once those get added, would also be nice to arrange the intercepts for multinomial logistic objective in such a way that they would sum to zero, like GLMNET does.

trivialfis · 2024-07-31T19:40:12Z

These are excellent suggestions.

Since the binary format will be dropped

We will remove the capability of saving it in this release, then the next release will remove the loading part. Can take some time.

Making the base_score work on the untransformed / raw prediction scale

To implement this, we will have to add a new parameter, probably called intercept. Otherwise, it's a breaking change that's impossible to show warning.

Not converting the base_score to a string, thereby avoiding potential losses of precision

This one is more difficult, currently, all parameters are passed through strings. Using JSON might help, but floating point serialization requires matching the encoder and decoder, otherwise there will be precision losses.

I will take over the PR, please let me know if there are other things I can help with the R package. Hoping to include it in the next release (2.2). We still have plenty of time.

trivialfis · 2024-08-01T14:15:27Z

Hmm, run into an issue with the weighted sample mean. If the weight is not strictly a proba simplex summing up to 1, we can generate an invalid mean value, like a mean value greater than 1 for logistic regression.

david-cortes · 2024-08-01T16:00:46Z

Hmm, run into an issue with the weighted sample mean. If the weight is not strictly a proba simplex summing up to 1, we can generate an invalid mean value, like a mean value greater than 1 for logistic regression.

I'm not sure what kind of impact it'd have, but there's always the option of doing a more precise sequential calculation like this:
https://github.com/david-cortes/isotree/blob/f8983ae2170c8c675b4ccda1f6ed394678c26a0a/src/mult.hpp#L183

Or to use compensated sums.

Although from a look at the code, could it perhaps be that some matrix is in the wrong memory layout? I see it uses:

auto cidx = i / n_rows;
auto ridx = i % n_rows;

But it doesn't validate that the matrix is column-major.

trivialfis · 2024-08-02T12:48:29Z

I'm not sure what kind of impact it'd have

It violates the range of logistic regression [0, 1], the prediction (after applying inverse link) cannot exceed 1. But with inappropriate weight, we can make the weighted mean greater than 1. At the moment, we only require the weights to be positive, which is insufficient to provide a valid sample mean for logistic regression. One such example (for illustration) would be simply reusing the target y as weight, resulting 0 weight for 0 label, and 1 weight for 1 label.

could it perhaps be that some matrix is in the wrong memory layout

I will add a restriction to it once the weight issue is resolved. The matrix view supports different memory layout type, we can dispatch.

david-cortes · 2024-08-02T17:56:13Z

One such example (for illustration) would be simply reusing the target y as weight, resulting 0 weight for 0 label, and 1 weight for 1 label.

But that is not an issue with the computation: if the data contains only examples of one class, then the optimal solution for a regularized method like XGBoost's is to make the intercept infinite and not use any feature for generating scores.

If a user wishes to use XGBoost as one-class classifier, the correct behavior from the user would be to manually set the intercept to zero.

I guess the most logical course of action in such cases would be to throw an error explaining that the response is constant.

At the moment, we only require the weights to be positive, which is insufficient to provide a valid sample mean for logistic regression

If the weights are non-negative and the floating point computations had infinite precision, then it shouldn't be possible to arrive at a value greater or smaller than the max/min of the inputs that go into that mean. But I can see it going wrong with fp32 when the number of rows is large, whether the weights sum to 1 or not.

Some other ideas:

Perhaps you could divide the weights by their sum outside of the function that goes into the transformation, so as to force the operation order to be y * (w / sum(w)) rather than (y * w) / sum(w) (or perhaps fma(y, w/sum(w), agg)) - not sure if that's the issue here though.
For the specific case of logistic regression, another possibility could be to sum the weights by class instead (fewer numbers to sum = more precise computations).
Maybe the aggregate could be kept in fp64 precision?

Are you able to provide an example where the result is outside the range [0,1] when the inputs are all {0,1}?

trivialfis · 2024-08-03T11:34:40Z

I guess the most logical course of action in such cases would be to throw an error explaining that the response is constant.

Make sense.

Some other ideas:

Thank you for sharing, will look into them. Spent a bit of time today to prove the sample mean being the minimizer. Will continue to work on the PR next week.

trivialfis · 2024-08-09T01:06:12Z

could it perhaps be that some matrix is in the wrong memory layout

Added notes about why the reduction is written in this way.

todo for myself:

add allreduce synchronization.
add tests.
edge cases, including empty labels for binary classification.

david-cortes changed the title ~~Optimal intercept initialization for simple objectives~~ [WIP] Optimal intercept initialization for simple objectives May 18, 2024

trivialfis self-assigned this Jul 15, 2024

trivialfis force-pushed the intercepts branch 2 times, most recently from eaa8453 to 70ac561 Compare July 31, 2024 16:26

david-cortes and others added 13 commits August 1, 2024 22:01

optimal intercept initialization ref dmlc#9899

de00e30

linter

b75955d

use default value for multi-targets

d7e2ef9

add cpu-only weighted mean

acf05d5

initialize to zero for multitarget

9f5a840

linter

1d8d85b

Implement sample mean.

b196ebe

weighted.

2901228

Use it in init estimation.

4c0536a

Remove.

17cd359

polish.

48d5de9

Remove inv link for now.

cad2232

CPU build.

4fe3475

trivialfis force-pushed the intercepts branch from bee5a5b to 4fe3475 Compare August 1, 2024 14:14

Note.

aaafd1c

trivialfis added 8 commits August 9, 2024 10:17

Float eq.

5ade4de

Work on allreduce.

ce3e2e8

Precision.

394d9db

Merge branch 'master' into intercepts

ef5efd8

Fixes.

6e618cd

qemu test.

95583f0

Empty input.

fe0d6ee

i386

d03db14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Optimal intercept initialization for simple objectives #10298

[WIP] Optimal intercept initialization for simple objectives #10298

david-cortes commented May 18, 2024 •

edited

Loading

trivialfis commented May 20, 2024 •

edited

Loading

trivialfis commented Jul 31, 2024 •

edited

Loading

trivialfis commented Jul 31, 2024

david-cortes commented Jul 31, 2024 •

edited

Loading

trivialfis commented Jul 31, 2024

trivialfis commented Aug 1, 2024 •

edited

Loading

david-cortes commented Aug 1, 2024

trivialfis commented Aug 2, 2024 •

edited

Loading

david-cortes commented Aug 2, 2024 •

edited

Loading

trivialfis commented Aug 3, 2024

trivialfis commented Aug 9, 2024 •

edited

Loading

[WIP] Optimal intercept initialization for simple objectives #10298

Are you sure you want to change the base?

[WIP] Optimal intercept initialization for simple objectives #10298

Conversation

david-cortes commented May 18, 2024 • edited Loading

trivialfis commented May 20, 2024 • edited Loading

trivialfis commented Jul 31, 2024 • edited Loading

trivialfis commented Jul 31, 2024

david-cortes commented Jul 31, 2024 • edited Loading

trivialfis commented Jul 31, 2024

trivialfis commented Aug 1, 2024 • edited Loading

david-cortes commented Aug 1, 2024

trivialfis commented Aug 2, 2024 • edited Loading

david-cortes commented Aug 2, 2024 • edited Loading

trivialfis commented Aug 3, 2024

trivialfis commented Aug 9, 2024 • edited Loading

david-cortes commented May 18, 2024 •

edited

Loading

trivialfis commented May 20, 2024 •

edited

Loading

trivialfis commented Jul 31, 2024 •

edited

Loading

david-cortes commented Jul 31, 2024 •

edited

Loading

trivialfis commented Aug 1, 2024 •

edited

Loading

trivialfis commented Aug 2, 2024 •

edited

Loading

david-cortes commented Aug 2, 2024 •

edited

Loading

trivialfis commented Aug 9, 2024 •

edited

Loading