Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages][breaking]Rework xgboost4j jvm packages for spark #10617

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wbo4958
Copy link
Contributor

@wbo4958 wbo4958 commented Jul 22, 2024

About this PR

Features and Fixes

  • Introduce an abstract XGBoost Estimator:

    • Create an abstract base class that encapsulates common functionality for all XGBoost estimators.
    • This will promote code reuse and maintain consistency across different estimators.
  • Update to the latest XGBoost parameters

    • Add all XGBoost parameters supported in XGBoost4j-spark base on this
    • Add setter and getter for these parameters.
    • Remove the deprecated parameters
  • Implement support for XGBoostRanker:

    • Add a new XGBoostRanker estimator for ranking problem.
  • Address the missing value handling:

    • 0 is a meaningful value in machine learning case, the old version enforces 0 as the missing value which may result in XGBoost model inaccurate.
  • Consolidate xgboost4j-gpu and xgboost4j-spark-gpu:

    • Move the functionality of xgboost4j-gpu into xgboost4j-spark-gpu.
    • Remove any soft links in xgboost4j-spark-gpu to ensure a clean structure, Instead, make xgboost4j-spark-gpu extend from xgboost4j-spark and xgboost4j
  • Remove any ETL operations in XGBoost

    • As a thin wrapper of XGBoost4j, it shouldn't wrap much ETL in XGBoost4j itself to make code hard to maintain and read.
  • Create uber jars for xgboost4j-spark and xgboost4j-spark-gpu:

    • Create aggregator module to build uber jars (fat jars) that include all the necessary dependencies for xgboost4j-spark
    • Create aggregator-gpu module to build uber jars for xgboost4j-spark-gpu.
  • Rework the GPU plugin:

    • Rework the GPU plugin to improve its stability and make it more readable and maintainable.
  • Expand sanity tests for CPU and GPU consistency:

    • Increase the coverage of sanity tests to verify that the training and transform results are consistent between CPU and GPU implementations.
    • This will help identify and prevent any discrepancies or bugs related to the GPU support.

Breaking changes

  • Remove some unused parameters like Rabit related parameters and train_test_ratio. etc.

  • Separate XGBoost-spark parameter from the whole XGBoost parameters, and make the constructor of XGBoost estimator only accept the XGBoost parameters instead of parameters for xgboost4j-spark like,

    val xgbParams: Map[String, Any] = Map(
          "max_depth" -> 5,
          "eta" -> 0.2,
          "objective" -> "binary:logistic"
        )
    val est = new XGBoostClassifier(xgbParams)
          .setNumRound(10)
          .setDevice("cpu")
          .setNumWorkers(2)

    or

    val est = new XGBoostClassifier(xgbParams)
          .setMaxDepth(5)
          .setEta(0.2)
          .setObjective("binary:logistic")
          .setNumRound(10)
          .setDevice("cpu")
          .setNumWorkers(2)
  • XGBoostRegressor doesn't support ranking problem anymore, the functionality of ranking has been moved to the new estimator XGBoostRanker. And remove the ETL (like grouping) for ranking problem.

  • Removed the xgboost4j-gpu module and move the functionality into xgboost4j-spark-gpu

Test this PR

  • Unit tests For scala 2.12
mvn clean package
mvn -P gpu clean package
  • Unit tests For scala 2.13
cd xgboost
python dev/change_scala_version.py --scala-version 2.13

cd jvm-packages
mvn clean package
mvn -P gpu clean package

@wbo4958 wbo4958 marked this pull request as draft July 22, 2024 04:41
@wbo4958 wbo4958 changed the title Rework xgboost4j jvm packages for spark [jvm-packages][breaking]Rework xgboost4j jvm packages for spark Jul 22, 2024
@wbo4958 wbo4958 marked this pull request as ready for review July 22, 2024 06:12
@wbo4958
Copy link
Contributor Author

wbo4958 commented Jul 22, 2024

hey @trivialfis @hcho3 @dotbg, please help review this PR, thx very much.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 22, 2024

Would you please split this pull requests into smaller chunks so that it is easier to review?

@wbo4958
Copy link
Contributor Author

wbo4958 commented Jul 22, 2024

It's hard to separate it, since it's completely jvm rewrite. The whole code architecture has been re-orged. and totally different implementation.

- Remove xgboost4j-gpu and move its functionality to xgboost4j-spark-gpu
- Remove any soft links in xgboost4j-spark-gpu
- Abstract an XGBoost Estimator which handles the common functionality for all xgboost estimators.
- Support XGBoostRanker
- Remove any unnecessary ETL in XGBoost
- Fix the missing value usage
- Support uber jar for xgboost4j-spark
- Support uber jar for xgboost4j-spark-gpu
- Rework GPU plugin
- More sannity test to ensure the training/transform results are same between CPU and GPU.
@hcho3
Copy link
Collaborator

hcho3 commented Jul 22, 2024

I won't review the code, since I don't think I am qualified to review such a large change in the Java/Scala code base. The best I can do is to review the changes in the CI/CD pipelines.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Jul 22, 2024

I won't review the code, since I don't think I am qualified to review such a large change in the Java/Scala code base. The best I can do is to review the changes in the CI/CD pipelines.

Thx

@dotbg
Copy link
Contributor

dotbg commented Jul 22, 2024

@wbo4958 thank you for the PR. I shall find some time to go through it. Unfortunately, I was very busy lately :(
@hcho3 thank you for reviewing the CI/CD part. this should easen the review for me.

Copy link
Contributor

@dotbg dotbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments, but I am rather far from done yet

@@ -133,7 +133,7 @@
<property name="braceAdjustment" value="0"/>
<property name="caseIndent" value="2"/>
<property name="throwsIndent" value="4"/>
<property name="lineWrappingIndentation" value="4"/>
<property name="lineWrappingIndentation" value="2"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this change in another PR

<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<flink.version>1.19.0</flink.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<flink.version>1.19.0</flink.version>
<flink.version>1.19.1</flink.version>

<junit.version>4.13.2</junit.version>
<spark.version>3.5.1</spark.version>
<spark.version.gpu>3.5.1</spark.version.gpu>
<fasterxml.jackson.version>2.15.2</fasterxml.jackson.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<fasterxml.jackson.version>2.15.2</fasterxml.jackson.version>
<fasterxml.jackson.version>2.17.2</fasterxml.jackson.version>

<maven.wagon.http.retryHandler.count>5</maven.wagon.http.retryHandler.count>
<log.capi.invocation>OFF</log.capi.invocation>
<use.cuda>OFF</use.cuda>
<spark.rapids.version>24.04.1</spark.rapids.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<spark.rapids.version>24.04.1</spark.rapids.version>
<spark.rapids.version>24.06.0</spark.rapids.version>

<use.cuda>OFF</use.cuda>
<spark.rapids.version>24.04.1</spark.rapids.version>
<cudf.classifier>cuda12</cudf.classifier>
<scalatest.version>3.2.18</scalatest.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<scalatest.version>3.2.18</scalatest.version>
<scalatest.version>3.2.19</scalatest.version>

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>3.4.1</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<version>3.4.1</version>
<version>3.4.2</version>

@@ -82,19 +82,27 @@ This file is divided into 3 sections:
</check>

<check level="error" class="org.scalastyle.scalariform.ClassNamesChecker" enabled="true">
<parameters><parameter name="regex"><![CDATA[[A-Z][A-Za-z]*]]></parameter></parameters>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's have these changes in another PR.

"device" -> device
)
)
).setNumRound(10).setNumWorkers(numWorkers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it still work if we specify the map values instead?

@wbo4958
Copy link
Contributor Author

wbo4958 commented Jul 22, 2024

HI @dotbg, Thx for review, some other reviewers complained this PR is quite huge PR, So I will break it down to smaller PRs. Will let you know. Thx again.

@wbo4958 wbo4958 marked this pull request as draft July 23, 2024 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants