Merge remote-tracking branch 'upstream/master' into per-thread-defaul…

…t-stream
dmlc · Jul 18, 2023 · 0bff0e4 · 0bff0e4
2 parents b019d60 + 0897477
commit 0bff0e4
Show file tree

Hide file tree

Showing 28 changed files with 319 additions and 579 deletions.
diff --git a/demo/README.md b/demo/README.md
@@ -145,7 +145,7 @@ Send a PR to add a one sentence description:)
 ## Tools using XGBoost
 
 - [BayesBoost](https://github.com/mpearmain/BayesBoost) - Bayesian Optimization using xgboost and sklearn API
-- [FLAML](https://github.com/microsoft/FLAML) - An open source AutoML library 
+- [FLAML](https://github.com/microsoft/FLAML) - An open source AutoML library
 designed to automatically produce accurate machine learning models with low computational cost. FLAML includes [XGBoost as one of the default learners](https://github.com/microsoft/FLAML/blob/main/flaml/model.py) and can also be used as a fast hyperparameter tuning tool for XGBoost ([code example](https://microsoft.github.io/FLAML/docs/Examples/AutoML-for-XGBoost)).
 - [gp_xgboost_gridsearch](https://github.com/vatsan/gp_xgboost_gridsearch) - In-database parallel grid-search for XGBoost on [Greenplum](https://github.com/greenplum-db/gpdb) using PL/Python
 - [tpot](https://github.com/rhiever/tpot) - A Python tool that automatically creates and optimizes machine learning pipelines using genetic programming.

diff --git a/doc/tutorials/spark_estimator.rst b/doc/tutorials/spark_estimator.rst
@@ -35,13 +35,13 @@ We can create a ``SparkXGBRegressor`` estimator like:
   )
 
 
-The above snippet creates a spark estimator which can fit on a spark dataset,
-and return a spark model that can transform a spark dataset and generate dataset
-with prediction column. We can set almost all of xgboost sklearn estimator parameters
-as ``SparkXGBRegressor`` parameters, but some parameter such as ``nthread`` is forbidden
-in spark estimator, and some parameters are replaced with pyspark specific parameters
-such as ``weight_col``, ``validation_indicator_col``, ``use_gpu``, for details please see
-``SparkXGBRegressor`` doc.
+The above snippet creates a spark estimator which can fit on a spark dataset, and return a
+spark model that can transform a spark dataset and generate dataset with prediction
+column. We can set almost all of xgboost sklearn estimator parameters as
+``SparkXGBRegressor`` parameters, but some parameter such as ``nthread`` is forbidden in
+spark estimator, and some parameters are replaced with pyspark specific parameters such as
+``weight_col``, ``validation_indicator_col``, for details please see ``SparkXGBRegressor``
+doc.
 
 The following code snippet shows how to train a spark xgboost regressor model,
 first we need to prepare a training dataset as a spark dataframe contains
@@ -88,7 +88,7 @@ XGBoost PySpark fully supports GPU acceleration. Users are not only able to enab
 efficient training but also utilize their GPUs for the whole PySpark pipeline including
 ETL and inference. In below sections, we will walk through an example of training on a
 PySpark standalone GPU cluster. To get started, first we need to install some additional
-packages, then we can set the ``use_gpu`` parameter to ``True``.
+packages, then we can set the ``device`` parameter to ``cuda`` or ``gpu``.
 
 Prepare the necessary packages
 ==============================
@@ -128,7 +128,7 @@ Write your PySpark application
 ==============================
 
 Below snippet is a small example for training xgboost model with PySpark. Notice that we are
-using a list of feature names and the additional parameter ``use_gpu``:
+using a list of feature names and the additional parameter ``device``:
 
 .. code-block:: python
 
@@ -148,12 +148,12 @@ using a list of feature names and the additional parameter ``use_gpu``:
   # get a list with feature column names
   feature_names = [x.name for x in train_df.schema if x.name != label_name]
 
-  # create a xgboost pyspark regressor estimator and set use_gpu=True
+  # create a xgboost pyspark regressor estimator and set device="cuda"
   regressor = SparkXGBRegressor(
     features_col=feature_names,
     label_col=label_name,
     num_workers=2,
-    use_gpu=True,
+    device="cuda",
   )
 
   # train and return the model
@@ -163,6 +163,7 @@ using a list of feature names and the additional parameter ``use_gpu``:
   predict_df = model.transform(test_df)
   predict_df.show()
 
+Like other distributed interfaces, the ```device`` parameter doesn't support specifying ordinal as GPUs are managed by Spark instead of XGBoost (good: ``device=cuda``, bad: ``device=cuda:0``).
 
 Submit the PySpark application
 ==============================

diff --git a/jvm-packages/README.md b/jvm-packages/README.md
@@ -3,161 +3,15 @@
 [![Documentation Status](https://readthedocs.org/projects/xgboost/badge/?version=latest)](https://xgboost.readthedocs.org/en/latest/jvm/index.html)
 [![GitHub license](http://dmlc.github.io/img/apache2.svg)](../LICENSE)
 
-[Documentation](https://xgboost.readthedocs.org/en/latest/jvm/index.html) |
+[Documentation](https://xgboost.readthedocs.org/en/stable/jvm/index.html) |
 [Resources](../demo/README.md) |
 [Release Notes](../NEWS.md)
 
-XGBoost4J is the JVM package of xgboost. It brings all the optimizations
-and power xgboost into JVM ecosystem.
+XGBoost4J is the JVM package of xgboost. It brings all the optimizations and power xgboost
+into JVM ecosystem.
 
-- Train XGBoost models in scala and java with easy customizations.
-- Run distributed xgboost natively on jvm frameworks such as
-Apache Flink and Apache Spark.
+- Train XGBoost models in scala and java with easy customization.
+- Run distributed xgboost natively on jvm frameworks such as Apache Flink and Apache
+Spark.
 
-You can find more about XGBoost on [Documentation](https://xgboost.readthedocs.org/en/latest/jvm/index.html) and [Resource Page](../demo/README.md).
-
-## Add Maven Dependency
-
-XGBoost4J, XGBoost4J-Spark, etc. in maven repository is compiled with g++-4.8.5.
-
-### Access release version
-
-<b>Maven</b>
-
-```
-<dependency>
-    <groupId>ml.dmlc</groupId>
-    <artifactId>xgboost4j_2.12</artifactId>
-    <version>latest_version_num</version>
-</dependency>
-<dependency>
-    <groupId>ml.dmlc</groupId>
-    <artifactId>xgboost4j-spark_2.12</artifactId>
-    <version>latest_version_num</version>
-</dependency>
-```
-or 
-```
-<dependency>
-    <groupId>ml.dmlc</groupId>
-    <artifactId>xgboost4j_2.13</artifactId>
-    <version>latest_version_num</version>
-</dependency>
-<dependency>
-    <groupId>ml.dmlc</groupId>
-    <artifactId>xgboost4j-spark_2.13</artifactId>
-    <version>latest_version_num</version>
-</dependency>
-```
-
-<b>sbt</b>
-```sbt
-libraryDependencies ++= Seq(
-  "ml.dmlc" %% "xgboost4j" % "latest_version_num",
-  "ml.dmlc" %% "xgboost4j-spark" % "latest_version_num"
-)
-```
-
-For the latest release version number, please check [here](https://github.com/dmlc/xgboost/releases).
-
-
-### Access SNAPSHOT version
-
-First add the following Maven repository hosted by the XGBoost project:
-
-<b>Maven</b>:
-
-```xml
-<repository>
-  <id>XGBoost4J Snapshot Repo</id>
-  <name>XGBoost4J Snapshot Repo</name>
-  <url>https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/snapshot/</url>
-</repository>
-```
-
-<b>sbt</b>:
-
-```sbt
-resolvers += "XGBoost4J Snapshot Repo" at "https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/snapshot/"
-```
-
-Then add XGBoost4J as a dependency:
-
-<b>Maven</b>
-
-```
-<dependency>
-    <groupId>ml.dmlc</groupId>
-    <artifactId>xgboost4j_2.12</artifactId>
-    <version>latest_version_num-SNAPSHOT</version>
-</dependency>
-<dependency>
-    <groupId>ml.dmlc</groupId>
-    <artifactId>xgboost4j-spark_2.12</artifactId>
-    <version>latest_version_num-SNAPSHOT</version>
-</dependency>
-```
-or with scala 2.13 
-```
-<dependency>
-    <groupId>ml.dmlc</groupId>
-    <artifactId>xgboost4j_2.13</artifactId>
-    <version>latest_version_num-SNAPSHOT</version>
-</dependency>
-<dependency>
-    <groupId>ml.dmlc</groupId>
-    <artifactId>xgboost4j-spark_2.13</artifactId>
-    <version>latest_version_num-SNAPSHOT</version>
-</dependency>
-```
-
-<b>sbt</b>
-```sbt
-libraryDependencies ++= Seq(
-  "ml.dmlc" %% "xgboost4j" % "latest_version_num-SNAPSHOT",
-  "ml.dmlc" %% "xgboost4j-spark" % "latest_version_num-SNAPSHOT"
-)
-```
-
-For the latest release version number, please check [the repository listing](https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/list.html).
-
-### GPU algorithm
-To enable the GPU algorithm (`tree_method='gpu_hist'`), use artifacts `xgboost4j-gpu_2.12` and `xgboost4j-spark-gpu_2.12` instead.
-Note that scala 2.13 is not supported by the [NVIDIA/spark-rapids#1525](https://github.com/NVIDIA/spark-rapids/issues/1525) yet, so the GPU algorithm can only be used with scala 2.12.
-
-## Examples
-
-Full code examples for Scala, Java, Apache Spark, and Apache Flink can
-be found in the [examples package](https://github.com/dmlc/xgboost/tree/master/jvm-packages/xgboost4j-example).
-
-**NOTE on LIBSVM Format**:
-
-There is an inconsistent issue between XGBoost4J-Spark and other language bindings of XGBoost.
-
-When users use Spark to load trainingset/testset in LIBSVM format with the following code snippet:
-
-```scala
-spark.read.format("libsvm").load("trainingset_libsvm")
-```
-
-Spark assumes that the dataset is 1-based indexed. However, when you do prediction with other bindings of XGBoost (e.g. Python API of XGBoost), XGBoost assumes that the dataset is 0-based indexed. It creates a pitfall for the users who train model with Spark but predict with the dataset in the same format in other bindings of XGBoost.
-
-## Development
-
-You can build/package xgboost4j locally with the following steps:
-
-**Linux:**
-1. Ensure [Docker for Linux](https://docs.docker.com/install/) is installed.
-2. Clone this repo: `git clone --recursive https://github.com/dmlc/xgboost.git`
-3. Run the following command:
-  - With Tests: `./xgboost/jvm-packages/dev/build-linux.sh`
-  - Skip Tests: `./xgboost/jvm-packages/dev/build-linux.sh --skip-tests`
-
-**Windows:**
-1. Ensure [Docker for Windows](https://docs.docker.com/docker-for-windows/install/) is installed.
-2. Clone this repo: `git clone --recursive https://github.com/dmlc/xgboost.git`
-3. Run the following command:
-  - With Tests: `.\xgboost\jvm-packages\dev\build-linux.cmd`
-  - Skip Tests: `.\xgboost\jvm-packages\dev\build-linux.cmd --skip-tests`
-
-*Note: this will create jars for deployment on Linux machines.*
+You can find more about XGBoost on [Documentation](https://xgboost.readthedocs.org/en/stable/jvm/index.html) and [Resource Page](../demo/README.md).
diff --git a/jvm-packages/dev/.gitattributes b/jvm-packages/dev/.gitattributes
diff --git a/jvm-packages/dev/.gitignore b/jvm-packages/dev/.gitignore
diff --git a/jvm-packages/dev/Dockerfile b/jvm-packages/dev/Dockerfile
diff --git a/jvm-packages/dev/build-linux.cmd b/jvm-packages/dev/build-linux.cmd
diff --git a/jvm-packages/dev/build-linux.sh b/jvm-packages/dev/build-linux.sh