Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/master' into colsplit-gpu-pred…
Browse files Browse the repository at this point in the history
…ictor
  • Loading branch information
rongou committed Jun 27, 2023
2 parents 9dc920a + f479871 commit 5710a46
Show file tree
Hide file tree
Showing 39 changed files with 1,590 additions and 690 deletions.
43 changes: 26 additions & 17 deletions R-package/tests/testthat/test_basic.R
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,18 @@ test_that("dart prediction works", {
rnorm(100)

set.seed(1994)
booster_by_xgboost <- xgboost(data = d, label = y, max_depth = 2, booster = "dart",
rate_drop = 0.5, one_drop = TRUE,
eta = 1, nthread = 2, nrounds = nrounds, objective = "reg:squarederror")
booster_by_xgboost <- xgboost(
data = d,
label = y,
max_depth = 2,
booster = "dart",
rate_drop = 0.5,
one_drop = TRUE,
eta = 1,
nthread = 2,
nrounds = nrounds,
objective = "reg:squarederror"
)
pred_by_xgboost_0 <- predict(booster_by_xgboost, newdata = d, ntreelimit = 0)
pred_by_xgboost_1 <- predict(booster_by_xgboost, newdata = d, ntreelimit = nrounds)
expect_true(all(matrix(pred_by_xgboost_0, byrow = TRUE) == matrix(pred_by_xgboost_1, byrow = TRUE)))
Expand All @@ -97,19 +106,19 @@ test_that("dart prediction works", {

set.seed(1994)
dtrain <- xgb.DMatrix(data = d, info = list(label = y))
booster_by_train <- xgb.train(params = list(
booster = "dart",
max_depth = 2,
eta = 1,
rate_drop = 0.5,
one_drop = TRUE,
nthread = 1,
tree_method = "exact",
objective = "reg:squarederror"
),
data = dtrain,
nrounds = nrounds
)
booster_by_train <- xgb.train(
params = list(
booster = "dart",
max_depth = 2,
eta = 1,
rate_drop = 0.5,
one_drop = TRUE,
nthread = 1,
objective = "reg:squarederror"
),
data = dtrain,
nrounds = nrounds
)
pred_by_train_0 <- predict(booster_by_train, newdata = dtrain, ntreelimit = 0)
pred_by_train_1 <- predict(booster_by_train, newdata = dtrain, ntreelimit = nrounds)
pred_by_train_2 <- predict(booster_by_train, newdata = dtrain, training = TRUE)
Expand Down Expand Up @@ -399,7 +408,7 @@ test_that("colsample_bytree works", {
xgb.importance(model = bst)
# If colsample_bytree works properly, a variety of features should be used
# in the 100 trees
expect_gte(nrow(xgb.importance(model = bst)), 30)
expect_gte(nrow(xgb.importance(model = bst)), 28)
})

test_that("Configuration works", {
Expand Down
5 changes: 4 additions & 1 deletion R-package/tests/testthat/test_update.R
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@ test_that("updating the model works", {
watchlist <- list(train = dtrain, test = dtest)

# no-subsampling
p1 <- list(objective = "binary:logistic", max_depth = 2, eta = 0.05, nthread = 2)
p1 <- list(
objective = "binary:logistic", max_depth = 2, eta = 0.05, nthread = 2,
updater = "grow_colmaker,prune"
)
set.seed(11)
bst1 <- xgb.train(p1, dtrain, nrounds = 10, watchlist, verbose = 0)
tr1 <- xgb.model.dt.tree(model = bst1)
Expand Down
2 changes: 2 additions & 0 deletions doc/c.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ DMatrix
.. doxygengroup:: DMatrix
:project: xgboost

.. _c_streaming:

Streaming
---------

Expand Down
3 changes: 3 additions & 0 deletions doc/tutorials/dask.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,9 @@ on a dask cluster:
y = da.random.random(size=(num_obs, 1), chunks=(1000, 1))
dtrain = xgb.dask.DaskDMatrix(client, X, y)
# or
# dtrain = xgb.dask.DaskQuantileDMatrix(client, X, y)
# `DaskQuantileDMatrix` is available for the `hist` and `gpu_hist` tree method.
output = xgb.dask.train(
client,
Expand Down
81 changes: 70 additions & 11 deletions doc/tutorials/external_memory.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,15 @@ GPU-based training algorithm. We will introduce them in the following sections.

The feature is still experimental as of 2.0. The performance is not well optimized.

The external memory support has gone through multiple iterations and is still under heavy
development. Like the :py:class:`~xgboost.QuantileDMatrix` with
:py:class:`~xgboost.DataIter`, XGBoost loads data batch-by-batch using a custom iterator
supplied by the user. However, unlike the :py:class:`~xgboost.QuantileDMatrix`, external
memory will not concatenate the batches unless GPU is used (it uses a hybrid approach,
more details follow). Instead, it will cache all batches on the external memory and fetch
them on-demand. Go to the end of the document to see a comparison between
`QuantileDMatrix` and external memory.

*************
Data Iterator
*************
Expand Down Expand Up @@ -113,10 +122,11 @@ External memory is supported by GPU algorithms (i.e. when ``tree_method`` is set
``gpu_hist``). However, the algorithm used for GPU is different from the one used for
CPU. When training on a CPU, the tree method iterates through all batches from external
memory for each step of the tree construction algorithm. On the other hand, the GPU
algorithm concatenates all batches into one and stores it in GPU memory. To reduce overall
memory usage, users can utilize subsampling. The good news is that the GPU hist tree
method supports gradient-based sampling, enabling users to set a low sampling rate without
compromising accuracy.
algorithm uses a hybrid approach. It iterates through the data during the beginning of
each iteration and concatenates all batches into one in GPU memory. To reduce overall
memory usage, users can utilize subsampling. The GPU hist tree method supports
`gradient-based sampling`, enabling users to set a low sampling rate without compromising
accuracy.

.. code-block:: python
Expand All @@ -134,6 +144,8 @@ see `this paper <https://arxiv.org/abs/2005.09148>`_.
When GPU is running out of memory during iteration on external memory, user might
recieve a segfault instead of an OOM exception.

.. _ext_remarks:

*******
Remarks
*******
Expand All @@ -142,17 +154,64 @@ When using external memory with XBGoost, data is divided into smaller chunks so
a fraction of it needs to be stored in memory at any given time. It's important to note
that this method only applies to the predictor data (``X``), while other data, like labels
and internal runtime structures are concatenated. This means that memory reduction is most
effective when dealing with wide datasets where ``X`` is larger compared to other data
like ``y``, while it has little impact on slim datasets.
effective when dealing with wide datasets where ``X`` is significantly larger in size
compared to other data like ``y``, while it has little impact on slim datasets.

As one might expect, fetching data on-demand puts significant pressure on the storage
device. Today's computing device can process way more data than a storage can read in a
single unit of time. The ratio is at order of magnitudes. An GPU is capable of processing
hundred of Gigabytes of floating-point data in a split second. On the other hand, a
four-lane NVMe storage connected to a PCIe-4 slot usually has about 6GB/s of data transfer
rate. As a result, the training is likely to be severely bounded by your storage
device. Before adopting the external memory solution, some back-of-envelop calculations
might help you see whether it's viable. For instance, if your NVMe drive can transfer 4GB
(a fairly practical number) of data per second and you have a 100GB of data in compressed
XGBoost cache (which corresponds to a dense float32 numpy array with the size of 200GB,
give or take). A tree with depth 8 needs at least 16 iterations through the data when the
parameter is right. You need about 14 minutes to train a single tree without accounting
for some other overheads and assume the computation overlaps with the IO. If your dataset
happens to have TB-level size, then you might need thousands of trees to get a generalized
model. These calculations can help you get an estimate on the expected training time.

However, sometimes we can ameliorate this limitation. One should also consider that the OS
(mostly talking about the Linux kernel) can usually cache the data on host memory. It only
evicts pages when new data comes in and there's no room left. In practice, at least some
portion of the data can persist on the host memory throughout the entire training
session. We are aware of this cache when optimizing the external memory fetcher. The
compressed cache is usually smaller than the raw input data, especially when the input is
dense without any missing value. If the host memory can fit a significant portion of this
compressed cache, then the performance should be decent after initialization. Our
development so far focus on two fronts of optimization for external memory:

- Avoid iterating through the data whenever appropriate.
- If the OS can cache the data, the performance should be close to in-core training.

Starting with XGBoost 2.0, the implementation of external memory uses ``mmap``. It is not
yet tested against system errors like disconnected network devices (`SIGBUS`). Also, it's
worth noting that most tests have been conducted on Linux distributions.
tested against system errors like disconnected network devices (`SIGBUS`). In the face of
a bus error, you will see a hard crash and need to clean up the cache files. If the
training session might take a long time and you are using solutions like NVMe-oF, we
recommend checkpointing your model periodically. Also, it's worth noting that most tests
have been conducted on Linux distributions.

Another important point to keep in mind is that creating the initial cache for XGBoost may
take some time. The interface to external memory is through custom iterators, which may or
may not be thread-safe. Therefore, initialization is performed sequentially.

Another important point to keep in mind is that creating the initial cache for XGBoost may
take some time. The interface to external memory is through custom iterators, which we can
not assume to be thread-safe. Therefore, initialization is performed sequentially. Using
the `xgboost.config_context` with `verbosity=2` can give you some information on what
XGBoost is doing during the wait if you don't mind the extra output.

*******************************
Compared to the QuantileDMatrix
*******************************

Passing an iterator to the :py:class:`~xgboost.QuantileDmatrix` enables direct
construction of `QuantileDmatrix` with data chunks. On the other hand, if it's passed to
:py:class:`~xgboost.DMatrix`, it instead enables the external memory feature. The
:py:class:`~xgboost.QuantileDmatrix` concatenates the data on memory after compression and
doesn't fetch data during training. On the other hand, the external memory `DMatrix`
fetches data batches from external memory on-demand. Use the `QuantileDMatrix` (with
iterator if necessary) when you can fit most of your data in memory. The training would be
an order of magnitute faster than using external memory.

****************
Text File Inputs
Expand Down
18 changes: 9 additions & 9 deletions doc/tutorials/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,22 +11,22 @@ See `Awesome XGBoost <https://github.com/dmlc/xgboost/tree/master/demo>`_ for mo

model
saving_model
learning_to_rank
dart
monotonic
feature_interaction_constraint
aft_survival_analysis
categorical
multioutput
rf
kubernetes
Distributed XGBoost with XGBoost4J-Spark <https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html>
Distributed XGBoost with XGBoost4J-Spark-GPU <https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_gpu_tutorial.html>
dask
spark_estimator
ray
dart
monotonic
rf
feature_interaction_constraint
learning_to_rank
aft_survival_analysis
external_memory
c_api_tutorial
input_format
param_tuning
external_memory
custom_metric_obj
categorical
multioutput
5 changes: 3 additions & 2 deletions doc/tutorials/learning_to_rank.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,9 @@ Notice that the samples are sorted based on their query index in a non-decreasin
import xgboost as xgb
# Make a synthetic ranking dataset for demonstration
X, y = make_classification(random_state=rng)
rng = np.random.default_rng(1994)
seed = 1994
X, y = make_classification(random_state=seed)
rng = np.random.default_rng(seed)
n_query_groups = 3
qid = rng.integers(0, 3, size=X.shape[0])
Expand Down
43 changes: 43 additions & 0 deletions doc/tutorials/param_tuning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,3 +58,46 @@ This can affect the training of XGBoost model, and there are two ways to improve

- In such a case, you cannot re-balance the dataset
- Set parameter ``max_delta_step`` to a finite number (say 1) to help convergence


*********************
Reducing Memory Usage
*********************

If you are using a HPO library like :py:class:`sklearn.model_selection.GridSearchCV`,
please control the number of threads it can use. It's best to let XGBoost to run in
parallel instead of asking `GridSearchCV` to run multiple experiments at the same
time. For instance, creating a fold of data for cross validation can consume a significant
amount of memory:

.. code-block:: python
# This creates a copy of dataset. X and X_train are both in memory at the same time.
# This happens for every thread at the same time if you run `GridSearchCV` with
# `n_jobs` larger than 1
X_train, X_test, y_train, y_test = train_test_split(X, y)
.. code-block:: python
df = pd.DataFrame()
# This creates a new copy of the dataframe, even if you specify the inplace parameter
new_df = df.drop(...)
.. code-block:: python
array = np.array(...)
# This may or may not make a copy of the data, depending on the type of the data
array.astype(np.float32)
.. code-block::
# np by default uses double, do you actually need it?
array = np.array(...)
You can find some more specific memory reduction practices scattered through the documents
For instances: :doc:`/tutorials/dask`, :doc:`/gpu/index`,
:doc:`/contrib/scaling`. However, before going into these, being conscious about making
data copies is a good starting point. It usually consumes a lot more memory than people
expect.
11 changes: 3 additions & 8 deletions rabit/include/rabit/internal/io.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,7 @@
#include "rabit/internal/utils.h"
#include "rabit/serializable.h"

namespace rabit {
namespace utils {
namespace rabit::utils {
/*! \brief re-use definition of dmlc::SeekStream */
using SeekStream = dmlc::SeekStream;
/**
Expand All @@ -31,9 +30,6 @@ struct MemoryFixSizeBuffer : public SeekStream {
// similar to SEEK_END in libc
static std::size_t constexpr kSeekEnd = std::numeric_limits<std::size_t>::max();

protected:
MemoryFixSizeBuffer() = default;

public:
/**
* @brief Ctor
Expand Down Expand Up @@ -68,7 +64,7 @@ struct MemoryFixSizeBuffer : public SeekStream {
* @brief Current position in the buffer (stream).
*/
std::size_t Tell() override { return curr_ptr_; }
virtual bool AtEnd() const { return curr_ptr_ == buffer_size_; }
[[nodiscard]] virtual bool AtEnd() const { return curr_ptr_ == buffer_size_; }

protected:
/*! \brief in memory buffer */
Expand Down Expand Up @@ -119,6 +115,5 @@ struct MemoryBufferStream : public SeekStream {
/*! \brief current pointer */
size_t curr_ptr_;
}; // class MemoryBufferStream
} // namespace utils
} // namespace rabit
} // namespace rabit::utils
#endif // RABIT_INTERNAL_IO_H_
Loading

0 comments on commit 5710a46

Please sign in to comment.