diff --git a/NEWS.md b/NEWS.md index 135193270d09..c3c4d88ad3b8 100644 --- a/NEWS.md +++ b/NEWS.md @@ -3,9 +3,166 @@ XGBoost Change Log This file records the changes in xgboost library in reverse chronological order. -## Master (2018.09.30) -* BREAKING CHANGES - - External memory page files have changed, breaking backwards compatibility for temporary storage used during external memory training. This only affects external memory users upgrading their xgboost version - we recommend clearing all *.page files before resuming training. Model serialization is unaffected. +## v0.81 (2018.11.04) +### New feature: feature interaction constraints +* Users are now able to control which features (independent variables) are allowed to interact by specifying feature interaction constraints (#3466). +* [Tutorial](https://xgboost.readthedocs.io/en/release_0.81/tutorials/feature_interaction_constraint.html) is available, as well as [R](https://github.com/dmlc/xgboost/blob/9254c58e4dfff6a59dc0829a2ceb02e45ed17cd0/R-package/demo/interaction_constraints.R) and [Python](https://github.com/dmlc/xgboost/blob/9254c58e4dfff6a59dc0829a2ceb02e45ed17cd0/tests/python/test_interaction_constraints.py) examples. + +### New feature: learning to rank using scikit-learn interface +* Learning to rank task is now available for the scikit-learn interface of the Python package (#3560, #3848). It is now possible to integrate the XGBoost ranking model into the scikit-learn learning pipeline. +* Examples of using `XGBRanker` class is found at [demo/rank/rank_sklearn.py](https://github.com/dmlc/xgboost/blob/24a268a2e3cb17302db3d72da8f04016b7d352d9/demo/rank/rank_sklearn.py). + +### New feature: R interface for SHAP interactions +* SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. Previously, this feature was only available from the Python package; now it is available from the R package as well (#3636). + +### New feature: GPU predictor now use multiple GPUs to predict +* GPU predictor is now able to utilize multiple GPUs at once to accelerate prediction (#3738) + +### New feature: Scale distributed XGBoost to large-scale clusters +* Fix OS file descriptor limit assertion error on large cluster (#3835, dmlc/rabit#73) by replacing `select()` based AllReduce/Broadcast with `poll()` based implementation. +* Mitigate tracker "thundering herd" issue on large cluster. Add exponential backoff retry when workers connect to tracker. +* With this change, we were able to scale to 1.5k executors on a 12 billion row dataset after some tweaks here and there. + +### Major bug fix: learning to rank with XGBoost4J-Spark +* Previously, `repartitionForData` would shuffle data and lose ordering necessary for ranking task. +* To fix this issue, data points within each RDD partition is explicitly group by their group (query session) IDs (#3654). Also handle empty RDD partition carefully (#3750). + +### Major bug fix: early stopping fixed in XGBoost4J-Spark +* Earlier implementation of early stopping had incorrect semantics and didn't let users to specify direction for optimizing (maximize / minimize) +* A parameter `maximize_evaluation_metrics` is defined so as to tell whether a metric should be maximized or minimized as part of early stopping criteria (#3808). Also early stopping now has correct semantics. + +### API changes +* Column sampling by level (`colsample_bylevel`) is now functional for `hist` algorithm (#3635, #3862) +* Add `disable_default_eval_metric` parameter to disable default metric (#3606) +* Experimental AVX support for gradient computation is removed (#3752) +* XGBoost4J-Spark + - Add `rank:ndcg` and `rank:map` to supported objectives (#3697) +* Python package + - Add `callbacks` argument to `fit()` function of sciki-learn API (#3682) + - Add `XGBRanker` to scikit-learn interface (#3560, #3848) + - Add `validate_features` argument to `predict()` function of scikit-learn API (#3653) + - Allow scikit-learn grid search over parameters specified as keyword arguments (#3791) + - Add `coef_` and `intercept_` as properties of scikit-learn wrapper (#3855). Some scikit-learn functions expect these properties. + +### Performance improvements +* Address very high GPU memory usage for large data (#3635) +* Fix performance regression within `EvaluateSplits()` of `gpu_hist` algorithm. (#3680) + +### Bug-fixes +* Fix a problem in GPU quantile sketch with tiny instance weights. (#3628) +* Fix copy constructor for `HostDeviceVectorImpl` to prevent dangling pointers (#3657) +* Fix a bug in partitioned file loading (#3673) +* Fixed an uninitialized pointer in `gpu_hist` (#3703) +* Reshared data among GPUs when number of GPUs is changed (#3721) +* Add back `max_delta_step` to split evaluation (#3668) +* Do not round up integer thresholds for integer features in JSON dump (#3717) +* Use `dmlc::TemporaryDirectory` to handle temporaries in cross-platform way (#3783) +* Fix accuracy problem with `gpu_hist` when `min_child_weight` and `lambda` are set to 0 (#3793) +* Make sure that `tree_method` parameter is recognized and not silently ignored (#3849) +* XGBoost4J-Spark + - Make sure `thresholds` are considered when executing `predict()` method (#3577) + - Avoid losing precision when computing probabilities by converting to `Double` early (#3576) + - `getTreeLimit()` should return `Int` (#3602) + - Fix checkpoint serialization on HDFS (#3614) + - Throw `ControlThrowable` instead of `InterruptedException` so that it is properly re-thrown (#3632) + - Remove extraneous output to stdout (#3665) + - Allow specification of task type for custom objectives and evaluations (#3646) + - Fix distributed updater check (#3739) + - Fix issue when spark job execution thread cannot return before we execute `first()` (#3758) +* Python package + - Fix accessing `DMatrix.handle` before it is set (#3599) + - `XGBClassifier.predict()` should return margin scores when `output_margin` is set to true (#3651) + - Early stopping callback should maximize metric of form `NDCG@n-` (#3685) + - Preserve feature names when slicing `DMatrix` (#3766) +* R package + - Replace `nround` with `nrounds` to match actual parameter (#3592) + - Amend `xgb.createFolds` to handle classes of a single element (#3630) + - Fix buggy random generator and make `colsample_bytree` functional (#3781) + +### Maintenance: testing, continuous integration, build system +* Add sanitizers tests to Travis CI (#3557) +* Add NumPy, Matplotlib, Graphviz as requirements for doc build (#3669) +* Comply with CRAN submission policy (#3660, #3728) +* Remove copy-paste error in JVM test suite (#3692) +* Disable flaky tests in `R-package/tests/testthat/test_update.R` (#3723) +* Make Python tests compatible with scikit-learn 0.20 release (#3731) +* Separate out restricted and unrestricted tasks, so that pull requests don't build downloadable artifacts (#3736) +* Add multi-GPU unit test environment (#3741) +* Allow plug-ins to be built by CMake (#3752) +* Test wheel compatibility on CPU containers for pull requests (#3762) +* Fix broken doc build due to Matplotlib 3.0 release (#3764) +* Produce `xgboost.so` for XGBoost-R on Mac OSX, so that `make install` works (#3767) +* Retry Jenkins CI tests up to 3 times to improve reliability (#3769, #3769, #3775, #3776, #3777) +* Add basic unit tests for `gpu_hist` algorithm (#3785) +* Fix Python environment for distributed unit tests (#3806) +* Test wheels on CUDA 10.0 container for compatibility (#3838) +* Fix JVM doc build (#3853) + +### Maintenance: Refactor C++ code for legibility and maintainability +* Merge generic device helper functions into `GPUSet` class (#3626) +* Re-factor column sampling logic into `ColumnSampler` class (#3635, #3637) +* Replace `std::vector` with `HostDeviceVector` in `MetaInfo` and `SparsePage` (#3446) +* Simplify `DMatrix` class (#3395) +* De-duplicate CPU/GPU code using `Transform` class (#3643, #3751) +* Remove obsoleted `QuantileHistMaker` class (#3761) +* Remove obsoleted `NoConstraint` class (#3792) + +### Other Features +* C++20-compliant Span class for safe pointer indexing (#3548, #3588) +* Add helper functions to manipulate multiple GPU devices (#3693) +* XGBoost4J-Spark + - Allow specifying host ip from the `xgboost-tracker.properties file` (#3833). This comes in handy when `hosts` files doesn't correctly define localhost. + +### Usability Improvements +* Add reference to GitHub repository in `pom.xml` of JVM packages (#3589) +* Add R demo of multi-class classification (#3695) +* Document JSON dump functionality (#3600, #3603) +* Document CUDA requirement and lack of external memory for GPU algorithms (#3624) +* Document LambdaMART objectives, both pairwise and listwise (#3672) +* Document `aucpr` evaluation metric (#3687) +* Document gblinear parameters: `feature_selector` and `top_k` (#3780) +* Add instructions for using MinGW-built XGBoost with Python. (#3774) +* Removed nonexistent parameter `use_buffer` from documentation (#3610) +* Update Python API doc to include all classes and members (#3619, #3682) +* Fix typos and broken links in documentation (#3618, #3640, #3676, #3713, #3759, #3784, #3843, #3852) +* Binary classification demo should produce LIBSVM with 0-based indexing (#3652) +* Process data once for Python and CLI examples of learning to rank (#3666) +* Include full text of Apache 2.0 license in the repository (#3698) +* Save predictor parameters in model file (#3856) +* JVM packages + - Let users specify feature names when calling `getModelDump` and `getFeatureScore` (#3733) + - Warn the user about the lack of over-the-wire encryption (#3667) + - Fix errors in examples (#3719) + - Document choice of trackers (#3831) + - Document that vanilla Apache Spark is required (#3854) +* Python package + - Document that custom objective can't contain colon (:) (#3601) + - Show a better error message for failed library loading (#3690) + - Document that feature importance is unavailable for non-tree learners (#3765) + - Document behavior of `get_fscore()` for zero-importance features (#3763) + - Recommend pickling as the way to save `XGBClassifier` / `XGBRegressor` / `XGBRanker` (#3829) +* R package + - Enlarge variable importance plot to make it more visible (#3820) + +### BREAKING CHANGES +* External memory page files have changed, breaking backwards compatibility for temporary storage used during external memory training. This only affects external memory users upgrading their xgboost version - we recommend clearing all `*.page` files before resuming training. Model serialization is unaffected. + +### Known issues +* Quantile sketcher fails to produce any quantile for some edge cases (#2943) +* The `hist` algorithm leaks memory when used with learning rate decay callback (#3579) +* Using custom evaluation funciton together with early stopping causes assertion failure in XGBoost4J-Spark (#3595) +* Early stopping doesn't work with `gblinear` learner (#3789) +* Label and weight vectors are not reshared upon the change in number of GPUs (#3794). To get around this issue, delete the `DMatrix` object and re-load. +* The `DMatrix` Python objects are initialized with incorrect values when given array slices (#3841) +* The `gpu_id` parameter is broken and not yet properly supported (#3850) + +### Acknowledgement +**Contributors** (in no particular order): Hyunsu Cho (@hcho3), Jiaming Yuan (@trivialfis), Nan Zhu (@CodingCat), Rory Mitchell (@RAMitchell), Andy Adinets (@canonizer), Vadim Khotilovich (@khotilov), Sergei Lebedev (@superbobry) + +**First-time Contributors** (in no particular order): Matthew Tovbin (@tovbinm), Jakob Richter (@jakob-r), Grace Lam (@grace-lam), Grant W Schneider (@grantschneider), Andrew Thia (@BlueTea88), Sergei Chipiga (@schipiga), Joseph Bradley (@jkbradley), Chen Qin (@chenqin), Jerry Lin (@linjer), Dmitriy Rybalko (@rdtft), Michael Mui (@mmui), Takahiro Kojima (@515hikaru), Bruce Zhao (@BruceZhaoR), Wei Tian (@weitian), Saumya Bhatnagar (@Sam1301), Juzer Shakir (@JuzerShakir), Zhao Hang (@cleghom), Jonathan Friedman (@jontonsoup), Bruno Tremblay (@meztez), @Shiki-H, @mrgutkun, @gorogm, @htgeis, @jakehoare, @zengxy, @KOLANICH + +**First-time Reviewers** (in no particular order): Nikita Titov (@StrikerRUS), Xiangrui Meng (@mengxr), Nirmal Borah (@Nirmal-Neel) + ## v0.80 (2018.08.13) * **JVM packages received a major upgrade**: To consolidate the APIs and improve the user experience, we refactored the design of XGBoost4J-Spark in a significant manner. (#3387) diff --git a/jvm-packages/pom.xml b/jvm-packages/pom.xml index 49de96691fb0..d7bd19649cea 100644 --- a/jvm-packages/pom.xml +++ b/jvm-packages/pom.xml @@ -6,7 +6,7 @@ ml.dmlc xgboost-jvm - 0.81-SNAPSHOT + 0.81 pom XGBoost JVM Package JVM Package for XGBoost diff --git a/jvm-packages/xgboost4j-example/pom.xml b/jvm-packages/xgboost4j-example/pom.xml index 4449d3aa3a98..28c954d134a9 100644 --- a/jvm-packages/xgboost4j-example/pom.xml +++ b/jvm-packages/xgboost4j-example/pom.xml @@ -6,10 +6,10 @@ ml.dmlc xgboost-jvm - 0.81-SNAPSHOT + 0.81 xgboost4j-example - 0.81-SNAPSHOT + 0.81 jar @@ -26,7 +26,7 @@ ml.dmlc xgboost4j-spark - 0.81-SNAPSHOT + 0.81 org.apache.spark @@ -37,7 +37,7 @@ ml.dmlc xgboost4j-flink - 0.81-SNAPSHOT + 0.81 org.apache.commons diff --git a/jvm-packages/xgboost4j-flink/pom.xml b/jvm-packages/xgboost4j-flink/pom.xml index 47b5f871ff05..bf1a2f42c9fc 100644 --- a/jvm-packages/xgboost4j-flink/pom.xml +++ b/jvm-packages/xgboost4j-flink/pom.xml @@ -6,10 +6,10 @@ ml.dmlc xgboost-jvm - 0.81-SNAPSHOT + 0.81 xgboost4j-flink - 0.81-SNAPSHOT + 0.81 @@ -26,7 +26,7 @@ ml.dmlc xgboost4j - 0.81-SNAPSHOT + 0.81 org.apache.commons diff --git a/jvm-packages/xgboost4j-spark/pom.xml b/jvm-packages/xgboost4j-spark/pom.xml index 601bad1b9b96..a0820faaecb3 100644 --- a/jvm-packages/xgboost4j-spark/pom.xml +++ b/jvm-packages/xgboost4j-spark/pom.xml @@ -6,7 +6,7 @@ ml.dmlc xgboost-jvm - 0.81-SNAPSHOT + 0.81 xgboost4j-spark @@ -24,7 +24,7 @@ ml.dmlc xgboost4j - 0.81-SNAPSHOT + 0.81 org.apache.spark diff --git a/jvm-packages/xgboost4j/pom.xml b/jvm-packages/xgboost4j/pom.xml index 2a069e7670c1..ad7921ebb64c 100644 --- a/jvm-packages/xgboost4j/pom.xml +++ b/jvm-packages/xgboost4j/pom.xml @@ -6,10 +6,10 @@ ml.dmlc xgboost-jvm - 0.81-SNAPSHOT + 0.81 xgboost4j - 0.81-SNAPSHOT + 0.81 jar diff --git a/python-package/xgboost/VERSION b/python-package/xgboost/VERSION index 885b0568652e..453a698ed5ba 100644 --- a/python-package/xgboost/VERSION +++ b/python-package/xgboost/VERSION @@ -1 +1 @@ -0.80 +0.81