Skip to content

Commit

Permalink
[DOC] Discretize: fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
ajdapretnar committed Apr 6, 2022
1 parent a465ca4 commit 621cf7a
Show file tree
Hide file tree
Showing 5 changed files with 11 additions and 11 deletions.
22 changes: 11 additions & 11 deletions doc/visual-programming/source/widgets/data/discretize.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
<!-- Discretize -->
Discretize
==========

Discretizes continuous attributes from an input dataset.
Converts numeric attributes to categorical.

**Inputs**

Expand All @@ -13,22 +13,22 @@ Discretizes continuous attributes from an input dataset.

The **Discretize** widget [discretizes](https://en.wikipedia.org/wiki/Discretization) numeric variables.

![](images/Discretize-All-stamped.png)
![](images/Discretize.png)

1. Set default method for discretization.

2. Select particular variables to set specific discretization methods. Hovering over a variable shows intervals.
2. Select variables to set specific discretization methods for each. Hovering over a variable shows intervals.

3. Discretization methods

- **Keep numeric** keeps the variable as it is.
- **Remove** removes variable.
- **Natural binning** finds nice thresholds for the variable's range of values, for instance 10, 20, 30 or 0.2, 0.4, 0.6, 0.8. We can set the desired number of bins; the actual number will depend upon the interval.
- **Fixed width** uses a user-defined bin width. Boundaries of bins will be multiples of width. For instance, if the width is 10 and variable's values range from 35 to 68, the resulting bins will be <40, 40-50, 50-60, >60. This method does not work for time variables. If the width is too large (resulting in a single interval) or too small (resulting in more than 100 intervals), the variable is removed.
- **Natural binning** finds nice thresholds for the variable's range of values, for instance 10, 20, 30 or 0.2, 0.4, 0.6, 0.8. We can set the desired number of bins; the actual number will depend on the interval.
- **Fixed width** uses a user-defined bin width. Boundaries of bins will be multiples of width. For instance, if the width is 10 and the variable's values range from 35 to 68, the resulting bins will be <40, 40-50, 50-60, >60. This method does not work for time variables. If the width is too large (resulting in a single interval) or too small (resulting in more than 100 intervals), the variable is removed.
- **Time interval** is similar to Fixed width, but for time variables. We specify the width and a time unit, e.g. 4 months or 3 days. Bin boundaries will be multiples of the interval; e.g. with 4 months, bins will always include Jan-Mar, Apr-Jun, Jul-Sep and Oct-Dec.
- **[Equal-frequency]**(http://www.saedsayad.com/unsupervised_binning.htm) splits the attribute into the given number of intervals with approximately the same number of instances.
- **[Equal-frequency](http://www.saedsayad.com/unsupervised_binning.htm)** splits the attribute into a given number of intervals with approximately the same number of instances.
- [Equal-width](https://en.wikipedia.org/wiki/Data_binning) evenly splits the range between the smallest and the largest observed value.
- [Entropy-MDL](http://ijcai.org/Past%20Proceedings/IJCAI-93-VOL2/PDF/022.pdf), is a top-down discretization invented by Fayyad and Irani, which recursively splits the attribute at a cut maximizing information gain, until the gain is lower than the minimal description length of the cut. This discretization can result in an arbitrary number of intervals, including a single interval, in which case the variable is discarded as useless (removed).
- [Entropy-MDL](http://ijcai.org/Past%20Proceedings/IJCAI-93-VOL2/PDF/022.pdf) is a top-down discretization invented by Fayyad and Irani, which recursively splits the attribute at a cut maximizing information gain, until the gain is lower than the minimal description length of the cut. This discretization can result in an arbitrary number of intervals, including a single interval, in which case the variable is discarded as useless (removed).
- **Custom** allows entering an increasing, comma-separated list of thresholds. This is not applicable to time variables.
- **Use default setting** (enabled for particular settings and not default) sets the method to specified as "Default setting".

Expand All @@ -37,10 +37,10 @@ The **Discretize** widget [discretizes](https://en.wikipedia.org/wiki/Discretiza
Example
-------

In the schema below, we tool the *Heart disease* data set and
In the schema below, we took the *Heart disease* data set and
- discretized *age* to a fixed interval of 10 (years),
- *max HR* to approximately 6 bins (this closest match were 7 bins with a width of 25),
- *max HR* to approximately 6 bins (the closest match were 7 bins with a width of 25),
- removed *Cholesterol*,
- and used *entropy-mdl* for remaining variables, which resulted in removing *rest SBP* and in two intervals for *ST by exercise* and *major vessels colored*.
- and used *entropy-mdl* for the remaining variables, which resulted in removing *rest SBP* and in two intervals for *ST by exercise* and *major vessels colored*.

![](images/Discretize-Example.png)
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 621cf7a

Please sign in to comment.