Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Gini impurity: formula and docstring fixed. #1495

Merged
merged 1 commit into from
Sep 16, 2016

Conversation

ajdapretnar
Copy link
Contributor

Gini coefficient and Gini impurity are not the same. Docstring fixed to proper Wikipedia source.

As observed in many sources, the formula for Gini impurity never contains division by 2. @BlazZupan, please check whether this is the right formula or not.

Also, "Gini" was a misleading attribute name in Rank widget. Here we're actually measuring Gini decrease. So even perhaps "Gini Gain" might not be the right name for this. Comments? @lanzagar @astaric

Probably some fixes to names should be made elsewhere, too.

@ajdapretnar
Copy link
Contributor Author

Tests are now fixed as well.

@codecov-io
Copy link

codecov-io commented Aug 2, 2016

Current coverage is 88.86% (diff: 100%)

Merging #1495 into master will not change coverage

@@             master      #1495   diff @@
==========================================
  Files            78         78          
  Lines          8099       8099          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits           7197       7197          
  Misses          902        902          
  Partials          0          0          

Sunburst

Powered by Codecov. Last update abbf69f...c839687

@@ -207,7 +207,7 @@ def _gini(D):
"""Gini index of class-distribution matrix"""
P = D / np.sum(D, axis=0)
return sum((np.ones(1 if len(D.shape) == 1 else D.shape[1]) - np.sum(np.square(P), axis=0))
* 0.5 * np.sum(D, axis=0) / np.sum(D))
* np.sum(D, axis=0) / np.sum(D))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Among the several equivalent representations of the proposed equation, this one seems nicest:

return 1 - (P*P).sum()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it looks the nicest and most concise. Can maybe @lanzagar also comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance it looks to me like at least the second line is needed to get the weighted impurity we want.
It is possible that 1 can be used instead of the complications with np.ones (should be tested).

Copy link
Contributor Author

@ajdapretnar ajdapretnar Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is out of my league. Perhaps merge this PR to fix division by two and make another one with improved formula? Otherwise I'll just copy-paste whatever you'll tell me to. 😀

Copy link
Contributor

@lanzagar lanzagar Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy paste:

    P = np.asarray(D / np.sum(D, axis=0))
    return np.sum((1 - np.sum(P**2, axis=0)) *
                  np.sum(D, axis=0) / np.sum(D))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@ajdapretnar ajdapretnar force-pushed the gini-formula branch 2 times, most recently from 2b4a24e to 81b4360 Compare September 16, 2016 10:10
@lanzagar lanzagar changed the title [RFC] Gini impurity: formula and docstring fixed. [FIX] Gini impurity: formula and docstring fixed. Sep 16, 2016
@lanzagar lanzagar merged commit 3f3ebd4 into biolab:master Sep 16, 2016
@ajdapretnar ajdapretnar deleted the gini-formula branch October 14, 2016 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants