Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] OwLouvain: Add normalize data checkbox to PCA preprocessing #3573

Merged
merged 7 commits into from
Feb 15, 2019

Conversation

pavlin-policar
Copy link
Collaborator

@pavlin-policar pavlin-policar commented Feb 2, 2019

Edited on 4.2.2019

Issue

In #3448 we found that normalizing the data can have beneficial effects for t-SNE. This likely also holds for Louvain clustering, when PCA preprocessing is applied. It also makes little sense to enable data normalization for PCA in one widget, but not another.

Description of changes

I tried to separate the PCA specific parameters from the "Enable PCA preprocessing" by adding an indent. IMO it looks better than if there we no indent, but it's not very pretty.

image

Description of changes:

  • 024d125 Add the option to skip zero centering to the normalization preprocessor. This will also be necessary later on, when I add normalization for sparse matrices.

    This is already implemented in Orange.preprocess.preprocess.Scale, which seems to do exactly the same thing as Orange.preprocess.preprocess.Normalize. Indeed, it seems as though Scale actually uses the same normalization implementation. This is very confusing, as Normalize is just a slightly less powerful version of Scale, but having a much better name. It also seems that the only place Scale is used is in the Preprocess widget, while Normalize is used all over the place.

    This is very confusing and IMO Scale should be renamed to Normalize, since they do the same thing, but Scale does it better. In any case, I am using Normalize here, mainly because I found Scale after implementing this.

  • 495dfa8 Moved table_dense_sparse decorator into Orange.widget.testing.utils. It's quite handy.

  • e1d91ef Add normalization support for sparse data. This is included here because this is not PCA specific. We only scale by SD. For PCA, this is fine, because centering is applied in PCA. This is also allright where PCA is not used, because the table is directly used for computing pairwise distances between samples. Euclidean and Manhattan distance are independent of the position i.e. they are the same if the points are centered or not. So applying only scaling is all right here as well.

    Note that this is only correct as long as the distances are independent of absolute position. If we ever add other distances, this needs to be considered e.g. the cosine distance doesn't have this property.

  • bb1f3e0 Disable PCA components slider if Apply PCA is not checked; it makes little sense to be enabled.

Includes
  • Code changes
  • Tests
  • Documentation

@codecov
Copy link

codecov bot commented Feb 2, 2019

Codecov Report

Merging #3573 into master will decrease coverage by <.01%.
The diff coverage is 87.5%.

@@            Coverage Diff             @@
##           master    #3573      +/-   ##
==========================================
- Coverage   83.97%   83.97%   -0.01%     
==========================================
  Files         370      370              
  Lines       66941    66960      +19     
==========================================
+ Hits        56215    56229      +14     
- Misses      10726    10731       +5

@codecov
Copy link

codecov bot commented Feb 2, 2019

Codecov Report

Merging #3573 into master will increase coverage by 0.01%.
The diff coverage is 95.83%.

@@            Coverage Diff            @@
##           master   #3573      +/-   ##
=========================================
+ Coverage   83.99%     84%   +0.01%     
=========================================
  Files         370     370              
  Lines       67023   67098      +75     
=========================================
+ Hits        56298   56369      +71     
- Misses      10725   10729       +4

@janezd
Copy link
Contributor

janezd commented Feb 2, 2019

I tried to run the widget (using its "main"). It erred with

  File "/Users/janez/Dropbox/orange3/Orange/clustering/louvain.py", line 142, in fit
    graph, resolution=self.resolution, random_state=self.random_state
TypeError: best_partition() got an unexpected keyword argument 'random_state'

This is not related to this PR and may be a problem in my environment. However, the label remained "Running...". I think the widget should change the label to "Erred..." (it this the right verb? :) when optimization fail.

Unrelated to this PR, but can be fixed here since it involves a change in the same place: the error was reported in the status bar, yet it's an error in the code and should be shown as crash. Similar to issue #3548.

@pavlin-policar
Copy link
Collaborator Author

Yes, you're completely right. This slipped my mind completely. I'll definitely add something to indicate an error, but I don't think repeating the error message shown in the status bar is needed. It might encourage users to look for error messages somewhere in the widget UI, while this would be completely specific to this widget. I think it's better to enforce one single place for the error message on the widget (despite them being kind of hard to notice sometimes).

As far as the error you got, you likely have an old version of python-louvain, probably 0.11. I added random state support in the library and should be 0.13 or higher. So you probably just need to update the package version.

@pavlin-policar pavlin-policar force-pushed the louvain-pca-normalize branch 3 times, most recently from 39cf5b7 to a7a3b19 Compare February 2, 2019 13:59
@janezd
Copy link
Contributor

janezd commented Feb 2, 2019

Sure, no need to repeat the message. Just set the label to "Error" or something.

My louvain is indeed 0.11. Which is good, otherwise I wouldn't have spotted this glitch. :)

@lanzagar
Copy link
Contributor

lanzagar commented Feb 4, 2019

I would suggest to:

  • rename PCA Preprocessing box to Preprocessing
  • put Normalize data above "Apply PCA preprocessing" and have it independent of PCA
  • maybe change Components -> PCA Components

@lanzagar
Copy link
Contributor

lanzagar commented Feb 4, 2019

And also, same comment as in t-SNE: why is normalization not supported for sparse data?

@pavlin-policar
Copy link
Collaborator Author

PCA Components reads as Principal component analysis components. Principal components would be better but that's too long, and the slider basically disappears. PCs can be unclear. I don't have any other ideas. I'll keep it as PCA Components for now.

@pavlin-policar pavlin-policar force-pushed the louvain-pca-normalize branch 3 times, most recently from 939f9e2 to de07f08 Compare February 4, 2019 16:59
@pavlin-policar
Copy link
Collaborator Author

pavlin-policar commented Feb 4, 2019

I've updated the original PR description describing comments and changes.

Orange/preprocess/preprocess.py Outdated Show resolved Hide resolved
Orange/preprocess/preprocess.py Show resolved Hide resolved
@@ -190,6 +208,7 @@ def cancel(self):
self.__set_state_ready()

def commit(self):
# pylint: disable=too-many-branches
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no reason why this pylint check should be disabled for this function.
And I kind of agree with pylint that this function has a lot of ifs... Can we reduce them?
E.g. looking at the len(...attributes) < 1 check - wouldn't a better place for this be in set_data (and on error just set self.data to None, so it does not have to be checked ever again).

@pavlin-policar pavlin-policar force-pushed the louvain-pca-normalize branch 3 times, most recently from 46d64bf to ee5d4f3 Compare February 11, 2019 13:58
@pavlin-policar pavlin-policar force-pushed the louvain-pca-normalize branch 5 times, most recently from acc2add to 8fc6215 Compare February 15, 2019 10:19
@pavlin-policar pavlin-policar force-pushed the louvain-pca-normalize branch 2 times, most recently from 470625e to fc1fad9 Compare February 15, 2019 10:27
@lanzagar lanzagar merged commit 59228cf into biolab:master Feb 15, 2019
@pavlin-policar pavlin-policar deleted the louvain-pca-normalize branch February 15, 2019 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants