Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Unified clustering API #3814

Merged
merged 5 commits into from
Jun 21, 2019
Merged

[ENH] Unified clustering API #3814

merged 5 commits into from
Jun 21, 2019

Conversation

PrimozGodec
Copy link
Contributor

@PrimozGodec PrimozGodec commented May 22, 2019

Issue

Fixes #3805

Description of changes

Clustering algorithms are simplified and unified. Each clustering algorithm (kMeans, DBSCAN, Lovain) inherit from a common Clustering class. Clustering itself is stateless (it does not remember the data or parameters (such as centroids)). After it is called it returns clusters or when fit_storage is called it creates a new instance of type ClusteringModel which holds the model parameters (in case of k-means they are clusters) - model can be created just for k-Means. Other algorithms do not have parameters to store or do not enable predicting.

The interface to call the clustering is now different. Here is the example for kMeans:

d = Table("iris")
km = KMeans(preprocessors=None, n_clusters=3)  #init the clustering
clusters = km(d)  # compute clusters

or in the case when we need model:

d = Table("iris")
km = KMeans(preprocessors=None, n_clusters=3)  #init the clustering
model = km.fit_storage(d)  # fit the model

I also removed the silhouette computing from k-means class since I think it is not the part of the clustering but its own unit. From now on it will be called directly from the widget.

Includes
  • Code changes
  • Tests
  • Documentation

@PrimozGodec
Copy link
Contributor Author

@janezd and @markotoplak are you ok with the proposed scheme? I will push the changes for tests and widgets after we agree with the clustering scheme.

@janezd janezd self-assigned this May 24, 2019
@pavlin-policar
Copy link
Collaborator

Conceptually, some of this doesn't make sense. This is what I would expect to happen when using clustering classes: I call DBScan with some data and I get a model. I would assume the cluster labels are available at this point (via some cluster_assignments field or something). Now I get some new data. I notice the dbscan model has a transform method, so why not use this to check which clusters these new samples would belong to. Ok, I call predict, and now am baffled why there may or may not be more clusters in the new data than in the original data. If I overlay the original data with the new data, I notice that the clusters don't overlap at all. What gives!? I look into the code and lo and behold, transform doesn't actually do what I think it does (or what I would expect it to do from having used scikit-learn) but actually computes a new clustering on this data. Why? After another 5min of searching around the internet, I find that dbscan does not support transform.

In k-means, this makes complete sense, but dbscan and Louvain don't have a transform method, nor is it straightforward to add one.

I propose that the cluster assignments of the training data be available on the model as a field (so I can get cluster assignments by just calling fit) and if the method supports transform, then this can be implemented there. If it doesn't support transform, it should fail loudly, via a NotImplementedError.

@PrimozGodec
Copy link
Contributor Author

@pavlin-policar I agree with your solution. I also do not like that we call clustering twice for the same results as well that we pretend that there is a transform function. @janezd what you think about Pavlins idea.

@janezd janezd removed their assignment May 30, 2019
@PrimozGodec PrimozGodec added the needs discussion Core developers need to discuss the issue label May 31, 2019
@PrimozGodec
Copy link
Contributor Author

Tests are failing since the widget for Lovain clustering is not fixed to new clusterings yet. I would like to hear the opinion about @pavlin-policar idea anyway.

@lanzagar
Copy link
Contributor

lanzagar commented Jun 7, 2019

Main points of some additional discussion:

  • removing the compute_silhouette_score parameter from kMeans should be done by deprecating it first and not crashing on use of previous API
  • when wrapping sklearn methods in Orange, we copy the parameters (I guess mostly for introspection, code completion, etc). This practice can be discussed, but I would prefer to keep it like it was in this PR and do that separately if someone wants to propose a general change (not just for clustering methods)
  • because ClusteringModel is not really something that is often used/useful, the Clustering.__call__ method could default to the fit_predict behaviour and just return the clusters. For methods that support it (kmeans), if someone wants the model they can get it by specifically calling fit/fit_storage instead of __call__

@lanzagar lanzagar removed their assignment Jun 7, 2019
@lanzagar lanzagar changed the title Clustering simplified [WIP] Clustering simplified Jun 7, 2019
@PrimozGodec PrimozGodec force-pushed the clustering branch 11 times, most recently from febfdb9 to 055d273 Compare June 12, 2019 10:05
@PrimozGodec
Copy link
Contributor Author

@lanzagar can you check whether the clustering implementation is what we want. I meanwhile I will fix lint and also add some more tests for clustering.

@codecov
Copy link

codecov bot commented Jun 13, 2019

Codecov Report

Merging #3814 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3814      +/-   ##
==========================================
+ Coverage   84.29%   84.31%   +0.01%     
==========================================
  Files         384      385       +1     
  Lines       72749    72758       +9     
==========================================
+ Hits        61327    61343      +16     
+ Misses      11422    11415       -7

Orange/clustering/clustering.py Outdated Show resolved Hide resolved
pylintrc Outdated Show resolved Hide resolved
Orange/clustering/clustering.py Outdated Show resolved Hide resolved
Orange/clustering/clustering.py Outdated Show resolved Hide resolved
Orange/clustering/clustering.py Outdated Show resolved Hide resolved
Orange/clustering/clustering.py Show resolved Hide resolved
@PrimozGodec PrimozGodec removed the needs discussion Core developers need to discuss the issue label Jun 14, 2019
@PrimozGodec PrimozGodec force-pushed the clustering branch 6 times, most recently from c7e9c4c to e9bcf1e Compare June 17, 2019 09:44
@PrimozGodec PrimozGodec force-pushed the clustering branch 4 times, most recently from e73e590 to 86295cb Compare June 17, 2019 14:46
@PrimozGodec PrimozGodec changed the title [WIP] Clustering simplified Clustering simplified Jun 17, 2019
@PrimozGodec
Copy link
Contributor Author

@lanzagar I think it is done now. Can you check if there is anything to correct?

Orange/clustering/louvain.py Outdated Show resolved Hide resolved
Orange/clustering/__init__.py Show resolved Hide resolved
Orange/tests/test_clustering_kmeans.py Show resolved Hide resolved
@PrimozGodec PrimozGodec changed the title Clustering simplified [ENH] Clustering simplified Jun 21, 2019
@PrimozGodec PrimozGodec changed the title [ENH] Clustering simplified [ENH] Unified clustering API Jun 21, 2019
@lanzagar lanzagar merged commit 6f3aef2 into biolab:master Jun 21, 2019
@PrimozGodec PrimozGodec deleted the clustering branch June 21, 2019 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Different interfaces for clustering methods
4 participants