[ENH] Unified clustering API #3814

PrimozGodec · 2019-05-22T09:14:08Z

Issue

Description of changes

Clustering algorithms are simplified and unified. Each clustering algorithm (kMeans, DBSCAN, Lovain) inherit from a common Clustering class. Clustering itself is stateless (it does not remember the data or parameters (such as centroids)). After it is called it returns clusters or when fit_storage is called it creates a new instance of type ClusteringModel which holds the model parameters (in case of k-means they are clusters) - model can be created just for k-Means. Other algorithms do not have parameters to store or do not enable predicting.

The interface to call the clustering is now different. Here is the example for kMeans:

d = Table("iris")
km = KMeans(preprocessors=None, n_clusters=3)  #init the clustering
clusters = km(d)  # compute clusters

or in the case when we need model:

d = Table("iris")
km = KMeans(preprocessors=None, n_clusters=3)  #init the clustering
model = km.fit_storage(d)  # fit the model

I also removed the silhouette computing from k-means class since I think it is not the part of the clustering but its own unit. From now on it will be called directly from the widget.

Includes

Code changes
Tests
Documentation

PrimozGodec · 2019-05-22T09:15:20Z

@janezd and @markotoplak are you ok with the proposed scheme? I will push the changes for tests and widgets after we agree with the clustering scheme.

pavlin-policar · 2019-05-27T11:52:08Z

Conceptually, some of this doesn't make sense. This is what I would expect to happen when using clustering classes: I call DBScan with some data and I get a model. I would assume the cluster labels are available at this point (via some cluster_assignments field or something). Now I get some new data. I notice the dbscan model has a transform method, so why not use this to check which clusters these new samples would belong to. Ok, I call predict, and now am baffled why there may or may not be more clusters in the new data than in the original data. If I overlay the original data with the new data, I notice that the clusters don't overlap at all. What gives!? I look into the code and lo and behold, transform doesn't actually do what I think it does (or what I would expect it to do from having used scikit-learn) but actually computes a new clustering on this data. Why? After another 5min of searching around the internet, I find that dbscan does not support transform.

In k-means, this makes complete sense, but dbscan and Louvain don't have a transform method, nor is it straightforward to add one.

I propose that the cluster assignments of the training data be available on the model as a field (so I can get cluster assignments by just calling fit) and if the method supports transform, then this can be implemented there. If it doesn't support transform, it should fail loudly, via a NotImplementedError.

PrimozGodec · 2019-05-27T12:04:55Z

@pavlin-policar I agree with your solution. I also do not like that we call clustering twice for the same results as well that we pretend that there is a transform function. @janezd what you think about Pavlins idea.

PrimozGodec · 2019-05-31T09:52:42Z

Tests are failing since the widget for Lovain clustering is not fixed to new clusterings yet. I would like to hear the opinion about @pavlin-policar idea anyway.

lanzagar · 2019-06-07T10:22:58Z

Main points of some additional discussion:

removing the compute_silhouette_score parameter from kMeans should be done by deprecating it first and not crashing on use of previous API
when wrapping sklearn methods in Orange, we copy the parameters (I guess mostly for introspection, code completion, etc). This practice can be discussed, but I would prefer to keep it like it was in this PR and do that separately if someone wants to propose a general change (not just for clustering methods)
because ClusteringModel is not really something that is often used/useful, the Clustering.__call__ method could default to the fit_predict behaviour and just return the clusters. For methods that support it (kmeans), if someone wants the model they can get it by specifically calling fit/fit_storage instead of __call__

PrimozGodec · 2019-06-12T10:27:22Z

@lanzagar can you check whether the clustering implementation is what we want. I meanwhile I will fix lint and also add some more tests for clustering.

codecov · 2019-06-13T11:49:27Z

Codecov Report

Merging #3814 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3814      +/-   ##
==========================================
+ Coverage   84.29%   84.31%   +0.01%     
==========================================
  Files         384      385       +1     
  Lines       72749    72758       +9     
==========================================
+ Hits        61327    61343      +16     
+ Misses      11422    11415       -7

Orange/clustering/clustering.py

pylintrc

Orange/clustering/clustering.py

PrimozGodec · 2019-06-17T16:26:57Z

@lanzagar I think it is done now. Can you check if there is anything to correct?

Orange/clustering/louvain.py

Orange/clustering/__init__.py

Orange/tests/test_clustering_kmeans.py

janezd self-assigned this May 24, 2019

PrimozGodec mentioned this pull request May 30, 2019

[FIX] DBSCAN: Fix predicted labels #3833

Merged

3 tasks

janezd removed their assignment May 30, 2019

PrimozGodec added the needs discussion Core developers need to discuss the issue label May 31, 2019

PrimozGodec force-pushed the clustering branch from cee5ebb to 7016fbc Compare June 6, 2019 13:39

lanzagar self-assigned this Jun 7, 2019

lanzagar removed their assignment Jun 7, 2019

lanzagar changed the title ~~Clustering simplified~~ [WIP] Clustering simplified Jun 7, 2019

PrimozGodec force-pushed the clustering branch 11 times, most recently from febfdb9 to 055d273 Compare June 12, 2019 10:05

PrimozGodec force-pushed the clustering branch from 055d273 to 12ffffe Compare June 13, 2019 11:49

janezd assigned lanzagar Jun 14, 2019

lanzagar reviewed Jun 14, 2019

View reviewed changes

PrimozGodec removed the needs discussion Core developers need to discuss the issue label Jun 14, 2019

PrimozGodec force-pushed the clustering branch 6 times, most recently from c7e9c4c to e9bcf1e Compare June 17, 2019 09:44

PrimozGodec mentioned this pull request Jun 17, 2019

AnnotateProjection: Clustering modified for upcoming changes in Orange biolab/orange3-bioinformatics#146

Merged

3 tasks

PrimozGodec force-pushed the clustering branch 4 times, most recently from e73e590 to 86295cb Compare June 17, 2019 14:46

PrimozGodec changed the title ~~[WIP] Clustering simplified~~ Clustering simplified Jun 17, 2019

lanzagar requested changes Jun 21, 2019

View reviewed changes

Orange/clustering/louvain.py Outdated Show resolved Hide resolved

Orange/clustering/__init__.py Show resolved Hide resolved

Orange/tests/test_clustering_kmeans.py Show resolved Hide resolved

PrimozGodec changed the title ~~Clustering simplified~~ [ENH] Clustering simplified Jun 21, 2019

PrimozGodec changed the title ~~[ENH] Clustering simplified~~ [ENH] Unified clustering API Jun 21, 2019

PrimozGodec force-pushed the clustering branch from 86295cb to e5f03f4 Compare June 21, 2019 09:44

PrimozGodec added 5 commits June 21, 2019 13:04

Clustering simplified

3d2fa0e

Clustering: modified dependent widgets

fa2af37

Clustering: Fixed tests

857a29a

Clustering: Deprecate silhouette in kmeans

27634c5

Clustering: Additional tests for clustering methods

2d6b629

PrimozGodec force-pushed the clustering branch from e5f03f4 to 2d6b629 Compare June 21, 2019 11:04

lanzagar approved these changes Jun 21, 2019

View reviewed changes

lanzagar merged commit 6f3aef2 into biolab:master Jun 21, 2019

PrimozGodec deleted the clustering branch June 21, 2019 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Unified clustering API #3814

[ENH] Unified clustering API #3814

PrimozGodec commented May 22, 2019 •

edited

Loading

PrimozGodec commented May 22, 2019

pavlin-policar commented May 27, 2019

PrimozGodec commented May 27, 2019

PrimozGodec commented May 31, 2019

lanzagar commented Jun 7, 2019

PrimozGodec commented Jun 12, 2019

codecov bot commented Jun 13, 2019 •

edited

Loading

PrimozGodec commented Jun 17, 2019

[ENH] Unified clustering API #3814

[ENH] Unified clustering API #3814

Conversation

PrimozGodec commented May 22, 2019 • edited Loading

Issue

Description of changes

Includes

PrimozGodec commented May 22, 2019

pavlin-policar commented May 27, 2019

PrimozGodec commented May 27, 2019

PrimozGodec commented May 31, 2019

lanzagar commented Jun 7, 2019

PrimozGodec commented Jun 12, 2019

codecov bot commented Jun 13, 2019 • edited Loading

Codecov Report

PrimozGodec commented Jun 17, 2019

PrimozGodec commented May 22, 2019 •

edited

Loading

codecov bot commented Jun 13, 2019 •

edited

Loading