Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Silhouette Plot: Add cosine distance #3176

Merged
merged 8 commits into from
Aug 6, 2018

Conversation

lanzagar
Copy link
Contributor

Added cosine distance to Silhouette Plot.
Handle nan values in the computed dist matrix (e.g. in case of all-zero vectors for cosine) by omitting instances and showing a warning.

Includes
  • Code changes
  • Tests
  • Documentation

@lanzagar lanzagar changed the title Silhouette Plot: Add cosine distance [ENH] Silhouette Plot: Add cosine distance Jul 31, 2018
@codecov-io
Copy link

codecov-io commented Jul 31, 2018

Codecov Report

Merging #3176 into master will increase coverage by 0.16%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3176      +/-   ##
==========================================
+ Coverage   82.48%   82.64%   +0.16%     
==========================================
  Files         336      342       +6     
  Lines       58338    59016     +678     
==========================================
+ Hits        48118    48774     +656     
- Misses      10220    10242      +22

@ales-erjavec
Copy link
Contributor

In case of selected Cosine distance can you check the input domain to ensure it has no discrete columns?

Either show an error and stop, or show a warning and drop them from the domain before computing the distance.

The way that Cosine treats discrete columns means that it implicitly depends on the variable reuse™ meaning it can produce different results depending on the history and order of loaded data

For instance using
discrete-confound-a.tab.txt
discrete-confound-b.tab.txt

$ cat discrete-confound-a.tab
A	B	C
d	d	d
		class
a1	b1	+
a1	b2	+
a3	b3	-
a1	b2	-
a2	b3	+
a3	b4	-
$ cat discrete-confound-b.tab
A	B	C
d	d	d
		class
a0	a1	+
a1	a2	+
a3	a3	-
a1	b2	-
a2	b3	+
a3	b4	-

compare

import Orange
A = Orange.data.Table("discrete-confound-a.tab")
print(Orange.distance.Cosine(A).round(3))

which prints:

[[0.      nan   nan   nan   nan   nan]
 [  nan 0.    0.293 0.    0.293 0.293]
 [  nan 0.293 0.    0.293 0.    0.   ]
 [  nan 0.    0.293 0.    0.293 0.293]
 [  nan 0.293 0.    0.293 0.    0.   ]
 [  nan 0.293 0.    0.293 0.    0.   ]]

and

import Orange
B = Orange.data.Table("discrete-confound-b.tab")
A = Orange.data.Table("discrete-confound-a.tab")
print(Orange.distance.Cosine(A).round(3))

that produces

[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]

@lanzagar
Copy link
Contributor Author

lanzagar commented Aug 1, 2018

The way that Cosine treats discrete columns means that it implicitly depends on the variable reuse™ meaning it can produce different results depending on the history and order of loaded data

Not just that, it seems to me that the way Cosine currently treats discrete columns is just plain wrong. It basically differentiates between the first and all other values (which it treats as equal)?
I am not sure if this is wanted in some circumstances or was just a first approach to make it work and never redone? Maybe @janezd remembers anything about this?
Why I find it strange is that it explicitly implements discrete_to_indicators that does this instead of e.g. simply calling Continuize on the data. Maybe due to performance concerns? Anyway, we should probably either change supports_discrete to False or change to one hot encoding (or something even better?).

Currently discrete features are only clipped (i.e. first value, all
other values) which can give misleading results. Until better handling
of discrete features, Cosine should say it does not support them so that
a warning is displayed when using it.
@lanzagar
Copy link
Contributor Author

lanzagar commented Aug 3, 2018

I have changed Cosine to not advocate support of categorical features until this is resolved. This affects the Distances widget too - everything above is the same there as well. Now it shows a warning and ignores categorical features for Cosine distance.
I added the same warning in Silhouette for metrics that do not support categorical features.

@lanzagar lanzagar added this to the 3.15 milestone Aug 3, 2018
@ajdapretnar ajdapretnar merged commit e47dc90 into biolab:master Aug 6, 2018
@lanzagar lanzagar deleted the silhouette-distances branch March 14, 2022 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants