Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Support sparse Jaccard #3657

Merged
merged 9 commits into from
Mar 15, 2019
Merged

Conversation

ajdapretnar
Copy link
Contributor

Issue

Jaccard did not support sparse data, making it useless for text mining.

Description of changes

Custom support for sparse Jaccard.

Includes
  • Code changes
  • Tests
  • Documentation

@ajdapretnar ajdapretnar requested a review from janezd March 4, 2019 14:10
@ajdapretnar
Copy link
Contributor Author

ajdapretnar commented Mar 4, 2019

@lanzagar @thocevar @janezd
This requires some discussion.

  1. Before, if Distances are given sparse discrete data, Jaccard didn't work. Currently fixed with a no-so-nice hack.
  2. DistMatrix cannot be tested with np.testing.assert_array_equal, because max() returns an int instead of numpy.int64. See: DistMatrix: np.testing.assert_array_equal crashes on two different matrices #3658
  3. Distance.Jaccard().fit does not work for any sparse data. How to solve this?

All in all, the code needs to be made nicer and cleaner. Any suggestions welcome.

issparse(data.X) and getattr(metric, "fallback", None)
issparse(data.X) and getattr(metric, "fallback",
None) and metric is not
distance.Jaccard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for not checking metric.supports_sparse instead of metric is not distance.Jaccard?

The condition that specifically checks for Jaccard a few lines later is needed because there is no specific flag signalling whether a metric supports distances by columns. Here, we have a flag to check, unless I overlooked something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jaccard is the only distance that supports discrete attributes via fallback. Other metrics fall back to sklearn's methods, which won't work with discrete.

@codecov
Copy link

codecov bot commented Mar 12, 2019

Codecov Report

Merging #3657 into master will increase coverage by <.01%.
The diff coverage is 89.09%.

@@            Coverage Diff             @@
##           master    #3657      +/-   ##
==========================================
+ Coverage   84.31%   84.32%   +<.01%     
==========================================
  Files         370      370              
  Lines       67856    67899      +43     
==========================================
+ Hits        57214    57255      +41     
- Misses      10642    10644       +2

@codecov
Copy link

codecov bot commented Mar 12, 2019

Codecov Report

Merging #3657 into master will increase coverage by <.01%.
The diff coverage is 97.82%.

@@            Coverage Diff             @@
##           master    #3657      +/-   ##
==========================================
+ Coverage   84.43%   84.44%   +<.01%     
==========================================
  Files         372      372              
  Lines       68111    68143      +32     
==========================================
+ Hits        57511    57543      +32     
  Misses      10600    10600

@janezd
Copy link
Contributor

janezd commented Mar 12, 2019

551effc changes the base class so that it can handle numpy arrays if fallback is not provided. Perhaps we can stop passing numpy arrays to fallbacks and use fallbacks only for sparse data.

fd1fd19 moves @ajdapretnar's code for sparse Jaccard to the proper class and also contains minor fixes in tests.

@ajdapretnar, you still have to compute the probabilities in sparse data fitter for handling missing values when the model is used on dense data.

@janezd
Copy link
Contributor

janezd commented Mar 12, 2019

Jaccard.compute_distances is now ugly. Dense data is handled in compute_distances itself, while sparse data is in another method. It should be symmetric: one method for dense and one for sparse, and compute_distances calls one or another. That is, extract most of the code from compute_distances, except for the first two lines, into a separate method.

sparse_jaccard is also not a very good name: jaccard is already the class name. Maybe _compute_sparse and _compute_dense.

@janezd
Copy link
Contributor

janezd commented Mar 13, 2019

Cosine tests failed because of wrong clipping: sklearn (correctly) clips to [0, 2], while our function for dense matrices (effectively) clipped to [0, 1]. This commit doesn't belong to this PR, but this PR adds tests for comparison between distances on sparse and dense matrices, which revealed this problem, hence its more practical to fix this here.

@@ -450,7 +453,7 @@ def fit_rows(self, attributes, x, n_vals):
frequencies of non-zero values per each column.
"""
if sp.issparse(x):
ps = None # wrong!
ps = x.getnnz(axis=0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the Cython code correctly, this should be ok. @janezd

@janezd janezd self-assigned this Mar 15, 2019
@janezd janezd changed the title [WIP][RFC] Support sparse Jaccard [RFC] Support sparse Jaccard Mar 15, 2019
@janezd janezd changed the title [RFC] Support sparse Jaccard [ENH] Support sparse Jaccard Mar 15, 2019
@janezd janezd merged commit 574f2c2 into biolab:master Mar 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants