Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Proper calculation of distances #2454

Merged
merged 27 commits into from
Aug 31, 2017
Merged

[ENH] Proper calculation of distances #2454

merged 27 commits into from
Aug 31, 2017

Conversation

janezd
Copy link
Contributor

@janezd janezd commented Jul 6, 2017

Issue

Orange currently uses the puny SKL functions for computation of distances, which have no normalization and don't handle nominal features and missing values.

Description of changes
Includes
  • Code changes
  • Tests
  • Documentation

@ajdapretnar
Copy link
Contributor

I was wondering what's this year's big project after last year's trees. Looks like a logical continuation. 👍

@janezd
Copy link
Contributor Author

janezd commented Jul 6, 2017

The code is far from done yet. You can comment on it, but at this moment I wish if somebody could take a look at distances.md from this pull request (open it in an editor with a viewer that can parse LaTeX (e.g. Haroopad) or see the attached PDF). It describes the normalization and treatment of missing values for different cases and distance types. @BlazZupan?

@nikicc, which type of sparse matrices (preferably a single one) do I need to support?

distances.pdf

@janezd
Copy link
Contributor Author

janezd commented Jul 6, 2017

@ajdapretnar, yes, it's logical, it's the next most annoying thing in Orange after trees (and before Bayes?).

@nikicc
Copy link
Contributor

nikicc commented Jul 6, 2017

@nikicc, which type of sparse matrices (preferably a single one) do I need to support?

SciPy's csr_matrix and csc_matrix are the only one that we currently use AFAIK. I believe they support normal NumPy indexing; just that one is faster for column- and the other for row operations. A while ago, we decided that we do not make any guarantees and whoever would prefer one to the other may change it inplace on the table, so feel free to do so if this brings any speed-ups.

Also, I suggest that rather than calling NumPy directly — I spotted at least np.bincount and np.nanmean — you consider functions from Orange/statistics/util.py. There are some NumPy's equivalent function that can also handle sparse data. E.g. bincount and nanmean are already there, but you might need to write others like nanvar etc.

@janezd janezd force-pushed the distances branch 2 times, most recently from 87f282f to 9d6bdec Compare July 10, 2017 09:35
@janezd
Copy link
Contributor Author

janezd commented Jul 10, 2017

@nikicc, what do you mean by not making any guarantees?

Do I understand correctly that the new code for calculation of distances needs to support both types of matrices? The thing is that the Python code only covers fitting (that is, computation of means and variances needed for handling missing values and normalization). Actual computation is in Cython. I thus have to write a separate function for each matrix type x distance type x axis. Which is, well, lots of functions (eight per each matrix type). :)

@nikicc
Copy link
Contributor

nikicc commented Jul 10, 2017

Do I understand correctly that the new code for calculation of distances needs to support both types of matrices?

IMO: yes. In theory, you could get either CSC or CSR, though you will mostly get a CSR.

The thing is that the Python code only covers fitting (that is, computation of means and variances needed for handling missing values and normalization). Actual computation is in Cython. I thus have to write a separate function for each matrix type x distance type x axis. Which is, well, lots of functions (eight per each matrix type). :)

I see. But couldn't you always cast it either to CSR (or CSC, whichever is faster for a given axis parameter) before passing it to Cython?

@ajdapretnar
Copy link
Contributor

I can't get distances.md to render well pretty much anywhere. How can I render it properly? Where will this be shown?

@janezd
Copy link
Contributor Author

janezd commented Jul 14, 2017

@ajdapretnar, open it in, say, haroopad.

This file is just for core developers. Parts of the text, but not entire derivations, may be eventually included in documentation.

@janezd
Copy link
Contributor Author

janezd commented Jul 14, 2017

Fails on Python 3.5 with

======================================================================
FAIL: test_target_with_nan (Orange.widgets.model.tests.test_owlogisticregression.TestOWLogisticRegression)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/biolab/orange3/build/travis-test/Orange/widgets/model/tests/test_owlogisticregression.py", line 113, in test_target_with_nan
    self.assertTrue(np.array_equal(coef1, coef2))
AssertionError: False is not true

It seems unrelated to my PR. @jerneju, you've added this test in #2392. Could you please check how this PR interferes with it? Could there be a problem that sklearn's implementation of logistic regression is unstable and doesn't necessarily produce the same model on the same data?

@ales-erjavec
Copy link
Contributor

That test (along with gh-2450 probably) exposed an existing error. See gh-2470.

Copy link
Contributor

@kernc kernc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we see some performance benchmarks? Like, how many times worse off is distance computation now? 😆

for row2 in range(n_rows2 if two_tables else row1):
d = 0
for col in range(n_cols):
if mads[col] == -2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have magic names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean?

val1, val2 = x1[row1, col], x2[row2, col]
if npy_isnan(val1) and npy_isnan(val2):
d += dist_missing2[col]
elif mads[col] == -1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic + moving out of inner loop should give some performance benefit.

d += dist_missing[col, ival1]
elif ival1 != ival2:
d += 1
elif normalize:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be moved out of all the loops to benefit performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered it, but I can't without replicating the case above, which handles discrete data. Getting rid of one if out of many, at a cost of basically duplicating the function is not worth it.

But now I see I could do that in euclidean_cols and manhattan_cols. Not sure I will, though; it's like splitting the function into two.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modern microprocessors tend to have quite long pipelines so that the branch misprediction delay is between 10 and 20 clock cycles.

You might put any duplications into small helper functions which will be inlined by the compiler.

@janezd
Copy link
Contributor Author

janezd commented Jul 14, 2017

The speed depends on how and what you measure. It's trivial to be fast if all you do is calculating is dot products. If we add the time needed to prepare the data on which you can then just compute the dot product, it gets way slower. This is however beyond the point since we could not properly handle missing and discrete data with sklearn's metrics.

If needed, we can still use sklearn for continuous data without missing values. Or handle this ourselves, as special case.

@kernc
Copy link
Contributor

kernc commented Jul 14, 2017

This is however beyond the point since we could not properly handle missing and discrete data with sklearn's metrics.

Which reminds me: also some usability metrics would be nice. Like, how much objectively better / more correct models are now possible. From the user's, not the purist's perspective. 😺

@janezd
Copy link
Contributor Author

janezd commented Jul 14, 2017

Like this? :)

Before: Silhouette grouped data by discrete attributes and computed data on the continuous. In other words: useful for observing silhouettes of classes and clusters, but not for all-discrete or all-numeric data

After: Silhouette can be used to actually explore data

Before: Good luck with trying to see mds or clustering of, say, heart disease data without ignoring discrete attrs altogether or constructing dubious preprocessing

After: Orange now supports the basic unsupervised data exploration that it should have always supported

The only purist thing here is adding variances when essentially imputing means. Imputation of numeric data is not a big deal and could have been done before. Properly handling discrete data and missing discrete data is not purism.

Plus, I don't worry too much about speed. Computation of distances is probably negligible in comparison with whatever you plan to do with them.

@kernc
Copy link
Contributor

kernc commented Jul 14, 2017

Can you back the purported advantages with more objective numbers and/or figures to pass peer review?

I don't worry too much about speed.

I know. I meant really simple benchmarks like:

>>> X = np.random.random((10000, 300))
>>> from sklearn.metrics import pairwise_distances
>>> %timeit pairwise_distances(X)
2.17 s ± 40.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

If the new implementation took 40 s on such a reasonable dataset, that'd be quite a hit, no?

@janezd
Copy link
Contributor Author

janezd commented Jul 15, 2017

If the new implementation took 40 s on such a reasonable dataset, that'd be quite a hit, no?

I hate you.

In [14]: %timeit Euclidean(t)
1 loop, best of 3: 40 s per loop

You haven't tried this on my computer to know what time I would get... Or have you? :)

I'll see what I can get by optimizing loops. In the worst case, I can remove the improved handling of missing values, but we definitely need discrete values (picture below; bees shouldn't be hairy and dolphins should). In this case, I can remove all if-s from loops.

screen shot 2017-07-15 at 09 49 48

@kernc
Copy link
Contributor

kernc commented Jul 17, 2017

Right.

This is what I get streaming zoo through Continuize, computing distances on indicator columns.
screenshot_2017-07-18_00-12-51

Dolphins should have hair, but probably so should bees and houseflies and moths. And with hair being from about the same material as nails and scales and shells, so might tortoises. Informative or not, above plots are about 98% equivalent? Can you show unit nominal distances make another significant real-world difference? 🐱

without ignoring discrete attrs altogether or constructing dubious preprocessing

Agreed, the transformations might be idempotently auto-applied as necessary. Normalization of continuous features' values needs to take part in any sane distance computation pipeline anyway, right?

@jerneju
Copy link
Contributor

jerneju commented Jul 18, 2017

@janezd
Python 3.5.2
test_target_with_nan which has been added in #2392 reports this:

Orange/data/table.py:449: VisibleDeprecationWarning: boolean index did not match indexed array along dimension 0; dimension is 150 but corresponding boolean dimension is 145
  self.ids = np.array(source.ids[row_indices])

@codecov-io
Copy link

codecov-io commented Jul 18, 2017

Codecov Report

Merging #2454 into master will increase coverage by 0.66%.
The diff coverage is 95.68%.

@@            Coverage Diff             @@
##           master    #2454      +/-   ##
==========================================
+ Coverage   74.68%   75.34%   +0.66%     
==========================================
  Files         320      327       +7     
  Lines       56444    57595    +1151     
==========================================
+ Hits        42157    43397    +1240     
+ Misses      14287    14198      -89

@janezd
Copy link
Contributor Author

janezd commented Jul 18, 2017

This is what I get streaming zoo through Continuize, computing distances on indicator columns.

How?! (The fact I had to ask this already shows we're doing something wrong. :))

I reached this:

>>> %timeit pairwise_distances(X)
2.02 s ± 49.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit Euclidean(t)
2.69 s ± 33.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I still have some room for improvement, but this should already suffice. The current implementation normalizes the data, computes distances like in sklearn (I actually copied/adapted its code) and the fixes nans due to missing data.

For Manhattan (not yet in the commit I just pushed), the new implementation is about twice as fast as sklearn's.

In [5]: %timeit pairwise_distances(X, metric="cityblock")
27.2 s ± 468 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit Manhattan(t)
16.8 s ± 508 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Euclidean has one multiplication per each row x row x column, while Manhattan has subtraction and fabs. I'll shave some time off here, too.

@janezd
Copy link
Contributor Author

janezd commented Jul 20, 2017

Running times (in seconds) on np.random.random((10000, 300)):

sklearn Orange
Euclidean 2.06 2.79
Cosine 1.73 2.05
Manhattan 31.3 17.1
Jaccard 45.1 25.2



class Distance:
def __call__(self, e1, e2=None, axis=1, impute=False):
# Argument types in docstrings must be in a single line(?), hence
# pylint: disable=line-too-long
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In numpydoc, must be in a single logical line.

@janezd
Copy link
Contributor Author

janezd commented Jul 22, 2017

Based on #2486 to enable new tests.

@janezd janezd force-pushed the distances branch 2 times, most recently from b02bb0e to 18c9387 Compare July 22, 2017 09:05
@astaric astaric merged commit c3a7d1a into biolab:master Aug 31, 2017
@janezd janezd mentioned this pull request Dec 13, 2018
@janezd janezd deleted the distances branch April 5, 2019 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants