Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Distances: Optimize PearsonR/SpearmanR #2852

Merged

Conversation

ales-erjavec
Copy link
Contributor

@ales-erjavec ales-erjavec commented Jan 5, 2018

Issue
  • PearsonR is poorly implemented by a double for loop in python.
  • SpearmanR quadruples it's workload only to throw most of it away.
Description of changes
  • Use numpy.corrcoef for PearsonR
  • Implement a variants of corrcoef/spearmanr which compute correlations between two arrays more efficiently (without also computing all correlations within the two arrays).
Includes
  • Code changes
  • Tests
  • Documentation

@ales-erjavec ales-erjavec force-pushed the distances-correlations-optimization branch from 0e915a6 to 1d5c34f Compare January 11, 2018 08:52
@codecov-io
Copy link

codecov-io commented Jan 11, 2018

Codecov Report

Merging #2852 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2852      +/-   ##
==========================================
+ Coverage   81.91%   81.92%   +0.01%     
==========================================
  Files         326      326              
  Lines       55997    56031      +34     
==========================================
+ Hits        45868    45903      +35     
+ Misses      10129    10128       -1

* Use numpy.corrcoef in PearsonR
* Optimize PearsonR/SpearmanR when computing pairwise distances on
  a single input table
... for the case where computing distances from two tables.
@ales-erjavec ales-erjavec force-pushed the distances-correlations-optimization branch from 1d5c34f to f837920 Compare January 16, 2018 13:25
@lanzagar lanzagar added this to the 3.10 milestone Feb 7, 2018
@thocevar thocevar self-assigned this Feb 23, 2018
@lanzagar lanzagar modified the milestones: 3.10, 3.11 Feb 23, 2018
Copy link
Contributor

@thocevar thocevar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok. Do you have any measurements of how much we gain with this re-implementation of what we previously handed over to numpy and scipy?

rho = rho[:2, :2].copy()
else:
# scalar if n1 == 1
rho = stats.spearmanr(x1, axis=self.axis)[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two cases (if, else) necessary? At first glance stats.spearmanr seems to (efficiently) handle the case of a missing second attribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind.

@ales-erjavec
Copy link
Contributor Author

In [1]: import Orange, numpy

In [2]: d = Orange.data.Table(numpy.random.random(size=(200, 200)))

In [3]: %timeit Orange.distance.PearsonR(d)

Before

1.59 s ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After

614 µs ± 84.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [1]: import Orange, numpy

In [2]: d = Orange.data.Table(numpy.random.random(size=(400, 200)))

In [3]: %timeit Orange.distance.SpearmanR(d[1:], d[:-1])

Before

201 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After

47.9 ms ± 1.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@thocevar thocevar merged commit b09b282 into biolab:master Feb 26, 2018
@ales-erjavec ales-erjavec deleted the distances-correlations-optimization branch February 26, 2018 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants