[ENH] PCA: Preserve f32s & reduce memory footprint when computing means #3582

pavlin-policar · 2019-02-04T22:41:31Z

Issue

While working towards getting the improved PCA merged into scikit-learn, I've found two improvements.

Description of changes

np.float32 are now preserved
Apparently, scipys sparse x.mean method isn't the most memory efficient, and scikit-learn's utility function mean_variance_axis is much better (see benchmarks here)

Includes

Code changes
Tests
Documentation

codecov · 2019-02-04T22:52:49Z

Codecov Report

Merging #3582 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3582      +/-   ##
==========================================
+ Coverage   83.98%   83.98%   +<.01%     
==========================================
  Files         370      370              
  Lines       66976    66981       +5     
==========================================
+ Hits        56249    56254       +5     
  Misses      10727    10727

pavlin-policar · 2019-02-04T22:54:53Z

Codecov makes no sense. I added code with no tests, yet still somehow managed to improve coverage. What?

lanzagar · 2019-02-05T09:47:22Z

Orange/projection/pca.py

+    if sp.issparse(A):
+        means, _ = mean_variance_axis(A, axis=0)
+    else:
+        means = np.mean(A, axis=0)


We already have Orange.statistics.util.mean which is supposed to handle dense and sparse matrices, but it does not have the axis parameter (unlike most other functions in that module). Maybe you could improve that function instead and use it here?

That's a much better idea. Unfortunately, mean_variance_axis just ignores NaNs and provides no feedback if the data had any NaNs. This means that mean wouldn't properly handle NaNs and we'd have no good way of knowing where they occurred.

However, changing nanmean to use this is completely fine since this is exactly the behaviour we want there. So I did that.

lanzagar · 2019-02-05T09:49:48Z

If you add new code with no new tests it can still be covered by existing tests, thus increasing the coverage... ;)

lanzagar reviewed Feb 5, 2019

View reviewed changes

pavlin-policar force-pushed the randomized-pca-improvements branch 2 times, most recently from e714b48 to 65b43b2 Compare February 11, 2019 12:23

PCA: Preserve f32s & reduce memory footprint when computing means

d974ef4

pavlin-policar force-pushed the randomized-pca-improvements branch from 65b43b2 to d974ef4 Compare February 11, 2019 12:55

janezd assigned lanzagar Feb 14, 2019

lanzagar merged commit 0430bae into biolab:master Feb 15, 2019

pavlin-policar deleted the randomized-pca-improvements branch February 15, 2019 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] PCA: Preserve f32s & reduce memory footprint when computing means #3582

[ENH] PCA: Preserve f32s & reduce memory footprint when computing means #3582

pavlin-policar commented Feb 4, 2019

codecov bot commented Feb 4, 2019 •

edited

Loading

pavlin-policar commented Feb 4, 2019

lanzagar Feb 5, 2019

pavlin-policar Feb 10, 2019

lanzagar commented Feb 5, 2019

[ENH] PCA: Preserve f32s & reduce memory footprint when computing means #3582

[ENH] PCA: Preserve f32s & reduce memory footprint when computing means #3582

Conversation

pavlin-policar commented Feb 4, 2019

Issue

Description of changes

Includes

codecov bot commented Feb 4, 2019 • edited Loading

Codecov Report

pavlin-policar commented Feb 4, 2019

lanzagar Feb 5, 2019

Choose a reason for hiding this comment

pavlin-policar Feb 10, 2019

Choose a reason for hiding this comment

lanzagar commented Feb 5, 2019

codecov bot commented Feb 4, 2019 •

edited

Loading