Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] PCA: Preserve f32s & reduce memory footprint when computing means #3582

Merged
merged 1 commit into from
Feb 15, 2019

Conversation

pavlin-policar
Copy link
Collaborator

Issue

While working towards getting the improved PCA merged into scikit-learn, I've found two improvements.

Description of changes
  1. np.float32 are now preserved
  2. Apparently, scipys sparse x.mean method isn't the most memory efficient, and scikit-learn's utility function mean_variance_axis is much better (see benchmarks here)
Includes
  • Code changes
  • Tests
  • Documentation

@codecov
Copy link

codecov bot commented Feb 4, 2019

Codecov Report

Merging #3582 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3582      +/-   ##
==========================================
+ Coverage   83.98%   83.98%   +<.01%     
==========================================
  Files         370      370              
  Lines       66976    66981       +5     
==========================================
+ Hits        56249    56254       +5     
  Misses      10727    10727

@pavlin-policar
Copy link
Collaborator Author

Codecov makes no sense. I added code with no tests, yet still somehow managed to improve coverage. What?

if sp.issparse(A):
means, _ = mean_variance_axis(A, axis=0)
else:
means = np.mean(A, axis=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have Orange.statistics.util.mean which is supposed to handle dense and sparse matrices, but it does not have the axis parameter (unlike most other functions in that module). Maybe you could improve that function instead and use it here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a much better idea. Unfortunately, mean_variance_axis just ignores NaNs and provides no feedback if the data had any NaNs. This means that mean wouldn't properly handle NaNs and we'd have no good way of knowing where they occurred.

However, changing nanmean to use this is completely fine since this is exactly the behaviour we want there. So I did that.

@lanzagar
Copy link
Contributor

lanzagar commented Feb 5, 2019

If you add new code with no new tests it can still be covered by existing tests, thus increasing the coverage... ;)

@pavlin-policar pavlin-policar force-pushed the randomized-pca-improvements branch 2 times, most recently from e714b48 to 65b43b2 Compare February 11, 2019 12:23
@lanzagar lanzagar merged commit 0430bae into biolab:master Feb 15, 2019
@pavlin-policar pavlin-policar deleted the randomized-pca-improvements branch February 15, 2019 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants