Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] PCA transformation speedup #1539

Merged
merged 3 commits into from
Sep 9, 2016
Merged

[ENH] PCA transformation speedup #1539

merged 3 commits into from
Sep 9, 2016

Conversation

markotoplak
Copy link
Member

@markotoplak markotoplak commented Sep 2, 2016

Speedup of transformations into the PCA space on my laptop:

  • adult:
    6.07s -> 0.21s
  • an infrared dataset with shape (16384, 1608):
    (more than 40 minutes - cancelled) -> 6.69s

@codecov-io
Copy link

codecov-io commented Sep 2, 2016

Current coverage is 88.30% (diff: 97.05%)

Merging #1539 into master will increase coverage by 0.02%

@@             master      #1539   diff @@
==========================================
  Files            77         77          
  Lines          7629       7654    +25   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           6735       6759    +24   
- Misses          894        895     +1   
  Partials          0          0          

Sunburst

Powered by Codecov. Last update ea0173f...935303a

@@ -28,3 +28,27 @@ def scale(values, min=0, max=1):
if ptp == 0:
return np.clip(values, min, max)
return (-minval + values) / ptp * (max - min) + min


class ComputeValueWCommon:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not belong here. Put it in pca.py, where it's used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be used for other transformations. It belongs in some base orange classes as it is referenced in table.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. Will be used for other transformations as part of this PR? Utils are tools, like screwdrivers and wrenches. This doesn't look like a tool, rather a very specific piece of plug I, unrelatedly, happen to not appreciate at all. 😃

Copy link
Member Author

@markotoplak markotoplak Sep 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not, not as a part of this PR, but I'll definitely use it a lot in an add-on. This thing tries to at least partially solve the problem where a lot of features are based on the same computation. This also frequently occurs in text mining - also @nikicc was questioning it once.

Or do you perhaps suggest to put this in table.py or variable.py?

@kernc
Copy link
Contributor

kernc commented Sep 3, 2016

While I enjoy the performance increase, I don't like how the change introduces considerable branching on the consumer end.

The key to performance is elegance, not battalions of special cases.
— Jon Bentley and Doug McIlroy

def __call__(self, data):
if data.domain != self.projection.pre_domain:
data = data.from_table(self.projection.pre_domain, data)
return self.projection.transform(data.X)[:, self.feature]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result would be pretty much the same if you just somehow intelligently memoized this self.projection.transform() call?

Copy link
Member Author

@markotoplak markotoplak Sep 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could ob course be memoized and shared between different compute_value functions for the same domain. But for lesser memory use that cache would have to be invalidated after transformation. Therefore, I did it so that the common part is exposed to the tranformer, which can destroy the intermediate results after transformation.

Copy link
Contributor

@kernc kernc Sep 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't currently have a better idea. 😕 I just strongly feelknow this isn't something a user of the projector should worry about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My line of reasoning. I would like to clear the cache after transformation. The only thing that knows when transformation ends is the user. Therefore the user has to be aware of it.

An alternative solution would be to call something like clear_cache() for all the variables after transformation, but this has the same awareness-drawback while being less direct. Also, theoretically, such caches would have problems with concurrent transforms...

Speedup of transformations into the PCA space on my laptop:
- adult:
  6.07s -> 0.21s
- an infrared dataset with shape (16384, 1608):
  (more than 40 minutes - cancelled) -> 6.69s

Parameters
----------
compute_shared: a callable that takes a Orange.data.Table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Callable[[Orange.data.Table], object] PEP484

@astaric astaric merged commit d464b58 into biolab:master Sep 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants