Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Variable.make #3925

Merged
merged 18 commits into from
Sep 20, 2019
Merged

Remove Variable.make #3925

merged 18 commits into from
Sep 20, 2019

Conversation

janezd
Copy link
Contributor

@janezd janezd commented Jul 5, 2019

Implements #3705. Probably also #3521, #3949, #3049.

Includes
  • Code changes
  • Tests
  • Documentation

@janezd
Copy link
Contributor Author

janezd commented Jul 6, 2019

This passes rather lots of tests without significant modifications in code (except in tests that do ugly things, like renaming variables).

Variable.name is not read-only property because it's used in Variable.__hash__.

I think we should deprecate Domain.index and Domain.__contains__ because they are not well-defined.

@janezd janezd force-pushed the remove-variable-make-2 branch 2 times, most recently from f023d17 to 5283234 Compare July 7, 2019 12:43
@janezd
Copy link
Contributor Author

janezd commented Jul 8, 2019

@lanzagar, can you take a look? There are still things to smooth out (like removing all calls of Variable.make, which are currently forwarded to normal constructors), but in principle this seems to work. Comments from others very welcome, too, of course.

@lanzagar
Copy link
Contributor

lanzagar commented Jul 8, 2019

I can look at it on Thursday/Friday.

As for Domain.__contains__ - I think it is ok the way it works / is defined now: checks if a variable is (anywhere) in the domain... What was/is a big inconsistency is that it does not match __iter__, but that is already deprecated and I am very much in favour of changing it - one can already iterate over attributes, class vars, variables (atts + clsvars), or metas in an explicit and well-defined way.
__iter__ could be removed altogether, but it might be useful to have it iterate over all vars in the domain in some agreed upon and defined order? In any case index should almost certainly match __iter__ to make sense, and not abuse negative indices, which confuses everyone that knows python.

@janezd
Copy link
Contributor Author

janezd commented Jul 8, 2019

As for Domain.contains - I think it is ok the way it works / is defined now: checks if a variable is (anywhere) in the domain...

The problem is that two variables are now equal if they are "compatible". As a consequence, var in domain will no longer be equivalent to any(var == var0 for var0 in domain). I don't think Domain.contains` is very useful anyway.

@markotoplak
Copy link
Member

markotoplak commented Jul 10, 2019

I like this approach that now all discrete variable with the same name (without compute_value) are equal and then DomainConversion creates a mapping as a compute_value-type function. I have no counterexamples yet: I do not see a reason why it couldn't work, but I am not sure it will either. :)

@lanzagar lanzagar self-assigned this Jul 19, 2019
@lanzagar
Copy link
Contributor

lanzagar commented Sep 6, 2019

Some bugs will probably need to be ironed out, but I think this is good conceptually.
The first one to investigate, which produces a crash in the current version is:
File (iris) -> Logistic Regression -> Predictions <- File (iris12),
where iris12 is a file with only 2 class values from iris, e.g. first 100 instances

  File "/home/lan/dev/orange3/Orange/widgets/evaluate/owpredictions.py", line 612, in predict
    return self.predict_discrete(predictor, data)
  File "/home/lan/dev/orange3/Orange/widgets/evaluate/owpredictions.py", line 618, in predict_discrete
    return predictor(data, Model.ValueProbs)
  File "/home/lan/dev/orange3/Orange/classification/base_classification.py", line 47, in __call__
    probs_ext[:, c, cv] = probs[:, c, i]
IndexError: index 2 is out of bounds for axis 2 with size 2

@codecov
Copy link

codecov bot commented Sep 6, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@2fc7ca5). Click here to learn what that means.
The diff coverage is 87.03%.

@@            Coverage Diff            @@
##             master    #3925   +/-   ##
=========================================
  Coverage          ?   84.66%           
=========================================
  Files             ?      371           
  Lines             ?    64892           
  Branches          ?        0           
=========================================
  Hits              ?    54940           
  Misses            ?     9952           
  Partials          ?        0

@codecov
Copy link

codecov bot commented Sep 6, 2019

Codecov Report

Merging #3925 into master will decrease coverage by 0.01%.
The diff coverage is 94.49%.

@@            Coverage Diff            @@
##           master   #3925      +/-   ##
=========================================
- Coverage   85.42%   85.4%   -0.02%     
=========================================
  Files         385     385              
  Lines       68873   68871       -2     
=========================================
- Hits        58832   58821      -11     
- Misses      10041   10050       +9

@janezd
Copy link
Contributor Author

janezd commented Sep 7, 2019

@lanzagar: the problem was caused by inconsistency in SklModelClassification, which overrode __call__ instead of predict and so my new code was called from the wrong place. These are good news: the concept is still sound, it's broken by existing bad code. (I'm not pointing fingers, its author was here for three weeks when writing it. :)

More good news: the schema I used (logistic regression trained on two-class Iris from Select Rows is tested on three-class directly from File) does not work on master branch. Here it now works, after the last fixes.

@janezd janezd force-pushed the remove-variable-make-2 branch 4 times, most recently from f2f9288 to 7c45427 Compare September 8, 2019 20:23
@janezd janezd changed the title [WIP] [RFC] Remove Variable.make Remove Variable.make Sep 8, 2019
@janezd
Copy link
Contributor Author

janezd commented Sep 10, 2019

@lanzagar, @markotoplak, now would be a good time to check and merge -- if OK. This will surely introduce some bugs, we must use it for some time before releasing it.

"47 changed files" looks scary, but most are just fixes in tests -- removal of clear_all_caches and giving names to test variables with empty names.

@markotoplak
Copy link
Member

I tried this with tests from Spectroscopy. Your changes do not break anything significant. Great work!

What worries me that now tests of spectroscopy took 302 sec. On master they finish in 61 seconds. These tests are pretty heavy on data processing and classification where the test data has to be transformed. I'll try to narrow this down a bit.

@@ -254,6 +258,13 @@ def empty(self):
"""True if the domain has no variables of any kind"""
return not self.variables and not self.metas

def _get_equivalent(self, var):
if isinstance(var, Variable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this:

import time
import Orange

data = Orange.data.Table([list(range(5000))])

t = time.time()
pdata = Orange.preprocess.SklImpute()(data)
print("first transform", time.time()-t, "s")

t = time.time()
data.transform(pdata.domain)
print("Table.tranform", time.time()-t, "s")

On master I get

first transform 0.6564505100250244 s
Table.tranform 0.5023694038391113 s

but on this branch

first transform 11.536584615707397 s
Table.tranform 0.5088396072387695 s

My profiler led me here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in the "Removal of Variable.make: _get_equivalent in O(1)" commit.

@markotoplak
Copy link
Member

I pushed a commit that makes _get_equivalent faster. Feel free to make it prettier. Performance issues now seem to be resolved.

return hasattr(other, "master") and self.master is other.master
return type(self) is type(other) \
and self.name == other.name \
and self._compute_value is other._compute_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get self._compute_value == other._compute_value so that compute_values could define their own __eq__?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why not. I changed it.

@janezd
Copy link
Contributor Author

janezd commented Sep 12, 2019

Thanks for fixing _get_equivalent. My version needed O(n^2) for transforming a domain with n variables!

@markotoplak
Copy link
Member

I very much approve this PR. Thanks a lot for this, Janez.

@markotoplak
Copy link
Member

markotoplak commented Sep 18, 2019

@janezd, I agree that removing backmapping does not belong to this PR.

Forward mapping is necessary for models to work, but back mapping is not. We only need it when we compute accuracy - it therefore seems to belong there. But yes, a potential compatibility breaking change and perhaps best handled separately.

@janezd janezd changed the title [RFC] Remove Variable.make Remove Variable.make Sep 20, 2019
@lanzagar lanzagar merged commit 6c81328 into biolab:master Sep 20, 2019
@janezd janezd mentioned this pull request Feb 10, 2020
ales-erjavec added a commit to ales-erjavec/orange3 that referenced this pull request Nov 11, 2020
`ReplaceUnknownsModel` did not work on input dataset with `class_vars`
since biolabgh-3925 due to error raised by backmapping.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants