Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Naive Bayes is confounded by variable reuse. #2943

Closed
ales-erjavec opened this issue Mar 6, 2018 · 3 comments
Closed

Naive Bayes is confounded by variable reuse. #2943

ales-erjavec opened this issue Mar 6, 2018 · 3 comments
Labels
bug A bug confirmed by the core team needs discussion Core developers need to discuss the issue

Comments

@ales-erjavec
Copy link
Contributor

Orange version

3.11.dev, 3.10, 3.9, 3.8 ...

Expected behavior

Can run learner performance evaluation sequentially in the same session on different datasets and get predictable deterministic results (not withstanding explicit randomness sources).

Actual behavior

Cannot run Test & Score on anneal.tab and glass.tab (both included in the distribution) sequentially without one affecting the other.

A by product of the Variable reuse.

Also related gh-2500

Steps to reproduce the behavior

screen shot 2018-03-06 at 09 53 28

In a fresh session load the anneal.tab in the File widget. Observe and record the reported AUC (10 fold cross validation, stratified, averaged over all classes); reload as many times as necessary to establish that the AUC is stable: 0.979.

Load glass.tab and then anneal.tab again. The AUC for Naive Bayes is different: 0.844.

Presumably the problem is in variable reuse. The first time anneal.tab is loaded y has values 1, 2, 3, 5, U the second time 1, 2, 3, 5, U, 6, 7 (the values from glass.tab#y are merged).

This is not a problem in general for skl-learners that use LabelEncoder to encode targets (and ignore the values not actually present in the target vector), but NaiveBayes uses the full y domain for Laplace estimation of class probabilities.

import Orange
anneal = Orange.data.Table("anneal")
l1 = Orange.classification.NaiveBayesLearner()(anneal)
print(l1.class_prob)
_ = Orange.data.Table("glass")
l2 = Orange.classification.NaiveBayesLearner()(anneal)
print(l2.class_prob)

In Orange 2 at least there was on option to disable the variable reuse. Never the less, the default should be no reuse.

Additional info (worksheets, data, screenshots, ...)
@janezd janezd added bug A bug confirmed by the core team needs discussion Core developers need to discuss the issue labels Jan 24, 2019
@lanzagar
Copy link
Contributor

lanzagar commented Feb 1, 2019

Variable reuse can be problematic in unexpected places and should be carefully (re)considered.

But until something is changed on that front (might take some time?) we could/should change naive bayes to not use all variable values, but just those found in training data (as all other learners?).
Looks like class_freq is obtained with get_contingency so it might be as easy as filtering out class values with 0 representation after that is computed? Might need a fix in Model as well to not crash on unseen values?

@janezd
Copy link
Contributor

janezd commented Feb 1, 2019

+1 for solution for NB.

Is it true that variable reuse is needed only by the file widget, and all other uses may be problematic? We removed the option for reuse because users didn't understand it. What if we re-add it, but with a clear long explanation (perhaps visible only when relevant). Something like "This data seems to have the same variables as the data currently loaded by other data widget." And then a checkbox to confirm?

This setting would appear/disappear and be synchronized in all simultaneously opened file widgets. How ugly is that?

@janezd
Copy link
Contributor

janezd commented Feb 15, 2019

Fixed in #3575

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug confirmed by the core team needs discussion Core developers need to discuss the issue
Projects
None yet
Development

No branches or pull requests

4 participants