Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Neighbors: improve exclusion of references, checkbox to (un)limit output data #4997

Merged

Conversation

borondics
Copy link
Member

Issue

OWNeighbors was limited to output distances only a subset of the input data. This change enables outputting distances to the full dataset.

Description of changes

For Orange Spectroscopy we would like to be able to compare the full dataset to a selected reference instance and visualize the distances to it.

TODO:

The widget needs some work in my opinion, right now, every instance that is in an arbitrarily defined 1E-5 distance from the reference is excluded from the output. This should be handled dynamically based on the values of the data instances.

Includes
  • Code changes
  • Tests
  • Documentation

@codecov
Copy link

codecov bot commented Sep 22, 2020

Codecov Report

Merging #4997 into master will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4997      +/-   ##
==========================================
- Coverage   84.62%   84.62%   -0.01%     
==========================================
  Files         285      285              
  Lines       59427    59430       +3     
==========================================
+ Hits        50288    50290       +2     
- Misses       9139     9140       +1     

@janezd
Copy link
Contributor

janezd commented Sep 24, 2020

If the user wants to have all instances, (s)he now has to limit it to a high-enough number. Wouldn't it be more elegant to just put a checkbox in front of the spin?

As for 1e-5, what about this: if there are some instances that are (almost) equal to the reference instance, we would expect the distribution of distances to have a small group of values below 1e-5 (I think this is still reasonably above numeric errors) and then nothing for a long time (but perhaps not all the way to 1e-5).

On the other hand, we may not need to go through all this: values are normalized, hence 1e-5 is already dynamic in a sense. Have you encountered cases when it removed instances that you'd prefer kept?

I wouldn't expose 1e-5 in the user interface. Beside it looking to technical and taking too much space, the number is meaningless to the user because variables are normalized. I guess.

@janezd janezd added the needs discussion Core developers need to discuss the issue label Sep 24, 2020
@borondics
Copy link
Member Author

borondics commented Sep 24, 2020

I did put a checkbox in front of the spin using the checked='spin_checked' option of the gui object. I also set the max to some ridiculously high number so that there wouldn't be any problems with manual inputs. This could be calculated or maybe the opbject takes None as a parameter, I haven't checked.

What if the values of the instances are very small? I don't think they should be automatically normalized, this should be an option similar to PCA. In that case, if the user selected no normalization (which makes sense for spectroscopy) there could be cases (not only for spectroscopy datasets) where all values are below 1E5 and thus the difference as well, while they would be actually significantly different instances.

The Jaccard distance always (? needs more testing) gives me the "all instances are the same" message for spectroscopy data. I haven't looked into the actual calculation though. I put 1E-5 to the GUI hoping that the hardcoded comparison will be removed in the new version of the calculation.

@janezd
Copy link
Contributor

janezd commented Sep 25, 2020

I did put a checkbox in front of the spin

I must be blind. Sorry. .)

The ridiculously high number could be replaced (there's no reason not to...), though it won't change anything.

Jaccard is a different kind of distance, not applicable to spectroscopy. It computes differences between sets.

Discussion results:

  • The checkbox for removal of similar instances was a workaround for a bug. We just want to exclude the reference example if it appears in the data. The bug has been fixed, so the plan now is to remove the checkbox entirely. If the user wants to remove instances that are too similar, (s)he can use "Select Rows".
  • For normalization, there should be a checkbox similar to the one in Distances.

Would this be OK?

Would you continue working on this or should somebody else take over?

@janezd janezd removed the needs discussion Core developers need to discuss the issue label Sep 25, 2020
@borondics
Copy link
Member Author

borondics commented Sep 25, 2020

@janezd, thanks for the update!

I'd be happy with any solution as long as we can output the full incoming dataset. For the normalization, it would be great to have it as an option.

If not too slow, I think you should take over because I just saw @BlazZupan just reported something in #5000 that I'm not sure I can fix. I started working on it because a user requested the feature, so there is a time factor. :)

@markotoplak markotoplak self-assigned this Sep 29, 2020
@janezd janezd assigned janezd and unassigned markotoplak Oct 2, 2020
@janezd janezd force-pushed the OWNeighbors_output_all_distances branch 4 times, most recently from 3c8b04a to 4c0cbcc Compare October 2, 2020 13:13
@janezd janezd changed the title Enable OWNeighbors output all distances [ENH] Neighbors: improve exclusion of references, checkbox to (un)limit output data Oct 2, 2020
- Adjust the maximal number of neighbours to the data size
- Enable OWNeighbors to output all distances relative to reference
- Exclude references based on ids, not distances
- Remove the checkbox for excluding references; exclude always
- Removed unused setting
@janezd janezd force-pushed the OWNeighbors_output_all_distances branch from 4c0cbcc to ee6591e Compare October 2, 2020 13:17
@janezd janezd removed their assignment Oct 2, 2020
@markotoplak
Copy link
Member

@janezd, this looks good. The only question is whether we need to keep backward compatibility for this one. As this is more like a bugfix (and the changes would be slight), I'd say there is no need to make a "compatibility mode". What do you think?

@janezd
Copy link
Contributor

janezd commented Oct 5, 2020

I think this is too trivial for complications with "compatibility mode". It would be just dead, garbage code.

@markotoplak markotoplak merged commit 7d04ab1 into biolab:master Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants