Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Distances: prevent inf numbers #2380

Merged
merged 2 commits into from
Jun 30, 2017
Merged

[FIX] Distances: prevent inf numbers #2380

merged 2 commits into from
Jun 30, 2017

Conversation

jerneju
Copy link
Contributor

@jerneju jerneju commented Jun 6, 2017

Issue

screenshot_20170606_112022

---------------------------- ValueError Exception -----------------------------
Traceback (most recent call last):
File "pyqtgraph/graphicsItems/AxisItem.py", line 525, in paint specs = self.generateDrawSpecs(painter)
File "pyqtgraph/graphicsItems/AxisItem.py", line 819, in generateDrawSpecs tickLevels = self.tickValues(self.range[0], self.range[1], lengthInPixels)
File "pyqtgraph/graphicsItems/AxisItem.py", line 691, in tickValues num = int((maxVal-start) / spacing) + 1
ValueError: cannot convert float NaN to integer

Description of changes
Includes
  • Code changes
  • Tests
  • Documentation

@jerneju
Copy link
Contributor Author

jerneju commented Jun 6, 2017

@lanzagar lanzagar self-assigned this Jun 9, 2017
@codecov-io
Copy link

codecov-io commented Jun 9, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@942877a). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master    #2380   +/-   ##
=========================================
  Coverage          ?   74.25%           
=========================================
  Files             ?      320           
  Lines             ?    55836           
  Branches          ?        0           
=========================================
  Hits              ?    41459           
  Misses            ?    14377           
  Partials          ?        0

@jerneju
Copy link
Contributor Author

jerneju commented Jun 9, 2017

Rebased.

@@ -160,6 +161,9 @@ def checks(metric, data):
except MemoryError:
self.Error.distances_memory_error()
return

if np.isinf(met).any():
met = np.nan_to_num(met)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the conditional, this would make one linear pass instead of three.

@@ -160,6 +161,8 @@ def checks(metric, data):
except MemoryError:
self.Error.distances_memory_error()
return

met = np.nan_to_num(met)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This is just a quick local patch, not a solution.

It reminds me of a guy (whom we all know and) who - years ago - programmed some machine learning algorithm. He occasionally got probabilities above 1, so he fixed the problem by adding

if (p > 1) {
    p = 1;
}

If we get an error report, we shouldn't find the simplest way to just fix it somewhere. If some distances are infinite, they are infinite. We can't just set them to 42 and pretend everything is cool. If we believe that distance matrix should never contain infinite values, we should raise an exception. If we decide that infinite values can arise and are legal, other functions/widgets should be able to handle those.

In this case, I'd say that infinite values are allowed and hierarchical clustering should do something about them. Replacing them with large values (perhaps len(data) * np.nanmax(data)) could make sense within the hierarchical clustering. Or it may be even more complex since it depends upon linkage.

So: if there are infinite distances, we need to think what would we like the clustering to do with them, not what is the easiest way to get rid of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps len(data) * np.nanmax(data)

If you opt for this suggestion, please, just don't use np.nanmax(data). We should never use np.nanmax(data) since it doesn't work on sparse data! We have our own implementation of nanmax in Orange/statistics/util.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to write that while I totally agree with @janezd, most solutions for inf distances will probably be equivalent to substitution with large numbers. In which case maybe we should consider if we want every widget to duplicate this substitution.
But there is always the possibility that somewhere inf numbers would be illegal (and we would want a warning). And even worse - this substitution will probably not allow downstream methods to ignore the problem, since you still cannot multiply the new "large values", or add two of them etc. which could happen in average linkage and so on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janezd, Distance Transformation widget could have a check box where user could select to impute infinite distances. @lanzagar suggested something like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds more like clipping. With lower and upper range?

@jerneju jerneju changed the title [FIX] Distances: prevent inf numbers [WIP][FIX] Distances: prevent inf numbers Jun 12, 2017
@lanzagar lanzagar removed their assignment Jun 13, 2017
@@ -763,6 +763,9 @@ class Outputs:
cluster_roles = ["Attribute", "Class variable", "Meta variable"]
basic_annotations = ["None", "Enumeration"]

class Error(widget.OWWidget.Error):
not_finite_distances = Msg("Not all distances are finite")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrase: Some distances are infinite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why some error messages have dot at the end and some have not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, we have to decide. @BlazZupan, @astaric, @lanzagar, @ajdapretnar? I don't have a clear preference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for dots in sentences (verb present).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. No verb, no dot. Verb, dot. Period. :)

However, some error messages are not sentences, like "Out of memory" or "Error during optimization". What about those?

Message without verb bad, message with verb good? Makes them longer, though. Sometimes long not good, when widget narrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no, I would keep those dot-less. Short, too.

@@ -970,6 +974,9 @@ def set_distances(self, matrix):

self.matrix = matrix
if matrix is not None:
if not np.all(np.isfinite(matrix)):
self.Error.not_finite_distances()
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have overlooked something, but I think this is wrong: it aborts the function without reseting the items (and related models). Note that the check above sets matrix to None. In line with this, you should probably have something like:

    if matrix is not None:
        if not np.all(np.isfinite(matrix)):
            self.Error.not_finite_distances()
            matrix = None

    self.matrix = matrix
    if matrix is not None:
        self._set_items(matrix.row_items, matrix.axis)
    else:
        self._set_items(None)

The line self.matrix = matrix is moved from above to the place after the check for infinite values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fixed.

if matrix is not None:
N, _ = matrix.shape
if N < 2:
self.error("Empty distance matrix")
matrix = None
if not np.all(np.isfinite(matrix)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not like that. If it comes here, matrix can be set to None in the line above. Another if matrix is not None is needed here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Ok now.

@@ -1082,6 +1090,8 @@ def _update_labels(self):
self.labels.setMinimumWidth(1 if labels else -1)

def _invalidate_clustering(self):
if self.matrix is not None and not np.all(np.isfinite(self.matrix)):
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, this check may be redundant since np.all(np.isfinite(self.matrix)) is already checked above and self.matrix is set to None in this case. Hence, the situation in which this conjunction is true cannot arise, I guess.

Second -- if this condition can be fulfilled -- is skipping these calls really OK?

  • _update() clears the plot and then performs further cleanup if self.matrix is None. I guess this must be done if the widget receives invalid data.
  • _update_labels() checks for self.root (which is set to None in _update()) and works correctly when matrix is None.
  • _invalidate_output() has to be called pass None to outputs.

I believe that these two lines have to be removed. But please check whether there are any other side-effects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines are now removed.

@jerneju jerneju changed the title [WIP][FIX] Distances: prevent inf numbers [FIX] Distances: prevent inf numbers Jun 28, 2017
if matrix is not None:
N, _ = matrix.shape
if N < 2:
self.error("Empty distance matrix")
matrix = None
if not np.all(np.isfinite(matrix)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Ok now.

@janezd janezd merged commit 47647c2 into biolab:master Jun 30, 2017
@jerneju jerneju deleted the inf-distances branch June 30, 2017 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants