Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory use in stats() for dtype=object #5702

Merged
merged 1 commit into from
Dec 3, 2021

Conversation

markotoplak
Copy link
Member

Issue

Before, .astype("str") was applied to the input array. This created an output array of fixed-length strings of the size of the longest string in the table. Converting numpy object arrays to string arrays should be, therefore, avoided.

This bug stems from #3722, which fixed #3671, but introduced this issue. :)

ORANGE3-311 on Sentry.

Description of changes

A combination of pandas.isnull and direct comparisons is used instead.

Includes
  • Code changes
  • Tests
  • Documentation

@codecov
Copy link

codecov bot commented Nov 23, 2021

Codecov Report

Merging #5702 (5dc49d1) into master (3aae878) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #5702   +/-   ##
=======================================
  Coverage   86.12%   86.12%           
=======================================
  Files         315      315           
  Lines       66074    66074           
=======================================
+ Hits        56905    56906    +1     
+ Misses       9169     9168    -1     

@markotoplak
Copy link
Member Author

The failed tests are unrelated.

For a dramatic effect, from Sentry ORANGE3-311:

  File "itemdelegates.py", line 573, in paint
    ratio = self.barFillRatioData(index)
  File "itemdelegates.py", line 539, in barFillRatioData
    return cast_(float, self.cachedData(index, self.barFillRatioRole))
  File "itemdelegates.py", line 291, in cachedData
    return self.__cache.data(index, role)
  File "itemdelegates.py", line 139, in data
    data[role] = model.data(index, role)
  File "owtable.py", line 78, in data
    dist = super().data(index, TableModel.VariableStatsRole)
  File "itemmodels.py", line 1000, in data
    return self._stats_for_column(col)
  File "itemmodels.py", line 1078, in _stats_for_column
    self.__stats = datacaching.getCached(
  File "datacaching.py", line 15, in getCached
    info[funct] = res = funct(*params, **kwparams)
  File "basic_stats.py", line 40, in __init__
    data._compute_basic_stats(include_metas=include_metas)]
  File "table.py", line 1634, in _compute_basic_stats
    rr.append(fast_stats(self.metas, W))
  File "util.py", line 375, in stats
    X_str = X.astype(str)
  numpy.core._exceptions._ArrayMemoryError:  Unable to allocate 109. GiB for an array with shape (47323, 19) and data type <U32502

Before, .astype("str") was applied to the input array. This created
output array of fixed-length strings of the size of the longest string
in the table. Converting numpy object arrays to string arrays should
be, therefore, avoided.
@markotoplak
Copy link
Member Author

Why were empty strings used to denote unknown values in object arrays in Orange? Does anyone remember why did we decide that? Who relies on that?

Would it be hard to change? Why not just use None as an unknown value? See https://pandas.pydata.org/docs/reference/api/pandas.isnull.html

@PrimozGodec PrimozGodec merged commit 2234de6 into biolab:master Dec 3, 2021
@PrimozGodec
Copy link
Contributor

Why were empty strings used to denote unknown values in object arrays in Orange? Does anyone remember why did we decide that? Who relies on that?

Would it be hard to change? Why not just use None as an unknown value? See https://pandas.pydata.org/docs/reference/api/pandas.isnull.html

I was also surprised about that when I discovered that some time ago. As I understand they are used just for string variables. Actually, I think that we should use np.nan as it is used for other data types. The empty string can mean that we have information that is not unknown is just an empty string (maybe it is not useful in practice but it can happen), currently unknown is the same as an empty string.

@markotoplak markotoplak deleted the fix-stats-str branch February 28, 2022 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Orange.statistics.util.py: improper computation of missing meta values
2 participants