Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Merge: work with sparse #2305

Merged
merged 2 commits into from
May 29, 2017
Merged

[FIX] Merge: work with sparse #2305

merged 2 commits into from
May 29, 2017

Conversation

jerneju
Copy link
Contributor

@jerneju jerneju commented May 10, 2017

Issue

Fixes #2155.

Description of changes
Includes
  • Code changes
  • Tests
  • Documentation

@jerneju jerneju changed the title Merge: works with sparse [WIP][FIX] Merge: work with sparse May 10, 2017
@codecov-io
Copy link

codecov-io commented May 10, 2017

Codecov Report

Merging #2305 into master will decrease coverage by 0.04%.
The diff coverage is 92.5%.

@@            Coverage Diff             @@
##           master    #2305      +/-   ##
==========================================
- Coverage   73.33%   73.28%   -0.05%     
==========================================
  Files         317      317              
  Lines       55447    55474      +27     
==========================================
- Hits        40662    40656       -6     
- Misses      14785    14818      +33

@jerneju
Copy link
Contributor Author

jerneju commented May 12, 2017

Waiting #2286 because hstack from util.py is needed.

@nikicc
Copy link
Contributor

nikicc commented May 18, 2017

#2286 is merged. Are there some changes still needed? Can we do something about radon?

@jerneju jerneju changed the title [WIP][FIX] Merge: work with sparse [FIX] Merge: work with sparse May 18, 2017
@jerneju
Copy link
Contributor Author

jerneju commented May 18, 2017

@nikicc : Radon does not complain anymore because this code now uses universal hstack.

Copy link
Contributor

@nikicc nikicc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I concatenate two BoW data sets on the output all variables seem to have a name a. Please check.

screen shot 2017-05-19 at 09 50 57

(indices[1], arr2, right)):
known = ind != -1
if sum(known):
to_change[known] = lookup[ind[known]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not sufficient for sparse data. For dense, we initially set all values as missing and hence it's ok to set only known values. For sparse, we initially set all values as zeros. Hence we should, along setting known values, also set unknown values as nans.

tpe = object if object in (left.dtype, right.dtype) else left.dtype
left_width, right_width = left.shape[1], right.shape[1]
arr = np.full((indices.shape[1], left_width + right_width), np.nan, tpe)

sparse = sp.issparse(left) or sp.issparse(right)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boolean flags should begin with is_. https://stackoverflow.com/questions/1227998/naming-conventions-what-to-name-a-boolean-variable

is_sparse = sp.issparse(left) or sp.issparse(right)
if is_sparse:
    ...

return np.full((height, width), np.nan, dtype=dtype)
else:
return sp.csr_matrix((height, width), dtype=dtype)

tpe = object if object in (left.dtype, right.dtype) else left.dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.find_common_type([left.dtype, right.dtype])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. In that case all tests fail.

if not sparse:
return np.full((height, width), np.nan, dtype=dtype)
else:
return sp.csr_matrix((height, width), dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emits a warning in the for loop below:

SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.

if not is_sparse:
known = ind != -1
if sum(known):
to_change[known] = lookup[ind[known]]
Copy link
Contributor

@kernc kernc May 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re lint, does this work?

def _join_array_by_indices(left, right, indices, string_cols=None):

    def prepare(arr, inds):
        try:
            newarr = arr[inds]
        except IndexError:
            newarr = np.full_like(arr, np.nan)
        else:
            newarr[inds == -1] = np.full(arr.shape[1], np.nan)
        return newarr

    res = hstack((prepare(left, indices[0]),
                  prepare(right, indices[1])))
    if string_cols:
        res[:, string_cols] = ""  # IDK what this does
    return  res

Copy link
Contributor Author

@jerneju jerneju May 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this does not work.

arr1[:, [sc for sc in string_cols if sc < left_width]] = ""
arr2[:, [sc - left_width for sc in string_cols if sc >= left_width]] = ""

for ind, to_change, lookup in (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works!

    def _join_array_by_indices(left, right, indices, string_cols=None):
        def prepare(arr, inds, str_cols):
            try:
                newarr = arr[inds]
            except IndexError:
                newarr = np.full_like(arr, np.nan)
            else:
                empty = np.full(arr.shape[1], np.nan)
                if str_cols:
                    assert arr.dtype == object
                    empty = empty.astype(object)
                    empty[str_cols] = ''
                newarr[inds == -1] = empty
            return newarr

        left_width = left.shape[1]
        str_left = [i for i in string_cols or () if i < left_width]
        str_right = [i - left_width for i in string_cols or () if i >= left_width]
        res = hstack((prepare(left, indices[0], str_left),
                      prepare(right, indices[1], str_right)))
        return res

@jerneju
Copy link
Contributor Author

jerneju commented May 26, 2017

Well-done, @kernc !

@ajdapretnar
Copy link
Contributor

@nikicc Could you please check and merge if this is done? :)

@nikicc nikicc merged commit 554b885 into biolab:master May 29, 2017
@jerneju jerneju deleted the sparse-merge branch May 29, 2017 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants