Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coding matrix/vector variables in tidy form #61

Open
statscats opened this issue Feb 18, 2020 · 15 comments
Open

Coding matrix/vector variables in tidy form #61

statscats opened this issue Feb 18, 2020 · 15 comments
Labels
feature New feature or request

Comments

@statscats
Copy link

When variables are vector/matrix they are stored in a tibble named as a[i,j].

@jgabry and I were talking about whether there was some utility of creating a convenience function that splits the indices in long form i.e.,

param idx1 idx2
a i j

It would be useful for getting things in the right form for Bayesplot.

@mjskay
Copy link
Collaborator

mjskay commented Feb 18, 2020

Unless I'm misunderstanding that is pretty close to what spread_draws / gather_draws are for in tidybayes (amongst other things).

@lauken13
Copy link

Sorry for the late reply - I apparently had logged onto a different Github account.

Yeah that looks perfect! Do you think it's worth wrapping it to jump from the matrix form to this in one step?

@mjskay
Copy link
Collaborator

mjskay commented Feb 21, 2020

Yeah! Long term I'd like to replace tidybayes::tidy_draws() with (or have it be a wrapper around) posterior::as_draws_df(). The other functions in tidybayes use tidy_draws() internally, so once that change happens those other functions should work on any of the formats supported by posterior.

A related question: is there particular functionality in gather/spread_draws that could be abstracted out or which it might be desirable to have exposed in a form that is easier to program against? Currently those functions take in specs for what transformation you want in a non-standard evaluation form, convert them internally to a simpler spec format, then do all the ugly data munging using those. This makes for a nice UI for interactive use but might be a pain to program against. If there's a need/desire for a more programmable variant that shouldn't be a hard change.

The second question is if some of that moves to posterior or not. This might be related to bigger questions of "what to do with tidybayes" --- some time back I mentioned to Jonah that I was contemplating splitting off the geoms from tidybayes into a separate package (maybe "ggdist"). It has evolved from a package just about munging posteriors into a sort of two-headed beast with a bunch of uncertainty vis geoms and stats.

@paul-buerkner
Copy link
Collaborator

One way of differentiating between tidybayes and posterior functions could be whether data or ploting is involved in addition to posterior draws. So whatever only handles draws could (potentially) go into posterior, but anything including data (such as add_predicted_draws etc.) or plotting would be out of scope. Not sure if this is a reasonable threshold though and I don't have a strong opinion here.

@lauken13
Copy link

lauken13 commented Feb 25, 2020

Yeah definitely don't want to feature overlap here!

Do you mean that this is functionality that would be useful for me to work on integrating @mjskay?

@mjskay
Copy link
Collaborator

mjskay commented Feb 26, 2020

One way of differentiating between tidybayes and posterior functions could be whether data or ploting is involved in addition to posterior draws. So whatever only handles draws could (potentially) go into posterior, but anything including data (such as add_predicted_draws etc.) or plotting would be out of scope. Not sure if this is a reasonable threshold though and I don't have a strong opinion here.

Hmm, that's a criterion for splitting the functionality I hadn't considered. My working theory for how to split things was based on whether the functionality is specifically designed for tidy data frames of draws or not, which would suggest to me that the spread_draws/gather_draws stuff might live in tidybayes --- unless there's an analog to the kind of output from those functions in the other formats. I suppose it's a little bit on the line between the two worlds though.

@lauken13 I'd be happy for you to work on this! Personally I am likely to have zero time for intense coding between now and late Spring due to teaching, but could definitely provide advice and whatnot. Would it be helpful to have a chat about what is needed and where it should live?

@paul-buerkner
Copy link
Collaborator

@mjskay this is a good point and likely better way to differentiate the packages. I am happy to be involved in a chat if needed.

@paul-buerkner paul-buerkner added the feature New feature or request label Mar 24, 2020
@paul-buerkner
Copy link
Collaborator

Is there anything to be done for this issue still? I think time has basically told quite well what should be in posterior and what should be in tidybayes. @mjskay do you agree?

@mjskay
Copy link
Collaborator

mjskay commented Mar 26, 2021

I agree the split between posterior / tidybayes is clear now.

Re: the original purpose of this issue (array reshaping utility functions), I wonder if @statscats @lauken13 @jgabry have thoughts about what might be useful based on the new rvar stuff. There is some internal code in posterior now for going between flat and multidimensional array formats used by rvar. So my question would be: are transformations done by rvar/as_draws_rvars sufficient for the needs in bayesplot, or would it be useful to expose some lower level functions for doing those transformations on base arrays?

@jsocolar
Copy link
Contributor

@mjskay here's my attempt to factor out the dimension parsing stuff. Mind having a look whenever you get a chance?

I've tried to return an object that will be versatile enough for whatever people might need it for, including if Stan ever supports ragged arrays, etc.

get_indices <- function(variables){
  vars_indices <- strsplit(variables, "(\\[|\\])")
  vars <- sapply(vars_indices, `[[`, 1)
  var_names <- unique(vars)
  
  var <- var_names[6]
  
  indices_list <- lapply(var_names, function (var) {
    var_i <- vars == var
    # reset class here as otherwise the draws arrays in the output rvars
    # have type draws_matrix, which makes inspecting them hard
    var_length <- sum(var_i)
    
    if (var_length == 1) {
      # single variable, no indices
      out <- 1
      dimnames(out) <- NULL
    } else {
      # variable with indices => we need to reshape the array
      # basically, we're going to do a bunch of work up front to figure out
      # a single array slice that does most of the work for us.
      
      # first, pull out the list of indices into a data frame
      # where each column is an index variable
      indices <- sapply(vars_indices[var_i], `[[`, 2)
      indices <- as.data.frame(do.call(rbind, strsplit(indices, ",")),
                               stringsAsFactors = FALSE)
      
      unique_indices <- vector("list", length(indices))
      .dimnames <- vector("list", length(indices))
      names(unique_indices) <- names(indices)
      for (i in seq_along(indices)) {
        numeric_index <- suppressWarnings(as.numeric(indices[[i]]))
        if (!anyNA(numeric_index) && rlang::is_integerish(numeric_index)) {
          # for integer indices, we need to convert them to integers
          # so that we can sort them in numerical order (not string order)
          if (min(numeric_index) >= 1) {
            # integer indices >= 1 are forced to lower bound of 1 + no dimnames
            indices[[i]] <- as.integer(numeric_index)
            unique_indices[[i]] <- seq.int(1, max(numeric_index))
          } else {
            # indices with values < 1 are sorted but otherwise left as-is, and will create dimnames
            indices[[i]] <- numeric_index
            unique_indices[[i]] <- sort(unique(numeric_index))
            .dimnames[[i]] <- unique_indices[[i]]
          }
        } else {
          # we convert non-numeric indices to factors so that we can force them
          # to be ordered as they appear in the data (rather than in alphabetical order)
          factor_levels <- unique(indices[[i]])
          indices[[i]] <- factor(indices[[i]], levels = factor_levels)
          # these aren't sorted so they appear in original order
          unique_indices[[i]] <- factor(factor_levels, levels = factor_levels)
          .dimnames[[i]] <- unique_indices[[i]]
        }
      }
      
      # sort indices and fill in missing indices as NA to ensure
      # (1) even if the order of the variables is something weird (like 
      # x[2,2] comes before x[1,1]) the result
      # places those columns in the correct cells of the array
      # (2) if some combination of indices is missing (say x[2,1] isn't
      # in the input) that cell in the array gets an NA
      
      # Use expand.grid to get all cells in output array. We reverse indices
      # here because it helps us do the sort after the merge, where
      # we need to sort in reverse order of the indices (because
      # the value of the last index should move slowest)
      all_indices <- expand.grid(rev(unique_indices))
      # merge with all.x = TRUE (left join) to fill in missing cells with NA
      indices <- merge(all_indices, cbind(indices, index = seq_len(nrow(indices))),
                       all.x = TRUE, sort = FALSE)
      # need to do the sort manually after merge because when sort = TRUE, merge
      # sorts factors as if they were strings, and we need factors to be sorted as factors
      indices <- indices[do.call(order, as.list(indices[, -ncol(indices), drop = FALSE])),]
    }
    indices
  })
  indices_list
}

@mjskay
Copy link
Collaborator

mjskay commented May 21, 2021

Thanks for starting this!

get_indices() is a little generic. Maybe parse_variable_indices()?

Also, unless I've missed something the .dimnames information is not returned anywhere. This is needed inside as_draws_rvars to rename non-numeric indices.

More generally, I think that the returned format is probably too specific to the original needs of as_draws_rvars, and for a more generic helper function should be revisited a bit. E.g. current output is:

> get_indices(c("x[0]","x[1]","mu[1]", "mu[2]","mu[5]","sigma[a,b]","sigma[b,b]"))
[[1]]
  V1 index
1  0     1
2  1     2

[[2]]
  V1 index
1  1     1
2  2     2
4  3    NA
5  4    NA
3  5     3

[[3]]
  V2 V1 index
1  b  a     1
2  b  b     2

Some thoughts about this format:

  1. Each entry in the top-level list should probably be named according to the base name of the variable
  2. The VX column names are meaningless and could be dropped
  3. It's awkward to work with the data frame of indices when you don't want the index column in there (as you found in your code when you had to write a get_dims function on top of this), maybe each element of the main list would be better if it were itself a list containing a few items, including a data frame (or matrix) with the VX columns, an index element, and a dimnames element (so that that can be returned as well). I can't remember if there are other useful bits of information parsed out of this that would be good to return (maybe a dim element as well?). This makes the output more generically useful and less tailored specifically to my original use case.
  4. The index column should probably be made to be indexed relative to the original list and not the subset specific to that variable name prefix. I believe that replacing the definition of index with which(var_i) should accomplish that (but haven't tested this).
  5. This function currently fails for variables without indices (e.g. get_indices("x")), but it probably should not (and I realize now that it also fails on things like get_indices("x[1]") and should not either).

@jsocolar
Copy link
Contributor

jsocolar commented May 24, 2021

Hey @mjskay,
The current behavior for get_indices(c("x[-1,1]", "x[3,3]")) is:

> get_indices(c("x[-1,1]", "x[3,3]"))
[[1]]
  V2 V1 index
1  1 -1     1
5  1  3    NA
3  2 -1    NA
6  2  3    NA
4  3 -1    NA
2  3  3     2

Note that it treats the columns independently in the sense that, despite that the first column is not one-indexed (and therefore doesn't get filled in with the full sequence of integers from one to max), it still fills in the second column with integers from one to max.

Likewise, if one column is non-numeric and the other column is integer, it treats the second column like integers rather than factor labels.

What do you think the general-purpose behavior should be here? Does it matter?

@mjskay
Copy link
Collaborator

mjskay commented May 24, 2021

Note that it treats the columns independently in the sense that, despite that the first column is not one-indexed (and therefore doesn't get filled in with the full sequence of integers from one to max), it still fills in the second column with integers from one to max.

Likewise, if one column is non-numeric and the other column is integer, it treats the second column like integers rather than factor labels.

What do you think the general-purpose behavior should be here? Does it matter?

Good questions. I think that it should treat columns independently --- as a user I would expect from "x[foo,1]" and "x[foo,3]" that a "x[foo,2]" should exist.

You're right that the behavior of integer indices < 1 feels inconsistent. Probably it does make sense to fill those in with NAs as well. In any case I expect this to be a rare occurrence, but that the most likely situation it would happen is if someone is importing from something that is 0-indexed, in which case filling in missing indices would be the right behavior, so we should probably just do it. If we don't have a test for that we should add it.

@jsocolar
Copy link
Contributor

Cool. When you get a chance, check out the new parse_variable_indices() over on #152. I think this should do everything we want; hope the code's not too clumsy.

@mjskay
Copy link
Collaborator

mjskay commented May 24, 2021

Awesome, thanks! Will do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants