Coding matrix/vector variables in tidy form #61

statscats · 2020-02-18T20:00:50Z

When variables are vector/matrix they are stored in a tibble named as a[i,j].

@jgabry and I were talking about whether there was some utility of creating a convenience function that splits the indices in long form i.e.,

param idx1 idx2
a i j

It would be useful for getting things in the right form for Bayesplot.

mjskay · 2020-02-18T21:59:35Z

Unless I'm misunderstanding that is pretty close to what spread_draws / gather_draws are for in tidybayes (amongst other things).

lauken13 · 2020-02-21T02:02:45Z

Sorry for the late reply - I apparently had logged onto a different Github account.

Yeah that looks perfect! Do you think it's worth wrapping it to jump from the matrix form to this in one step?

mjskay · 2020-02-21T03:21:22Z

Yeah! Long term I'd like to replace tidybayes::tidy_draws() with (or have it be a wrapper around) posterior::as_draws_df(). The other functions in tidybayes use tidy_draws() internally, so once that change happens those other functions should work on any of the formats supported by posterior.

A related question: is there particular functionality in gather/spread_draws that could be abstracted out or which it might be desirable to have exposed in a form that is easier to program against? Currently those functions take in specs for what transformation you want in a non-standard evaluation form, convert them internally to a simpler spec format, then do all the ugly data munging using those. This makes for a nice UI for interactive use but might be a pain to program against. If there's a need/desire for a more programmable variant that shouldn't be a hard change.

The second question is if some of that moves to posterior or not. This might be related to bigger questions of "what to do with tidybayes" --- some time back I mentioned to Jonah that I was contemplating splitting off the geoms from tidybayes into a separate package (maybe "ggdist"). It has evolved from a package just about munging posteriors into a sort of two-headed beast with a bunch of uncertainty vis geoms and stats.

paul-buerkner · 2020-02-21T12:20:06Z

One way of differentiating between tidybayes and posterior functions could be whether data or ploting is involved in addition to posterior draws. So whatever only handles draws could (potentially) go into posterior, but anything including data (such as add_predicted_draws etc.) or plotting would be out of scope. Not sure if this is a reasonable threshold though and I don't have a strong opinion here.

lauken13 · 2020-02-25T20:03:26Z

Yeah definitely don't want to feature overlap here!

Do you mean that this is functionality that would be useful for me to work on integrating @mjskay?

mjskay · 2020-02-26T03:14:34Z

One way of differentiating between tidybayes and posterior functions could be whether data or ploting is involved in addition to posterior draws. So whatever only handles draws could (potentially) go into posterior, but anything including data (such as add_predicted_draws etc.) or plotting would be out of scope. Not sure if this is a reasonable threshold though and I don't have a strong opinion here.

Hmm, that's a criterion for splitting the functionality I hadn't considered. My working theory for how to split things was based on whether the functionality is specifically designed for tidy data frames of draws or not, which would suggest to me that the spread_draws/gather_draws stuff might live in tidybayes --- unless there's an analog to the kind of output from those functions in the other formats. I suppose it's a little bit on the line between the two worlds though.

@lauken13 I'd be happy for you to work on this! Personally I am likely to have zero time for intense coding between now and late Spring due to teaching, but could definitely provide advice and whatnot. Would it be helpful to have a chat about what is needed and where it should live?

paul-buerkner · 2020-02-26T11:22:57Z

@mjskay this is a good point and likely better way to differentiate the packages. I am happy to be involved in a chat if needed.

paul-buerkner · 2021-03-26T14:23:21Z

Is there anything to be done for this issue still? I think time has basically told quite well what should be in posterior and what should be in tidybayes. @mjskay do you agree?

mjskay · 2021-03-26T15:32:48Z

I agree the split between posterior / tidybayes is clear now.

Re: the original purpose of this issue (array reshaping utility functions), I wonder if @statscats @lauken13 @jgabry have thoughts about what might be useful based on the new rvar stuff. There is some internal code in posterior now for going between flat and multidimensional array formats used by rvar. So my question would be: are transformations done by rvar/as_draws_rvars sufficient for the needs in bayesplot, or would it be useful to expose some lower level functions for doing those transformations on base arrays?

jsocolar · 2021-05-21T19:10:58Z

@mjskay here's my attempt to factor out the dimension parsing stuff. Mind having a look whenever you get a chance?

I've tried to return an object that will be versatile enough for whatever people might need it for, including if Stan ever supports ragged arrays, etc.

get_indices <- function(variables){
  vars_indices <- strsplit(variables, "(\\[|\\])")
  vars <- sapply(vars_indices, `[[`, 1)
  var_names <- unique(vars)
  
  var <- var_names[6]
  
  indices_list <- lapply(var_names, function (var) {
    var_i <- vars == var
    # reset class here as otherwise the draws arrays in the output rvars
    # have type draws_matrix, which makes inspecting them hard
    var_length <- sum(var_i)
    
    if (var_length == 1) {
      # single variable, no indices
      out <- 1
      dimnames(out) <- NULL
    } else {
      # variable with indices => we need to reshape the array
      # basically, we're going to do a bunch of work up front to figure out
      # a single array slice that does most of the work for us.
      
      # first, pull out the list of indices into a data frame
      # where each column is an index variable
      indices <- sapply(vars_indices[var_i], `[[`, 2)
      indices <- as.data.frame(do.call(rbind, strsplit(indices, ",")),
                               stringsAsFactors = FALSE)
      
      unique_indices <- vector("list", length(indices))
      .dimnames <- vector("list", length(indices))
      names(unique_indices) <- names(indices)
      for (i in seq_along(indices)) {
        numeric_index <- suppressWarnings(as.numeric(indices[[i]]))
        if (!anyNA(numeric_index) && rlang::is_integerish(numeric_index)) {
          # for integer indices, we need to convert them to integers
          # so that we can sort them in numerical order (not string order)
          if (min(numeric_index) >= 1) {
            # integer indices >= 1 are forced to lower bound of 1 + no dimnames
            indices[[i]] <- as.integer(numeric_index)
            unique_indices[[i]] <- seq.int(1, max(numeric_index))
          } else {
            # indices with values < 1 are sorted but otherwise left as-is, and will create dimnames
            indices[[i]] <- numeric_index
            unique_indices[[i]] <- sort(unique(numeric_index))
            .dimnames[[i]] <- unique_indices[[i]]
          }
        } else {
          # we convert non-numeric indices to factors so that we can force them
          # to be ordered as they appear in the data (rather than in alphabetical order)
          factor_levels <- unique(indices[[i]])
          indices[[i]] <- factor(indices[[i]], levels = factor_levels)
          # these aren't sorted so they appear in original order
          unique_indices[[i]] <- factor(factor_levels, levels = factor_levels)
          .dimnames[[i]] <- unique_indices[[i]]
        }
      }
      
      # sort indices and fill in missing indices as NA to ensure
      # (1) even if the order of the variables is something weird (like 
      # x[2,2] comes before x[1,1]) the result
      # places those columns in the correct cells of the array
      # (2) if some combination of indices is missing (say x[2,1] isn't
      # in the input) that cell in the array gets an NA
      
      # Use expand.grid to get all cells in output array. We reverse indices
      # here because it helps us do the sort after the merge, where
      # we need to sort in reverse order of the indices (because
      # the value of the last index should move slowest)
      all_indices <- expand.grid(rev(unique_indices))
      # merge with all.x = TRUE (left join) to fill in missing cells with NA
      indices <- merge(all_indices, cbind(indices, index = seq_len(nrow(indices))),
                       all.x = TRUE, sort = FALSE)
      # need to do the sort manually after merge because when sort = TRUE, merge
      # sorts factors as if they were strings, and we need factors to be sorted as factors
      indices <- indices[do.call(order, as.list(indices[, -ncol(indices), drop = FALSE])),]
    }
    indices
  })
  indices_list
}

mjskay · 2021-05-21T22:38:41Z

Thanks for starting this!

get_indices() is a little generic. Maybe parse_variable_indices()?

Also, unless I've missed something the .dimnames information is not returned anywhere. This is needed inside as_draws_rvars to rename non-numeric indices.

More generally, I think that the returned format is probably too specific to the original needs of as_draws_rvars, and for a more generic helper function should be revisited a bit. E.g. current output is:

> get_indices(c("x[0]","x[1]","mu[1]", "mu[2]","mu[5]","sigma[a,b]","sigma[b,b]"))
[[1]]
  V1 index
1  0     1
2  1     2

[[2]]
  V1 index
1  1     1
2  2     2
4  3    NA
5  4    NA
3  5     3

[[3]]
  V2 V1 index
1  b  a     1
2  b  b     2

Some thoughts about this format:

Each entry in the top-level list should probably be named according to the base name of the variable
The VX column names are meaningless and could be dropped
It's awkward to work with the data frame of indices when you don't want the index column in there (as you found in your code when you had to write a get_dims function on top of this), maybe each element of the main list would be better if it were itself a list containing a few items, including a data frame (or matrix) with the VX columns, an index element, and a dimnames element (so that that can be returned as well). I can't remember if there are other useful bits of information parsed out of this that would be good to return (maybe a dim element as well?). This makes the output more generically useful and less tailored specifically to my original use case.
The index column should probably be made to be indexed relative to the original list and not the subset specific to that variable name prefix. I believe that replacing the definition of index with which(var_i) should accomplish that (but haven't tested this).
This function currently fails for variables without indices (e.g. get_indices("x")), but it probably should not (and I realize now that it also fails on things like get_indices("x[1]") and should not either).

jsocolar · 2021-05-24T17:06:29Z

Hey @mjskay,
The current behavior for get_indices(c("x[-1,1]", "x[3,3]")) is:

> get_indices(c("x[-1,1]", "x[3,3]"))
[[1]]
  V2 V1 index
1  1 -1     1
5  1  3    NA
3  2 -1    NA
6  2  3    NA
4  3 -1    NA
2  3  3     2

Note that it treats the columns independently in the sense that, despite that the first column is not one-indexed (and therefore doesn't get filled in with the full sequence of integers from one to max), it still fills in the second column with integers from one to max.

Likewise, if one column is non-numeric and the other column is integer, it treats the second column like integers rather than factor labels.

What do you think the general-purpose behavior should be here? Does it matter?

mjskay · 2021-05-24T18:14:22Z

Note that it treats the columns independently in the sense that, despite that the first column is not one-indexed (and therefore doesn't get filled in with the full sequence of integers from one to max), it still fills in the second column with integers from one to max.

Likewise, if one column is non-numeric and the other column is integer, it treats the second column like integers rather than factor labels.

What do you think the general-purpose behavior should be here? Does it matter?

Good questions. I think that it should treat columns independently --- as a user I would expect from "x[foo,1]" and "x[foo,3]" that a "x[foo,2]" should exist.

You're right that the behavior of integer indices < 1 feels inconsistent. Probably it does make sense to fill those in with NAs as well. In any case I expect this to be a rare occurrence, but that the most likely situation it would happen is if someone is importing from something that is 0-indexed, in which case filling in missing indices would be the right behavior, so we should probably just do it. If we don't have a test for that we should add it.

jsocolar · 2021-05-24T18:46:19Z

Cool. When you get a chance, check out the new parse_variable_indices() over on #152. I think this should do everything we want; hope the code's not too clumsy.

mjskay · 2021-05-24T19:01:35Z

Awesome, thanks! Will do

paul-buerkner added the feature New feature or request label Mar 24, 2020

mjskay mentioned this issue May 21, 2021

summary with one line per _named_ parameter #43

Open

jsocolar mentioned this issue May 21, 2021

Roll up summaries of nonscalar variables #152

Open

4 tasks

mjskay mentioned this issue Jan 30, 2024

extract_variable_array and the with_indices parameter for variables() #342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coding matrix/vector variables in tidy form #61

Coding matrix/vector variables in tidy form #61

statscats commented Feb 18, 2020

mjskay commented Feb 18, 2020

lauken13 commented Feb 21, 2020

mjskay commented Feb 21, 2020

paul-buerkner commented Feb 21, 2020

lauken13 commented Feb 25, 2020 •

edited

Loading

mjskay commented Feb 26, 2020

paul-buerkner commented Feb 26, 2020

paul-buerkner commented Mar 26, 2021

mjskay commented Mar 26, 2021

jsocolar commented May 21, 2021

mjskay commented May 21, 2021

jsocolar commented May 24, 2021 •

edited

Loading

mjskay commented May 24, 2021

jsocolar commented May 24, 2021

mjskay commented May 24, 2021

Coding matrix/vector variables in tidy form #61

Coding matrix/vector variables in tidy form #61

Comments

statscats commented Feb 18, 2020

mjskay commented Feb 18, 2020

lauken13 commented Feb 21, 2020

mjskay commented Feb 21, 2020

paul-buerkner commented Feb 21, 2020

lauken13 commented Feb 25, 2020 • edited Loading

mjskay commented Feb 26, 2020

paul-buerkner commented Feb 26, 2020

paul-buerkner commented Mar 26, 2021

mjskay commented Mar 26, 2021

jsocolar commented May 21, 2021

mjskay commented May 21, 2021

jsocolar commented May 24, 2021 • edited Loading

mjskay commented May 24, 2021

jsocolar commented May 24, 2021

mjskay commented May 24, 2021

lauken13 commented Feb 25, 2020 •

edited

Loading

jsocolar commented May 24, 2021 •

edited

Loading