Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

function for calculating scales #126

Open
2 of 3 tasks
wibeasley opened this issue Oct 17, 2022 · 4 comments
Open
2 of 3 tasks

function for calculating scales #126

wibeasley opened this issue Oct 17, 2022 · 4 comments

Comments

@wibeasley
Copy link
Member

wibeasley commented Oct 17, 2022

inputs:

  • vector of column names
  • minimum count of nonmissing columns
  • weights vector
@wibeasley wibeasley self-assigned this Oct 17, 2022
@wibeasley
Copy link
Member Author

@genevamarshall, @yutiantang and others are using sjstats::mean_n(). It doesn't support nonuniform weights. And (at least currently) uses a slow approach that involves casting the data.frame to a matrix.

@wibeasley
Copy link
Member Author

wibeasley commented Oct 22, 2023

I've been working on something that meets all these requirements except for for the nonuniform weights.
https://github.com/LiveOak/vasquez-border-reentry-1

row_sum <- function(
    d,
    columns_to_average        = character(0),
    pattern, 
    new_column_name  = "row_sum",
    threshold_proportion      = .75,
    verbose                   = FALSE
) {

  if (length(columns_to_average) == 0L) {
    columns_to_average <-
      d |>
      colnames() |>
      grep(
        x         = _,
        pattern   = pattern,
        value     = TRUE,
        perl      = TRUE
      )

    if (verbose) {
      message(
        "The following columns will be summed:\n- ",
        paste(columns_to_average, collapse = "\n- ")
      )
    }
  }

  d |>
    dplyr::mutate(
      row_sum = # Finding the sum (used by m4)
        rowSums(
          dplyr::across(!!columns_to_average),
          na.rm = TRUE
        ),
      nonmissing_count =
        rowSums(
          dplyr::across(
            !!columns_to_average,
            .fns = \(x) { !is.na(x) }
          )
        ),
      nonmissing_proportion = nonmissing_count / length(columns_to_average),
      {{new_column_name}} :=
        dplyr::if_else(
          threshold_proportion <= nonmissing_proportion,
          row_sum,
          # row_sum / nonmissing_count,
          NA_real_
        )
    ) |>
    dplyr::select(
      -row_sum,
      -nonmissing_count,
      -nonmissing_proportion,
    )
  # Alternatively, return just the new columns
  # dplyr::pull({{new_column_name}})
}

wibeasley added a commit that referenced this issue Oct 28, 2023
wibeasley added a commit that referenced this issue Oct 28, 2023
wibeasley added a commit that referenced this issue Oct 28, 2023
I'm not sure why it was producing errors before

ref #126
wibeasley added a commit that referenced this issue Oct 28, 2023
wibeasley added a commit that referenced this issue Oct 28, 2023
wibeasley added a commit that referenced this issue Oct 28, 2023
@wibeasley wibeasley mentioned this issue Oct 28, 2023
@DavidBard
Copy link
Member

@wibeasley Feature request and questions:
FR: Would be nice to have a row_mean function as well, which averages across all nonmissing items.
Q1: For row_sum, should 'columns_to_average' argument be 'columns_to_sum' instead?
Q2: Can you provide an example of how this function might be used inside a dplyr::mutate statement?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants