1

I have a dataframe where missingness in indicated by "Z" (there may also be some "z" and NA entries present in the data), and values are entered as characters ("0", "1", etc). I need to create scores ("updrs1", "updrs2", updrs3") that add up the non-missing values across columns selected by colname prefix ("NP1", "NP2", "NP3").

dummy data:

dummy_df <- data.frame(
  subject_id = seq(1,6,1), 
  OTHERV1 = c(1,1,0,0,1,1),
  NP1VAR1 = c("Z","0","Z","Z","Z","Z"),
  NP1VAR2 = c("Z","0","Z","Z","Z","Z"),
  NP1VAR3 = c("Z","3","Z","Z","Z","Z"),
  NP2VAR1 = c("Z","2","Z","Z","Z","Z"), 
  NP2VAR2 = c("Z","0","Z","Z","Z","Z"),
  NP2VAR3 = c("Z","0","Z","Z","Z","Z"),
  NP3VAR1 = c("Z","4","Z","Z","Z","Z"),
  NP3VAR2 = c("Z","0","Z","Z","z","Z"),
  NP3VAR3 = c("Z","0","Z","Z","Z",NA),
  OTHERV2 = c(NA,NA,NA,NA,NA,NA)
)

desired output:

subject_id updrs1 updrs2 updrs3
1 1 Z Z Z
2 2 3 2 4
3 3 Z Z Z
4 4 Z Z Z
5 5 Z Z Z
6 6 Z Z Z

NOTE: all values in the output are characters

NOTE: treating NA/Z/z as 0 (i.e., transforming all values with as.numeric()) is problematic.

I've tried variations on this answer, with no luck.

I have used a combination of tidyr::pivot_longer(), dplyr::group_by(), and summarize():

desired_output <- select(dummy_df, c(subject_id, starts_with("NP"))) %>%
  mutate(across(all_of(everything()), ~ifelse(. %in% c("Z", "z"), NA, .))) %>%
  pivot_longer(cols = starts_with("NP"), names_to = c(".value", "np_var"), names_sep = "VAR") %>%
  group_by(subject_id) %>%
  summarize(updrs1 = sum(as.numeric(NP1), na.rm = FALSE),
            updrs2 = sum(as.numeric(NP2), na.rm = FALSE), 
            updrs3 = sum(as.numeric(NP3), na.rm = FALSE), .groups = "drop") %>%
  mutate(across(all_of(everything()), as.character)) %>%
  replace(is.na(.), "Z")

This works. L Tyrone's answer also works, so I've accepted it. Thanks all.

2
  • Are the various indicators of missing data handled differently or can all be treated as NA?
    – Seth
    Commented Jul 9 at 23:35
  • They can all be treated as NA.
    – jbmchls
    Commented Jul 10 at 14:31

3 Answers 3

0

If you don't care whether:

  • the missingness values are upper or lower case
  • Z/z are the only missingness values
  • existing NA values can be represented as Z

then this works. There is likely to be a more direct way to do this, but I find a step-wise approach easier to follow. Note that it will return a warning, which you can ignore:

library(dplyr)
library(tidyr)

dummy_df |>
  pivot_longer(
    cols = starts_with("NP"),
    names_to = c("grp", "var"),
    names_pattern = "^(NP\\d+)(VAR\\d+)$",
    values_to = "value") |>
  mutate(value = if_else(grepl("^[0-9]+$", value), as.integer(value), NA)) |>
  summarise(value = sum(value), .by = c(subject_id, grp)) |>
  pivot_wider(id_cols = subject_id,
              names_from = grp,
              values_from = value) |>
  rename(updrs1 = NP1, updrs2 = NP2, updrs3 = NP3) |>
  mutate(across(where(is.numeric), as.character),
         across(everything(), ~replace_na(.x, "Z")))
    
# # A tibble: 6 × 4
#   subject_id updrs1 updrs2 updrs3
#   <chr>      <chr>  <chr>  <chr> 
# 1 1          Z      Z      Z     
# 2 2          3      2      4     
# 3 3          Z      Z      Z     
# 4 4          Z      Z      Z     
# 5 5          Z      Z      Z     
# 6 6          Z      Z      Z
4
  • Can you please explain the "names_pattern" attribute of pivot_longer()?
    – jbmchls
    Commented Jul 10 at 15:05
  • This works, so I will accept it as the best answer.
    – jbmchls
    Commented Jul 10 at 18:05
  • @jbmchls - because you want to combine (sum) multiple instances of the NP groups, you need some way of creating common values to group by. However, the NP groups have different suffixes. names_pattern = is useful in your case as it allows you break up the column names into two or more columns. The brackets in the regular expression (regex) define the boundaries of the patterns to 'capture' from the column names. If you need an explanation of the regex, let me know.
    – L Tyrone
    Commented Jul 10 at 19:46
  • @jbmchls - also, to work out what the code does, you can run it up to each pipe "|>" and examine the result of each step. So dummy_df |> ... values_to = "value"), dummy_df |> ... as.integer(value), NA)) etc.
    – L Tyrone
    Commented Jul 10 at 20:28
0

You may also use rowwise() with sum():

library(tidyverse)
df_new <- dummy_df %>%
  mutate(across(starts_with("NP"), ~ as.numeric(.x))) %>%
  rowwise() %>%
  mutate(subject_id,
         updrs1 = sum(c_across(starts_with("NP1")), na.rm = TRUE) %>% as.character(),
         updrs2 = sum(c_across(starts_with("NP2")), na.rm = TRUE) %>% as.character(),
         updrs3 = sum(c_across(starts_with("NP3")), na.rm = TRUE) %>% as.character(),
         .keep = "none") %>%
    mutate(across(starts_with("updrs"), ~ifelse(. == "0", "Z", .)))
df_new

Or without using rowwise() but use rowSums():

df_new <- dummy_df %>%
  mutate(across(starts_with("NP"), ~ as.numeric(.x))) %>%
  mutate(subject_id,
         updrs1 = rowSums(across(starts_with("NP1")), na.rm = TRUE) %>% as.character(),
         updrs2 = rowSums(across(starts_with("NP2")), na.rm = TRUE) %>% as.character(),
         updrs3 = rowSums(across(starts_with("NP3")), na.rm = TRUE) %>% as.character(),
         .keep = "none") %>%
    mutate(across(starts_with("updrs"), ~ifelse(. == "0", "Z", .)))
df_new
2
  • When applied to my real data, the str_replace() function turned a 20 into "2Z". Can you fix, please?
    – jbmchls
    Commented Jul 10 at 14:59
  • I have updated my answer. Use ifelse() instead of str_replace(). please let me know if you still have the same problem with the real data Commented Jul 10 at 19:56
0
f <- function(x) {
  x <- as.character(sum(x, na.rm=TRUE))
  if(x=="0") return("Z") else return(x)
}

dummy_df |>
  select(subject_id, matches("^NP[1-3]")) |>
  pivot_longer(-subject_id, 
               names_pattern="(NP[1-3])(VAR[1-3])",
               names_to=c(".value", "set"),
               values_transform = as.numeric) |>
  summarise(across(NP1:NP3, f), .by=subject_id) 

# A tibble: 6 × 4
  subject_id NP1   NP2   NP3  
       <dbl> <chr> <chr> <chr>
1          1 Z     Z     Z    
2          2 3     2     4    
3          3 Z     Z     Z    
4          4 Z     Z     Z    
5          5 Z     Z     Z    
6          6 Z     Z     Z

library(dplyr)
library(tidyr)
2
  • My output doesn't match yours. I have 0 values instead of "Z".
    – jbmchls
    Commented Jul 10 at 15:34
  • Yeah - me too. Sorry about that. Fixed.
    – Edward
    Commented Jul 10 at 23:20

Not the answer you're looking for? Browse other questions tagged or ask your own question.