Add values across dataframe columns

Question

I have a dataframe where missingness in indicated by "Z" (there may also be some "z" and NA entries present in the data), and values are entered as characters ("0", "1", etc). I need to create scores ("updrs1", "updrs2", updrs3") that add up the non-missing values across columns selected by colname prefix ("NP1", "NP2", "NP3").

dummy data:

dummy_df <- data.frame(
  subject_id = seq(1,6,1), 
  OTHERV1 = c(1,1,0,0,1,1),
  NP1VAR1 = c("Z","0","Z","Z","Z","Z"),
  NP1VAR2 = c("Z","0","Z","Z","Z","Z"),
  NP1VAR3 = c("Z","3","Z","Z","Z","Z"),
  NP2VAR1 = c("Z","2","Z","Z","Z","Z"), 
  NP2VAR2 = c("Z","0","Z","Z","Z","Z"),
  NP2VAR3 = c("Z","0","Z","Z","Z","Z"),
  NP3VAR1 = c("Z","4","Z","Z","Z","Z"),
  NP3VAR2 = c("Z","0","Z","Z","z","Z"),
  NP3VAR3 = c("Z","0","Z","Z","Z",NA),
  OTHERV2 = c(NA,NA,NA,NA,NA,NA)
)

desired output:

	subject_id	updrs1	updrs2	updrs3
1	1	Z	Z	Z
2	2	3	2	4
3	3	Z	Z	Z
4	4	Z	Z	Z
5	5	Z	Z	Z
6	6	Z	Z	Z

NOTE: all values in the output are characters

NOTE: treating NA/Z/z as 0 (i.e., transforming all values with as.numeric()) is problematic.

I've tried variations on this answer, with no luck.

I have used a combination of tidyr::pivot_longer(), dplyr::group_by(), and summarize():

desired_output <- select(dummy_df, c(subject_id, starts_with("NP"))) %>%
  mutate(across(all_of(everything()), ~ifelse(. %in% c("Z", "z"), NA, .))) %>%
  pivot_longer(cols = starts_with("NP"), names_to = c(".value", "np_var"), names_sep = "VAR") %>%
  group_by(subject_id) %>%
  summarize(updrs1 = sum(as.numeric(NP1), na.rm = FALSE),
            updrs2 = sum(as.numeric(NP2), na.rm = FALSE), 
            updrs3 = sum(as.numeric(NP3), na.rm = FALSE), .groups = "drop") %>%
  mutate(across(all_of(everything()), as.character)) %>%
  replace(is.na(.), "Z")

This works. L Tyrone's answer also works, so I've accepted it. Thanks all.

Are the various indicators of missing data handled differently or can all be treated as NA? — Seth, Commented Jul 9 at 23:35

L Tyrone · Accepted Answer · 2024-07-10 20:31:54Z

0

If you don't care whether:

the missingness values are upper or lower case
Z/z are the only missingness values
existing NA values can be represented as Z

then this works. There is likely to be a more direct way to do this, but I find a step-wise approach easier to follow. Note that it will return a warning, which you can ignore:

library(dplyr)
library(tidyr)

dummy_df |>
  pivot_longer(
    cols = starts_with("NP"),
    names_to = c("grp", "var"),
    names_pattern = "^(NP\\d+)(VAR\\d+)$",
    values_to = "value") |>
  mutate(value = if_else(grepl("^[0-9]+$", value), as.integer(value), NA)) |>
  summarise(value = sum(value), .by = c(subject_id, grp)) |>
  pivot_wider(id_cols = subject_id,
              names_from = grp,
              values_from = value) |>
  rename(updrs1 = NP1, updrs2 = NP2, updrs3 = NP3) |>
  mutate(across(where(is.numeric), as.character),
         across(everything(), ~replace_na(.x, "Z")))
    
# # A tibble: 6 × 4
#   subject_id updrs1 updrs2 updrs3
#   <chr>      <chr>  <chr>  <chr> 
# 1 1          Z      Z      Z     
# 2 2          3      2      4     
# 3 3          Z      Z      Z     
# 4 4          Z      Z      Z     
# 5 5          Z      Z      Z     
# 6 6          Z      Z      Z

edited Jul 10 at 20:31

answered Jul 10 at 0:04

L Tyrone

4,94621 gold badges27 silver badges36 bronze badges

Can you please explain the "names_pattern" attribute of pivot_longer()?
– jbmchls
Commented Jul 10 at 15:05
This works, so I will accept it as the best answer.
– jbmchls
Commented Jul 10 at 18:05
@jbmchls - because you want to combine (sum) multiple instances of the NP groups, you need some way of creating common values to group by. However, the NP groups have different suffixes. names_pattern = is useful in your case as it allows you break up the column names into two or more columns. The brackets in the regular expression (regex) define the boundaries of the patterns to 'capture' from the column names. If you need an explanation of the regex, let me know.
– L Tyrone
Commented Jul 10 at 19:46
@jbmchls - also, to work out what the code does, you can run it up to each pipe "|>" and examine the result of each step. So dummy_df |> ... values_to = "value"), dummy_df |> ... as.integer(value), NA)) etc.
– L Tyrone
Commented Jul 10 at 20:28

Add a comment |

Zhiqiang Wang · Accepted Answer · 2024-07-10 19:52:39Z

You may also use rowwise() with sum():

library(tidyverse)
df_new <- dummy_df %>%
  mutate(across(starts_with("NP"), ~ as.numeric(.x))) %>%
  rowwise() %>%
  mutate(subject_id,
         updrs1 = sum(c_across(starts_with("NP1")), na.rm = TRUE) %>% as.character(),
         updrs2 = sum(c_across(starts_with("NP2")), na.rm = TRUE) %>% as.character(),
         updrs3 = sum(c_across(starts_with("NP3")), na.rm = TRUE) %>% as.character(),
         .keep = "none") %>%
    mutate(across(starts_with("updrs"), ~ifelse(. == "0", "Z", .)))
df_new

Or without using rowwise() but use rowSums():

df_new <- dummy_df %>%
  mutate(across(starts_with("NP"), ~ as.numeric(.x))) %>%
  mutate(subject_id,
         updrs1 = rowSums(across(starts_with("NP1")), na.rm = TRUE) %>% as.character(),
         updrs2 = rowSums(across(starts_with("NP2")), na.rm = TRUE) %>% as.character(),
         updrs3 = rowSums(across(starts_with("NP3")), na.rm = TRUE) %>% as.character(),
         .keep = "none") %>%
    mutate(across(starts_with("updrs"), ~ifelse(. == "0", "Z", .)))
df_new

When applied to my real data, the str_replace() function turned a 20 into "2Z". Can you fix, please? — jbmchls, Commented Jul 10 at 14:59
I have updated my answer. Use ifelse() instead of str_replace(). please let me know if you still have the same problem with the real data — Zhiqiang Wang, Commented Jul 10 at 19:56

Edward · Accepted Answer · 2024-07-10 23:19:59Z

0

f <- function(x) {
  x <- as.character(sum(x, na.rm=TRUE))
  if(x=="0") return("Z") else return(x)
}

dummy_df |>
  select(subject_id, matches("^NP[1-3]")) |>
  pivot_longer(-subject_id, 
               names_pattern="(NP[1-3])(VAR[1-3])",
               names_to=c(".value", "set"),
               values_transform = as.numeric) |>
  summarise(across(NP1:NP3, f), .by=subject_id)

# A tibble: 6 × 4
  subject_id NP1   NP2   NP3  
       <dbl> <chr> <chr> <chr>
1          1 Z     Z     Z    
2          2 3     2     4    
3          3 Z     Z     Z    
4          4 Z     Z     Z    
5          5 Z     Z     Z    
6          6 Z     Z     Z

library(dplyr)
library(tidyr)

edited Jul 10 at 23:19

answered Jul 10 at 2:59

Edward

14.6k2 gold badges15 silver badges29 bronze badges

My output doesn't match yours. I have 0 values instead of "Z".
– jbmchls
Commented Jul 10 at 15:34
Yeah - me too. Sorry about that. Fixed.
– Edward
Commented Jul 10 at 23:20

Add a comment |

Collectives™ on Stack Overflow

Add values across dataframe columns

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
r
dplyr
data-wrangling
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Not the answer you're looking for? Browse other questions tagged rdplyrdata-wrangling or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
r
dplyr
data-wrangling
or ask your own question.