v1.1.0 runtime for case_when with grouping variable is slow #6674

fawda123 · 2023-02-01T01:53:22Z

Using case_when in a mutate call with a grouping variable is much, much slower in v1.1.0 compared to v1.0.10. The code works but it's causing a tremendous slowdown in many of the packages I maintain (see here, many examples have elapsed time >5s).

Here's a reprex for v1.1.0.

library(dplyr, warn.conflicts = F)
library(microbenchmark)

n <- 1000
dat <- data.frame(
    x = seq(1:n), 
    y = rnorm(n)
)

microbenchmark(
    dat %>% 
        group_by(x) %>% 
        mutate(
                 z = case_when(
                    y < 0 ~ '-',
                    T ~ '+', 
                 )
        ), 
    times = 100
)
#> Unit: seconds
#>                                                                        expr
#>  dat %>% group_by(x) %>% mutate(z = case_when(y < 0 ~ "-", T ~      "+", ))
#>       min       lq     mean   median       uq      max neval
#>  2.376748 2.537896 2.650869 2.625663 2.723655 3.170204   100

^{Created on 2023-02-01 with reprex v2.0.2}

Session info

sessioninfo::session_info()
#> - Session info  --------------------------------------------------------------
#>  hash: person in steamy room: medium-dark skin tone, goat, black small square
#> 
#>  setting  value
#>  version  R version 4.1.3 (2022-03-10)
#>  os       Windows 10 x64 (build 22000)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.1252
#>  ctype    English_United States.1252
#>  tz       America/New_York
#>  date     2023-02-01
#>  pandoc   2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  package        * version date (UTC) lib source
#>  cli              3.6.0   2023-01-09 [1] CRAN (R 4.1.3)
#>  digest           0.6.31  2022-12-11 [1] CRAN (R 4.1.3)
#>  dplyr          * 1.1.0   2023-01-29 [1] CRAN (R 4.1.3)
#>  evaluate         0.20    2023-01-17 [1] CRAN (R 4.1.3)
#>  fansi            1.0.4   2023-01-22 [1] CRAN (R 4.1.3)
#>  fastmap          1.1.0   2021-01-25 [1] CRAN (R 4.1.2)
#>  fs               1.6.0   2023-01-23 [1] CRAN (R 4.1.3)
#>  generics         0.1.3   2022-07-05 [1] CRAN (R 4.1.3)
#>  glue             1.6.2   2022-02-24 [1] CRAN (R 4.1.3)
#>  htmltools        0.5.4   2022-12-07 [1] CRAN (R 4.1.3)
#>  knitr            1.42    2023-01-25 [1] CRAN (R 4.1.3)
#>  lifecycle        1.0.3   2022-10-07 [1] CRAN (R 4.1.3)
#>  magrittr         2.0.3   2022-03-30 [1] CRAN (R 4.1.3)
#>  microbenchmark * 1.4.9   2021-11-09 [1] CRAN (R 4.1.3)
#>  pillar           1.8.1   2022-08-19 [1] CRAN (R 4.1.3)
#>  pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.1.2)
#>  purrr            1.0.1   2023-01-10 [1] CRAN (R 4.1.3)
#>  R.cache          0.15.0  2021-04-30 [1] CRAN (R 4.1.3)
#>  R.methodsS3      1.8.1   2020-08-26 [1] CRAN (R 4.1.1)
#>  R.oo             1.24.0  2020-08-26 [1] CRAN (R 4.1.1)
#>  R.utils          2.11.0  2021-09-26 [1] CRAN (R 4.1.3)
#>  R6               2.5.1   2021-08-19 [1] CRAN (R 4.1.2)
#>  reprex           2.0.2   2022-08-17 [1] CRAN (R 4.1.3)
#>  rlang            1.0.6   2022-09-24 [1] CRAN (R 4.1.3)
#>  rmarkdown        2.20    2023-01-19 [1] CRAN (R 4.1.3)
#>  rstudioapi       0.13    2020-11-12 [1] CRAN (R 4.1.2)
#>  sessioninfo      1.2.1   2021-11-02 [1] CRAN (R 4.1.2)
#>  styler           1.7.0   2022-03-13 [1] CRAN (R 4.1.3)
#>  tibble           3.1.8   2022-07-22 [1] CRAN (R 4.1.3)
#>  tidyselect       1.2.0   2022-10-10 [1] CRAN (R 4.1.3)
#>  utf8             1.2.2   2021-07-24 [1] CRAN (R 4.1.2)
#>  vctrs            0.5.2   2023-01-23 [1] CRAN (R 4.1.3)
#>  withr            2.5.0   2022-03-03 [1] CRAN (R 4.1.3)
#>  xfun             0.36    2022-12-21 [1] CRAN (R 4.1.3)
#>  yaml             2.3.7   2023-01-23 [1] CRAN (R 4.1.3)
#> 
#>  [1] C:/Users/mbeck/R/win-library
#>  [2] C:/Program Files/R/R-4.1.3/library
#> 
#> ------------------------------------------------------------------------------

And here's a reprex for v1.0.10 (note that the times for this one are in milliseconds, above was seconds).

library(dplyr, warn.conflicts = F)
library(microbenchmark)

n <- 1000
dat <- data.frame(
    x = seq(1:n), 
    y = rnorm(n)
)

microbenchmark(
    dat %>% 
        group_by(x) %>% 
        mutate(
                 z = case_when(
                    y < 0 ~ '-',
                    T ~ '+', 
                 )
        ), 
    times = 100
)
#> Unit: milliseconds
#>                                                                        expr
#>  dat %>% group_by(x) %>% mutate(z = case_when(y < 0 ~ "-", T ~      "+", ))
#>       min       lq     mean  median       uq      max neval
#>  114.9103 120.9102 126.9423 123.889 128.7439 167.7735   100

^{Created on 2023-02-01 with reprex v2.0.2}

Session info

sessioninfo::session_info()
#> - Session info  --------------------------------------------------------------
#>  hash: open mailbox with raised flag, love-you gesture: medium skin tone, snowboarder: light skin tone
#> 
#>  setting  value
#>  version  R version 4.1.3 (2022-03-10)
#>  os       Windows 10 x64 (build 22000)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.1252
#>  ctype    English_United States.1252
#>  tz       America/New_York
#>  date     2023-02-01
#>  pandoc   2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  package        * version date (UTC) lib source
#>  assertthat       0.2.1   2019-03-21 [1] CRAN (R 4.1.2)
#>  cli              3.6.0   2023-01-09 [1] CRAN (R 4.1.3)
#>  DBI              1.1.3   2022-06-18 [1] CRAN (R 4.1.3)
#>  digest           0.6.31  2022-12-11 [1] CRAN (R 4.1.3)
#>  dplyr          * 1.0.10  2022-09-01 [1] CRAN (R 4.1.3)
#>  evaluate         0.20    2023-01-17 [1] CRAN (R 4.1.3)
#>  fansi            1.0.4   2023-01-22 [1] CRAN (R 4.1.3)
#>  fastmap          1.1.0   2021-01-25 [1] CRAN (R 4.1.2)
#>  fs               1.6.0   2023-01-23 [1] CRAN (R 4.1.3)
#>  generics         0.1.3   2022-07-05 [1] CRAN (R 4.1.3)
#>  glue             1.6.2   2022-02-24 [1] CRAN (R 4.1.3)
#>  htmltools        0.5.4   2022-12-07 [1] CRAN (R 4.1.3)
#>  knitr            1.42    2023-01-25 [1] CRAN (R 4.1.3)
#>  lifecycle        1.0.3   2022-10-07 [1] CRAN (R 4.1.3)
#>  magrittr         2.0.3   2022-03-30 [1] CRAN (R 4.1.3)
#>  microbenchmark * 1.4.9   2021-11-09 [1] CRAN (R 4.1.3)
#>  pillar           1.8.1   2022-08-19 [1] CRAN (R 4.1.3)
#>  pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.1.2)
#>  purrr            1.0.1   2023-01-10 [1] CRAN (R 4.1.3)
#>  R.cache          0.15.0  2021-04-30 [1] CRAN (R 4.1.3)
#>  R.methodsS3      1.8.1   2020-08-26 [1] CRAN (R 4.1.1)
#>  R.oo             1.24.0  2020-08-26 [1] CRAN (R 4.1.1)
#>  R.utils          2.11.0  2021-09-26 [1] CRAN (R 4.1.3)
#>  R6               2.5.1   2021-08-19 [1] CRAN (R 4.1.2)
#>  reprex           2.0.2   2022-08-17 [1] CRAN (R 4.1.3)
#>  rlang            1.0.6   2022-09-24 [1] CRAN (R 4.1.3)
#>  rmarkdown        2.20    2023-01-19 [1] CRAN (R 4.1.3)
#>  rstudioapi       0.13    2020-11-12 [1] CRAN (R 4.1.2)
#>  sessioninfo      1.2.1   2021-11-02 [1] CRAN (R 4.1.2)
#>  styler           1.7.0   2022-03-13 [1] CRAN (R 4.1.3)
#>  tibble           3.1.8   2022-07-22 [1] CRAN (R 4.1.3)
#>  tidyselect       1.2.0   2022-10-10 [1] CRAN (R 4.1.3)
#>  utf8             1.2.2   2021-07-24 [1] CRAN (R 4.1.2)
#>  vctrs            0.5.2   2023-01-23 [1] CRAN (R 4.1.3)
#>  withr            2.5.0   2022-03-03 [1] CRAN (R 4.1.3)
#>  xfun             0.36    2022-12-21 [1] CRAN (R 4.1.3)
#>  yaml             2.3.7   2023-01-23 [1] CRAN (R 4.1.3)
#> 
#>  [1] C:/Users/mbeck/R/win-library
#>  [2] C:/Program Files/R/R-4.1.3/library
#> 
#> ------------------------------------------------------------------------------

The text was updated successfully, but these errors were encountered:

hadley · 2023-02-01T14:44:26Z

When benchmarking a problem like this, you really want to separate the pieces. Is this a problem with mutate(), or is this a problem with case_when()? You example requires case_when() to work on a single observation at a time, which is not it's strength because it's designed to be vectorised. That suggest to me that a meaningful comparison would use a few vector lengths:

library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '1.1.0'
y1 <- rnorm(1)
y1e3 <- rnorm(1000)
y1e6 <- rnorm(1e6)

bench::mark(
  y1 = case_when(y1 < 0 ~ "-", T ~ "+"),
  y1e3 = case_when(y1e3 < 0 ~ "-", T ~ "+"),
  y1e6 = case_when(y1e6 < 0 ~ "-", T ~ "+"),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 y1            892µs  977.4µs     949.     1.01MB     34.0
#> 2 y1e3        934.7µs  991.7µs     939.    65.37KB     34.0
#> 3 y1e6         50.1ms   75.7ms      14.6   61.04MB     23.7

^{Created on 2023-02-01 with reprex v2.0.2}

library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '1.0.10'
y1 <- rnorm(1)
y1e3 <- rnorm(1000)
y1e6 <- rnorm(1e6)

bench::mark(
  y1 = case_when(y1 < 0 ~ "-", T ~ "+"),
  y1e3 = case_when(y1e3 < 0 ~ "-", T ~ "+"),
  y1e6 = case_when(y1e6 < 0 ~ "-", T ~ "+"),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 y1           38.6µs   41.2µs   21814.      296KB     45.8
#> 2 y1e3         67.7µs   78.9µs   10627.     98.8KB     24.0
#> 3 y1e6         66.3ms   94.7ms      11.0    95.4MB     38.7

^{Created on 2023-02-01 with reprex v2.0.2}

So that suggests that yes, using case_when() with a single observation has gotten significantly slower (maybe 800µs extra overhead), but it gets faster as the length of the vector increases.

I don't think your specific use case is a particularly compelling reason to re-consider case_when() performance, but the drop in speed at 1000 elements might suggest we should take a quick look to try and reduce some of the setup overhead.

r2evans · 2023-02-01T14:54:43Z

edit: @hadley, I was writing this before I saw your comment, sorry for the repetition. However, I argue even with 1000-long vectors (ungrouped), the 10x decrease (by n_itr) in case_when is significant.

I think it might be helpful to isolate this as two distinct slow-downs: case_when in isolation, and case_when within mutate. I think the use of group_by()/.by= is either a red herring (exacerbating the problem) or another change in performance.

Starting with data,

set.seed(42)
n <- 1000
y <- rnorm(n)
df <- tibble(y2 = y)

we see the following comparative performance:

packageVersion("dplyr")
# [1] '1.0.10'
bench::mark(
  "dplyr-1.0.10-case_when" = case_when(y < 0 ~ "-", TRUE ~ "+"),
  "dplyr-1.0.10-mutate" = mutate(df, z = case_when(y2 < 0 ~ "-", TRUE ~ "+")),
  min_iterations=500, check=FALSE)
# # A tibble: 2 × 13
#   expression                  min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_t…¹ result memory     time       gc      
#   <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>  <bch:tm> <list> <list>     <list>     <list>  
# 1 dplyr-1.0.10-case_when   89.5µs  102.9µs     8511.    98.8KB     6.50  3928     3     462ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 2 dplyr-1.0.10-mutate      1.24ms   1.39ms      620.   100.3KB     2.49   498     2     803ms <NULL> <Rprofmem> <bench_tm> <tibble>
# # … with abbreviated variable name ¹total_time

### different R instance, same laptop, same R
packageVersion("dplyr")
# [1] '1.1.0'
bench::mark(
  "dplyr-1.1.0-case_when" = case_when(y < 0 ~ "-", .default = "+"),
  "dplyr-1.1.0-mutate" = mutate(df, z = case_when(y2 < 0 ~ "-", .default = "+")),
  min_iterations=500, check=FALSE)
# # A tibble: 2 × 13
#   expression                 min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 dplyr-1.1.0-case_when    1.5ms   1.63ms      595.    49.6KB     9.67   492     8   827.47ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 2 dplyr-1.1.0-mutate      2.76ms   3.08ms      317.    58.5KB     8.46   487    13      1.54s <NULL> <Rprofmem> <bench_tm> <tibble>

I find it very interesting that the only code difference between the two dplyr versions are the change between TRUE ~ "+" and .default = "+", yet (a) case_when has a 10x performance difference, and (b) mutate + case_when is much less different. The n_itr is high enough that I suggest these results are credible (and I repeated each several times to make sure).

hadley · 2023-02-01T14:59:02Z

@r2evans I think mutate() is entirely a red herring. It just looks like we've gained ~800µs of overhead in case_when(), and that's impacting the run-time at smaller lengths (given the other evidence I'm pretty sure this is an additive change, not a multiplicative one). I agree it's worth looking into.

r2evans · 2023-02-01T15:03:12Z

I think the slowdown in mutate may be interesting by itself, but the initial reason for my comment (that trailed yours by moments) was to isolate what is likely the larger component. I'm hopeful that a much wider net of users (now that 1.1.0 has been formally released) will provide more context and use-cases to consider if/when/how this slowdown is approached. Thanks for the package, effort, and discourse @hadley

charliejhadley · 2023-02-06T23:32:11Z

I've just updated to {dplyr} v1.1.0 and have hit a very big slow down due to this issue. I think I have a useful demonstration issue and have presented a reprex.

I have data on the Top 100 UK songs every week from 2000 to 2023 which is 1119,000 rows of data with this format and 17,275 groups when grouped by id_title_artist.

# A tibble: 4 × 5
  date_week_start position_current position_next title                id_title_artist                status
  <date>                     <dbl>         <dbl> <chr>                          <int>                <chr>
1 1999-12-26                    49            40 1999                              89                "Re-release"
2 1999-12-26                    52            52 2 TIMES                          105                "New release"

My code was slowed by this issue because of the following bit of code:

the_data %>% 
  mutate(check_rerelease = case_when(
    date_week_start == min(date_week_start) ~ 0, # handle re-release in first week of data
    status == "Re-release" ~ 1,
    TRUE ~ NA_real_
  ))

To give some proper context to this, let's generate fake date for the top 10

library(tidyverse)
dates <- seq(ymd("1999-12-26"), ymd("2023-01-01"), "7 days")
n_dates <- length(dates)

fake_data <- tibble(
  date_week_start = rep(dates,10),
) %>% 
  arrange(date_week_start) %>% 
  mutate(position_current = rep(1:10, n_dates),
         position_next = sample(c(NA, 1:100), 10 * n_dates, replace = TRUE),
         id_title_artist = sample(1:17275, 10 * n_dates),
         status = sample(c(rep("Consecutive", n_dates*0.7), rep("New release", n_dates*0.15), rep("Re-release", n_dates*0.1), rep(NA, n_dates*0.05)), 10 * n_dates, replace = TRUE)) 

fake_data
## A tibble: 12,020 × 5
#date_week_start position_current position_next id_title_artist status     
#<date>                     <int>         <int>           <int> <chr>      
#  1 1999-12-26                     1            67           11930 Consecutive
#2 1999-12-26                     2            38            5950 Consecutive
#3 1999-12-26                     3            NA            4878 Consecutive
#4 1999-12-26                     4            33            4589 New release
#5 1999-12-26                     5            86           13923 New release
#6 1999-12-26                     6            42           16232 Consecutive
#7 1999-12-26                     7            13            6975 Consecutive
#8 1999-12-26                     8            81            5723 Consecutive
#9 1999-12-26                     9            58            3404 Consecutive
#10 1999-12-26                    10            50           13796 Re-release 
## … with 12,010 more rows
## ℹ Use `print(n = ...)` to see more rows

Now my code is looking for re-releases but needs to make sure that songs released in the first week of data are handled differently. As this code is then functionalised to look at different ranges of data that's particularly important:

fake_data %>% 
  arrange(date_week_start) %>% 
  group_by(id_title_artist) %>% 
  mutate(check_rerelease = case_when(
    date_week_start == min(date_week_start) ~ 0, # handle re-release in first week of data
    status == "Re-release" ~ 1,
    TRUE ~ NA_real_
  ))

Reprex

library(tidyverse)
library(lubridate)

dates <- seq(ymd("1999-12-26"), ymd("2023-01-01"), "7 days")
n_dates <- length(dates)

set.seed(1)
fake_data <- tibble(
  date_week_start = rep(dates,10),
) %>% 
  arrange(date_week_start) %>% 
  mutate(position_current = rep(1:10, n_dates),
         position_next = sample(c(NA, 1:100), 10 * n_dates, replace = TRUE),
         id_title_artist = sample(1:17275, 10 * n_dates),
         status = sample(c(rep("Consecutive", n_dates*0.7), rep("New release", n_dates*0.15), rep("Re-release", n_dates*0.1), rep(NA, n_dates*0.05)), 10 * n_dates, replace = TRUE)) 


fake_data %>% 
  arrange(date_week_start) %>% 
  group_by(id_title_artist) %>% 
  mutate(check_rerelease = case_when(
    date_week_start == min(date_week_start) ~ 0, # handle re-release in first week of data
    status == "Re-release" ~ 1,
    TRUE ~ NA_real_
  ))

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       macOS Monterey 12.5
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/London
#>  date     2023-02-06
#>  pandoc   2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package       * version date (UTC) lib source
#>  assertthat      0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
#>  backports       1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
#>  broom           1.0.3   2023-01-25 [1] CRAN (R 4.2.0)
#>  cellranger      1.1.0   2016-07-27 [1] CRAN (R 4.2.0)
#>  cli             3.4.1   2022-09-23 [1] CRAN (R 4.2.0)
#>  colorspace      2.0-3   2022-02-21 [1] CRAN (R 4.2.0)
#>  crayon          1.5.2   2022-09-29 [1] CRAN (R 4.2.0)
#>  DBI             1.1.3   2022-06-18 [1] CRAN (R 4.2.0)
#>  dbplyr          2.3.0   2023-01-16 [1] CRAN (R 4.2.0)
#>  digest          0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
#>  dplyr         * 1.1.0   2023-01-29 [1] CRAN (R 4.2.0)
#>  ellipsis        0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate        0.17    2022-10-07 [1] CRAN (R 4.2.0)
#>  fansi           1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap         1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  forcats       * 1.0.0   2023-01-29 [1] CRAN (R 4.2.0)
#>  fs              1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  gargle          1.2.1   2022-09-08 [1] CRAN (R 4.2.0)
#>  generics        0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
#>  ggplot2       * 3.4.0   2022-11-04 [1] CRAN (R 4.2.0)
#>  glue            1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  googledrive     2.0.0   2021-07-08 [1] CRAN (R 4.2.0)
#>  googlesheets4   1.0.1   2022-08-13 [1] CRAN (R 4.2.0)
#>  gtable          0.3.1   2022-09-01 [1] CRAN (R 4.2.0)
#>  haven           2.5.1   2022-08-22 [1] CRAN (R 4.2.0)
#>  highr           0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  hms             1.1.2   2022-08-19 [1] CRAN (R 4.2.0)
#>  htmltools       0.5.3   2022-07-18 [1] CRAN (R 4.2.0)
#>  httr            1.4.4   2022-08-17 [1] CRAN (R 4.2.0)
#>  jsonlite        1.8.4   2022-12-06 [1] CRAN (R 4.2.0)
#>  knitr           1.39.6  2022-08-04 [1] Github (yihui/knitr@bebf67e)
#>  lifecycle       1.0.3   2022-10-07 [1] CRAN (R 4.2.0)
#>  lubridate     * 1.9.1   2023-01-24 [1] CRAN (R 4.2.0)
#>  magrittr        2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  modelr          0.1.10  2022-11-11 [1] CRAN (R 4.2.0)
#>  munsell         0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
#>  pillar          1.8.1   2022-08-19 [1] CRAN (R 4.2.0)
#>  pkgconfig       2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr         * 1.0.1   2023-01-10 [1] CRAN (R 4.2.0)
#>  R.cache         0.15.0  2021-04-30 [1] CRAN (R 4.2.0)
#>  R.methodsS3     1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo            1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils         2.12.0  2022-06-28 [1] CRAN (R 4.2.0)
#>  R6              2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  readr         * 2.1.3   2022-10-01 [1] CRAN (R 4.2.0)
#>  readxl          1.4.1   2022-08-17 [1] CRAN (R 4.2.0)
#>  reprex          2.0.2   2022-08-17 [1] CRAN (R 4.2.0)
#>  rlang           1.0.6   2022-09-24 [1] CRAN (R 4.2.0)
#>  rmarkdown       2.17    2022-10-07 [1] CRAN (R 4.2.0)
#>  rstudioapi      0.14    2022-08-22 [1] CRAN (R 4.2.0)
#>  rvest           1.0.3   2022-08-19 [1] CRAN (R 4.2.0)
#>  scales          1.2.1   2022-08-20 [1] CRAN (R 4.2.0)
#>  sessioninfo     1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi         1.7.8   2022-07-11 [1] CRAN (R 4.2.0)
#>  stringr       * 1.5.0   2022-12-02 [1] CRAN (R 4.2.0)
#>  styler          1.7.0   2022-03-13 [1] CRAN (R 4.2.0)
#>  tibble        * 3.1.8   2022-07-22 [1] CRAN (R 4.2.0)
#>  tidyr         * 1.3.0   2023-01-24 [1] CRAN (R 4.2.0)
#>  tidyselect      1.2.0   2022-10-10 [1] CRAN (R 4.2.1)
#>  tidyverse     * 1.3.2   2022-07-18 [1] CRAN (R 4.2.0)
#>  timechange      0.2.0   2023-01-11 [1] CRAN (R 4.2.0)
#>  tzdb            0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8            1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs           0.5.2   2023-01-23 [1] CRAN (R 4.2.0)
#>  withr           2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun            0.35    2022-11-16 [1] CRAN (R 4.2.0)
#>  xml2            1.3.3   2021-11-30 [1] CRAN (R 4.2.0)
#>  yaml            2.3.5   2022-02-21 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

courtiol · 2023-02-08T09:07:05Z

Although the following reprex combines different issues, it illustrates a slowdown of more than 50 x between dplyr 1.0.10 and 1.1 and brings this simple code to run in more than 2 seconds.

d <- data.frame(grp = rep(paste(1:500), each = 2),
                x = rep(c("A", "B"), each = 500))

library(dplyr)

d |> 
  group_by(grp) |> 
  summarise(x = case_when(x[1] == "A" ~ "bar", TRUE ~ "foo"))

LiamDBailey · 2023-02-08T10:59:27Z

To expand on the reprex from @courtiol. If we compare two approaches where we either use group_by()/summarise() before calling case_when() (case_when on a single vector, so more efficient) or use case_when() inside group_by()/summarise() (case_when run on multiple smaller vectors, less efficient). In v1.0.10, we'd see a slight difference in speed (~9x). In v1.1.0, there's now >50x difference.

In v1.0.10, case_when() inside group_by()/summarise() was a less efficient but viable approach and I was likely used quite often. The speed hit with case_when() for smaller vectors makes this approach seem no longer viable.

v1.0.10

d <- data.frame(grp = rep(paste(1:1000), each = 2),
                x = rep(c("A", "B"), each = 1000))

library(dplyr)
library(bench)

packageVersion("dplyr")
#> [1] '1.0.10'

mark(grouped = {d |> 
       group_by(grp) |> 
       summarise(x = case_when(x[1] == "A" ~ "bar", TRUE ~ "foo"))},
     ungrouped = {d |> 
       group_by(grp) |> 
       summarise(firstX = first(x), .groups = "drop") |>
       mutate(x = case_when(firstX == "A" ~ "bar", TRUE ~ "foo")) |>
       select(-firstX)})
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 grouped       116ms  122.5ms      7.99    2.67MB     22.0
#> 2 ungrouped      12ms   13.1ms     73.4     1.44MB     15.9

v1.1.0

d <- data.frame(grp = rep(paste(1:1000), each = 2),
                x = rep(c("A", "B"), each = 1000))

library(dplyr)
library(bench)

packageVersion("dplyr")
#> [1] '1.1.0'

mark(grouped = {d |> 
       group_by(grp) |> 
       summarise(x = case_when(x[1] == "A" ~ "bar", TRUE ~ "foo"))},
     ungrouped = {d |> 
       group_by(grp) |> 
       summarise(firstX = first(x), .groups = "drop") |>
       mutate(x = case_when(firstX == "A" ~ "bar", TRUE ~ "foo")) |>
       select(-firstX)})
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 grouped       4.32s    4.32s     0.231    5.73MB     21.7
#> 2 ungrouped   79.02ms  84.06ms    12.0      1.51MB     22.0

System info

sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value
#>  version  R version 4.2.2 (2022-10-31 ucrt)
#>  os       Windows 10 x64 (build 16299)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_World.1252
#>  ctype    English_World.1252
#>  tz       Europe/Berlin
#>  date     2023-02-08
#>  pandoc   2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date (UTC) lib source
#>  bench       * 1.1.2   2021-11-30 [1] CRAN (R 4.2.2)
#>  cli           3.6.0   2023-01-09 [1] CRAN (R 4.2.2)
#>  digest        0.6.31  2022-12-11 [1] CRAN (R 4.2.2)
#>  dplyr       * 1.1.0   2023-01-29 [1] CRAN (R 4.2.2)
#>  evaluate      0.20    2023-01-17 [1] CRAN (R 4.2.2)
#>  fansi         1.0.4   2023-01-22 [1] CRAN (R 4.2.2)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.2)
#>  fs            1.6.0   2023-01-23 [1] CRAN (R 4.2.2)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.2)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.2)
#>  htmltools     0.5.4   2022-12-07 [1] CRAN (R 4.2.2)
#>  knitr         1.42    2023-01-25 [1] CRAN (R 4.2.2)
#>  lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.2.2)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.2)
#>  pillar        1.8.1   2022-08-19 [1] CRAN (R 4.2.2)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.3)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.2)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.2.2)
#>  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.2.2)
#>  rmarkdown     2.20    2023-01-19 [1] CRAN (R 4.2.2)
#>  rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.2.2)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.2)
#>  tibble        3.1.8   2022-07-22 [1] CRAN (R 4.2.2)
#>  tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.2.2)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.2)
#>  vctrs         0.5.2   2023-01-23 [1] CRAN (R 4.2.2)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.2)
#>  xfun          0.36    2022-12-21 [1] CRAN (R 4.2.2)
#>  yaml          2.3.7   2023-01-23 [1] CRAN (R 4.2.2)
#> 
#>  [1] C:/Users/bailey/Documents/R/win-library/4.0
#>  [2] C:/Program Files/R/R-4.2.2/library
#> 
#> ------------------------------------------------------------------------------

hadley · 2023-02-08T13:11:09Z

Yes, we know it’s slow and we’ll work on it. No need to keep providing reprexes that don’t add new insight to the problem.

To help with tidyverse/dplyr#6674 tidyverse/dplyr#6681

r2evans · 2023-02-13T20:27:45Z

Thanks @DavisVaughan !

This comment was marked as outdated.

Sign in to view

DavisVaughan added this to the 1.1.1 milestone Feb 7, 2023

DavisVaughan mentioned this issue Feb 7, 2023

Mutate and summarize speed related to parentheses #6681

Closed

lionel- added a commit to r-lib/rlang that referenced this issue Feb 10, 2023

Add private option to disable special infix labelling

33db700

To help with tidyverse/dplyr#6674 tidyverse/dplyr#6681

DavisVaughan mentioned this issue Feb 10, 2023

Improve performance of case_when() and case_match() #6711

Merged

DavisVaughan closed this as completed in #6711 Feb 13, 2023

joranE mentioned this issue Mar 14, 2023

case_when is very slow in version 1.1.0 #6788

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.1.0 runtime for case_when with grouping variable is slow #6674

v1.1.0 runtime for case_when with grouping variable is slow #6674

fawda123 commented Feb 1, 2023 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

hadley commented Feb 1, 2023

r2evans commented Feb 1, 2023 •

edited

Loading

hadley commented Feb 1, 2023

r2evans commented Feb 1, 2023

charliejhadley commented Feb 6, 2023

courtiol commented Feb 8, 2023

LiamDBailey commented Feb 8, 2023

hadley commented Feb 8, 2023

r2evans commented Feb 13, 2023

v1.1.0 runtime for case_when with grouping variable is slow #6674

v1.1.0 runtime for case_when with grouping variable is slow #6674

Comments

fawda123 commented Feb 1, 2023 • edited Loading

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

hadley commented Feb 1, 2023

r2evans commented Feb 1, 2023 • edited Loading

hadley commented Feb 1, 2023

r2evans commented Feb 1, 2023

charliejhadley commented Feb 6, 2023

Reprex

courtiol commented Feb 8, 2023

LiamDBailey commented Feb 8, 2023

hadley commented Feb 8, 2023

r2evans commented Feb 13, 2023

fawda123 commented Feb 1, 2023 •

edited

Loading

r2evans commented Feb 1, 2023 •

edited

Loading