Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.1.0 runtime for case_when with grouping variable is slow #6674

Closed
fawda123 opened this issue Feb 1, 2023 · 12 comments · Fixed by #6711
Closed

v1.1.0 runtime for case_when with grouping variable is slow #6674

fawda123 opened this issue Feb 1, 2023 · 12 comments · Fixed by #6711
Milestone

Comments

@fawda123
Copy link

fawda123 commented Feb 1, 2023

Using case_when in a mutate call with a grouping variable is much, much slower in v1.1.0 compared to v1.0.10. The code works but it's causing a tremendous slowdown in many of the packages I maintain (see here, many examples have elapsed time >5s).

Here's a reprex for v1.1.0.

library(dplyr, warn.conflicts = F)
library(microbenchmark)

n <- 1000
dat <- data.frame(
    x = seq(1:n), 
    y = rnorm(n)
)

microbenchmark(
    dat %>% 
        group_by(x) %>% 
        mutate(
                 z = case_when(
                    y < 0 ~ '-',
                    T ~ '+', 
                 )
        ), 
    times = 100
)
#> Unit: seconds
#>                                                                        expr
#>  dat %>% group_by(x) %>% mutate(z = case_when(y < 0 ~ "-", T ~      "+", ))
#>       min       lq     mean   median       uq      max neval
#>  2.376748 2.537896 2.650869 2.625663 2.723655 3.170204   100

Created on 2023-02-01 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> - Session info  --------------------------------------------------------------
#>  hash: person in steamy room: medium-dark skin tone, goat, black small square
#> 
#>  setting  value
#>  version  R version 4.1.3 (2022-03-10)
#>  os       Windows 10 x64 (build 22000)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.1252
#>  ctype    English_United States.1252
#>  tz       America/New_York
#>  date     2023-02-01
#>  pandoc   2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  package        * version date (UTC) lib source
#>  cli              3.6.0   2023-01-09 [1] CRAN (R 4.1.3)
#>  digest           0.6.31  2022-12-11 [1] CRAN (R 4.1.3)
#>  dplyr          * 1.1.0   2023-01-29 [1] CRAN (R 4.1.3)
#>  evaluate         0.20    2023-01-17 [1] CRAN (R 4.1.3)
#>  fansi            1.0.4   2023-01-22 [1] CRAN (R 4.1.3)
#>  fastmap          1.1.0   2021-01-25 [1] CRAN (R 4.1.2)
#>  fs               1.6.0   2023-01-23 [1] CRAN (R 4.1.3)
#>  generics         0.1.3   2022-07-05 [1] CRAN (R 4.1.3)
#>  glue             1.6.2   2022-02-24 [1] CRAN (R 4.1.3)
#>  htmltools        0.5.4   2022-12-07 [1] CRAN (R 4.1.3)
#>  knitr            1.42    2023-01-25 [1] CRAN (R 4.1.3)
#>  lifecycle        1.0.3   2022-10-07 [1] CRAN (R 4.1.3)
#>  magrittr         2.0.3   2022-03-30 [1] CRAN (R 4.1.3)
#>  microbenchmark * 1.4.9   2021-11-09 [1] CRAN (R 4.1.3)
#>  pillar           1.8.1   2022-08-19 [1] CRAN (R 4.1.3)
#>  pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.1.2)
#>  purrr            1.0.1   2023-01-10 [1] CRAN (R 4.1.3)
#>  R.cache          0.15.0  2021-04-30 [1] CRAN (R 4.1.3)
#>  R.methodsS3      1.8.1   2020-08-26 [1] CRAN (R 4.1.1)
#>  R.oo             1.24.0  2020-08-26 [1] CRAN (R 4.1.1)
#>  R.utils          2.11.0  2021-09-26 [1] CRAN (R 4.1.3)
#>  R6               2.5.1   2021-08-19 [1] CRAN (R 4.1.2)
#>  reprex           2.0.2   2022-08-17 [1] CRAN (R 4.1.3)
#>  rlang            1.0.6   2022-09-24 [1] CRAN (R 4.1.3)
#>  rmarkdown        2.20    2023-01-19 [1] CRAN (R 4.1.3)
#>  rstudioapi       0.13    2020-11-12 [1] CRAN (R 4.1.2)
#>  sessioninfo      1.2.1   2021-11-02 [1] CRAN (R 4.1.2)
#>  styler           1.7.0   2022-03-13 [1] CRAN (R 4.1.3)
#>  tibble           3.1.8   2022-07-22 [1] CRAN (R 4.1.3)
#>  tidyselect       1.2.0   2022-10-10 [1] CRAN (R 4.1.3)
#>  utf8             1.2.2   2021-07-24 [1] CRAN (R 4.1.2)
#>  vctrs            0.5.2   2023-01-23 [1] CRAN (R 4.1.3)
#>  withr            2.5.0   2022-03-03 [1] CRAN (R 4.1.3)
#>  xfun             0.36    2022-12-21 [1] CRAN (R 4.1.3)
#>  yaml             2.3.7   2023-01-23 [1] CRAN (R 4.1.3)
#> 
#>  [1] C:/Users/mbeck/R/win-library
#>  [2] C:/Program Files/R/R-4.1.3/library
#> 
#> ------------------------------------------------------------------------------

And here's a reprex for v1.0.10 (note that the times for this one are in milliseconds, above was seconds).

library(dplyr, warn.conflicts = F)
library(microbenchmark)

n <- 1000
dat <- data.frame(
    x = seq(1:n), 
    y = rnorm(n)
)

microbenchmark(
    dat %>% 
        group_by(x) %>% 
        mutate(
                 z = case_when(
                    y < 0 ~ '-',
                    T ~ '+', 
                 )
        ), 
    times = 100
)
#> Unit: milliseconds
#>                                                                        expr
#>  dat %>% group_by(x) %>% mutate(z = case_when(y < 0 ~ "-", T ~      "+", ))
#>       min       lq     mean  median       uq      max neval
#>  114.9103 120.9102 126.9423 123.889 128.7439 167.7735   100

Created on 2023-02-01 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> - Session info  --------------------------------------------------------------
#>  hash: open mailbox with raised flag, love-you gesture: medium skin tone, snowboarder: light skin tone
#> 
#>  setting  value
#>  version  R version 4.1.3 (2022-03-10)
#>  os       Windows 10 x64 (build 22000)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.1252
#>  ctype    English_United States.1252
#>  tz       America/New_York
#>  date     2023-02-01
#>  pandoc   2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  package        * version date (UTC) lib source
#>  assertthat       0.2.1   2019-03-21 [1] CRAN (R 4.1.2)
#>  cli              3.6.0   2023-01-09 [1] CRAN (R 4.1.3)
#>  DBI              1.1.3   2022-06-18 [1] CRAN (R 4.1.3)
#>  digest           0.6.31  2022-12-11 [1] CRAN (R 4.1.3)
#>  dplyr          * 1.0.10  2022-09-01 [1] CRAN (R 4.1.3)
#>  evaluate         0.20    2023-01-17 [1] CRAN (R 4.1.3)
#>  fansi            1.0.4   2023-01-22 [1] CRAN (R 4.1.3)
#>  fastmap          1.1.0   2021-01-25 [1] CRAN (R 4.1.2)
#>  fs               1.6.0   2023-01-23 [1] CRAN (R 4.1.3)
#>  generics         0.1.3   2022-07-05 [1] CRAN (R 4.1.3)
#>  glue             1.6.2   2022-02-24 [1] CRAN (R 4.1.3)
#>  htmltools        0.5.4   2022-12-07 [1] CRAN (R 4.1.3)
#>  knitr            1.42    2023-01-25 [1] CRAN (R 4.1.3)
#>  lifecycle        1.0.3   2022-10-07 [1] CRAN (R 4.1.3)
#>  magrittr         2.0.3   2022-03-30 [1] CRAN (R 4.1.3)
#>  microbenchmark * 1.4.9   2021-11-09 [1] CRAN (R 4.1.3)
#>  pillar           1.8.1   2022-08-19 [1] CRAN (R 4.1.3)
#>  pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.1.2)
#>  purrr            1.0.1   2023-01-10 [1] CRAN (R 4.1.3)
#>  R.cache          0.15.0  2021-04-30 [1] CRAN (R 4.1.3)
#>  R.methodsS3      1.8.1   2020-08-26 [1] CRAN (R 4.1.1)
#>  R.oo             1.24.0  2020-08-26 [1] CRAN (R 4.1.1)
#>  R.utils          2.11.0  2021-09-26 [1] CRAN (R 4.1.3)
#>  R6               2.5.1   2021-08-19 [1] CRAN (R 4.1.2)
#>  reprex           2.0.2   2022-08-17 [1] CRAN (R 4.1.3)
#>  rlang            1.0.6   2022-09-24 [1] CRAN (R 4.1.3)
#>  rmarkdown        2.20    2023-01-19 [1] CRAN (R 4.1.3)
#>  rstudioapi       0.13    2020-11-12 [1] CRAN (R 4.1.2)
#>  sessioninfo      1.2.1   2021-11-02 [1] CRAN (R 4.1.2)
#>  styler           1.7.0   2022-03-13 [1] CRAN (R 4.1.3)
#>  tibble           3.1.8   2022-07-22 [1] CRAN (R 4.1.3)
#>  tidyselect       1.2.0   2022-10-10 [1] CRAN (R 4.1.3)
#>  utf8             1.2.2   2021-07-24 [1] CRAN (R 4.1.2)
#>  vctrs            0.5.2   2023-01-23 [1] CRAN (R 4.1.3)
#>  withr            2.5.0   2022-03-03 [1] CRAN (R 4.1.3)
#>  xfun             0.36    2022-12-21 [1] CRAN (R 4.1.3)
#>  yaml             2.3.7   2023-01-23 [1] CRAN (R 4.1.3)
#> 
#>  [1] C:/Users/mbeck/R/win-library
#>  [2] C:/Program Files/R/R-4.1.3/library
#> 
#> ------------------------------------------------------------------------------
@jonspring

This comment was marked as outdated.

@dpprdan

This comment was marked as outdated.

@fawda123

This comment was marked as outdated.

@hadley
Copy link
Member

hadley commented Feb 1, 2023

When benchmarking a problem like this, you really want to separate the pieces. Is this a problem with mutate(), or is this a problem with case_when()? You example requires case_when() to work on a single observation at a time, which is not it's strength because it's designed to be vectorised. That suggest to me that a meaningful comparison would use a few vector lengths:

library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '1.1.0'
y1 <- rnorm(1)
y1e3 <- rnorm(1000)
y1e6 <- rnorm(1e6)

bench::mark(
  y1 = case_when(y1 < 0 ~ "-", T ~ "+"),
  y1e3 = case_when(y1e3 < 0 ~ "-", T ~ "+"),
  y1e6 = case_when(y1e6 < 0 ~ "-", T ~ "+"),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 y1            892µs  977.4µs     949.     1.01MB     34.0
#> 2 y1e3        934.7µs  991.7µs     939.    65.37KB     34.0
#> 3 y1e6         50.1ms   75.7ms      14.6   61.04MB     23.7

Created on 2023-02-01 with reprex v2.0.2

library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '1.0.10'
y1 <- rnorm(1)
y1e3 <- rnorm(1000)
y1e6 <- rnorm(1e6)

bench::mark(
  y1 = case_when(y1 < 0 ~ "-", T ~ "+"),
  y1e3 = case_when(y1e3 < 0 ~ "-", T ~ "+"),
  y1e6 = case_when(y1e6 < 0 ~ "-", T ~ "+"),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 y1           38.6µs   41.2µs   21814.      296KB     45.8
#> 2 y1e3         67.7µs   78.9µs   10627.     98.8KB     24.0
#> 3 y1e6         66.3ms   94.7ms      11.0    95.4MB     38.7

Created on 2023-02-01 with reprex v2.0.2

So that suggests that yes, using case_when() with a single observation has gotten significantly slower (maybe 800µs extra overhead), but it gets faster as the length of the vector increases.

I don't think your specific use case is a particularly compelling reason to re-consider case_when() performance, but the drop in speed at 1000 elements might suggest we should take a quick look to try and reduce some of the setup overhead.

@r2evans
Copy link

r2evans commented Feb 1, 2023

edit: @hadley, I was writing this before I saw your comment, sorry for the repetition. However, I argue even with 1000-long vectors (ungrouped), the 10x decrease (by n_itr) in case_when is significant.


I think it might be helpful to isolate this as two distinct slow-downs: case_when in isolation, and case_when within mutate. I think the use of group_by()/.by= is either a red herring (exacerbating the problem) or another change in performance.

Starting with data,

set.seed(42)
n <- 1000
y <- rnorm(n)
df <- tibble(y2 = y)

we see the following comparative performance:

packageVersion("dplyr")
# [1] '1.0.10'
bench::mark(
  "dplyr-1.0.10-case_when" = case_when(y < 0 ~ "-", TRUE ~ "+"),
  "dplyr-1.0.10-mutate" = mutate(df, z = case_when(y2 < 0 ~ "-", TRUE ~ "+")),
  min_iterations=500, check=FALSE)
# # A tibble: 2 × 13
#   expression                  min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_t…¹ result memory     time       gc      
#   <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>  <bch:tm> <list> <list>     <list>     <list>  
# 1 dplyr-1.0.10-case_when   89.5µs  102.9µs     8511.    98.8KB     6.50  3928     3     462ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 2 dplyr-1.0.10-mutate      1.24ms   1.39ms      620.   100.3KB     2.49   498     2     803ms <NULL> <Rprofmem> <bench_tm> <tibble>
# # … with abbreviated variable name ¹​total_time

### different R instance, same laptop, same R
packageVersion("dplyr")
# [1] '1.1.0'
bench::mark(
  "dplyr-1.1.0-case_when" = case_when(y < 0 ~ "-", .default = "+"),
  "dplyr-1.1.0-mutate" = mutate(df, z = case_when(y2 < 0 ~ "-", .default = "+")),
  min_iterations=500, check=FALSE)
# # A tibble: 2 × 13
#   expression                 min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 dplyr-1.1.0-case_when    1.5ms   1.63ms      595.    49.6KB     9.67   492     8   827.47ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 2 dplyr-1.1.0-mutate      2.76ms   3.08ms      317.    58.5KB     8.46   487    13      1.54s <NULL> <Rprofmem> <bench_tm> <tibble>

I find it very interesting that the only code difference between the two dplyr versions are the change between TRUE ~ "+" and .default = "+", yet (a) case_when has a 10x performance difference, and (b) mutate + case_when is much less different. The n_itr is high enough that I suggest these results are credible (and I repeated each several times to make sure).

@hadley
Copy link
Member

hadley commented Feb 1, 2023

@r2evans I think mutate() is entirely a red herring. It just looks like we've gained ~800µs of overhead in case_when(), and that's impacting the run-time at smaller lengths (given the other evidence I'm pretty sure this is an additive change, not a multiplicative one). I agree it's worth looking into.

@r2evans
Copy link

r2evans commented Feb 1, 2023

I think the slowdown in mutate may be interesting by itself, but the initial reason for my comment (that trailed yours by moments) was to isolate what is likely the larger component. I'm hopeful that a much wider net of users (now that 1.1.0 has been formally released) will provide more context and use-cases to consider if/when/how this slowdown is approached. Thanks for the package, effort, and discourse @hadley

@charliejhadley
Copy link
Contributor

I've just updated to {dplyr} v1.1.0 and have hit a very big slow down due to this issue. I think I have a useful demonstration issue and have presented a reprex.

I have data on the Top 100 UK songs every week from 2000 to 2023 which is 1119,000 rows of data with this format and 17,275 groups when grouped by id_title_artist.

# A tibble: 4 × 5
  date_week_start position_current position_next title                id_title_artist                status
  <date>                     <dbl>         <dbl> <chr>                          <int>                <chr>
1 1999-12-26                    49            40 1999                              89                "Re-release"
2 1999-12-26                    52            52 2 TIMES                          105                "New release"

My code was slowed by this issue because of the following bit of code:

the_data %>% 
  mutate(check_rerelease = case_when(
    date_week_start == min(date_week_start) ~ 0, # handle re-release in first week of data
    status == "Re-release" ~ 1,
    TRUE ~ NA_real_
  ))

To give some proper context to this, let's generate fake date for the top 10

library(tidyverse)
dates <- seq(ymd("1999-12-26"), ymd("2023-01-01"), "7 days")
n_dates <- length(dates)

fake_data <- tibble(
  date_week_start = rep(dates,10),
) %>% 
  arrange(date_week_start) %>% 
  mutate(position_current = rep(1:10, n_dates),
         position_next = sample(c(NA, 1:100), 10 * n_dates, replace = TRUE),
         id_title_artist = sample(1:17275, 10 * n_dates),
         status = sample(c(rep("Consecutive", n_dates*0.7), rep("New release", n_dates*0.15), rep("Re-release", n_dates*0.1), rep(NA, n_dates*0.05)), 10 * n_dates, replace = TRUE)) 

fake_data
## A tibble: 12,020 × 5
#date_week_start position_current position_next id_title_artist status     
#<date>                     <int>         <int>           <int> <chr>      
#  1 1999-12-26                     1            67           11930 Consecutive
#2 1999-12-26                     2            38            5950 Consecutive
#3 1999-12-26                     3            NA            4878 Consecutive
#4 1999-12-26                     4            33            4589 New release
#5 1999-12-26                     5            86           13923 New release
#6 1999-12-26                     6            42           16232 Consecutive
#7 1999-12-26                     7            13            6975 Consecutive
#8 1999-12-26                     8            81            5723 Consecutive
#9 1999-12-26                     9            58            3404 Consecutive
#10 1999-12-26                    10            50           13796 Re-release 
## … with 12,010 more rows
## ℹ Use `print(n = ...)` to see more rows

Now my code is looking for re-releases but needs to make sure that songs released in the first week of data are handled differently. As this code is then functionalised to look at different ranges of data that's particularly important:

fake_data %>% 
  arrange(date_week_start) %>% 
  group_by(id_title_artist) %>% 
  mutate(check_rerelease = case_when(
    date_week_start == min(date_week_start) ~ 0, # handle re-release in first week of data
    status == "Re-release" ~ 1,
    TRUE ~ NA_real_
  ))

Reprex

library(tidyverse)
library(lubridate)

dates <- seq(ymd("1999-12-26"), ymd("2023-01-01"), "7 days")
n_dates <- length(dates)

set.seed(1)
fake_data <- tibble(
  date_week_start = rep(dates,10),
) %>% 
  arrange(date_week_start) %>% 
  mutate(position_current = rep(1:10, n_dates),
         position_next = sample(c(NA, 1:100), 10 * n_dates, replace = TRUE),
         id_title_artist = sample(1:17275, 10 * n_dates),
         status = sample(c(rep("Consecutive", n_dates*0.7), rep("New release", n_dates*0.15), rep("Re-release", n_dates*0.1), rep(NA, n_dates*0.05)), 10 * n_dates, replace = TRUE)) 


fake_data %>% 
  arrange(date_week_start) %>% 
  group_by(id_title_artist) %>% 
  mutate(check_rerelease = case_when(
    date_week_start == min(date_week_start) ~ 0, # handle re-release in first week of data
    status == "Re-release" ~ 1,
    TRUE ~ NA_real_
  ))

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       macOS Monterey 12.5
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/London
#>  date     2023-02-06
#>  pandoc   2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package       * version date (UTC) lib source
#>  assertthat      0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
#>  backports       1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
#>  broom           1.0.3   2023-01-25 [1] CRAN (R 4.2.0)
#>  cellranger      1.1.0   2016-07-27 [1] CRAN (R 4.2.0)
#>  cli             3.4.1   2022-09-23 [1] CRAN (R 4.2.0)
#>  colorspace      2.0-3   2022-02-21 [1] CRAN (R 4.2.0)
#>  crayon          1.5.2   2022-09-29 [1] CRAN (R 4.2.0)
#>  DBI             1.1.3   2022-06-18 [1] CRAN (R 4.2.0)
#>  dbplyr          2.3.0   2023-01-16 [1] CRAN (R 4.2.0)
#>  digest          0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
#>  dplyr         * 1.1.0   2023-01-29 [1] CRAN (R 4.2.0)
#>  ellipsis        0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate        0.17    2022-10-07 [1] CRAN (R 4.2.0)
#>  fansi           1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap         1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  forcats       * 1.0.0   2023-01-29 [1] CRAN (R 4.2.0)
#>  fs              1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  gargle          1.2.1   2022-09-08 [1] CRAN (R 4.2.0)
#>  generics        0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
#>  ggplot2       * 3.4.0   2022-11-04 [1] CRAN (R 4.2.0)
#>  glue            1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  googledrive     2.0.0   2021-07-08 [1] CRAN (R 4.2.0)
#>  googlesheets4   1.0.1   2022-08-13 [1] CRAN (R 4.2.0)
#>  gtable          0.3.1   2022-09-01 [1] CRAN (R 4.2.0)
#>  haven           2.5.1   2022-08-22 [1] CRAN (R 4.2.0)
#>  highr           0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  hms             1.1.2   2022-08-19 [1] CRAN (R 4.2.0)
#>  htmltools       0.5.3   2022-07-18 [1] CRAN (R 4.2.0)
#>  httr            1.4.4   2022-08-17 [1] CRAN (R 4.2.0)
#>  jsonlite        1.8.4   2022-12-06 [1] CRAN (R 4.2.0)
#>  knitr           1.39.6  2022-08-04 [1] Github (yihui/knitr@bebf67e)
#>  lifecycle       1.0.3   2022-10-07 [1] CRAN (R 4.2.0)
#>  lubridate     * 1.9.1   2023-01-24 [1] CRAN (R 4.2.0)
#>  magrittr        2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  modelr          0.1.10  2022-11-11 [1] CRAN (R 4.2.0)
#>  munsell         0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
#>  pillar          1.8.1   2022-08-19 [1] CRAN (R 4.2.0)
#>  pkgconfig       2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr         * 1.0.1   2023-01-10 [1] CRAN (R 4.2.0)
#>  R.cache         0.15.0  2021-04-30 [1] CRAN (R 4.2.0)
#>  R.methodsS3     1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo            1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils         2.12.0  2022-06-28 [1] CRAN (R 4.2.0)
#>  R6              2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  readr         * 2.1.3   2022-10-01 [1] CRAN (R 4.2.0)
#>  readxl          1.4.1   2022-08-17 [1] CRAN (R 4.2.0)
#>  reprex          2.0.2   2022-08-17 [1] CRAN (R 4.2.0)
#>  rlang           1.0.6   2022-09-24 [1] CRAN (R 4.2.0)
#>  rmarkdown       2.17    2022-10-07 [1] CRAN (R 4.2.0)
#>  rstudioapi      0.14    2022-08-22 [1] CRAN (R 4.2.0)
#>  rvest           1.0.3   2022-08-19 [1] CRAN (R 4.2.0)
#>  scales          1.2.1   2022-08-20 [1] CRAN (R 4.2.0)
#>  sessioninfo     1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi         1.7.8   2022-07-11 [1] CRAN (R 4.2.0)
#>  stringr       * 1.5.0   2022-12-02 [1] CRAN (R 4.2.0)
#>  styler          1.7.0   2022-03-13 [1] CRAN (R 4.2.0)
#>  tibble        * 3.1.8   2022-07-22 [1] CRAN (R 4.2.0)
#>  tidyr         * 1.3.0   2023-01-24 [1] CRAN (R 4.2.0)
#>  tidyselect      1.2.0   2022-10-10 [1] CRAN (R 4.2.1)
#>  tidyverse     * 1.3.2   2022-07-18 [1] CRAN (R 4.2.0)
#>  timechange      0.2.0   2023-01-11 [1] CRAN (R 4.2.0)
#>  tzdb            0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8            1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs           0.5.2   2023-01-23 [1] CRAN (R 4.2.0)
#>  withr           2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun            0.35    2022-11-16 [1] CRAN (R 4.2.0)
#>  xml2            1.3.3   2021-11-30 [1] CRAN (R 4.2.0)
#>  yaml            2.3.5   2022-02-21 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────




@courtiol
Copy link
Contributor

courtiol commented Feb 8, 2023

Although the following reprex combines different issues, it illustrates a slowdown of more than 50 x between dplyr 1.0.10 and 1.1 and brings this simple code to run in more than 2 seconds.

d <- data.frame(grp = rep(paste(1:500), each = 2),
                x = rep(c("A", "B"), each = 500))

library(dplyr)

d |> 
  group_by(grp) |> 
  summarise(x = case_when(x[1] == "A" ~ "bar", TRUE ~ "foo"))

@LiamDBailey
Copy link

To expand on the reprex from @courtiol. If we compare two approaches where we either use group_by()/summarise() before calling case_when() (case_when on a single vector, so more efficient) or use case_when() inside group_by()/summarise() (case_when run on multiple smaller vectors, less efficient). In v1.0.10, we'd see a slight difference in speed (~9x). In v1.1.0, there's now >50x difference.

In v1.0.10, case_when() inside group_by()/summarise() was a less efficient but viable approach and I was likely used quite often. The speed hit with case_when() for smaller vectors makes this approach seem no longer viable.

v1.0.10

d <- data.frame(grp = rep(paste(1:1000), each = 2),
                x = rep(c("A", "B"), each = 1000))

library(dplyr)
library(bench)

packageVersion("dplyr")
#> [1] '1.0.10'

mark(grouped = {d |> 
       group_by(grp) |> 
       summarise(x = case_when(x[1] == "A" ~ "bar", TRUE ~ "foo"))},
     ungrouped = {d |> 
       group_by(grp) |> 
       summarise(firstX = first(x), .groups = "drop") |>
       mutate(x = case_when(firstX == "A" ~ "bar", TRUE ~ "foo")) |>
       select(-firstX)})
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 grouped       116ms  122.5ms      7.99    2.67MB     22.0
#> 2 ungrouped      12ms   13.1ms     73.4     1.44MB     15.9

v1.1.0

d <- data.frame(grp = rep(paste(1:1000), each = 2),
                x = rep(c("A", "B"), each = 1000))

library(dplyr)
library(bench)

packageVersion("dplyr")
#> [1] '1.1.0'

mark(grouped = {d |> 
       group_by(grp) |> 
       summarise(x = case_when(x[1] == "A" ~ "bar", TRUE ~ "foo"))},
     ungrouped = {d |> 
       group_by(grp) |> 
       summarise(firstX = first(x), .groups = "drop") |>
       mutate(x = case_when(firstX == "A" ~ "bar", TRUE ~ "foo")) |>
       select(-firstX)})
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 grouped       4.32s    4.32s     0.231    5.73MB     21.7
#> 2 ungrouped   79.02ms  84.06ms    12.0      1.51MB     22.0
System info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value
#>  version  R version 4.2.2 (2022-10-31 ucrt)
#>  os       Windows 10 x64 (build 16299)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_World.1252
#>  ctype    English_World.1252
#>  tz       Europe/Berlin
#>  date     2023-02-08
#>  pandoc   2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date (UTC) lib source
#>  bench       * 1.1.2   2021-11-30 [1] CRAN (R 4.2.2)
#>  cli           3.6.0   2023-01-09 [1] CRAN (R 4.2.2)
#>  digest        0.6.31  2022-12-11 [1] CRAN (R 4.2.2)
#>  dplyr       * 1.1.0   2023-01-29 [1] CRAN (R 4.2.2)
#>  evaluate      0.20    2023-01-17 [1] CRAN (R 4.2.2)
#>  fansi         1.0.4   2023-01-22 [1] CRAN (R 4.2.2)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.2)
#>  fs            1.6.0   2023-01-23 [1] CRAN (R 4.2.2)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.2)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.2)
#>  htmltools     0.5.4   2022-12-07 [1] CRAN (R 4.2.2)
#>  knitr         1.42    2023-01-25 [1] CRAN (R 4.2.2)
#>  lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.2.2)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.2)
#>  pillar        1.8.1   2022-08-19 [1] CRAN (R 4.2.2)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.3)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.2)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.2.2)
#>  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.2.2)
#>  rmarkdown     2.20    2023-01-19 [1] CRAN (R 4.2.2)
#>  rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.2.2)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.2)
#>  tibble        3.1.8   2022-07-22 [1] CRAN (R 4.2.2)
#>  tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.2.2)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.2)
#>  vctrs         0.5.2   2023-01-23 [1] CRAN (R 4.2.2)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.2)
#>  xfun          0.36    2022-12-21 [1] CRAN (R 4.2.2)
#>  yaml          2.3.7   2023-01-23 [1] CRAN (R 4.2.2)
#> 
#>  [1] C:/Users/bailey/Documents/R/win-library/4.0
#>  [2] C:/Program Files/R/R-4.2.2/library
#> 
#> ------------------------------------------------------------------------------

@hadley
Copy link
Member

hadley commented Feb 8, 2023

Yes, we know it’s slow and we’ll work on it. No need to keep providing reprexes that don’t add new insight to the problem.

@r2evans
Copy link

r2evans commented Feb 13, 2023

Thanks @DavisVaughan !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants