-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.1.0 runtime for case_when with grouping variable is slow #6674
Comments
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
When benchmarking a problem like this, you really want to separate the pieces. Is this a problem with library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '1.1.0'
y1 <- rnorm(1)
y1e3 <- rnorm(1000)
y1e6 <- rnorm(1e6)
bench::mark(
y1 = case_when(y1 < 0 ~ "-", T ~ "+"),
y1e3 = case_when(y1e3 < 0 ~ "-", T ~ "+"),
y1e6 = case_when(y1e6 < 0 ~ "-", T ~ "+"),
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 y1 892µs 977.4µs 949. 1.01MB 34.0
#> 2 y1e3 934.7µs 991.7µs 939. 65.37KB 34.0
#> 3 y1e6 50.1ms 75.7ms 14.6 61.04MB 23.7 Created on 2023-02-01 with reprex v2.0.2 library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '1.0.10'
y1 <- rnorm(1)
y1e3 <- rnorm(1000)
y1e6 <- rnorm(1e6)
bench::mark(
y1 = case_when(y1 < 0 ~ "-", T ~ "+"),
y1e3 = case_when(y1e3 < 0 ~ "-", T ~ "+"),
y1e6 = case_when(y1e6 < 0 ~ "-", T ~ "+"),
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 y1 38.6µs 41.2µs 21814. 296KB 45.8
#> 2 y1e3 67.7µs 78.9µs 10627. 98.8KB 24.0
#> 3 y1e6 66.3ms 94.7ms 11.0 95.4MB 38.7 Created on 2023-02-01 with reprex v2.0.2 So that suggests that yes, using I don't think your specific use case is a particularly compelling reason to re-consider |
edit: @hadley, I was writing this before I saw your comment, sorry for the repetition. However, I argue even with 1000-long vectors (ungrouped), the 10x decrease (by I think it might be helpful to isolate this as two distinct slow-downs: Starting with data, set.seed(42)
n <- 1000
y <- rnorm(n)
df <- tibble(y2 = y) we see the following comparative performance: packageVersion("dplyr")
# [1] '1.0.10'
bench::mark(
"dplyr-1.0.10-case_when" = case_when(y < 0 ~ "-", TRUE ~ "+"),
"dplyr-1.0.10-mutate" = mutate(df, z = case_when(y2 < 0 ~ "-", TRUE ~ "+")),
min_iterations=500, check=FALSE)
# # A tibble: 2 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_t…¹ result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 dplyr-1.0.10-case_when 89.5µs 102.9µs 8511. 98.8KB 6.50 3928 3 462ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 2 dplyr-1.0.10-mutate 1.24ms 1.39ms 620. 100.3KB 2.49 498 2 803ms <NULL> <Rprofmem> <bench_tm> <tibble>
# # … with abbreviated variable name ¹total_time
### different R instance, same laptop, same R
packageVersion("dplyr")
# [1] '1.1.0'
bench::mark(
"dplyr-1.1.0-case_when" = case_when(y < 0 ~ "-", .default = "+"),
"dplyr-1.1.0-mutate" = mutate(df, z = case_when(y2 < 0 ~ "-", .default = "+")),
min_iterations=500, check=FALSE)
# # A tibble: 2 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 dplyr-1.1.0-case_when 1.5ms 1.63ms 595. 49.6KB 9.67 492 8 827.47ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 2 dplyr-1.1.0-mutate 2.76ms 3.08ms 317. 58.5KB 8.46 487 13 1.54s <NULL> <Rprofmem> <bench_tm> <tibble> I find it very interesting that the only code difference between the two |
@r2evans I think |
I think the slowdown in |
I've just updated to {dplyr} v1.1.0 and have hit a very big slow down due to this issue. I think I have a useful demonstration issue and have presented a reprex. I have data on the Top 100 UK songs every week from 2000 to 2023 which is 1119,000 rows of data with this format and 17,275 groups when grouped by id_title_artist. # A tibble: 4 × 5
date_week_start position_current position_next title id_title_artist status
<date> <dbl> <dbl> <chr> <int> <chr>
1 1999-12-26 49 40 1999 89 "Re-release"
2 1999-12-26 52 52 2 TIMES 105 "New release" My code was slowed by this issue because of the following bit of code:
To give some proper context to this, let's generate fake date for the top 10 library(tidyverse)
dates <- seq(ymd("1999-12-26"), ymd("2023-01-01"), "7 days")
n_dates <- length(dates)
fake_data <- tibble(
date_week_start = rep(dates,10),
) %>%
arrange(date_week_start) %>%
mutate(position_current = rep(1:10, n_dates),
position_next = sample(c(NA, 1:100), 10 * n_dates, replace = TRUE),
id_title_artist = sample(1:17275, 10 * n_dates),
status = sample(c(rep("Consecutive", n_dates*0.7), rep("New release", n_dates*0.15), rep("Re-release", n_dates*0.1), rep(NA, n_dates*0.05)), 10 * n_dates, replace = TRUE))
fake_data
## A tibble: 12,020 × 5
#date_week_start position_current position_next id_title_artist status
#<date> <int> <int> <int> <chr>
# 1 1999-12-26 1 67 11930 Consecutive
#2 1999-12-26 2 38 5950 Consecutive
#3 1999-12-26 3 NA 4878 Consecutive
#4 1999-12-26 4 33 4589 New release
#5 1999-12-26 5 86 13923 New release
#6 1999-12-26 6 42 16232 Consecutive
#7 1999-12-26 7 13 6975 Consecutive
#8 1999-12-26 8 81 5723 Consecutive
#9 1999-12-26 9 58 3404 Consecutive
#10 1999-12-26 10 50 13796 Re-release
## … with 12,010 more rows
## ℹ Use `print(n = ...)` to see more rows Now my code is looking for re-releases but needs to make sure that songs released in the first week of data are handled differently. As this code is then functionalised to look at different ranges of data that's particularly important: fake_data %>%
arrange(date_week_start) %>%
group_by(id_title_artist) %>%
mutate(check_rerelease = case_when(
date_week_start == min(date_week_start) ~ 0, # handle re-release in first week of data
status == "Re-release" ~ 1,
TRUE ~ NA_real_
)) Reprexlibrary(tidyverse)
library(lubridate)
dates <- seq(ymd("1999-12-26"), ymd("2023-01-01"), "7 days")
n_dates <- length(dates)
set.seed(1)
fake_data <- tibble(
date_week_start = rep(dates,10),
) %>%
arrange(date_week_start) %>%
mutate(position_current = rep(1:10, n_dates),
position_next = sample(c(NA, 1:100), 10 * n_dates, replace = TRUE),
id_title_artist = sample(1:17275, 10 * n_dates),
status = sample(c(rep("Consecutive", n_dates*0.7), rep("New release", n_dates*0.15), rep("Re-release", n_dates*0.1), rep(NA, n_dates*0.05)), 10 * n_dates, replace = TRUE))
fake_data %>%
arrange(date_week_start) %>%
group_by(id_title_artist) %>%
mutate(check_rerelease = case_when(
date_week_start == min(date_week_start) ~ 0, # handle re-release in first week of data
status == "Re-release" ~ 1,
TRUE ~ NA_real_
))
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.1 (2022-06-23)
#> os macOS Monterey 12.5
#> system aarch64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/London
#> date 2023-02-06
#> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
#> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
#> broom 1.0.3 2023-01-25 [1] CRAN (R 4.2.0)
#> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.0)
#> cli 3.4.1 2022-09-23 [1] CRAN (R 4.2.0)
#> colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0)
#> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.0)
#> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0)
#> dbplyr 2.3.0 2023-01-16 [1] CRAN (R 4.2.0)
#> digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0)
#> dplyr * 1.1.0 2023-01-29 [1] CRAN (R 4.2.0)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
#> evaluate 0.17 2022-10-07 [1] CRAN (R 4.2.0)
#> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
#> forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.2.0)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0)
#> gargle 1.2.1 2022-09-08 [1] CRAN (R 4.2.0)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0)
#> ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#> googledrive 2.0.0 2021-07-08 [1] CRAN (R 4.2.0)
#> googlesheets4 1.0.1 2022-08-13 [1] CRAN (R 4.2.0)
#> gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.0)
#> haven 2.5.1 2022-08-22 [1] CRAN (R 4.2.0)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0)
#> hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.0)
#> htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.0)
#> httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.0)
#> jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.2.0)
#> knitr 1.39.6 2022-08-04 [1] Github (yihui/knitr@bebf67e)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0)
#> lubridate * 1.9.1 2023-01-24 [1] CRAN (R 4.2.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#> modelr 0.1.10 2022-11-11 [1] CRAN (R 4.2.0)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
#> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
#> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.2.0)
#> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.2.0)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0)
#> R.utils 2.12.0 2022-06-28 [1] CRAN (R 4.2.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
#> readr * 2.1.3 2022-10-01 [1] CRAN (R 4.2.0)
#> readxl 1.4.1 2022-08-17 [1] CRAN (R 4.2.0)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0)
#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)
#> rmarkdown 2.17 2022-10-07 [1] CRAN (R 4.2.0)
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0)
#> rvest 1.0.3 2022-08-19 [1] CRAN (R 4.2.0)
#> scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
#> stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.0)
#> stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.2.0)
#> styler 1.7.0 2022-03-13 [1] CRAN (R 4.2.0)
#> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)
#> tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.2.0)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.1)
#> tidyverse * 1.3.2 2022-07-18 [1] CRAN (R 4.2.0)
#> timechange 0.2.0 2023-01-11 [1] CRAN (R 4.2.0)
#> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
#> vctrs 0.5.2 2023-01-23 [1] CRAN (R 4.2.0)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#> xfun 0.35 2022-11-16 [1] CRAN (R 4.2.0)
#> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0)
#> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
|
Although the following reprex combines different issues, it illustrates a slowdown of more than 50 x between dplyr 1.0.10 and 1.1 and brings this simple code to run in more than 2 seconds. d <- data.frame(grp = rep(paste(1:500), each = 2),
x = rep(c("A", "B"), each = 500))
library(dplyr)
d |>
group_by(grp) |>
summarise(x = case_when(x[1] == "A" ~ "bar", TRUE ~ "foo")) |
To expand on the reprex from @courtiol. If we compare two approaches where we either use In v1.0.10, v1.0.10 d <- data.frame(grp = rep(paste(1:1000), each = 2),
x = rep(c("A", "B"), each = 1000))
library(dplyr)
library(bench)
packageVersion("dplyr")
#> [1] '1.0.10'
mark(grouped = {d |>
group_by(grp) |>
summarise(x = case_when(x[1] == "A" ~ "bar", TRUE ~ "foo"))},
ungrouped = {d |>
group_by(grp) |>
summarise(firstX = first(x), .groups = "drop") |>
mutate(x = case_when(firstX == "A" ~ "bar", TRUE ~ "foo")) |>
select(-firstX)})
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 grouped 116ms 122.5ms 7.99 2.67MB 22.0
#> 2 ungrouped 12ms 13.1ms 73.4 1.44MB 15.9 v1.1.0 d <- data.frame(grp = rep(paste(1:1000), each = 2),
x = rep(c("A", "B"), each = 1000))
library(dplyr)
library(bench)
packageVersion("dplyr")
#> [1] '1.1.0'
mark(grouped = {d |>
group_by(grp) |>
summarise(x = case_when(x[1] == "A" ~ "bar", TRUE ~ "foo"))},
ungrouped = {d |>
group_by(grp) |>
summarise(firstX = first(x), .groups = "drop") |>
mutate(x = case_when(firstX == "A" ~ "bar", TRUE ~ "foo")) |>
select(-firstX)})
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 grouped 4.32s 4.32s 0.231 5.73MB 21.7
#> 2 ungrouped 79.02ms 84.06ms 12.0 1.51MB 22.0 System infosessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.2.2 (2022-10-31 ucrt)
#> os Windows 10 x64 (build 16299)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_World.1252
#> ctype English_World.1252
#> tz Europe/Berlin
#> date 2023-02-08
#> pandoc 2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> - Packages -------------------------------------------------------------------
#> package * version date (UTC) lib source
#> bench * 1.1.2 2021-11-30 [1] CRAN (R 4.2.2)
#> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.2)
#> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.2)
#> dplyr * 1.1.0 2023-01-29 [1] CRAN (R 4.2.2)
#> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.2)
#> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.2)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.2)
#> fs 1.6.0 2023-01-23 [1] CRAN (R 4.2.2)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.2)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.2)
#> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.2)
#> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.2)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.2)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.2)
#> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.3)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.2)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.2)
#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.2)
#> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.2.2)
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.2)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.2)
#> tibble 3.1.8 2022-07-22 [1] CRAN (R 4.2.2)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.2)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.2)
#> vctrs 0.5.2 2023-01-23 [1] CRAN (R 4.2.2)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.2)
#> xfun 0.36 2022-12-21 [1] CRAN (R 4.2.2)
#> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.2)
#>
#> [1] C:/Users/bailey/Documents/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.2.2/library
#>
#> ------------------------------------------------------------------------------ |
Yes, we know it’s slow and we’ll work on it. No need to keep providing reprexes that don’t add new insight to the problem. |
Thanks @DavisVaughan ! |
Using
case_when
in amutate
call with a grouping variable is much, much slower in v1.1.0 compared to v1.0.10. The code works but it's causing a tremendous slowdown in many of the packages I maintain (see here, many examples have elapsed time >5s).Here's a reprex for v1.1.0.
Created on 2023-02-01 with reprex v2.0.2
Session info
And here's a reprex for v1.0.10 (note that the times for this one are in milliseconds, above was seconds).
Created on 2023-02-01 with reprex v2.0.2
Session info
The text was updated successfully, but these errors were encountered: