Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Memoization" on elements in a vector input. #149

Closed
orgadish opened this issue Jul 27, 2023 · 4 comments
Closed

"Memoization" on elements in a vector input. #149

orgadish opened this issue Jul 27, 2023 · 4 comments

Comments

@orgadish
Copy link

orgadish commented Jul 27, 2023

I'm not sure if memoise is the appropriate place for this but when I suggested it in purrr, Hadley suggested memoization would be a better approach for the issue. However, memoization currently acts on the entire input to a function, without accounting for repeats in the input.

This issue came about when I discovered that fs::path_file and fs::path_dir run very slowly on Windows (see r-lib/fs#424), and since most of my use case of these functions is after using readr::read_csv(files, .id="file_path"), most of the vector is duplicated. As such, I found that I could save a significant amount of time by deduplicating the vector (2x on Mac, 40x on Windows). This approach is not just helpful for fs::path_ functions.

The most straightforward approach is:

with_deduplication <- function(f) {
  function(x, ...) {
    ux <- unique(x)
    f(ux, ...)[match(x, ux)]
  }
} 

I've also submitted a PR into vctrs to speed this up (see r-lib/vctrs#1857 and r-lib/vctrs#1858).

While traditional "Memoization" is typically performed blindly on the inputs, most programming languages aren't inherently vectorized like R. Therefore, I think it would make sense for memoise to add this extra feature to its memoization, such that it cached any input that matches the unique input. Or at least a new function, say memoise_unique since calculating unique every time takes some extra time.

@wch
Copy link
Member

wch commented Jul 27, 2023

I've also encountered issues with slow fs performance in the past, although I wonder why it's so slow on Windows in the example at r-lib/fs#424, given that it's not actually doing any filesystem operations. In some cases I've had to move completely away from using fs, and instead use base R file operations for higher performance.

I think your proposed function is more specific and narrow than makes sense for the memoise package -- for example, instead of allowing any kind of input object, it requires that the input x is a vector.

@orgadish
Copy link
Author

I guess what I was thinking is it's just a form of memoization on the elements of the input themselves. So memoize_unique would always memoize on unique(x) rather than x. It would then also handle de- and re-duplication back to the input passed in.

Perhaps both the original x and the unique x can be stored to save the unique(x) call every time, but I'm not sure that would help the primary use case.

@orgadish
Copy link
Author

@wch To follow up — the fs functions are just wrappers around the base R functions and so unsurprisingly, the 30x difference in performance is maintained using the base basename, for example. See my comment in r-lib/fs#424 for the specific benchmarking.

@orgadish orgadish closed this as not planned Won't fix, can't repro, duplicate, stale Sep 16, 2023
@orgadish
Copy link
Author

To anyone that comes across this and is looking for a solution, I've written a separate package, deduped to implement this:

# if(!requireNamespace("deduped")) install.packages("deduped")

library(deduped)

N_TOTAL <- 1e4
repeated_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:10), "inner") |> 
  rep(N_TOTAL/10) |> 
  sample()

bench::mark(
  direct = repeated_paths |> fs::path_dir(),
  indirect = repeated_paths |> deduped(fs::path_dir)(),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 direct       51.8ms   52.5ms      18.9  749.02KB     2.10
#> 2 indirect    206.3µs  213.5µs    4574.     6.13MB     0

all_unique_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:N_TOTAL), "inner")

bench::mark(
  direct = all_unique_paths |> fs::path_dir(),
  indirect = all_unique_paths |> deduped(fs::path_dir)(),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 direct       53.6ms   54.6ms      18.3  901.88KB     0   
#> 2 indirect     53.6ms   54.9ms      18.2    1.03MB     2.02

Created on 2023-10-26 with reprex v2.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants