"Memoization" on elements in a vector input. #149

orgadish · 2023-07-27T03:35:51Z

I'm not sure if memoise is the appropriate place for this but when I suggested it in purrr, Hadley suggested memoization would be a better approach for the issue. However, memoization currently acts on the entire input to a function, without accounting for repeats in the input.

This issue came about when I discovered that fs::path_file and fs::path_dir run very slowly on Windows (see r-lib/fs#424), and since most of my use case of these functions is after using readr::read_csv(files, .id="file_path"), most of the vector is duplicated. As such, I found that I could save a significant amount of time by deduplicating the vector (2x on Mac, 40x on Windows). This approach is not just helpful for fs::path_ functions.

The most straightforward approach is:

with_deduplication <- function(f) {
  function(x, ...) {
    ux <- unique(x)
    f(ux, ...)[match(x, ux)]
  }
}

I've also submitted a PR into vctrs to speed this up (see r-lib/vctrs#1857 and r-lib/vctrs#1858).

While traditional "Memoization" is typically performed blindly on the inputs, most programming languages aren't inherently vectorized like R. Therefore, I think it would make sense for memoise to add this extra feature to its memoization, such that it cached any input that matches the unique input. Or at least a new function, say memoise_unique since calculating unique every time takes some extra time.

The text was updated successfully, but these errors were encountered:

wch · 2023-07-27T16:31:12Z

I've also encountered issues with slow fs performance in the past, although I wonder why it's so slow on Windows in the example at r-lib/fs#424, given that it's not actually doing any filesystem operations. In some cases I've had to move completely away from using fs, and instead use base R file operations for higher performance.

I think your proposed function is more specific and narrow than makes sense for the memoise package -- for example, instead of allowing any kind of input object, it requires that the input x is a vector.

orgadish · 2023-07-27T16:36:14Z

I guess what I was thinking is it's just a form of memoization on the elements of the input themselves. So memoize_unique would always memoize on unique(x) rather than x. It would then also handle de- and re-duplication back to the input passed in.

Perhaps both the original x and the unique x can be stored to save the unique(x) call every time, but I'm not sure that would help the primary use case.

orgadish · 2023-09-16T03:58:53Z

@wch To follow up — the fs functions are just wrappers around the base R functions and so unsurprisingly, the 30x difference in performance is maintained using the base basename, for example. See my comment in r-lib/fs#424 for the specific benchmarking.

orgadish · 2023-10-27T04:00:35Z

To anyone that comes across this and is looking for a solution, I've written a separate package, deduped to implement this:

# if(!requireNamespace("deduped")) install.packages("deduped")

library(deduped)

N_TOTAL <- 1e4
repeated_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:10), "inner") |> 
  rep(N_TOTAL/10) |> 
  sample()

bench::mark(
  direct = repeated_paths |> fs::path_dir(),
  indirect = repeated_paths |> deduped(fs::path_dir)(),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 direct       51.8ms   52.5ms      18.9  749.02KB     2.10
#> 2 indirect    206.3µs  213.5µs    4574.     6.13MB     0

all_unique_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:N_TOTAL), "inner")

bench::mark(
  direct = all_unique_paths |> fs::path_dir(),
  indirect = all_unique_paths |> deduped(fs::path_dir)(),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 direct       53.6ms   54.6ms      18.3  901.88KB     0   
#> 2 indirect     53.6ms   54.9ms      18.2    1.03MB     2.02

^{Created on 2023-10-26 with reprex v2.0.2}

orgadish closed this as completed Sep 16, 2023

orgadish closed this as not planned Won't fix, can't repro, duplicate, stale Sep 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Memoization" on elements in a vector input. #149

"Memoization" on elements in a vector input. #149

orgadish commented Jul 27, 2023 •

edited

Loading

wch commented Jul 27, 2023

orgadish commented Jul 27, 2023

orgadish commented Sep 16, 2023

orgadish commented Oct 27, 2023

"Memoization" on elements in a vector input. #149

"Memoization" on elements in a vector input. #149

Comments

orgadish commented Jul 27, 2023 • edited Loading

wch commented Jul 27, 2023

orgadish commented Jul 27, 2023

orgadish commented Sep 16, 2023

orgadish commented Oct 27, 2023

orgadish commented Jul 27, 2023 •

edited

Loading