[Proof of Concept] `Polyester.@batch` for threading #104

efaulhaber · 2024-07-05T21:32:42Z

Have you considered using Polyester.jl for threading instead of Threads?
I always use the code in this PR when I want to do fair benchmarks of my implementations against CellListMap.jl. I get consistently better performance with this PR over main.

Here is a very cheap benchmark that is running one time step of an n-body simulation. This is the code:
https://github.com/trixi-framework/PointNeighbors.jl/blob/main/benchmarks/n_body.jl#L24-L33

This is on an AMD Ryzen Threadripper 3990X using 128 threads:

This is a minimum speedup of 1.6x and a speedup of 4.5x for the largest problem size.

Here is a real-life benchmark, running one particle interaction loop of the Weakly Compressible SPH method:

This is a speedup of 2x for most problem sizes.

lmiq · 2024-07-06T00:25:10Z

src/CoreComputing.jl

@@ -158,8 +158,8 @@ function map_pairwise_serial!(
    show_progress::Bool=false
 ) where {F,N,T}
    p = show_progress ? Progress(length(cl.ref), dt=1) : nothing
-    for i in eachindex(cl.ref)
-        output = inner_loop!(f, output, i, box, cl)
+    @batch for i in eachindex(cl.ref)


This insertion of @batch is in the serial implementation, so I´m unsure if that was what you were willing to do.

That was the easiest way to try this out.

In a few tests I don't see the difference in performance you report. I might have to dig into your example to see.

One thing that might be important (I didn't look closely to your example) is the density of particles. My tests have the 10^5/(100^3) density, which is the atomic density of water, in the range of the applications I'm interested in.

The tradeoffs in other density ranges are different.

How is the density relevant? Because of the cutoff? I used a cutoff of 3 times the spacing of a grid of particles. For more realistic benchmarks, I randomly perturbed the particle positions.

Yes, of course, in relation to the cutoff. A typical molecular dynamics simulation has a density of 10ˆ5/100ˆ3 particles, and use a cutoff of something of around 8 to 15, in the same units.

In some analsys of galaxy distributions (not my speciality), people use densities of 10ˆ5/20.274ˆ3 for a cutoff of 5.0. (much denser systems).

These define how many particles end up in each cell, and then what is worth doing in the inner loops that compute the interactions inside each cell or among neighbor cells. One parameter is the side of the cell relative to the cutoff (lcell in CellListMap), for very dense systems a cell side smalller than thte cutoff is beneficial. An in CellListMap y also project the particles in the vector connecting cell centers and then sort the projected distances to avoid even further computing spurious distances.

lmiq · 2024-07-06T00:31:19Z

I´ve experimented with Polyerster a few times and, indeed, if the mapped loop is tight it is faster. However, I have the understanding that Polyester does not compose well if the function is called from within another threaded code. I think it will just fill the threads and compete for the resources. That's why I never stick with it.

Concerning the benchmark: is that using CellListMap? That was not clear to me.

One thing is that the way the inner loop of the map_pairwise CellListMap is implemented, which is very general for type of function being provided by the user, can certainly no be optimal in most specific cases. I have thought previously of exposing the inner loop as part of the interface, so the user could implement at will, possiblly taking advantage of vectorization if the function being mapped allows so. But for my applications that was not a priority, so it stayed as it is.

efaulhaber · 2024-07-06T09:36:49Z

Concerning the benchmark: is that using CellListMap? That was not clear to me.

Yes, foreach_point_neighbor in the code I linked above just forwards to map_pairwise! in this case.

lmiq · 2024-07-06T11:00:22Z

Now that I think, probably what @batch is doing there is adding @inbounds, @fastmath. I don't think that performance difference is associated to the threading per se.

efaulhaber · 2024-07-07T17:47:26Z

But wouldn't we see a similar speedup on a single thread then?
main:

CellListMap.jl
with 10x10x10 = 1000 particles finished in 1.900 ms

CellListMap.jl
with 16x16x16 = 4096 particles finished in 9.661 ms

CellListMap.jl
with 25x25x25 = 15625 particles finished in 41.616 ms

CellListMap.jl
with 40x40x40 = 64000 particles finished in 184.198 ms

CellListMap.jl
with 63x63x63 = 250047 particles finished in 757.062 ms

CellListMap.jl
with 101x101x101 = 1030301 particles finished in 3.271 s

CellListMap.jl
with 160x160x160 = 4096000 particles finished in 13.483 s

Polyester.jl:

CellListMap.jl
with 10x10x10 = 1000 particles finished in 1.894 ms

CellListMap.jl
with 16x16x16 = 4096 particles finished in 9.653 ms

CellListMap.jl
with 25x25x25 = 15625 particles finished in 41.475 ms

CellListMap.jl
with 40x40x40 = 64000 particles finished in 183.926 ms

CellListMap.jl
with 63x63x63 = 250047 particles finished in 754.712 ms

CellListMap.jl
with 101x101x101 = 1030301 particles finished in 3.264 s

CellListMap.jl
with 160x160x160 = 4096000 particles finished in 13.458 s

lmiq · 2024-07-08T18:23:19Z

I don´t see the difference in performance you report here, really. For example, I´m running the examples of this page: https://m3g.github.io/CellListMap.jl/stable/LowLevel/#Histogram-of-distances

The forces example, which is probably the most similar to what you are computing in your tests, results in:

Current stable CellListMap 0.9.4:

julia> @btime map_pairwise!(
           $((x,y,i,j,d2,forces) -> calc_forces!(x,y,i,j,d2,mass,forces)),
           $forces,$box,$cl
       );
  34.576 ms (76 allocations: 13.74 MiB)

and with your PR:

julia> @btime map_pairwise!(
           $((x,y,i,j,d2,forces) -> calc_forces!(x,y,i,j,d2,mass,forces)),
           $forces,$box,$cl
       );
  34.173 ms (82 allocations: 13.75 MiB)

In other examples there might be a slight improvement in perfomance with @batch, but very minor. Maybe this is very dependent on the CPUs used?

Could you test one of these examples in your machine?

lmiq · 2024-07-13T14:30:06Z

Offtopic: can you please send me the link of your talk at JuliaCon about this when it is available? Thanks.

efaulhaber · 2024-07-13T15:15:25Z

https://www.youtube.com/live/V7FWl4YumcA?si=XIKhTj8dfmon0rwk&t=4669

But we only quickly mentioned the neighborhood search.

efaulhaber · 2024-07-15T09:43:13Z

Back to the topic:
I was confused as well, since Polyester.@batch is actually slower in our code for large problems.
So I repeated the benchmark with Threads.@threads and Threads.@threads :static and found this:

So Threads.@threads :static is just as fast as Polyester.@batch, except for very small problems, exactly as I would expect.
Interestingly, slapping @threads :static to the serial loop is significantly faster than the nbatches approach in main. This is a 2x speedup for 1M particles and a 4.7x speedup for 65M particles in the cheap n-body benchmark.

But as you already noticed, the difference becomes smaller with more neighbors. In this example, I have a search radius of 3x the particle spacing, which gives me ~100 neighbors with ~3 particles per cell. When I increase the search radius to 10x the particle spacing, the speedup for 1M particles goes down from 2x to 1.11x.

lmiq · 2024-07-15T14:47:07Z

Well, I´ll have to investigate what´s going on. The issue is that I don´t see these performance differences at all in the hardware I have, with my test functions. For instance, this is what I get with the polyester version:

julia> using CellListMap                                                                                                                                                                                                                                                    

julia> x, box = CellListMap.xatomic(10^4);                                                                                                                                                                                                                                  

julia> sys = ParticleSystem(positions=x, unitcell=box.input_unit_cell.matrix, output=0.0, cutoff=12.0);                                                                                                                                                                     

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> out += inv(d2)), $sys)
18.444 ms (87 allocations: 16.14 KiB)
75140.7942005164

With current version:

julia> using CellListMap

julia> x, box = CellListMap.xatomic(10^4);

julia> sys = ParticleSystem(positions=x, unitcell=box.input_unit_cell.matrix, output=0.0, cutoff=12.0);

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> out += inv(d2)), $sys)
  18.692 ms (87 allocations: 16.14 KiB)
77202.5687995922

efaulhaber · 2024-07-15T15:02:59Z

This is not using the map_pairwise! taking a CellListPair, right? I only changed this definition.

lmiq · 2024-07-15T15:24:06Z

Oh, right... sorry. But in that case I get the opposite result, the current version is faster:

with polyester:

julia> x, box = CellListMap.xatomic(10^4);

julia> y, _ = CellListMap.xatomic(10^4);

julia> sys = ParticleSystem(xpositions=x, ypositions=y, unitcell=2*box.input_unit_cell.matrix, output=zeros(length(x)), cutoff=12.0);

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
  46.463 ms (43 allocations: 7.25 KiB)

julia> Threads.nthreads()
6

with current version:

julia> x, box = CellListMap.xatomic(10^4);

julia> y, _ = CellListMap.xatomic(10^4);

julia> sys = ParticleSystem(xpositions=x, ypositions=y, unitcell=2*box.input_unit_cell.matrix, output=zeros(length(x)), cutoff=12.0);

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
  23.328 ms (93 allocations: 16.33 KiB)

So I'll really will have to dig further into your example to understand what's going on.

efaulhaber · 2024-07-16T09:24:44Z

Your benchmark on a Threadripper 3990X:
map_pairwise_parallel!:

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
  8.930 ms (286 allocations: 54.41 KiB)

map_pairwise_serial! with @batch:

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
  6.020 ms (53 allocations: 8.59 KiB)

map_pairwise_serial! with Threads.@threads :static:

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
  3.438 ms (698 allocations: 132.58 KiB)

lmiq · 2024-07-16T18:10:16Z

Well, definitelly not what I see here. This is another machine (a fairly standard one):

With @batch:

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); out end), $sys); 
 63.056 ms (1 allocation: 448 bytes)

with current stable:

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); out end), $sys);
  21.881 ms (117 allocations: 19.95 KiB)

With:

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD EPYC 7282 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
Threads: 8 default, 0 interactive, 4 GC (on 16 virtual cores)

(with only @threads :static in the serial loop probably the result is wrong, unless you did something to accumulate the result per thread and reduce later)

efaulhaber · 2024-07-17T07:29:30Z

(with only @threads :static in the serial loop probably the result is wrong, unless you did something to accumulate the result per thread and reduce later)

I skipped the reduce in all versions and only returned 0.

Use Polyester.@batch for threading

42d2e08

lmiq reviewed Jul 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proof of Concept] `Polyester.@batch` for threading #104

[Proof of Concept] `Polyester.@batch` for threading #104

efaulhaber commented Jul 5, 2024 •

edited

Loading

lmiq Jul 6, 2024 •

edited

Loading

efaulhaber Jul 6, 2024

lmiq Jul 6, 2024

efaulhaber Jul 8, 2024

lmiq Jul 8, 2024 •

edited

Loading

lmiq commented Jul 6, 2024 •

edited

Loading

efaulhaber commented Jul 6, 2024

lmiq commented Jul 6, 2024

efaulhaber commented Jul 7, 2024

lmiq commented Jul 8, 2024 •

edited

Loading

lmiq commented Jul 13, 2024

efaulhaber commented Jul 13, 2024

efaulhaber commented Jul 15, 2024

lmiq commented Jul 15, 2024

efaulhaber commented Jul 15, 2024

lmiq commented Jul 15, 2024

efaulhaber commented Jul 16, 2024

lmiq commented Jul 16, 2024

efaulhaber commented Jul 17, 2024

[Proof of Concept] Polyester.@batch for threading #104

Are you sure you want to change the base?

[Proof of Concept] Polyester.@batch for threading #104

Conversation

efaulhaber commented Jul 5, 2024 • edited Loading

lmiq Jul 6, 2024 • edited Loading

Choose a reason for hiding this comment

efaulhaber Jul 6, 2024

Choose a reason for hiding this comment

lmiq Jul 6, 2024

Choose a reason for hiding this comment

efaulhaber Jul 8, 2024

Choose a reason for hiding this comment

lmiq Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

lmiq commented Jul 6, 2024 • edited Loading

efaulhaber commented Jul 6, 2024

lmiq commented Jul 6, 2024

efaulhaber commented Jul 7, 2024

lmiq commented Jul 8, 2024 • edited Loading

lmiq commented Jul 13, 2024

efaulhaber commented Jul 13, 2024

efaulhaber commented Jul 15, 2024

lmiq commented Jul 15, 2024

efaulhaber commented Jul 15, 2024

lmiq commented Jul 15, 2024

efaulhaber commented Jul 16, 2024

lmiq commented Jul 16, 2024

efaulhaber commented Jul 17, 2024

[Proof of Concept] `Polyester.@batch` for threading #104

[Proof of Concept] `Polyester.@batch` for threading #104

efaulhaber commented Jul 5, 2024 •

edited

Loading

lmiq Jul 6, 2024 •

edited

Loading

lmiq Jul 8, 2024 •

edited

Loading

lmiq commented Jul 6, 2024 •

edited

Loading

lmiq commented Jul 8, 2024 •

edited

Loading