Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharding Functors don't support overdecomposition #810

Open
ipdemes opened this issue Aug 7, 2023 · 1 comment
Open

Sharding Functors don't support overdecomposition #810

ipdemes opened this issue Aug 7, 2023 · 1 comment
Assignees

Comments

@ipdemes
Copy link
Contributor

ipdemes commented Aug 7, 2023

Here is the description from @lightsighter on this issue:

the bugs that look like this:

legion_python: /home/idemeshko/development/legate.core_2309/src/core/runtime/shard.cc:126: virtual Legion::ShardID legate::LegateShardingFunctor::shard(const DomainPoint&, const Domain&, size_t): Assertion `start_node_id_ <= shard_id && shard_id < end_node_id_' failed.

look like a legate issue, I see them when running the test_ingest.py test. In that case the problem is that we're doing an index launch of 15 points (0,0)-(4,2) over two shards, but sharding functor has this preset per_node_count_ member variable which i think is saying how many points get assigned to each shard, and that is set to 4, but with 15 points, then some of those points get sharded to a non-extistent shard (hence the assertion), i feel like legate either needs to register a different sharding functor for those cases, or it needs to compute per_node_count_ based on the size of the launch bounds:
https://github.com/nv-legate/legate.core/blob/branch-23.09/src/core/runtime/shard.cc#L111
the result of calling linearize on this line with point (3,0), returns 9 which then divides by 4 to get 2, and that is bigger than your choice of shard IDs which is either 0 or 1:
https://github.com/nv-legate/legate.core/blob/branch-23.09/src/core/runtime/shard.cc#L124
it's unclear to me where per_node_count_ is coming from since it gets passed in from python when the LegateShardingFunctor was registered
note we weren't seeing this problem before because legion wasn't invoking the sharding functors in some cases where it should have, so this is just a latent bug that's being uncovered by legion calling the sharding functors more often

@ipdemes ipdemes self-assigned this Aug 7, 2023
@manopapad
Copy link
Contributor

I believe this is a general issue with our sharding functors, that they don’t support overdecomposition, which the user can cause by doing a manual task launch. I can trigger it by changing cunumeric like so:

diff --git a/cunumeric/deferred.py b/cunumeric/deferred.py
index 5cbea74b..8d97849d 100644
--- a/cunumeric/deferred.py
+++ b/cunumeric/deferred.py
@@ -35,7 +35,7 @@ from typing import (

 import legate.core.types as ty
 import numpy as np
-from legate.core import Annotation, Future, ReductionOp, Store
+from legate.core import Annotation, Future, Rect, ReductionOp, Store
 from numpy.core.numeric import (  # type: ignore [attr-defined]
     normalize_axis_tuple,
 )
@@ -1927,8 +1927,13 @@ class DeferredArray(NumPyThunk):
                 shape=(1,),
             ).base

-        task = self.context.create_auto_task(CuNumericOpCode.ARANGE)
-        task.add_output(self.base)
+        p_output = self.base.partition_by_tiling((4,))
+        task = self.context.create_manual_task(
+            CuNumericOpCode.ARANGE,
+            launch_domain=Rect(lo=(0,), hi=(6,)),
+        )
+        task.add_output(p_output)
+
         task.add_input(create_scalar(start, self.dtype))
         task.add_input(create_scalar(stop, self.dtype))
         task.add_input(create_scalar(step, self.dtype))

and running the following code:

import cunumeric as cn
x = cn.arange(24)

as follows:

legate --launcher mpirun --cpus 2 --nodes 1 --ranks-per-node 2 a.py

@manopapad manopapad assigned magnatelee and unassigned ipdemes Aug 8, 2023
@manopapad manopapad changed the title Fix per_node_count_ in sgarding function Sharding Functor don't support overdecomposition Aug 8, 2023
@manopapad manopapad changed the title Sharding Functor don't support overdecomposition Sharding Functors don't support overdecomposition Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants