-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Freeze in S3D TDB on Marlowe #1788
Comments
There's nothing interesting in the backtraces. Do another run with detailed legion spy logging and |
Logs are here: I confirmed at the point where I killed the job that the count of |
Realm hang. There is a DMA copy that started but did not finish. There are 62361 copies that started running, but only 62360 that completed:
|
Here is the bad copy:
@artempriakhin or @muraj can one of you guys take a look at the logs? |
@muraj @artempriakhin not sure if you have access to the machine. Here are the logs related to this bad copy
channel 12 is the XFER_REMOTE_WRITE, I feel like it is related to the obcount issue of gasnetex. @elliottslaughter Could you please try to increase the @lightsighter just curious why do you issue a copy from node 4, while src is on node 0 and dst is on node 7. It does not sounds efficient. |
With
Backtrace may not be especially helpful, but it's:
|
Turn off detailed Legion Spy logging and see if it goes away. Also which branch are you on (because it's definitely not the |
Oh, right. I forgot this was a Legion Spy build. Thanks. I'm running |
Legion's dependence analysis often does not happen on the node where the data is because data is used in multiple different places (more than just two nodes). It is a very common pattern to be requesting a copy between two nodes from a third node. |
Ok, after backing out Legion Spy, I can confirm that |
So this is a duplicate then of #1262 |
I thought we were following the formula in #1508 (comment)? Is |
It's unclear to me if that is actually the default right now in the GASNetEX module or whether it was just aspirational as in that was something we could do. That can add up to a lot of memory as you scale up as the number of buffers you need on each node scales with O(N) in the number of nodes. |
Just FYI that a run with the formula |
Running with
Is there anything else I can shut off to reduce the required |
Ok, running with At 3 nodes, we run 160 timesteps successfully and then fail with:
I notice later down on the log some lines that say:
Not sure if that's just a reflection of the crash or if "lost" means something more serious happened. Anyway, this doesn't look immediately related to the obcount issue? |
For comparison, with So perhaps the obcount is related. I haven't tried to do any NIC binding, and I know there are a lot of NICs on this machine, so maybe that would help. |
I also tried |
had to do a bit of catching up on this obcount problem before responding anything meaningful. As I understand it, the It's possible that we will never completely avoid the static tuning of @elliottslaughter, since we're running on infiniband, we have UCX now that is fully operational, and I will be open to discuss and understand why can't we use it here? Would be good to understand why we don't have this type of problem in UCX, how it is designed and whether it's more efficient? @SeyedMir correct me here please. We certainly ran some benchmarks comparing UCX and GASNet-EX, and the performance shouldn’t be worse from what I remember. And in case it's actually worse, perhaps it would be reasonable to make changes to UCX instead to make sure it matches up with gasnet. |
@syamajala I believe these issues should be fixed in master, can you give it a shot? |
@apryakhin If we can not completely get ride of the |
That looks like a real crash. Most likely on the application side, but could potentially be a Realm DMA kernel. I would bet on an application kernel though.
That is not related to the obcount issue. |
Regent has been fixed in It is possible that, since we also crashed in #1788 (comment), we're looking at a genuine application bug that is merely hidden by various settings (CUDA hijack and/or disabling GPUDirect). But if so then I'll probably need help from the original application authors to chase it down. |
I agree that if this is a detectable condition it's definitely worth an error or at least warning informing people that we hit it. |
I am running S3D's TDB branch on Marlowe and have encountered a freeze on 2 nodes.
Note that this is an NVIDIA machine so we're talking about an Infiniband network. I have built Legion with GASNet-EX and the
ibv
conduit.There are two sets of backtraces below, taken after the application was frozen for about 10 minutes (and 5 minutes between each set of backtraces):
/scratch/eslaught/marlowe_s3d_tdb_2024-10-31/DBO_Test_2/bt2-1
/scratch/eslaught/marlowe_s3d_tdb_2024-10-31/DBO_Test_2/bt2-2
Flags for this run include
-ll:force_kthreads -lg:inorder 1 -lg:safe_ctrlrepl 1 -lg:no_tracing
so you will see the index launch where the application froze on the stack.The text was updated successfully, but these errors were encountered: