More parallel+serial discussion #1470

germasch · 2019-06-04T01:15:19Z

This is follow-up from #1443 and PR #1466. The PR addresses the fundamentals of the issue, but the work isn't really complete.

Open issues:

python bindings support (I think @pnorbert is working on this, should be relatively straightforward).
which parts of adios2 make sense to support in serial mode? (InsituMPI: probably not, but maybe all / most of the rest)? Right now, BP3/BP4 and HDF5 are supported, if lightly tested.
SST should be supported, but is harder because it's a separate lib, and uses C.
corner cases with the current implementation.

The current approach is to replace MPI_* functions by their SMPI_* wrapper equivalents. These wrappers will call through to the real MPI if it's initialized (and not finalized), otherwise it'll do the MPI dummy stuff. I somewhat deliberately only changed selected pieces to use those wrappers, since I'm not too convinced that it's a real clean approach, though I'm torn about that. It could however, be considered to do a global search and replace and use the wrappers everywhere. Some things still wouldn't work (e.g., no easy way to implement MPI_Send/MPI_Recv properly on a single proc). The current state isn't horrible, ie., if someone were to try to use some part which hasn't been converted to SMPI_* wrappers w/o calling MPI_Init, they'd get MPI complaining about just that, rather than some mysterious subtle problms.

What I think would be cleaner, but potentially involve some code duplication, would be to have separate code paths for serial/parallel case where both are supported. e.g., in the serial case, AggregateMetadata probably shouldn't even be called. This, however, requires being able to distinguish whether a function is meant to execute serially. I think there is one straightforward way to make that distinction, ie., define MPI_COMM_NULL as having the meaning "don't use MPI, this executes serially". That's clearer than the current reliance on MPI_Initialized/MPI_Finalized; that status, one should note, might actually change along the way. So it's possible to have SMPI_Comm_dup return the communicator itself when called before MPI_Init (because that's what mpidummy) does, then initialize MPI, and then eventually have SMPI_Comm_free call the real MPI_Comm_free, which will rightfully complain that the communicator hadn't been dup'd in the first place. This actually happens in real life, which is why I didn't change the MPI_Comm_dup handling inside of core::ADIOS for now.

Switching between real MPI and mpidummy based on the communicator (MPI_COMM_NULL vs not) resolves a lot of ambiguity -- it's also cheaper than having to check MPI_Initialized and MPI_Finalized on every MPI call, so I'd be in favor of going that way. That's also a pretty small/localized change.

On the bigger issue (wrap everything with SMPI vs separate code paths), I think one might have try and look at it to decide what's better.

The text was updated successfully, but these errors were encountered:

germasch · 2019-06-04T01:18:50Z

Separately, for @eisenhauer @philip-davis, what's the reason that SST is built as a separate library? This, together with the fact that it partially uses C sources (so the using namespace stuff doesn't work) makes it harder to support the serial+parallel scenario, as it cannot use the wrappers from the main library. It's all doable, but with a decent amount of code duplication and ugliness, so I figured I'd start by asking whether it needs to be built as a separate lib.

eisenhauer · 2019-06-04T02:07:57Z

(Again, please forgive lateness in joining discussions. I've unsubscribed from most GitHub notifications to preserve sanity, so I miss things.) . WRT SST's status as a library (and, indirectly, it's representation in C): Mostly, it's for historical/arbitrary reasons. In the early stages, it was easier to play with SST concepts and prototype an implementation outside of ADIOS. We had a pretty good idea of what the interfaces were, both between ADIOS2 and SST, and inside SST between the control plane and the data plane, but we needed a framework in which to exercise things. So, SST was prototyped as a library with a well-defined interface to the test-harness levels and between the control plane and data plane. Philip and I were both most comfortable in C, so that's what we used. (That when we were doing this, ADIOS2 development was mostly focused on the writer side and the reader side had received little attention reinforced the practicality of outside-ADIOS2 development. Writer-side-first development made sense for files, but was unhelpful for a staging system.) When it came time to actually pull SST into ADIOS2, @chuckatkins and I made the decision to leave SST as a library. I don't have strong feelings about that per se. It doesn't matter much if we link objects directly or just link the library. But I imagine that more awkward bit is that SST is in C and can't directly use the wrappers from the main library. That's unrelated to the library thing...

germasch · 2019-06-04T03:44:05Z

Thanks @eisenhauer for the insights. You're right, the usage of C causes the bigger hassles. I think it's perfectly reasonable to support C code in adios2, though, even if it means hassles :(

Still, before I go there, I guess this can also be considered a guinea pig for my other consideration, that is, is it worth it to have separate code paths for the serial case. I took one file (cp_common.c) and took a look at what it'd take. Appended below is the patch I ended up with.

So on its face, that doesn't seem worth it -- quite a bit of code added for virtually no gain vs handling this through mpidummy. However, I'd like to ask first whether there might be in point in doing these separate paths for other reasons. You can see that what I ended up with is basically an encode, copy the encoded buffer, decode again. I don't really have an understanding of what's going on at a deeper level, though. If it were possible to eliminate the encode->decode roundtrip, things might look simpler, and maybe more importantly, it could also potentially speed thing up. I suppose that at least the "single reader" case might actually be relevant in real life, so if it's possible to optimize that case, maybe this could all be worth it after all.

diff --git a/source/adios2/toolkit/sst/cp/cp_common.c b/source/adios2/toolkit/sst/cp/cp_common.c
index 99cc4375..2e00643c 100644
--- a/source/adios2/toolkit/sst/cp/cp_common.c
+++ b/source/adios2/toolkit/sst/cp/cp_common.c
@@ -1,5 +1,6 @@
 #include <ctype.h>
 #include <stdarg.h>
+#include <stdbool.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
@@ -532,9 +533,41 @@ static FMStructDescList combineCpDpFormats(FMStructDescList top,
     return CombinedFormats;
 }

+static inline bool isSerial(SstStream Stream)
+{
+    return Stream->CohortSize == 1;
+}
+
+static void **CPSerial_consolidateDataToRankZero(SstStream Stream,
+                                                 void *LocalInfo,
+                                                 FFSTypeHandle Type,
+                                                 void **RetDataBlock)
+{
+    FFSBuffer Buf = create_FFSBuffer();
+    int DataSize;
+    char *Buffer =
+        FFSencode(Buf, FMFormat_of_original(Type), LocalInfo, &DataSize);
+
+    char *RecvBuffer = malloc((DataSize + 7) & ~7);
+    memcpy(RecvBuffer, Buffer, DataSize);
+    free_FFSBuffer(Buf);
+
+    FFSContext context = Stream->CPInfo->ffs_c;
+    struct _CP_DP_init_info **Pointers = malloc(sizeof(Pointers[0]));
+    FFSdecode_in_place(context, RecvBuffer, (void **)&Pointers[0]);
+
+    *RetDataBlock = RecvBuffer;
+    return (void **)Pointers;
+}
+
 void **CP_consolidateDataToRankZero(SstStream Stream, void *LocalInfo,
                                     FFSTypeHandle Type, void **RetDataBlock)
 {
+    if (isSerial(Stream))
+    {
+        return CPSerial_consolidateDataToRankZero(Stream, LocalInfo, Type,
+                                                  RetDataBlock);
+    }
     FFSBuffer Buf = create_FFSBuffer();
     int DataSize;
     int *RecvCounts = NULL;
@@ -609,9 +642,35 @@ void **CP_consolidateDataToRankZero(SstStream Stream, void *LocalInfo,
     return (void **)Pointers;
 }

+static void *CPSerial_distributeDataFromRankZero(SstStream Stream,
+                                                 void *root_info,
+                                                 FFSTypeHandle Type,
+                                                 void **RetDataBlock)
+{
+    FFSBuffer Buf = create_FFSBuffer();
+    int DataSize;
+    char *tmp =
+        FFSencode(Buf, FMFormat_of_original(Type), root_info, &DataSize);
+    char *Buffer = malloc(DataSize);
+    memcpy(Buffer, tmp, DataSize);
+    free_FFSBuffer(Buf);
+
+    FFSContext context = Stream->CPInfo->ffs_c;
+
+    void *RetVal;
+    FFSdecode_in_place(context, Buffer, &RetVal);
+    *RetDataBlock = Buffer;
+    return RetVal;
+}
+
 void *CP_distributeDataFromRankZero(SstStream Stream, void *root_info,
                                     FFSTypeHandle Type, void **RetDataBlock)
 {
+    if (isSerial(Stream))
+    {
+        return CPSerial_distributeDataFromRankZero(Stream, root_info, Type,
+                                                   RetDataBlock);
+    }
     int DataSize;
     char *Buffer;
     void *RetVal;
@@ -647,6 +706,12 @@ void *CP_distributeDataFromRankZero(SstStream Stream, void *root_info,
 void **CP_consolidateDataToAll(SstStream Stream, void *LocalInfo,
                                FFSTypeHandle Type, void **RetDataBlock)
 {
+    // FIXME, actually consolidateDataToAll is entirely unused
+    if (isSerial(Stream))
+    {
+        return CPSerial_consolidateDataToRankZero(Stream, LocalInfo, Type,
+                                                  RetDataBlock);
+    }
     FFSBuffer Buf = create_FFSBuffer();
     int DataSize;
     int *RecvCounts;```

eisenhauer · 2019-06-04T13:47:21Z

Unfortunately, I think it's hard to optimize out the (seemingly) redundant single-processor encode/decode. The complexity here is that the encode/decode also has a side effect of rearranging the memory layout of any data that is more complex than a flat structure with no pointers. Essentially it consolidates those disparate malloc blocks into a single large block (with internal pointers). (This allows EVpath and message-passing systems built on FFS to efficiently manage incoming complex structured messages as a single entity while simplifying their memory management. This single large block is represented by the RetDataBlock parameter in the routines above.) So, if we got rid of the encode/decode pair, we'd either have to provide similar functionality (likely at a similar cost), or rewrite all the single-processor memory handling as a special case. That's likely a much bigger mess than we want to tackle...

germasch · 2019-06-04T18:15:44Z

Thanks @eisenhauer for the clarification. So that leads me to the conclusion that separate code paths aren't the right solution here, so I'll work on making the MPI wrapping thing available to plain old C.

chuckatkins · 2019-06-07T14:28:39Z

@bradking just keeping you in the loop on this discussion

germasch · 2019-06-25T23:09:46Z

With #1480 merged, I'll close this.

chuckatkins added the discussion label Jun 7, 2019

germasch closed this as completed Jun 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More parallel+serial discussion #1470

More parallel+serial discussion #1470

germasch commented Jun 4, 2019

germasch commented Jun 4, 2019

eisenhauer commented Jun 4, 2019 •

edited

Loading

germasch commented Jun 4, 2019

eisenhauer commented Jun 4, 2019 •

edited

Loading

germasch commented Jun 4, 2019

chuckatkins commented Jun 7, 2019

germasch commented Jun 25, 2019

More parallel+serial discussion #1470

More parallel+serial discussion #1470

Comments

germasch commented Jun 4, 2019

germasch commented Jun 4, 2019

eisenhauer commented Jun 4, 2019 • edited Loading

germasch commented Jun 4, 2019

eisenhauer commented Jun 4, 2019 • edited Loading

germasch commented Jun 4, 2019

chuckatkins commented Jun 7, 2019

germasch commented Jun 25, 2019

eisenhauer commented Jun 4, 2019 •

edited

Loading

eisenhauer commented Jun 4, 2019 •

edited

Loading