Encapsulate multi-process communication #1691

bradking · 2019-08-21T19:28:19Z

Much of ADIOS's functionality is applicable to any multi-process communication environment, not just MPI. Disconnect the non-MPI-specific parts of our implementation from MPI. Replace direct use of MPI_Comm and MPI functions with a helper::Comm wrapper that encapsulates MPI functionality. This makes room for future abstraction of multi-process communication behind the encapsulated API.

Start by encapsulating MPI functions and the types MPI_Comm, MPI_Request, and MPI_Status behind wrappers in helper::Comm. For now leave encapsulation of MPI constants to future work.

germasch · 2019-08-22T01:22:29Z

Much of ADIOS's functionality is applicable to any multi-process communication environment, not just MPI.

So I have some concerns, in particular, what is "any multi-process communication environment"? I think it's pretty clear that MPI is going stay with us (ie., HPC) at least for the exascale, and other things I have heard of, like Spark or Hadoop don't have an API that's anything like this abstraction.

I totally agree that wrapping and abstracting an API is the right way to go if one wants to support multiple implementations or variants of something. But I also think that is something that should be done at the point where one has an actual need. Because if there is only one API, which is already built on a certain abstraction already (like MPI is) there's really no way to know what abstractions are the right ones to achieve the future goal of abstracting away differences. So I think all this really does here is to provide a C++ interface to MPI -- it doesn't actually even solve the case originally kicked of this problem, ie, the case where MPI is not present.

In terms of a C++ API -- well, it's been tried, and was eventually dropped, because people rather stuck with the familiar C API and the C++ bindings didn't offer much value. This is kind of my concern with this one -- everyone who does HPC is already familiar with MPI. An additional wrapping layer obscures that, so everyone contributing code has to familiarize themselves with this new abstraction, and potentially implement more missing wrappers. And there is still C code in adios2, which can't use this new abstraction, since it'd need yet another layer of C binding wrappers.

I guess if one really wants to go with a C++ wrapping layer, I'd say one should at least take advantage of C++ lifetime semantics, ie. implement a copy constructor that does MPI_Comm_dup and a destructor that does MPI_Comm_free, so one source of bugs will be eliminated (along the lines of how you're already nicely using move semantics).

Anyway, this is just my personal opinion -- the implementation looks perfectly good, but I don't really see the rationale for adding a layer of wrappers. It's very possible that there is some distributed memory communication environment that I'm now aware of, though.

bradking · 2019-08-22T12:08:28Z

it doesn't actually even solve the case originally kicked of this problem, ie, the case where MPI is not present.

This is just the first step. Additional work will follow. However, I can't continue to maintain this branch outside of master because every rebase takes hours of catching up with new uses of MPI. This plan was discussed in a call months ago.

In terms of a C++ API -- well, it's been tried, and was eventually dropped, because people rather stuck with the familiar C API and the C++ bindings didn't offer much value.

They don't add much value to upstream MPI implementations, but here we need an abstraction layer.

implement a copy constructor that does MPI_Comm_dup and a destructor that does MPI_Comm_free

The destructor already frees it if an explicit Free has not been called already. I intentionally deleted the copy constructor and copy-assignment operator of helper::Comm to avoid accidental duplication. A dedicated Duplicate method could be added when needed. There is already a Split method.

pnorbert · 2019-08-22T13:37:35Z

@bradking Just to clarify: did you stop duplicating the incoming communicator in the ADIOS constructor, and then always duplicate it in the io.Open() method? The move semantics confuses me. What will happen if you call multiple Opens?

bradking · 2019-08-22T14:14:21Z

@pnorbert the ADIOS constructor still duplicates the incomming communicator. The code is now helper::Comm::Duplicate(mpiComm) and is called in ADIOS's constructor.

…nication

Avoid storing a redundant copy in an IO class member.

Provide a `GetComm()` method to get it publicly.

Implement the Comm encapsulation by moving MPI code out of the factory.

For now it has an implicit conversion to MPI_Comm anyway.

chuckatkins · 2019-08-23T14:11:47Z

@germasch So to discuss at a high level where this is going, the effort here is multi-pronged and this is the first portion of it, eventually to achieve the following:

Allow serial and parallel operations in the same library
- This is admit-idly already somewhat manageable with the SMPI wrappers recently put in place
Allow an app to use the serial interface without even needing a link-time dependency on MPI.
- This can be achieved by factoring the underlying MPI components out into a separate library. The details of this are somewhat nuanced but @bradking can speak more to it.
Primarily, this is to allow for alternate communication libraries that are not MPI.
- MPI is the optimal use case for HPC environments but not really for non-HPC environments outside of an RDMA cluster. The communication patterns in the MPI API are pretty basic and if viewed from a high level can be implemented in a reasonably straight-forward way with a variety of backends, some of which are much more suitable to distributed non-HPC environments (direct 0MQ, Redis based message broker, etc.). Rather than have these need to be engine specific, if the comm architecture in adios core is adaptable then we can leverage common implementations for things like BP Aggreagation outside of just the BP engines.
- This also opens up some good research opportunities for exploring the intraprocess communication associated with I/O in different environments.

This is the first step in the adaptation and likely the most disruptive but it provides the framework for the above to proceed in a less disruptive way. Think of it as adding another dimension to the Adaptable in ADIOS.

eisenhauer · 2019-08-23T14:41:51Z

FYI on the occasional KillWriter failure. That may be related to load on the CI machines. Sometimes the writer side gets killed before the reader spins up completely, and then the reader fails because it finds no writer (as opposed to what we're testing, where the writer should exist and be killed so we can test that the reader handles the failure). In current master, I've doubled the time we wait before killing the writer, so if you rebase that particular failure may be less likely.

germasch · 2019-08-23T15:54:35Z

Thanks for your reply, @chuckatkins.

Allow an app to use the serial interface without even needing a link-time dependency on MPI.

This can be achieved by factoring the underlying MPI components out into a separate library. The details of this are somewhat nuanced but @bradking can speak more to it.

So I would generally be curious to know how that'd work. All I can come up with is dynamically loading the MPI library on demand. Fortunately, you guys are the cmake gurus, so you can hopefully figure out how to make that work robustly across systems.

But I'd caution that this will open up a whole new set of potential pitfalls for users. For example, you may easily compile your code against a different version of MPI than what's adios2 i then loading. You will still not get rid of the dependency on mpi.h, it needs to be included by a generic (parallel+serial) adios.h. You may pick up the wrong version of the header, and if sizeof(MPI_Comm) is not the same, as is the case with openmpi vs mpich, this leads to mystery crashes. (Even otherwise, picking up two versions of MPI is not a fun problem to debug. @jychoi-hpc may remember the XGC/adios meeting back in May, where the lady got mysterious "MPI not initialized" errors if though MPI_Init was called, which I tracked down to her code ended being linked to two different versions of MPI.)

MPI is the optimal use case for HPC environments but not really for non-HPC environments outside of an RDMA cluster. The communication patterns in the MPI API are pretty basic and if viewed from a high level can be implemented in a reasonably straight-forward way with a variety of backends, some of which are much more suitable to distributed non-HPC environments (direct 0MQ, Redis based message broker, etc.). Rather than have these need to be engine specific, if the comm architecture in adios core is adaptable then we can leverage common implementations for things like BP Aggreagation outside of just the BP engines.

I think having an internal communication abstraction might well be very useful, to be used in engines. I also think it's really hard to do this and not end up with something very similar to MPI in the first place. Anyway, my point is that this new abstraction isn't really suited for anything but MPI, it doesn't map to something like 0MQ at all for all I can tell.

This also opens up some good research opportunities for exploring the intraprocess communication associated with I/O in different environments.

As I said, there might well be a point in dealing with different communication environments internally, ie., in engines. As I think is already being done, I think I've even seen something 0MQ being merged lately, and obviously there's other code doing RDMA. I think those would be starting points to figure out what proper abstractions would look like...

Anyway, one thing I definitely do agree with is that maintaining this externally for long time can't be fun...

chuckatkins · 2019-08-23T16:35:30Z

Allow an app to use the serial interface without even needing a link-time dependency on MPI.

This can be achieved by factoring the underlying MPI components out into a separate library. The details of this are somewhat nuanced but @bradking can speak more to it.

So I would generally be curious to know how that'd work. All I can come up with is dynamically loading the MPI library on demand. Fortunately, you guys are the cmake gurus, so you can hopefully figure out how to make that work robustly across systems.

@bradking has a pretty good idea of how it needs to be implemented, although outside the scope of this pr. I expect the next push will be adding the serial implementation to the abstraction and then after that splitting the MPI dependency.

You will still not get rid of the dependency on mpi.h, it needs to be included by a generic (parallel+serial) adios.h.

The header could have something like the following where you could explicitly {en/dis}able mpi.h with #define ADIOS2_USE_MPI=1 // or 0 or leave it unset for the default to be whatever ADIOS2_HAVE_MPI is (which is the current behavior):

// ADIOS2Config.h (@ADIOS2_HAVE_MPI01@ is generated at configure time as explicitly 0 or 1)
#ifndef ADIOS2_USE_MPI
  #define ADIOS2_USE_MPI @ADIOS2_HAVE_MPI01@
#endif

// adios2.h
#include "ADIOS2Config.h"
...
#if ADIOS2_USE_MPI
  #ifndef ADIOS2_HAVE_MPI
    #error "ADIOS2 was not built with MPI support"
  #endif
  #include <mpi.h>
#endif

For example anyways. That logic might not be quite right and details of the use case would need to be worked out more but you get the idea on how it's doable to manage the interface in a single library where a serial application can disable the MPI interface entirely.

You may pick up the wrong version of the header, and if sizeof(MPI_Comm) is not the same, as is the case with openmpi vs mpich, this leads to mystery crashes.

You might, but that's no different that the way any MPI code is currently setup. It wouldn't "hide" the MPI dependency, just split it out so you would have something like:

find_package(ADIOS2 REQUIRED COMPONENTS MPI)
...
# serial usage
target_link_libraries(foo PRIVATE ADIOS2::ADIOS2)

# parallel usage
target_link_libraries(foo PRIVATE ADIOS2::ADIOS2_MPI)

or something of the sorts.

germasch · 2019-08-23T17:23:11Z

You will still not get rid of the dependency on mpi.h, it needs to be included by a generic (parallel+serial) adios.h.

The header could have something like the following where you could explicitly {en/dis}able mpi.h with #define ADIOS2_USE_MPI=1 // or 0 or leave it unset for the default to be whatever ADIOS2_HAVE_MPI is (which is the current behavior):

Yeah, if you do that, I guess it would be a nicer interface to have adios2_no_mpi.h, which does #undef ADIOS2_USE_MPI and #include <adios2.h>.

You may pick up the wrong version of the header, and if sizeof(MPI_Comm) is not the same, as is the case with openmpi vs mpich, this leads to mystery crashes.

You might, but that's no different that the way any MPI code is currently setup. It wouldn't "hide" the MPI dependency, just split it out so you would have something like:

You're right on that. Mixing MPIs can and does happen today, so that doesn't really get changed.

find_package(ADIOS2 REQUIRED COMPONENTS MPI)
...
# serial usage
target_link_libraries(foo PRIVATE ADIOS2::ADIOS2)

# parallel usage
target_link_libraries(foo PRIVATE ADIOS2::ADIOS2_MPI)

Well, if using cmake or shared libs, I would say that's actually a step backwards, since today target_link_libraries(foo PRIVATE ADIOS2::ADIOS2) will just work for both serial and parallel programs, with serial or parallel adios2, except in the one case where the program is parallel and wants to use parallel adios2, but adios2 has been built serially (which just isn't fixable.)

So I did, however, come to think of an issue that I'm really not sure is solvable: The typical situation of an MPI app using the MPI-parallel adios will go like this:

  MPI_Init(...);
  adios2::ADIOS ad(MPI_COMM_WORLD);

So by the time that control passes to adios2, MPI has already been initialized, but the adios2 library does not yet have access to its symbols (because it is not directly linked to it). So at the time that the constructor is called, adios2 figures out that it's being used with an MPI communicator, so it now definitely needs to link to the MPI library. But it doesn't have access to the MPI symbols yet. If MPI has been linked statically to the library, there is just no way to get access to its symbols. What I had been imagining was that adios2 at this point would dlopen the MPI library and then access its symbols through dlsym. But wouldn't this mechanism load a new instance of the MPI library, which hasn't even been initialized? Is it possible to get access to symbols from previously regularly loaded shared libraries at runtime?

ETA: Some googling seems to say that it should be possible to query symbols from previously loaded libraries even if they were not originally dynamically loaded. So I guess it probably is possible, though it won't work on systems that statically link executables (like many Crays by default).

chuckatkins · 2019-08-26T15:34:49Z

That was really just an example of how it might work. The header and cmake mechanism for having a single library supporting both serial and parallel without forcing MPI linkage is tbd but there's some options on how it can be done. We'll certainly iterate on it to make sure it's a clean interface and can maintain backwards compatibility with existing integration. Either way, the details for the yet-to-be-implemented mechanisms are out of scope here; we can delve into detail in the appropriate pr when it's ready.

What I had been imagining was that adios2 at this point would dlopen the MPI library and then access its symbols through dlsym. But wouldn't this mechanism load a new instance of the MPI library, which hasn't even been initialized? Is it possible to get access to symbols from previously regularly loaded shared libraries at runtime?

There won't be any need for dlopen as it can be done entirely at link time in a way that will work for static and shared. Again though, more details to follow in a later pr.

Encapsulate a pending async operation and its result status. This may require multiple `MPI_Request` instances in case we split large operations into a batch of multiple MPI operations. In the case of a batched operation, present the results as a combined status.

Make it a private implementation detail.

All of the functions have been migrated to the Comm encapsulation.

Also add a `Comm::ReduceInPlace` variant for `MPI_IN_PLACE`.

…sMPI

germasch · 2019-08-26T18:15:43Z

Well, maybe we've been talking past each other. But it's always been possible install separate parallel and serial versions of the lib, and obviously that could be combined into one build. What started this discussion originally was the desire to have a single library that would have the serial adios2 work without having to call MPI_Init, which previously didn't work for an adios2 that was build with MPI. That is, have the parallel lib include the serial functionality. That's now possible, but that parallel adios2 lib will need to be linked with MPI, and mpi.h needs to be available at#include <adios2.h>. What I was thinking is that the only way to avoid the link dependency on libmpi is to use dlopen.

Of course, the other way is to link to a different version of libadios2, but that's essentially always been possible, it just causes more hassles, that's the original point from @ax3l , I think.

A specific example is the python interface. What do you do when you import adios2? Previously, if adios2 was built with MPI, one would first have to initialize MPI before it becomes usable. Currently, as long as one uses the serial interface, it'll just work, and the parallel interface will work after mpi4py/MPI_Init. But if you have to link to different libadios2 that either link with MPI or not, you'll need to have two Python adios2 modules, too, right?

I dug out the original issue: #1443

That was really just an example of how it might work. The header and cmake mechanism for having a single library supporting both serial and parallel without forcing MPI linkage is tbd but there's some options on how it can be done. We'll certainly iterate on it to make sure it's a clean interface and can maintain backwards compatibility with existing integration. Either way, the details for the yet-to-be-implemented mechanisms are out of scope here; we can delve into detail in the appropriate pr when it's ready.

What I had been imagining was that adios2 at this point would dlopen the MPI library and then access its symbols through dlsym. But wouldn't this mechanism load a new instance of the MPI library, which hasn't even been initialized? Is it possible to get access to symbols from previously regularly loaded shared libraries at runtime?

There won't be any need for dlopen as it can be done entirely at link time in a way that will work for static and shared. Again though, more details to follow in a later pr.

ax3l · 2019-08-26T20:12:24Z

Uff, that's a long thread and now I got pinged on it ;) First of all: awesome to abstract the communication layer away, I am totally for it. I see a lot of potential in swapping it, e.g. to ZeroMQ, GASNet or Takyon, who knows.

As long as I am able to downstream use a "parallel (multi-process) library" for serial (single-process) purposes as well, I am happy. The PR looks like it addresses that need well. Here is my original issue: ornladios/ADIOS#183 In other words: "parallel" features should be a compatible extension to serial I/O, in terms of API & ABI functionality. (Yes, even if we mainly care about the latter.) Just because we compiled parallel features does not imply a user uses them downstream (aka MPI_Init should not be expected nor implicitly been called unless explicitly requested/included; MPI/multi-process comm. defines should only be included if explicitly requested, etc.).

Ideally, it's one (or a compatible to-be-linked-together set of) library(|ies) with a varying set of compatible includes/imports. The latter aspect is still missing from what I see in the new helper, but this PR is definitely another good step in that direction.

chuckatkins · 2019-08-26T20:45:43Z

Given that there doesn't seem to be any issues with this but instead how the split works after this then I'm calling the pr good to merge and we can iterate through the issues in the next phase as they come up.

bradking added 17 commits August 22, 2019 13:36

helper: add more MPI_Datatype values to mpidummy

ed65c00

helper: add MPI_IN_PLACE to mpidummy

b1e41ae

helper: Take CheckMPIReturn hint string by reference

541f380

helper: Add placeholder for Comm encapsulation of multi-process commu…

78311d8

…nication

core: Refactor IO to get communicator from ADIOS factory

33280a2

Avoid storing a redundant copy in an IO class member.

core: Refactor ADIOS factory to make communicator member private

a9a46ce

Provide a `GetComm()` method to get it publicly.

core: Refactor ADIOS factory to use Comm encapsulation

f7fec28

Implement the Comm encapsulation by moving MPI code out of the factory.

core: Expose Comm encapsulation outside ADIOS factory

9e7cdf6

For now it has an implicit conversion to MPI_Comm anyway.

helper: Encapsulate MPI_Comm_free as Comm::Free

64f04be

engine: Rename m_MPIComm to m_Comm

8b8b797

engine: Construct with Comm encapsulation

29177b0

bp4: Rename m_MPIComm to m_Comm

7ea742d

bp4: Use Comm encapsulation

ff37bac

bp3: Rename m_MPIComm to m_Comm

a491675

bp3: Use Comm encapsulation

be80eb7

helper: Port network functions to Comm encapsulation

19b0910

helper: Encapsulate MPI_Comm_split as Comm::Split

6bd190d

bradking force-pushed the mpi-encapsulation branch from d15630e to bd57350 Compare August 22, 2019 18:13

bradking added 2 commits August 26, 2019 12:10

aggregator: Simplify substream assignment logic

3d0481e

aggregator: Refactor to use Comm encapsulation

d121f2b

bradking added 22 commits August 26, 2019 12:11

bp4: Use Comm encapsulation in Aggregate methods

420c061

helper: Move Gather functions to Comm encapsulation

abd26d8

helper: Move Allgather functions to Comm encapsulation

71b826c

helper: Move ReduceValues function to Comm encapsulation

9360cd6

helper: Move Broadcast functions to Comm encapsulation

08e03c7

aggregator: Name async request types

323125b

helper: Add internal Comm method to look up MPI data type

b1da13e

helper: Move async send/recv functions to Comm encapsulation

5c1b6a1

helper: Encapsulate MPI_Bcast as Comm::Bcast

b3876c6

toolkit: Port transport and transportman to Comm encapsulation

d0a2fb6

helper: Encapsulate MPI_Barrier as Comm::Barrier

c6e6758

helper: Move CheckMPIReturn helper into Comm encapsulation

ba93031

Make it a private implementation detail.

helper: Drop empty adiosMPIFunctions.h

af7198e

All of the functions have been migrated to the Comm encapsulation.

helper: Encapsulate MPI_Comm_{rank,size} as Comm::{Rank,Size}

b6a34da

helper: Encapsulate MPI_Reduce as Comm::Reduce

39696b4

Also add a `Comm::ReduceInPlace` variant for `MPI_IN_PLACE`.

helper: Encapsulate MPI_{Send,Recv} as Comm::{Send,Recv}

4b07c8f

helper: Encapsulate MPI_Allgather as Comm::Allgather

b282a11

helper: Encapsulate MPI_Allreduce as Comm::Allreduce

d0f520b

helper: Encapsulate MPI_Scatter as Comm::Scatter

3808644

helper: Encapsulate MPI_Gather as Comm::Gather

f5357b4

helper: Replace implicit Comm conversion to MPI with explicit Comm::A…

3f711df

…sMPI

bradking force-pushed the mpi-encapsulation branch from bd57350 to 3f711df Compare August 26, 2019 16:11

chuckatkins merged commit 86e0a96 into ornladios:master Aug 26, 2019

bradking deleted the mpi-encapsulation branch August 27, 2019 13:17

bradking mentioned this pull request Oct 8, 2019

Encapsulate multi-process communication (part 2) #1817

Merged

bradking mentioned this pull request Oct 17, 2019

Encapsulate multi-process communication (part 3) #1836

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encapsulate multi-process communication #1691

Encapsulate multi-process communication #1691

bradking commented Aug 21, 2019

germasch commented Aug 22, 2019

bradking commented Aug 22, 2019

pnorbert commented Aug 22, 2019

bradking commented Aug 22, 2019

chuckatkins commented Aug 23, 2019

eisenhauer commented Aug 23, 2019

germasch commented Aug 23, 2019

chuckatkins commented Aug 23, 2019

germasch commented Aug 23, 2019 •

edited

Loading

chuckatkins commented Aug 26, 2019

germasch commented Aug 26, 2019

ax3l commented Aug 26, 2019 •

edited

Loading

chuckatkins commented Aug 26, 2019 •

edited

Loading

Encapsulate multi-process communication #1691

Encapsulate multi-process communication #1691

Conversation

bradking commented Aug 21, 2019

germasch commented Aug 22, 2019

bradking commented Aug 22, 2019

pnorbert commented Aug 22, 2019

bradking commented Aug 22, 2019

chuckatkins commented Aug 23, 2019

eisenhauer commented Aug 23, 2019

germasch commented Aug 23, 2019

chuckatkins commented Aug 23, 2019

germasch commented Aug 23, 2019 • edited Loading

chuckatkins commented Aug 26, 2019

germasch commented Aug 26, 2019

ax3l commented Aug 26, 2019 • edited Loading

chuckatkins commented Aug 26, 2019 • edited Loading

germasch commented Aug 23, 2019 •

edited

Loading

ax3l commented Aug 26, 2019 •

edited

Loading

chuckatkins commented Aug 26, 2019 •

edited

Loading