Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encapsulate multi-process communication #1691

Merged
merged 42 commits into from
Aug 26, 2019

Conversation

bradking
Copy link
Collaborator

Much of ADIOS's functionality is applicable to any multi-process communication environment, not just MPI. Disconnect the non-MPI-specific parts of our implementation from MPI. Replace direct use of MPI_Comm and MPI functions with a helper::Comm wrapper that encapsulates MPI functionality. This makes room for future abstraction of multi-process communication behind the encapsulated API.

Start by encapsulating MPI functions and the types MPI_Comm, MPI_Request, and MPI_Status behind wrappers in helper::Comm. For now leave encapsulation of MPI constants to future work.

@germasch
Copy link
Contributor

Much of ADIOS's functionality is applicable to any multi-process communication environment, not just MPI.

So I have some concerns, in particular, what is "any multi-process communication environment"? I think it's pretty clear that MPI is going stay with us (ie., HPC) at least for the exascale, and other things I have heard of, like Spark or Hadoop don't have an API that's anything like this abstraction.

I totally agree that wrapping and abstracting an API is the right way to go if one wants to support multiple implementations or variants of something. But I also think that is something that should be done at the point where one has an actual need. Because if there is only one API, which is already built on a certain abstraction already (like MPI is) there's really no way to know what abstractions are the right ones to achieve the future goal of abstracting away differences. So I think all this really does here is to provide a C++ interface to MPI -- it doesn't actually even solve the case originally kicked of this problem, ie, the case where MPI is not present.

In terms of a C++ API -- well, it's been tried, and was eventually dropped, because people rather stuck with the familiar C API and the C++ bindings didn't offer much value. This is kind of my concern with this one -- everyone who does HPC is already familiar with MPI. An additional wrapping layer obscures that, so everyone contributing code has to familiarize themselves with this new abstraction, and potentially implement more missing wrappers. And there is still C code in adios2, which can't use this new abstraction, since it'd need yet another layer of C binding wrappers.

I guess if one really wants to go with a C++ wrapping layer, I'd say one should at least take advantage of C++ lifetime semantics, ie. implement a copy constructor that does MPI_Comm_dup and a destructor that does MPI_Comm_free, so one source of bugs will be eliminated (along the lines of how you're already nicely using move semantics).

Anyway, this is just my personal opinion -- the implementation looks perfectly good, but I don't really see the rationale for adding a layer of wrappers. It's very possible that there is some distributed memory communication environment that I'm now aware of, though.

@bradking
Copy link
Collaborator Author

it doesn't actually even solve the case originally kicked of this problem, ie, the case where MPI is not present.

This is just the first step. Additional work will follow. However, I can't continue to maintain this branch outside of master because every rebase takes hours of catching up with new uses of MPI. This plan was discussed in a call months ago.

In terms of a C++ API -- well, it's been tried, and was eventually dropped, because people rather stuck with the familiar C API and the C++ bindings didn't offer much value.

They don't add much value to upstream MPI implementations, but here we need an abstraction layer.

implement a copy constructor that does MPI_Comm_dup and a destructor that does MPI_Comm_free

The destructor already frees it if an explicit Free has not been called already. I intentionally deleted the copy constructor and copy-assignment operator of helper::Comm to avoid accidental duplication. A dedicated Duplicate method could be added when needed. There is already a Split method.

@pnorbert
Copy link
Contributor

@bradking Just to clarify: did you stop duplicating the incoming communicator in the ADIOS constructor, and then always duplicate it in the io.Open() method? The move semantics confuses me. What will happen if you call multiple Opens?

@bradking
Copy link
Collaborator Author

@pnorbert the ADIOS constructor still duplicates the incomming communicator. The code is now helper::Comm::Duplicate(mpiComm) and is called in ADIOS's constructor.

@chuckatkins
Copy link
Contributor

@germasch So to discuss at a high level where this is going, the effort here is multi-pronged and this is the first portion of it, eventually to achieve the following:

  • Allow serial and parallel operations in the same library
    • This is admit-idly already somewhat manageable with the SMPI wrappers recently put in place
  • Allow an app to use the serial interface without even needing a link-time dependency on MPI.
    • This can be achieved by factoring the underlying MPI components out into a separate library. The details of this are somewhat nuanced but @bradking can speak more to it.
  • Primarily, this is to allow for alternate communication libraries that are not MPI.
    • MPI is the optimal use case for HPC environments but not really for non-HPC environments outside of an RDMA cluster. The communication patterns in the MPI API are pretty basic and if viewed from a high level can be implemented in a reasonably straight-forward way with a variety of backends, some of which are much more suitable to distributed non-HPC environments (direct 0MQ, Redis based message broker, etc.). Rather than have these need to be engine specific, if the comm architecture in adios core is adaptable then we can leverage common implementations for things like BP Aggreagation outside of just the BP engines.
    • This also opens up some good research opportunities for exploring the intraprocess communication associated with I/O in different environments.

This is the first step in the adaptation and likely the most disruptive but it provides the framework for the above to proceed in a less disruptive way. Think of it as adding another dimension to the Adaptable in ADIOS.

@eisenhauer
Copy link
Member

FYI on the occasional KillWriter failure. That may be related to load on the CI machines. Sometimes the writer side gets killed before the reader spins up completely, and then the reader fails because it finds no writer (as opposed to what we're testing, where the writer should exist and be killed so we can test that the reader handles the failure). In current master, I've doubled the time we wait before killing the writer, so if you rebase that particular failure may be less likely.

@germasch
Copy link
Contributor

Thanks for your reply, @chuckatkins.

  • Allow an app to use the serial interface without even needing a link-time dependency on MPI.

    • This can be achieved by factoring the underlying MPI components out into a separate library. The details of this are somewhat nuanced but @bradking can speak more to it.

So I would generally be curious to know how that'd work. All I can come up with is dynamically loading the MPI library on demand. Fortunately, you guys are the cmake gurus, so you can hopefully figure out how to make that work robustly across systems.

But I'd caution that this will open up a whole new set of potential pitfalls for users. For example, you may easily compile your code against a different version of MPI than what's adios2 i then loading. You will still not get rid of the dependency on mpi.h, it needs to be included by a generic (parallel+serial) adios.h. You may pick up the wrong version of the header, and if sizeof(MPI_Comm) is not the same, as is the case with openmpi vs mpich, this leads to mystery crashes. (Even otherwise, picking up two versions of MPI is not a fun problem to debug. @jychoi-hpc may remember the XGC/adios meeting back in May, where the lady got mysterious "MPI not initialized" errors if though MPI_Init was called, which I tracked down to her code ended being linked to two different versions of MPI.)

  • MPI is the optimal use case for HPC environments but not really for non-HPC environments outside of an RDMA cluster. The communication patterns in the MPI API are pretty basic and if viewed from a high level can be implemented in a reasonably straight-forward way with a variety of backends, some of which are much more suitable to distributed non-HPC environments (direct 0MQ, Redis based message broker, etc.). Rather than have these need to be engine specific, if the comm architecture in adios core is adaptable then we can leverage common implementations for things like BP Aggreagation outside of just the BP engines.

I think having an internal communication abstraction might well be very useful, to be used in engines. I also think it's really hard to do this and not end up with something very similar to MPI in the first place. Anyway, my point is that this new abstraction isn't really suited for anything but MPI, it doesn't map to something like 0MQ at all for all I can tell.

  • This also opens up some good research opportunities for exploring the intraprocess communication associated with I/O in different environments.

As I said, there might well be a point in dealing with different communication environments internally, ie., in engines. As I think is already being done, I think I've even seen something 0MQ being merged lately, and obviously there's other code doing RDMA. I think those would be starting points to figure out what proper abstractions would look like...

Anyway, one thing I definitely do agree with is that maintaining this externally for long time can't be fun...

@chuckatkins
Copy link
Contributor

  • Allow an app to use the serial interface without even needing a link-time dependency on MPI.

    • This can be achieved by factoring the underlying MPI components out into a separate library. The details of this are somewhat nuanced but @bradking can speak more to it.

So I would generally be curious to know how that'd work. All I can come up with is dynamically loading the MPI library on demand. Fortunately, you guys are the cmake gurus, so you can hopefully figure out how to make that work robustly across systems.

@bradking has a pretty good idea of how it needs to be implemented, although outside the scope of this pr. I expect the next push will be adding the serial implementation to the abstraction and then after that splitting the MPI dependency.

You will still not get rid of the dependency on mpi.h, it needs to be included by a generic (parallel+serial) adios.h.

The header could have something like the following where you could explicitly {en/dis}able mpi.h with #define ADIOS2_USE_MPI=1 // or 0 or leave it unset for the default to be whatever ADIOS2_HAVE_MPI is (which is the current behavior):

// ADIOS2Config.h (@ADIOS2_HAVE_MPI01@ is generated at configure time as explicitly 0 or 1)
#ifndef ADIOS2_USE_MPI
  #define ADIOS2_USE_MPI @ADIOS2_HAVE_MPI01@
#endif

// adios2.h
#include "ADIOS2Config.h"
...
#if ADIOS2_USE_MPI
  #ifndef ADIOS2_HAVE_MPI
    #error "ADIOS2 was not built with MPI support"
  #endif
  #include <mpi.h>
#endif

For example anyways. That logic might not be quite right and details of the use case would need to be worked out more but you get the idea on how it's doable to manage the interface in a single library where a serial application can disable the MPI interface entirely.

You may pick up the wrong version of the header, and if sizeof(MPI_Comm) is not the same, as is the case with openmpi vs mpich, this leads to mystery crashes.

You might, but that's no different that the way any MPI code is currently setup. It wouldn't "hide" the MPI dependency, just split it out so you would have something like:

find_package(ADIOS2 REQUIRED COMPONENTS MPI)
...
# serial usage
target_link_libraries(foo PRIVATE ADIOS2::ADIOS2)

# parallel usage
target_link_libraries(foo PRIVATE ADIOS2::ADIOS2_MPI)

or something of the sorts.

@germasch
Copy link
Contributor

germasch commented Aug 23, 2019

You will still not get rid of the dependency on mpi.h, it needs to be included by a generic (parallel+serial) adios.h.

The header could have something like the following where you could explicitly {en/dis}able mpi.h with #define ADIOS2_USE_MPI=1 // or 0 or leave it unset for the default to be whatever ADIOS2_HAVE_MPI is (which is the current behavior):

Yeah, if you do that, I guess it would be a nicer interface to have adios2_no_mpi.h, which does #undef ADIOS2_USE_MPI and #include <adios2.h>.

You may pick up the wrong version of the header, and if sizeof(MPI_Comm) is not the same, as is the case with openmpi vs mpich, this leads to mystery crashes.

You might, but that's no different that the way any MPI code is currently setup. It wouldn't "hide" the MPI dependency, just split it out so you would have something like:

You're right on that. Mixing MPIs can and does happen today, so that doesn't really get changed.

find_package(ADIOS2 REQUIRED COMPONENTS MPI)
...
# serial usage
target_link_libraries(foo PRIVATE ADIOS2::ADIOS2)

# parallel usage
target_link_libraries(foo PRIVATE ADIOS2::ADIOS2_MPI)

Well, if using cmake or shared libs, I would say that's actually a step backwards, since today target_link_libraries(foo PRIVATE ADIOS2::ADIOS2) will just work for both serial and parallel programs, with serial or parallel adios2, except in the one case where the program is parallel and wants to use parallel adios2, but adios2 has been built serially (which just isn't fixable.)

So I did, however, come to think of an issue that I'm really not sure is solvable: The typical situation of an MPI app using the MPI-parallel adios will go like this:

  MPI_Init(...);
  adios2::ADIOS ad(MPI_COMM_WORLD);

So by the time that control passes to adios2, MPI has already been initialized, but the adios2 library does not yet have access to its symbols (because it is not directly linked to it). So at the time that the constructor is called, adios2 figures out that it's being used with an MPI communicator, so it now definitely needs to link to the MPI library. But it doesn't have access to the MPI symbols yet. If MPI has been linked statically to the library, there is just no way to get access to its symbols. What I had been imagining was that adios2 at this point would dlopen the MPI library and then access its symbols through dlsym. But wouldn't this mechanism load a new instance of the MPI library, which hasn't even been initialized? Is it possible to get access to symbols from previously regularly loaded shared libraries at runtime?

ETA: Some googling seems to say that it should be possible to query symbols from previously loaded libraries even if they were not originally dynamically loaded. So I guess it probably is possible, though it won't work on systems that statically link executables (like many Crays by default).

@chuckatkins
Copy link
Contributor

That was really just an example of how it might work. The header and cmake mechanism for having a single library supporting both serial and parallel without forcing MPI linkage is tbd but there's some options on how it can be done. We'll certainly iterate on it to make sure it's a clean interface and can maintain backwards compatibility with existing integration. Either way, the details for the yet-to-be-implemented mechanisms are out of scope here; we can delve into detail in the appropriate pr when it's ready.

What I had been imagining was that adios2 at this point would dlopen the MPI library and then access its symbols through dlsym. But wouldn't this mechanism load a new instance of the MPI library, which hasn't even been initialized? Is it possible to get access to symbols from previously regularly loaded shared libraries at runtime?

There won't be any need for dlopen as it can be done entirely at link time in a way that will work for static and shared. Again though, more details to follow in a later pr.

Encapsulate a pending async operation and its result status.  This may
require multiple `MPI_Request` instances in case we split large
operations into a batch of multiple MPI operations.  In the case of a
batched operation, present the results as a combined status.
Make it a private implementation detail.
All of the functions have been migrated to the Comm encapsulation.
Also add a `Comm::ReduceInPlace` variant for `MPI_IN_PLACE`.
@germasch
Copy link
Contributor

Well, maybe we've been talking past each other. But it's always been possible install separate parallel and serial versions of the lib, and obviously that could be combined into one build. What started this discussion originally was the desire to have a single library that would have the serial adios2 work without having to call MPI_Init, which previously didn't work for an adios2 that was build with MPI. That is, have the parallel lib include the serial functionality. That's now possible, but that parallel adios2 lib will need to be linked with MPI, and mpi.h needs to be available at#include <adios2.h>. What I was thinking is that the only way to avoid the link dependency on libmpi is to use dlopen.

Of course, the other way is to link to a different version of libadios2, but that's essentially always been possible, it just causes more hassles, that's the original point from @ax3l , I think.

A specific example is the python interface. What do you do when you import adios2? Previously, if adios2 was built with MPI, one would first have to initialize MPI before it becomes usable. Currently, as long as one uses the serial interface, it'll just work, and the parallel interface will work after mpi4py/MPI_Init. But if you have to link to different libadios2 that either link with MPI or not, you'll need to have two Python adios2 modules, too, right?

I dug out the original issue: #1443

That was really just an example of how it might work. The header and cmake mechanism for having a single library supporting both serial and parallel without forcing MPI linkage is tbd but there's some options on how it can be done. We'll certainly iterate on it to make sure it's a clean interface and can maintain backwards compatibility with existing integration. Either way, the details for the yet-to-be-implemented mechanisms are out of scope here; we can delve into detail in the appropriate pr when it's ready.

What I had been imagining was that adios2 at this point would dlopen the MPI library and then access its symbols through dlsym. But wouldn't this mechanism load a new instance of the MPI library, which hasn't even been initialized? Is it possible to get access to symbols from previously regularly loaded shared libraries at runtime?

There won't be any need for dlopen as it can be done entirely at link time in a way that will work for static and shared. Again though, more details to follow in a later pr.

@ax3l
Copy link
Contributor

ax3l commented Aug 26, 2019

Uff, that's a long thread and now I got pinged on it ;) First of all: awesome to abstract the communication layer away, I am totally for it. I see a lot of potential in swapping it, e.g. to ZeroMQ, GASNet or Takyon, who knows.

As long as I am able to downstream use a "parallel (multi-process) library" for serial (single-process) purposes as well, I am happy. The PR looks like it addresses that need well. Here is my original issue: ornladios/ADIOS#183 In other words: "parallel" features should be a compatible extension to serial I/O, in terms of API & ABI functionality. (Yes, even if we mainly care about the latter.) Just because we compiled parallel features does not imply a user uses them downstream (aka MPI_Init should not be expected nor implicitly been called unless explicitly requested/included; MPI/multi-process comm. defines should only be included if explicitly requested, etc.).

Ideally, it's one (or a compatible to-be-linked-together set of) library(|ies) with a varying set of compatible includes/imports. The latter aspect is still missing from what I see in the new helper, but this PR is definitely another good step in that direction.

@chuckatkins
Copy link
Contributor

chuckatkins commented Aug 26, 2019

Given that there doesn't seem to be any issues with this but instead how the split works after this then I'm calling the pr good to merge and we can iterate through the issues in the next phase as they come up.

@chuckatkins chuckatkins merged commit 86e0a96 into ornladios:master Aug 26, 2019
@bradking bradking deleted the mpi-encapsulation branch August 27, 2019 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants