-
Notifications
You must be signed in to change notification settings - Fork 6
Child start handling for Process ()
is broken.
#8
Comments
Some background (from Jira): When running addChild/removeChild in a loop using the Supervisor API, we noticed a memory leak. I don't know if this is related to #DPP-72.
|
Going back to the original ask which caused us to introduce
runPubSubActor :: PubSub -> Process ()
runPubSubActor = ...
init :: PubSub -> Process ()
init pubSubCtx = do
– replacing Supervisor.run
runLocal (restartAll ..) [pubSubActorSpec]
where
pubSubActorSpec = LocalChildSpec
{ childKey = "pubsub-bridge", ..., childRun = (runPubSubActor pubSubCtx) }
I think now that haskell-distributed/distributed-process-client-server#9 would solve this nicely, since you could encapsulate the |
After spending a few hours searching, I cannot for the life of me find out how to "turn a
ignoring the fact that the
|
For future reference, there is a detailed outline about how to make |
Is anyone (@hyperthunk @tdietert?) able to summarise where this is at and if there's a path forward? |
@mcfilib the best I can do is point you to my workaround; converting a |
I'll take a look soon, I've been away from the project a while but will try and get back to you ASAP... |
Okay folks, this is a bit convoluted. So I previously fixed this bug by refactoring the The current implementation works fine, there aren't any races for Now I personally think we can close this bug, however I'm happy to discuss further. The original code looked like this: instance ToChildStart (Process ()) where
toChildStart proc = do
starterPid <- spawnLocal $ do
-- note [linking]: the first time we see the supervisor's pid,
-- we must link to it, but only once, otherwise we simply waste
-- time and resources creating duplicate links
(supervisor, _, sendPidPort) <- expectTriple
link supervisor
spawnIt proc supervisor sendPidPort
tcsProcLoop proc
return (StarterProcess starterPid)
where
tcsProcLoop :: Process () -> Process ()
tcsProcLoop p = forever' $ do
(supervisor, _, sendPidPort) <- expectTriple
spawnIt p supervisor sendPidPort
spawnIt :: Process ()
-> SupervisorPid
-> SendPort ChildPid
-> Process ()
spawnIt proc' supervisor sendPidPort = do
supervisedPid <- spawnLocal $ do
link supervisor
self <- getSelfPid
(proc' `catches` [ Handler $ filterInitFailures supervisor self
, Handler $ logFailure supervisor self ])
`catchesExit` [\_ m -> handleMessageIf m (== ExitShutdown)
(\_ -> return ())]
sendChan sendPidPort supervisedPid As you can probably tell by correlating that with my original bug report, those pids can leak... |
@mcfilib I'm London based, so if you're not too far off in the UK, we could do a co-lo and talk through any issues on this and other platform libraries you're working on/with. I don't have the ability to contribute much by way of coding at the moment (though I will get back to this eventually), but I'm more than happy to help get people up to speed with what's there. |
Yep, that's a deliberate design decision. You cannot pass a Note that it is possible to work around this issue, sort of. There are two places where I cheat and use STM to communicate with intra-node peer processes, but obviously this isn't a good idea for the Supervisor module, since you end up with a supervisor that cannot support remote clients. Worse than that, if the API supported this notion (using STM) for local clients but not remote ones, then there's no type safe way to enforce the notion that remote clients can't use your supervisor if they're supplying a |
Here are some notable examples of using STM to communicate with intra-node (but not truly remote) processes. They can be useful for coordinating with code that runs outside the Process monad.. The first is in distributed-process-client-server. This code allows us to wrap an input handler for an arbitrary STM action, which will be executed in the server loop along with all the inbound message handlers supplied in the server definition. The salient bits of the documentation start here. A few snippets of code: demoExternal = do
inChan <- liftIO newTQueueIO
replyQ <- liftIO newTQueueIO
let procDef = statelessProcess {
apiHandlers = [
handleExternal
(readTQueue inChan)
(\s (m :: String) -> do
liftIO $ atomically $ writeTQueue replyQ m
continue s)
]
}
let txt = "hello 2-way stm foo"
pid <- spawnLocal $ serve () (statelessInit Infinity) procDef
echoTxt <- liftIO $ do
-- firstly we write something that the server can receive
atomically $ writeTQueue inChan txt
-- then sit and wait for it to write something back to us
atomically $ readTQueue replyQ
say (show $ echoTxt == txt) and data StmServer = StmServer { serverPid :: ProcessId
, writerChan :: TQueue String
, readerChan :: TQueue String
}
instance Resolvable StmServer where
resolve = return . Just . serverPid
echoStm :: StmServer -> String -> Process (Either ExitReason String)
echoStm StmServer{..} = callSTM serverPid
(writeTQueue writerChan)
(readTQueue readerChan)
launchEchoServer :: CallHandler () String String -> Process StmServer
launchEchoServer handler = do
(inQ, replyQ) <- liftIO $ do
cIn <- newTQueueIO
cOut <- newTQueueIO
return (cIn, cOut)
let procDef = statelessProcess {
apiHandlers = [
handleCallExternal
(readTQueue inQ)
(writeTQueue replyQ)
handler
]
}
pid <- spawnLocal $ serve () (statelessInit Infinity) procDef
return $ StmServer pid inQ replyQ
testExternalCall :: TestResult Bool -> Process ()
testExternalCall result = do
let txt = "hello stm-call foo"
srv <- launchEchoServer (\st (msg :: String) -> reply msg st)
echoStm srv txt >>= stash result . (== Right txt) And finally we've got this, which uses the broadcastClient :: Exchange -> Process (InputStream Message)
broadcastClient ex@Exchange{..} = do
myNode <- getSelfNode
us <- getSelfPid
if processNodeId pid == myNode -- see [note: pcopy]
then do (sp, rp) <- newChan
configureExchange ex $ pCopy (BindSTM us sp)
mRef <- P.monitor pid
P.finally (receiveWait [ matchChanP rp
, handleServerFailure mRef ])
(P.unmonitor mRef)
else do (sp, rp) <- newChan :: Process (Channel Message)
configureExchange ex $ BindPort us sp
mRef <- P.monitor pid
P.finally (receiveWait [
match (\(_ :: BindOk) -> return $ newInputStream $ Left rp)
, match (\(f :: BindFail) -> die f)
, handleServerFailure mRef
])
(P.unmonitor mRef) I believe that I used IIRC I was also considering using |
So... Does anyone object to this issue being closed, since the offending code is no longer present? |
@hyperthunk @tdietert thank you both for your replies - immensely helpful. I don't have any objections to this being closed (genuinely wrote closured there and corrected myself). @hyperthunk thank you too for the offer of an irl. I'm in Cardiff at the minute and I think I have everything I need but I'll let you know if that changes. |
Although this issue is now closed, please consider the proposal in haskell-distributed/distributed-process#400. I may implement this in a branch, once I've resolved the outstanding pull request and updating dependencies to support the latest stack ecosystem. |
For the benefit of future readers, and @tdietert in particular, with the new static pointers capability, you can make your closures very easily... awaitsRegistration :: Process ()
awaitsRegistration = do
self <- getSelfPid
nid <- expect :: Process NodeId
runUntilRegistered nid self
say $ regName ++ " registered to " ++ show self
expect :: Process ()
where
runUntilRegistered nid us = do
whereisRemoteAsync nid regName
receiveWait [
matchIf (\(WhereIsReply n (Just p)) -> n == regName && p == us)
(const $ return ())
]
regName :: String
regName = "testRegisterRemote"
awaitsRegStatic :: Static (Process ())
awaitsRegStatic = staticPtr $ static awaitsRegistration
awaitsRegClosure :: Closure (Process ())
awaitsRegClosure = staticClosure awaitsRegStatic
testRegistryMonitoring :: LocalNode -> NodeId -> Assertion
testRegistryMonitoring node nid =
runProcess node $ do
pid <- spawn nid awaitsRegClosure
-- etc etc... |
@hyperthunk nice! Where does this capability come from? How do we get our hands on it? |
I'm going to be looking at rolling it into a release shortly. Various bits of distribute-process get an update to handle the static pointer support that's already available in distributed-static. :) |
I'm eager to see this! Is there a separate issue to track its progress? |
I'll set one up shortly. I'm recovering from a cold at the moment!
…On Sat, 12 Jan 2019, 02:11 Kai Zhang ***@***.*** wrote:
I'm going to be looking at rolling it into a release shortly. Various bits
of distribute-process get an update to handle the static pointer support
that's already available in distributed-static. :)
I'm eager to see this! Is there a separate issue to track its progress?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACQDLTpXELN7RJ0_pPYtrPVlfYcHdfJks5vCURigaJpZM4MFGpy>
.
|
@hyperthunk where does the |
First reported as part of haskell-distributed/distributed-process-platform#77, there is a serious fault in the
ToChildStart
instance forProcess ()
.Not only do we potentially spin up and subsequently leak starter processes, the whole premise of this approach is wrong, since it can lead to a supervisor waiting indefinitely for a child to start. This breaks the contract between parent and child processes and goes against the design principles of supervision as laid out in the original OTP implementation.
I propose we remove this instance and leave it to implementors to define, but also that we remove
StarterPid
from the data type, since we have no clean solutions for using these without hitting the issues mentioned above.Specifically, spawn should be asynchronous, and indication that a child has died should come from monitor signals, such that once the child has spawned, the monitor signal should be established before the code has a chance to proceed (and potentially crash prior to monitoring being properly established).
Some things I think we can/should rule out...
The original fix (from @tavisrudd before the repos were split)
Here's a playback of the commentary there... When the StarterProcess ChildStart variant is used with dynamic supervision, each cycle of startNewChild + terminateChild/deleteChild
leaks a process. The proposed fix was to kill the started process, for which we have an ID.
This was broken for two reasons. The first is that we do not evaluate
toChildStart
in the supervisor process, thus killing the starter process means we can no longer restart the child. The second reason is that we break location transparency, as per this part of the thread:... So it is safe for us to kill this
ProcessId
from our point of view, because we're deleting the child spec (and therefore won't require this "re-starter process" to continue living. But what if the code that created this re-starter tries to then send it to another supervisor? TheProcessId
given to that supervisor will be dead/stale and the supervisor will reject it (when trying to start the child spec) withStartFailureDied DiedUnknownId
which is confusing. (UPDATE: note that I'd missed the fact this supervisor won't be able to start children either...)At first glance, I thought what we actually wanted here is a finalizer that is guaranteed to kill off the re-starter process "at some point" after the
ChildSpec
becomes unreachable (and is gc'ed). However, theSystem.Mem.Weak
documentation isn't clear on whether or not this is guaranteed for an ADT such asProcessId
. In particular, this comment:The alternative to this would be to document this behaviour of
ChildSpec
, explaining that once deleted from a supervisor, theChildSpec
becomes invalid. But I really really dislike this idea. The problem is that theChildSpec
is now behaving like a shared, mutable (unsafe) data structure. Even if you serialize theChildSpec
and send it to another node, if the sameChildSpec
(or even anotherChildSpec
that shares the sameToChildStart
thunk!) is removed from any supervisor in the system, we've just invalidated that same data structure across all nodes, because theProcessId
is no longer valid. That just seems wrong to me, no matter how much documentation we throw at it.One approach that might work here would be to convert the
StarterProcess
constructor to take aWeak ProcessId
and in theToChildStart
instance create a weak reference to the process id and create a finalizer that kills the process once all its clients - the supervisors using it - go away, i.e., theChildStart
datum becomes garbage. Problem solved? Well no.....Firstly, we'd need to test that this finalization business works properly for a
Weak ProcessId
. You've already written profiling code to detect the leak, so that shouldn't be too difficult, though relying on finalization will probably mean having to increase the time bounds of the tests to give theSystem.Mem.Weak
infrastructure time to do the cleanup - maybe even forcing a GC at some point.Secondly, and this is a serious problem: finalisation is useless if/when the
ChildStart
datum escapes the local node. Which means, in practice, that you can't serialize andsend
it remotely, otherwise this whole finalization thing will go wrong - we'll never detect that a remote peer is using theChildStart
at all (i.e., we will loose track of it as soon as it's gone over the wire) and therefore - assuming that finalizing aProcessId
works at all - we'll end up killing the re-starter process whilst remote supervisors are still using it. Nasty. Confusing. Bug. :(You can see now why I was a fussy little kitty when @roman and I were debating how to support local children in the original ticket that introduced
StarterProcess
. This issue is fraught with edge cases and we're changing a piece of critical fault-tolerance infrastructure, so we can't afford to screw it up.I think we have three choices here, as I see it - feel free to suggest alternatives though, as always:
StarterProcess
and make people use theClosure Process
constructor insteadStarterProcess
to reference count its clients/supervisorsToChildStart
instance forProcess ()
and exportStarterProcess
then make people implement this themselvesThe idea of (1) is that we're avoiding the issue by making the API less friendly. Not a great option IMO.
UPDATE: I actually think (1) is what we should do now. I went on to say the following, which I'll recant shortly...
The problem with (2) and (3) is that we introduce a huge degree of complexity for very little benefit. Despite the original report in https://cloud-haskell.atlassian.net/browse/DPP-81, it is not difficult to turn a
Process ()
expression into a closure without template haskell, and therefore we're jumping through hoops for hardly any reason at all. But there's more: The way thatStarterProcess
works is fundamentally broken, because the supervisor has to interact with this other known process to spawn its children, and get back a process id and monitor ref. That is broken in a fundamental way: if we do not own the code for that starter process then we've no way of ensuring we won't block indefinitely waiting for a reply. Even if we do, if the starter resides on a foreign node, network congestion could cause almost indefinite blocking (even in the presence of heartbeats), so this is a bit no-no. We cannot have the supervisor process blocked waiting like that - spawning a child has to be asynchronous, and there needs to be a guarantee that monitoring has been set up correctly before the child proceeds to run, which requires a wrapper that we need to have written. We cannot leave this up to 3rd parties.Because of this (above), we have to have a thunk that we can execute, which we can wrap in the relevant spawn/monitor code for safely. Since we cannot send
Process a
over the wire, onlyClosure (Process a)
is possible - allowing for theCreateHandle
instance obviously, which takesClosure (SupervisorPid -> Process (ChildPid, Message))
- and we cannot operate in terms ofProcess a
. A concession would be to break upChildSpec
into those used to define the supervisor and those used for dynamic supervision, allowingProcess ()
in a child spec used to boot the supervisor but not in subsequent calls toaddChild
etc. I really do not like this idea however, since not only to it create an imbalanced API, it could also prevent us sending supervisor setup data over the wire (since it could potentially containProcess ()
instead ofClosure (Process ())
thunks), and that is a pretty huge disadvantage.The final concession we might make, would be to separate out the supervisor runtime implementation into two parts, one that handles local children and another for remote. I still don't like this idea though, because we end up with a fractured API, but I will consider it if people shout asking for
Process ()
to be supported as aToChildStart
instance again. The approach I would take to achieve this would involve (a) breaking up the server so that the child start handling could be modified, (b) segregating the client APIs and having two supervisor client handles, one for local only and another for remote only supervisors. The client handle for a local only supervisor would carry anSTM.TChan
used for sending unserialisable thunks to the server. This would require us to implement haskell-distributed/distributed-process-client-server#9 first.The text was updated successfully, but these errors were encountered: