Investigate large differences between number of data transfers and number of active graphsync requests in Estuary #279

hannahhoward · 2021-10-26T05:48:11Z

For some time, Estuary has been holding 10s of thousands of data transfers in the requested or ongoing state (~2500 ongoing, 10000+ requested), but only has at most a few hundred active or pending Graphsync requests at any given time. Since these are push channels, that means the other peer has simply never sent us a graphsync request in response to our push or restart message. Upon restart, Estuary attempts to restart every one of its transfers, but despite never receiving new Graphsync requests, they don't seem to fail either.

I haven't been able to determine the root cause of the problem. At first, I thought it might be that channel monitoring was not getting added to requests that restarted (see #268). When I discovered this issue, I thought because there was no monitoring, networking errors would not trigger restarts, and therefore channels would never reach the "exhausted restarts" point that would trigger a failure. However, that issue has been fixed without a change to the number of active requests.

It may be that there are multiple causes. I've identified a few definite issues and some possible hypothesis:

Transfer does not fail even if the attempt to send a restart message after 5 attempts fails. Estuary logs show that RestartDataTransferChannel returns an error for several hundred channels cause the initial restart message never sends as routing is not found to the other peer. But the transfer itself does not go to a failure state. It seems to me if you can't even route to a peer, you ought to fail the transfer. Possibly this should be a separate call in estuary, or possibly if the restart network message doesn't send (https://github.com/filecoin-project/go-data-transfer/blob/master/impl/restart.go#L105), one should simply fail the whole transfer. Solving this should address several hundred of the failures.
For transfers in the "requested" stage, the accept timeout on Estuary is 24 hours. Estuary gets restarted relatively frequently, so it may be we just never get to this point for all the requested transfers that are never getting a response.
Does restarting in the content of a transfer that's only reached the "requested" stage even work? I've read the code paths and it looks like it should work, but we don't really have tests of this behavior.

hannahhoward moved this to Icebox in Project Thunder (Interop) Nov 13, 2021

hannahhoward added this to Project Thunder (Interop) Nov 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate large differences between number of data transfers and number of active graphsync requests in Estuary #279

Investigate large differences between number of data transfers and number of active graphsync requests in Estuary #279

hannahhoward commented Oct 26, 2021

Investigate large differences between number of data transfers and number of active graphsync requests in Estuary #279

Investigate large differences between number of data transfers and number of active graphsync requests in Estuary #279

Comments

hannahhoward commented Oct 26, 2021