Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate large differences between number of data transfers and number of active graphsync requests in Estuary #279

Open
hannahhoward opened this issue Oct 26, 2021 · 0 comments

Comments

@hannahhoward
Copy link
Collaborator

For some time, Estuary has been holding 10s of thousands of data transfers in the requested or ongoing state (~2500 ongoing, 10000+ requested), but only has at most a few hundred active or pending Graphsync requests at any given time. Since these are push channels, that means the other peer has simply never sent us a graphsync request in response to our push or restart message. Upon restart, Estuary attempts to restart every one of its transfers, but despite never receiving new Graphsync requests, they don't seem to fail either.

I haven't been able to determine the root cause of the problem. At first, I thought it might be that channel monitoring was not getting added to requests that restarted (see #268). When I discovered this issue, I thought because there was no monitoring, networking errors would not trigger restarts, and therefore channels would never reach the "exhausted restarts" point that would trigger a failure. However, that issue has been fixed without a change to the number of active requests.

It may be that there are multiple causes. I've identified a few definite issues and some possible hypothesis:

  1. Transfer does not fail even if the attempt to send a restart message after 5 attempts fails. Estuary logs show that RestartDataTransferChannel returns an error for several hundred channels cause the initial restart message never sends as routing is not found to the other peer. But the transfer itself does not go to a failure state. It seems to me if you can't even route to a peer, you ought to fail the transfer. Possibly this should be a separate call in estuary, or possibly if the restart network message doesn't send (https://github.com/filecoin-project/go-data-transfer/blob/master/impl/restart.go#L105), one should simply fail the whole transfer. Solving this should address several hundred of the failures.

  2. For transfers in the "requested" stage, the accept timeout on Estuary is 24 hours. Estuary gets restarted relatively frequently, so it may be we just never get to this point for all the requested transfers that are never getting a response.

  3. Does restarting in the content of a transfer that's only reached the "requested" stage even work? I've read the code paths and it looks like it should work, but we don't really have tests of this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Icebox
Development

No branches or pull requests

1 participant