-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Translator restart if disconnected from upstream #1001
Translator restart if disconnected from upstream #1001
Conversation
6094b69
to
b092514
Compare
Bencher
🚨 1 ALERT: Threshold Boundary Limit exceeded!
Click to view all benchmark results
Bencher - Continuous Benchmarking View Public Perf Page Docs | Repo | Chat | Help |
Bencher
🚨 2 ALERTS: Threshold Boundary Limits exceeded!
Click to view all benchmark results
Bencher - Continuous Benchmarking View Public Perf Page Docs | Repo | Chat | Help |
Bencher
🚨 3 ALERTS: Threshold Boundary Limits exceeded!
Click to view all benchmark results
Bencher - Continuous Benchmarking View Public Perf Page Docs | Repo | Chat | Help |
Bencher
🚨 3 ALERTS: Threshold Boundary Limits exceeded!
Click to view all benchmark results
Bencher - Continuous Benchmarking View Public Perf Page Docs | Repo | Chat | Help |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK
This approach can be optimized. Currently, we are spawning a new task immediately following the task we intend to terminate. A more efficient method would involve incorporating the receiver as one of the branches in the select! block within the task we wish to terminate. This adjustment would eliminate the need for an additional asynchronous task, thereby facilitating a more graceful shutdown rather than an abrupt one.
A possible code snippet for one such task can be:
let socket_reader_task = tokio::task::spawn(async move {
let reader = BufReader::new(&*socket_reader);
let mut messages = FramedRead::new(
async_compat::Compat::new(reader),
LinesCodec::new_with_max_length(MAX_LINE_LENGTH),
);
loop {
tokio::select! {
res = messages.next().fuse() => {
match res {
Some(Ok(incoming)) => {
debug!("Receiving from Mining Device {}: {:?}", &host_, &incoming);
let incoming: json_rpc::Message = handle_result!(tx_status_reader, serde_json::from_str(&incoming));
if let v1::Message::StandardRequest(standard_req) = incoming {
if let Ok(Submit{..}) = standard_req.try_into() {
handle_result!(tx_status_reader, Self::save_share(self_.clone()));
}
}
let res = Self::handle_incoming_sv1(self_.clone(), incoming).await;
handle_result!(tx_status_reader, res);
},
Some(Err(_)) => {
handle_result!(tx_status_reader, Err(Error::Sv1MessageTooLong));
},
None => {
handle_result!(tx_status_reader, Err(
std::io::Error::new(
std::io::ErrorKind::ConnectionAborted,
"Connection closed by client"
)
));
break;
}
}
},
_ = cancellation_token_mining_device.cancelled().fuse() => {
warn!("Cancellation token triggered: Initiating shutdown of sv1 downstream reader");
break;
},
_ = rx_shutdown_clone.recv().fuse() => {
warn!("Shutdown signal received: Initiating shutdown of sv1 downstream reader");
break;
}
}
}
warn!("Downstream: Reader task is terminating. Performing cleanup.");
kill(&tx_shutdown_clone).await;
});
That is a particular task, but how would you do with this task? I could not find a better solution. let task = tokio::task::spawn(async move {
loop {
let msg = handle_result!(tx_status, rx_sv1_downstream.clone().recv().await);
match msg {
DownstreamMessages::SubmitShares(share) => {
handle_result!(
tx_status,
Self::handle_submit_shares(self_.clone(), share).await
);
}
DownstreamMessages::SetDownstreamTarget(new_target) => {
handle_result!(
tx_status,
Self::handle_update_downstream_target(self_.clone(), new_target)
);
}
};
}
});
tokio::task::spawn(async move {
tokio::select! {
_ = cancellation_token_handle_downstream.cancelled() => {
task.abort();
warn!("Shutting down handle_result task");
},
}
}); and what do you think about using a mutex as JoinaHandler` collector for spawned tasks and close all of them in task in the main? |
Ah, I missed it! In that case, aggregating JoinHandles with simultaneous termination would make more sense. |
b092514
to
c8d8472
Compare
Applied the changes suggested by @Shourya742 |
We can discuss it at the call, when I realized using a mutex as a joinhandler collector for the tasks, I liked it but I do not want to change it if not necessary. |
b158a03
to
d4800cd
Compare
test/message-generator/test/change-upstream/change-upstream.json
Outdated
Show resolved
Hide resolved
@lorbax is it ready for a final review? |
yes! |
in this PR is fixed the fallback to solo-mining in the case that the upstream sends a `SubmitShareError` on a valid share AND there are no other available upstreams in the JDC config. To check this PR you should follow the same procedure as described in the commit messages of the branch of this PR (which deals with a similar problem). stratum-mining#1001 The only exception is that the JDC should use this config test/config/change-upstream/jdc-config-local-example-change-upstream-solo-fallback.toml panics: 2024-07-11T15:57:25.769622Z INFO roles_logic_sv2::handlers::template_distribution: Received NewTemplate with id: 694, is future: true thread 'tokio-runtime-worker' panicked at jd-client/src/lib/downstream.rs:379:74: called `Option::unwrap()` on a `None` value note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace 2024-07-11T15:57:35.319832Z INFO roles_logic_sv2::handlers::mining: Received UpdateChannel->Extended message thread 'tokio-runtime-worker' panicked at jd-client/src/lib/downstream.rs:525:13: not yet implemented 2024-07-11T15:57:41.633138Z ERROR network_helpers_sv2::noise_connection_tokio: Disconnected from client while reading : early eof - 127.0.0.1:56278 ^C2024-07-11T15:57:57.023091Z INFO jd_client: Interrupt received
d4800cd
to
47fd420
Compare
yesterday we had a PR review club, where we dived into this PR we found some issues ( after that, I was able to manually reproduce #1004, but only with the trick described by @GitGab19 here: #844 (comment) basically I launched one pool with this patch (listening on port 44254), and another pool without any patches (listening on port 34254) then I modified JDC config so it would have two
finally, I was able to verify that this PR is allowing tProxy to restart after the patched pool rejects a valid share unfortunately the MG mock for a bad pool isn't working reliably... I tried following the instructions from 3d8793a multiple times but the behavior isn't deterministic given that:
I think we can probably drop 3d8793a and only use this PR to fix the bug we are keeping track of this bug on #1077 and it will eventually be added to CI (via MG or integration tests, to be defined) |
I already had included a jdc config for testing the PR
The only thing is that the MG should need a new MG feature in order to work flawlessly, but I used the MG test yesterday to replicate the issue and to test the fix (also after the review club) and it does the job. Can you provide more information about the difficulties or inconsistencies that you got using that test? "isn't working reliably" is a bit vague. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Since it seems not reliable (without the patch you mentioned us some days ago @lorbax), and we decided to move those integration tests in another place, I would just remove the MG test and the relative configs from this PR.
In the next days I will properly test it and review it again.
8228883
to
9a696b3
Compare
this is the first time that I see this error. Usually the channel ID is fixed for the entire duration of the connection and starts from 1, but also in the MG test the message |
in this PR is fixed the fallback to solo-mining in the case that the upstream sends a `SubmitShareError` on a valid share AND there are no other available upstreams in the JDC config. To check this PR you should follow the same procedure as described in the commit messages of the branch of this PR (which deals with a similar problem). stratum-mining#1001 The only exception is that the JDC should use this config test/config/change-upstream/jdc-config-local-example-change-upstream-solo-fallback.toml panics: 2024-07-11T15:57:25.769622Z INFO roles_logic_sv2::handlers::template_distribution: Received NewTemplate with id: 694, is future: true thread 'tokio-runtime-worker' panicked at jd-client/src/lib/downstream.rs:379:74: called `Option::unwrap()` on a `None` value note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace 2024-07-11T15:57:35.319832Z INFO roles_logic_sv2::handlers::mining: Received UpdateChannel->Extended message thread 'tokio-runtime-worker' panicked at jd-client/src/lib/downstream.rs:525:13: not yet implemented 2024-07-11T15:57:41.633138Z ERROR network_helpers_sv2::noise_connection_tokio: Disconnected from client while reading : early eof - 127.0.0.1:56278 ^C2024-07-11T15:57:57.023091Z INFO jd_client: Interrupt received
in this PR is fixed the fallback to solo-mining in the case that the upstream sends a `SubmitShareError` on a valid share AND there are no other available upstreams in the JDC config. To check this PR you should follow the same procedure as described in the commit messages of the branch of this PR (which deals with a similar problem). stratum-mining#1001 The only exception is that the JDC should use this config test/config/change-upstream/jdc-config-local-example-change-upstream-solo-fallback.toml panics: 2024-07-11T15:57:25.769622Z INFO roles_logic_sv2::handlers::template_distribution: Received NewTemplate with id: 694, is future: true thread 'tokio-runtime-worker' panicked at jd-client/src/lib/downstream.rs:379:74: called `Option::unwrap()` on a `None` value note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace 2024-07-11T15:57:35.319832Z INFO roles_logic_sv2::handlers::mining: Received UpdateChannel->Extended message thread 'tokio-runtime-worker' panicked at jd-client/src/lib/downstream.rs:525:13: not yet implemented 2024-07-11T15:57:41.633138Z ERROR network_helpers_sv2::noise_connection_tokio: Disconnected from client while reading : early eof - 127.0.0.1:56278 ^C2024-07-11T15:57:57.023091Z INFO jd_client: Interrupt received
9a696b3
to
a88ebed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nack. I dont think the translator should look like that. the start
function is far from anything we can work with.
please look here https://github.com/stratum-mining/stratum/blob/fbb846450a3053f7e57c9bdf90d407d4e10449da/roles/translator/src/lib/mod.rs I think this should be more or less the approach.
If you need to persist data, then we should add write/read to (disk?) abilities and save the data and recover it when we need it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approved this earlier but then I noticed two MG tests are failing deterministically, so unfortunately I have to revert my approval.
Basically, now we're making tProxy attempt to reconnect to Upstream
every time there's a State::UpstreamShutdown
, regardless of what caused it.
pool-sri-test-close-channel
To be honest, I'm not sure what this MG tries to achieve in the first place.
There's nothing in the specs stating that a proxy must close the entire connection when it receives a CloseChannel
from Upstream
.
This feels like a naive test. I would lean towards completely removing this test from the repo (ideally on a separate PR, to be merged before this one), but it would be good to get feedback from others.
translation-proxy-broke-pool
This MG test asserts that the tProxy closes the connection when it receives a bad extranonce from Upstream
.
That feels like a reasonable behavior, but it kind of conflicts with the current approach of this PR, because a bad extrononce also triggers a State::UpstreamShutdown
, which now causes tProxy to attempt to reconnect.
So if Upstream
sends a bad extranonce, should tProxy really keep trying to reconnect forever?
I'm not sure how to proceed here.
after further investigating the MG tests described above, I came up with a solution: plebhash@e63c249 basically, tProxy should only try to reconnect if the error coming from with this approach, both MG tests listed above are passing I still think that @lorbax please have a look at the solution on the suggested commit... if you agree, feel free to just port it to your branch and squash it into the single commit of this PR, no need to worry about credit/authorship |
looks good |
Hi! translator_sv2::error::Error::ChannelErrorReceiver but you suggests to send translator_sv2::error::Error::TokioChannelErrorRecv ? match e {
Error::ChannelErrorReceiver(_) => {
tx.send(Status {
state: State::UpstreamTryReconnect(e),
})
.await
.unwrap_or(());
},
_ => {
tx.send(Status {
state: State::UpstreamShutdown(e),
})
.await
.unwrap_or(());
},
} BTW, it appears to me that the mixed use of std_async channels and tokio channels is red flag. what do you think? |
@lorbax so I just tried to cover for both flavors but feel free to only do |
my idea is to introduce changes only when strictly needed in order to avoid unexpected behavior. What do you think about? |
a88ebed
to
63d37dc
Compare
yes those are very good points and I agree it's kind of a red flag... I'm not sure what are the historical motivations for having two different flavors of channels, but I think that when we start making deeper refactoring into the application layer we should discuss the possibility of getting rid of some of the two kinds of channels in favor of standardization but that's a discussion for the future for now, on the context of this PR, I agree we should move forward only covering stratum/roles/translator/Cargo.toml Line 22 in 2541703
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested twice and it works as expected, good work @lorbax !
Just a few minor nits and it's ready to go
- Add `start` function and put starting logic there. - Every AbortHandle of each task is stored in a collector, which is a mutex. - Added `kill_tasks` function that takes in input this mutex, access it, pop each JoinHandle and kill the corresponding task. When receive an UpstreamShutdown does the following: 1. waits a random amount of time s 0<s<3 secs (if there 100000 TProxies, then 100000 proxies will connect again _at the same time_). 2. calls `kill_tasks` 3. calls `start` use tokio::task in favor of async_std::task
63d37dc
to
a74262a
Compare
with channel I mean channels that allows a task to send data to another task, like mpsc channels. These are not things that are exported by network-helpers (or at least it appears to me like this). Perhaps is a good idea to make it uniform and substitute the usage of |
@GitGab19 lets not merge unless we have all CI checks in the green please |
When the jdc changes upstream, it disconnects the TProxy.
This PR addresses this issue by making the TProxy reconnecting to the JDC.
Removed the part related to the MG test.
How to test this PR:
The idea is to have a pool that rejects a share. It is possible to do that by tweaking the part of the code needed to check the target, forcing the pool to calculate a wrong hash for a share.
Following @GitGab19 suggestion, the part of the code responsible for that is
roles-logc-sv2::channel_logic::channel_factory::on_submit_shares_extended(...)
. Since other roles check the share using the same function, directly modifying this would have the consequence that no share at all is ever sent to the pool. If I understood correctly, the way to fix this issue is to apply @GitGab19 suggestion to a clone of functionon_submit_shares_extended
, which can be called callon_submit_shares_extended_pool
. After that, modify the implementation ofhandle_submit_share_extended
trait function ofParseDownstreamMiningMessages
trait in the following way