Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

heartbeat is sent even after io loop panicked #401

Closed
slavik-pastushenko opened this issue Apr 18, 2024 · 9 comments · Fixed by #403
Closed

heartbeat is sent even after io loop panicked #401

slavik-pastushenko opened this issue Apr 18, 2024 · 9 comments · Fixed by #403

Comments

@slavik-pastushenko
Copy link
Contributor

slavik-pastushenko commented Apr 18, 2024

Hello!
I am using tokio and trying to implement the reconnection flow.
I found out that even if your app panicked, the heartbeat is still sent.

2024-04-18T12:17:15.514560Z DEBUG send heartbeat
2024-04-18T12:17:19.316301Z DEBUG received heartbeat from server
2024-04-18T12:17:22.219640Z ERROR Socket was readable but we read 0. This usually means that the connection is half closed this mark it as broken
2024-04-18T12:17:22.219744Z ERROR error doing IO error=IOError(Kind(ConnectionAborted))
2024-04-18T12:17:22.219765Z ERROR Connection error error=IO error: connection aborted
thread 'lapin-io-loop' panicked at /xxx:144:42:
IO error: connection aborted
stack backtrace:
   0: rust_begin_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
...

after it panicked, you can see that there is no response but it still sends it.

2024-04-18T12:17:25.517414Z DEBUG send heartbeat
2024-04-18T12:17:35.520383Z DEBUG send heartbeat
2024-04-18T12:17:45.522635Z DEBUG send heartbeat
2024-04-18T12:17:55.525602Z DEBUG send heartbeat

The way I got it is:

channel.on_error(|error| panic!("{}", error));
connection.on_error(|error| panic!("{}", error));
@slavik-pastushenko
Copy link
Contributor Author

slavik-pastushenko commented Apr 18, 2024

I also implemented the way @mjanssen suggested here: #389 (comment)

it works correctly when you gracefully shutdown your rmq cluster.

if you force shutdown a rmq cluster, either:

  • it says reconnected, no errors, but I cannot see my connection/exchange/queue in the rabbitmq management console.
  • does not fetch an error and continues to send a heartbeat (without any response from the server). After 10-12 minutes it throws an error below, reconnects and works fine:
2024-04-18T09:47:17.930807Z ERROR error doing IO error=IOError(Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }) 2024-04-18T09:47:17.930854Z ERROR Connection error error=IO error: Connection reset by peer (os error 104)

@slavik-pastushenko
Copy link
Contributor Author

any ideas, @Keruspe

@slavik-pastushenko
Copy link
Contributor Author

2024-04-19T10:02:40.575129Z DEBUG send heartbeat
2024-04-19T10:02:40.575247Z TRACE send_frame channel=0
2024-04-19T10:02:40.575255Z TRACE wake channel=0
2024-04-19T10:02:40.575329Z TRACE Got event for socket event=Wake
2024-04-19T10:02:40.575393Z TRACE will write to buffer next_msg=AMQPFrame::Heartbeat
2024-04-19T10:02:40.575499Z TRACE wrote 8 bytes
2024-04-19T10:02:40.575560Z TRACE io_loop do_run done can_read=false can_write=true has_data=false status=Connected
2024-04-19T10:02:40.575571Z TRACE io_loop run
2024-04-19T10:02:40.575576Z TRACE io_loop do_run can_read=false can_write=true has_data=false

while my rmq cluster was completely off. if I restart my pod it will fetch that a connection cannot be established and will wait until my rmq cluster is back.

@slavik-pastushenko
Copy link
Contributor Author

after 6-7 minutes it got an error:

2024-04-19T10:29:53.277538Z TRACE io_loop do_run can_read=false can_write=true has_data=false
2024-04-19T10:29:56.852285Z TRACE Got event for socket event=Readable
2024-04-19T10:29:56.852424Z ERROR Socket was readable but we read 0. This usually means that the connection is half closed this mark it as broken
2024-04-19T10:29:56.852431Z ERROR error doing IO error=IOError(Kind(ConnectionAborted))
2024-04-19T10:29:56.852443Z ERROR Connection error error=IO error: connection aborted
2024-04-19T10:29:56.852472Z TRACE set_error consumer_tag=my_tag
2024-04-19T10:29:56.852487Z TRACE cancel consumer_tag=my_tag

and successfully restarted.

@slavik-pastushenko
Copy link
Contributor Author

After running tcpdump on the client we noticed that heartbeats get occasionally sent (to the pre-termination IP) but there is no response and the reconnection happens at completely random intervals eg. 1 service reconnects straight away and 3 others wait for up to 25 minutes.
Can we somehow validate a heartbeat response or throw an exception in the case where we don’t receive a heartbeat response from the server in let's say around after 2 or 3 heartbeats?

@Keruspe
Copy link
Collaborator

Keruspe commented Apr 19, 2024

I thought it was already the case.
Will give it a proper look.

If you have a pure testing environment to which I could get tmp access it could help if the issue turns out to be... not that obvious

@slavik-pastushenko
Copy link
Contributor Author

slavik-pastushenko commented Apr 20, 2024

the easiest way to reproduce it locally (using rabbitmq in docker with the default configuration):

  1. setup a connection and set the on_error handler:
let connection = Connection::connect(
                &addr,
                ConnectionProperties::default()
                    .with_executor(tokio_executor_trait::Tokio::current())
                    .with_reactor(tokio_reactor_trait::Tokio),
            )
            .await?;

connection.on_error(|error| panic!("{}", error));
  1. create a channel and set the on_error handler:
let channel = connection.create_channel().await?;
channel.on_error(|error| panic!("{}", error));
  1. stop the rabbitmq container
  2. receive an error:
2024-04-20T13:05:15.522848Z ERROR error doing IO error=IOError(Kind(ConnectionAborted))
2024-04-20T13:05:15.522870Z ERROR Connection error error=IO error: connection aborted
thread 'lapin-io-loop' panicked at /my-path.../x.rs:
IO error: connection aborted
stack backtrace:
   0: rust_begin_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
   2: core::panicking::panic_display
  1. see we still send a heartbeat without a response:
2024-04-20T13:05:16.763136Z DEBUG send heartbeat
2024-04-20T13:05:16.763422Z TRACE send_frame channel=0
2024-04-20T13:05:16.763460Z TRACE wake channel=0
2024-04-20T13:05:26.766356Z DEBUG send heartbeat
2024-04-20T13:05:26.766625Z TRACE send_frame channel=0
2024-04-20T13:05:26.766658Z TRACE wake channel=0

2024-04-20T13:05:36.768294Z DEBUG send heartbeat
2024-04-20T13:05:36.768708Z TRACE send_frame channel=0
2024-04-20T13:05:36.768760Z TRACE wake channel=0
2024-04-20T13:05:46.770914Z DEBUG send heartbeat
2024-04-20T13:05:46.771068Z TRACE send_frame channel=0
2024-04-20T13:05:46.771093Z TRACE wake channel=0
2024-04-20T13:05:56.774126Z DEBUG send heartbeat
2024-04-20T13:05:56.774387Z TRACE send_frame channel=0
2024-04-20T13:05:56.774492Z TRACE wake channel=0

@Keruspe
Copy link
Collaborator

Keruspe commented Apr 22, 2024

Can you give #404 a try?

Keruspe added a commit that referenced this issue Apr 22, 2024
@slavik-pastushenko
Copy link
Contributor Author

#403 solved the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants