Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Previously clustered instances incorrectly cached #196

Open
aksabg opened this issue Jun 18, 2024 · 0 comments
Open

Previously clustered instances incorrectly cached #196

aksabg opened this issue Jun 18, 2024 · 0 comments
Labels

Comments

@aksabg
Copy link

aksabg commented Jun 18, 2024

Questions

It seems that Vert.x caches non-existing addresses (previously shutdown) when using clustered event bus. Some eventbus send calls cause timeout exception because the listener on the address does not exist.

Version

Vert.x 4.5.5
hazelcast-kubernetes 3.2.3

Context

We are using clustered Vert.x with hazelcast kubernetes discovery. We have multiple kubernetes pods and each contains one Verticle.

If multiple nodes are killed for any reason, when they are started again, the cluster is established, but Vert.X doesn't seem to be aware of that. It looks like Vert.x is trying to send event bus messages to addresses that no longer exist in the cluster.

The only way we are able to solve this problem is to shut down all Verticles and then start them again. If at least one Verticles stays up, it propagates old addresses to every other new one.

Potentially relevant log statements


{
  "time": "2024-05-12T10:13:04.811512654Z",
  "level": "ERROR",
  "class": "com.myorg.MyClass",
  "message": "Received unknown error code. Error code received is -1, message received is Timed out after waiting 30000(ms) for a reply. address: __vertx.reply.d9179ccb-e25f-4e4e-9189-ff04368e4abb, repliedAddress: status/check/NotifyService."
}

{
  "time": "2024-05-12T10:06:34.812700633Z",
  "level": "WARN",
  "class": "io.vertx.core.eventbus.impl.clustered.ConnectionHolder",
  "requestId": "req-PvHlXydYJAhS7Run6wd4",
  "message": "Connecting to server d9f36bb4-029f-44d6-8f7c-304be8285b22 failed",
  "stacktrace": "io.vertx.core.impl.NoStackTraceThrowable: Not a member of the cluster\n"
}

{
  "time": "2024-05-12T10:14:34.810948660Z",
  "level": "WARN",
  "class": "io.vertx.core.eventbus.impl.clustered.ConnectionHolder",
  "requestId": "req-PvHlXydYJAhS7Run6wd4",
  "message": "Connecting to server 43e35345-2b5a-4b89-bb59-f5bf67f01780 failed",
  "stacktrace": "io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /172.21.128.0:36519\nCaused by: java.net.ConnectException: Connection refused\n\tat java.base/sun.nio.ch.Net.pollConnect(Native Method)\n\tat java.base/sun.nio.ch.Net.pollConnectNow(Unknown Source)\n\tat java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)\n\tat io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)\n\tat io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:335)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.base/java.lang.Thread.run(Unknown Source)\n"
}

We cannot reproduce the issue consistently, it only happens sometimes.

In Vert.x core the exception occurs in io.vertx.core.eventbus.impl.clustered.ConnectionHolder in the following peace of code:

  void connect() {
    Promise<NodeInfo> promise = Promise.promise();
    eventBus.vertx().getClusterManager().getNodeInfo(remoteNodeId, promise);
    promise.future()
      .flatMap(info -> eventBus.client().connect(info.port(), info.host()))
      .onComplete(ar -> {
        if (ar.succeeded()) {
          connected(ar.result());
        } else {
          log.warn("Connecting to server " + remoteNodeId + " failed", ar.cause());
          close(ar.cause());
        }
      });
  }

It seems that the piece of code in vertx hazelcast the fails is in the class io.vertx.spi.cluster.hazelcast.HazelcastClusterManager:

  @Override
  public void getNodeInfo(String nodeId, Promise<NodeInfo> promise) {
    vertx.<NodeInfo>executeBlocking(() -> {
      HazelcastNodeInfo value = nodeInfoMap.get(nodeId);
      if (value != null) {
        return value.unwrap();
      } else {
        throw new VertxException("Not a member of the cluster", true);
      }
    }, false).onComplete(promise);
  }

Any ideas on what might be going on?

@aksabg aksabg added the bug label Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

1 participant