Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong node roles after reconnecting to Redis #2581

Closed
jedevelop opened this issue May 16, 2023 · 19 comments
Closed

Wrong node roles after reconnecting to Redis #2581

jedevelop opened this issue May 16, 2023 · 19 comments
Labels
status: invalid An issue that we don't feel is valid

Comments

@jedevelop
Copy link

Hello. After restart docker container with Redis leader changes (if leader was another at the moment) and application does reconnect to Redis but MasterReplicaConnectionProvider contains nodes with old roles (property knownNodes).

Because role was changing after reconnect, MasterReplicaConnectionProvider.getConnectionAsync("WRITE") gives old master and occurs RedisReadOnlyException.

There is SentinelTopologyRefresh which track topology updating but it's doesn't work after reconnecting.

How enforce to update roles for nodes? Application written by Java, Spring Boot

Please, any ideas?

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label May 16, 2023
@mp911de
Copy link
Member

mp911de commented May 16, 2023

Do you have a reproducer handy that we can use to reproduce the issue without having to set up a ton of infrastructure? Likely, that is a Lettuce driver issue.

@mp911de mp911de added the status: waiting-for-feedback We need additional information before we can continue label May 16, 2023
@jedevelop
Copy link
Author

Unfortunately, I don't have a tool to reproduce the problem, but I can provide sample code. (Kotlin, Spring Data Redis)

I have 3 replicas and 3 sentinels for each on the same hosts. In other words host1 with two ports, one for redis, another for sentinel and etc. (example, host1:6379,host1:26379; host2:6379,host2:26379; host3:6379,host3:26379)

@Bean
fun lettuceClientOptionsBuilderCustomizer(): LettuceClientConfigurationBuilderCustomizer {
    return LettuceClientConfigurationBuilderCustomizer { builder ->
    	builder.readFrom(ReadFrom.REPLICA_PREFERRED)
            .clientOptions(ClientOptions.builder()
                .protocolVersion(ProtocolVersion.RESP3)
                .publishOnScheduler(true)
                .build())
            .useSsl()
            .disablePeerVerification()
    }
}

@Bean
fun redisSentinelConfiguration(properties: RedisProperties): RedisSentinelConfiguration {
    return RedisSentinelConfiguration(
        "mymaster",
        properties.sentinel.nodes   //sentinelHost1:26379,sentinelHost2:26379,sentinelHost3:26379
    )
}

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels May 16, 2023
@jxblum
Copy link
Contributor

jxblum commented May 16, 2023

My question is...

If you were to code a simple Java application without Spring (Boot/Data Redis), that is, use only Java and Lettuce with Redis, would you have the same problem?

NOTE: FTR, I suspect you would.

DISCLAIMER: Please forgive me in advance, but I am am thinking through your problem out loud as I try to understand the issue here. So, I am also just talking through my thoughts on this matter.

Given only my cursory knowledge on Spring Data Redis (yet) as well as my limited knowledge and experience with Redis in various capacities/contexts (e.g. Redis Sentinel with Docker), it seems to me, very little logic exits in Spring Data Redis for managing topology (role) changes of Redis Sentinel nodes where Master/Replicas are concerned.

For instance, if you are referring to the StaticMasterReplicaConnectionProvider (not simply MasterReplicaConnectionProvider as you referred to above) then you will notice that most of the contained logic delegates to the Lettuce driver classes, which maintain the connection/topology "state".

In addition, the RedisClient and RedisURI classes (here and here) along with MasterReplica (here) used to establish a Master/Replica, stateful connection, are Lettuce driver classes.

Even SentinelToployRefresh is an "internal" class provided by the Lettuce driver (source). Furthermore, it seems to me that this component, and this component alone, is solely responsible for topology changes in Redis within the Lettuce driver, which should notify clients, whether Spring or otherwise, of changes, as clearly documented and stated.

Therefore, outside of being able to pass configuration down from Spring, or any Java client for that matter, to (ultimately) the (Lettuce) Redis (client) driver when the StatefulConnection (Master/Replica variants) is established, there is very little Spring Data Redis can affect in this case AFAICT.

Upon closer inspection of the Lettuce driver in particular, it seems that maybe you need a dynamic topology change detection mechanism (rather than "static"), as described here. Although, maybe Docker much like AWS needs an explicit list of nodes (i.e. RedisURIs) when establishing the topology from the client (i.e. topology needs to be known in advance).

The Lettuce documentation on the StaticMasterReplicaTopologyProvider is pretty clear:

"StaticMasterReplicaTopologyProvider: Topology is defined by the list of URIs and the ROLE output. MasterReplica uses only the supplied nodes and won’t discover additional nodes in the setup. The connection needs to be re-established outside of Lettuce in case of a Master/Replica failover or topology changes."

In particular: "The connection needs to be re-established outside of Lettuce in case of a Master/Replica failover or topology changes."

However, the documentation is not very clear on "roles" in any case, and in particular, the "static" one.

The best thing I can think of at the moment is to restablish your connection(s) by closing the existing connections and re-opening them. Worse case scenario, this might even require the application to be restarted.

There is very little (automated or otherwise) Spring Data Redis can do other than what is possible and allowed by the underlying (Lettuce) Redis (client) driver API. And, after browsing through the API, I am coming up a bit short here.

@jxblum
Copy link
Contributor

jxblum commented May 16, 2023

Also from the Lettuce driver documentation:

"Redis Standalone Master/Replica will discover the roles of the supplied RedisURIs and issue commands to the appropriate node."

@jxblum
Copy link
Contributor

jxblum commented May 16, 2023

Also thinking if "strong" consistency is not required in your application use case and requirements (SLA) then what would be the harm in writing to any available node (not the "master"), if that is even possible to do with Redis Sentinel.

Also, given the Redis provided documentation on Redis Sentinel, it seems to me that "Sentinels" are responsible for notifying clients for topology changes, such as the new master (a "role" (??) in the HA topology).

@jxblum
Copy link
Contributor

jxblum commented May 16, 2023

I was also just reading, from "High availability with Redis Sentinel" documentation, in the section, "Fundamental things to know about Sentinel before deploying", # 6, that:

"Sentinel, Docker, or other forms of Network Address Translation or Port Mapping should be mixed with care: Docker performs port remapping, breaking Sentinel auto discovery of other Sentinel processes and the list of replicas for a master. Check the section about Sentinel and Docker later in this document for more information."

Unfortunately, the information is less than complete with respect to "roles".

You don't seem to be having problems with "connections" from the client (even after the Docker Containers hosting the nodes in the Redis Sentinel come back online), only that the "roles" of the "listed" (static) nodes is stale.

In fact, there is no mention of "roles" in the Redis documentation at all. :(

@jxblum jxblum added status: waiting-for-feedback We need additional information before we can continue and removed status: waiting-for-triage An issue we've not yet triaged labels May 16, 2023
@jedevelop
Copy link
Author

jedevelop commented May 17, 2023

If you were to code a simple Java application without Spring (Boot/Data Redis), that is, use only Java and Lettuce with Redis, would you have the same problem?

I think yes, because MasterReplicaConnectionProvider.knownNodes() are the part of lettuce-core

@jedevelop
Copy link
Author

Do you have a reproducer handy that we can use to reproduce the issue without having to set up a ton of infrastructure?

@mp911de I have all the necessary infrastructure set up and I can change or tweak something in the starter or in the driver, if it helps you

@spring-projects-issues spring-projects-issues removed the status: waiting-for-feedback We need additional information before we can continue label May 17, 2023
@jedevelop
Copy link
Author

jedevelop commented May 17, 2023

Even SentinelToployRefresh is an "internal" class provided by the Lettuce driver (source)

Yes, I use SentinelToployRefresh and when master is changed after failover I can see that application update topology.
SentinelConnector throws events:
"New topology: [RedisMasterReplicaNode [redisURI=rediss://@Host1:6379?timeout=1s, role=REPLICA], RedisMasterReplicaNode [redisURI=rediss://@host2:6379?timeout=1s, role=REPLICA], RedisMasterReplicaNode [redisURI=rediss://*******@host3:6379?timeout=1s, role=UPSTREAM]]"}

Among them, I see the current master and SentinelToployRefresh ONLY works if there is no disconnect. That is, I can send the debug sleep {timeout} command for the current master in docker container and SentinelToployRefresh will indeed work, but if all containers stop, for example, in the case of a disconnect, the topology is not updated. After disconnecting, there are no events from the SentinelConnector and the master is out of date.

It turns out that sentinel does not discard events during disconnect and the topology does not change in application

When all three containers are restarted, the application sees them again but with the old roles (MasterReplicaConnectionProvider.knownNodes())

NOTICE: I restart the entire infrastructure (stop docker containers, put certificates, start containers) and only then I see this error when trying to write to the old master. And I use Ansible for restart all infrastructure

@mp911de please, notice to the above

@jedevelop
Copy link
Author

Also, given the Redis provided documentation on Redis Sentinel, it seems to me that "Sentinels" are responsible for notifying clients for topology changes, such as the new master (a "role" (??) in the HA topology).

Sentinel does not throws events on stop of docker containers with Redis and Sentinel and the topology does not change.

@mp911de
Copy link
Member

mp911de commented May 17, 2023

Can you provide us with a bit of guidance regarding the docker containers for Redis? Do you have an image/multiple images that we can easily spin up to have a similar setup to yours? We would like to understand more about what is happening and therefore we need to have a way to reproduce the issue.

@jedevelop
Copy link
Author

Yes, but I need time

@jxblum
Copy link
Contributor

jxblum commented May 17, 2023

I am still not convinced this is a Spring Data Redis problem, but perhaps rather a Redis (client) driver, like Lettuce, problem specifically.

@jedevelop
Copy link
Author

jedevelop commented May 18, 2023

@jxblum Maybe it's a problem with the Lettuce driver, but I don't know how to check and fix it yet

@jedevelop
Copy link
Author

@mp911de

j2 template for redis configuration

bind 0.0.0.0
protected-mode yes
requirepass {{ redis_password }}
masterauth {{ redis_password }}
user {{ redis_exporter_user }} +client +ping +info +config|get +cluster|info +slowlog +latency +memory +select +get +scan +xinfo +type +pfcount +strlen +llen +scard +zcard +hlen +xlen +eval allkeys on >{{ redis_exporter_password }}
{% if redis_expose_socket | bool %}
unixsocket {{ redis_socket_dir }}/redis.sock
unixsocketperm 755
{% endif %}

{% if redis_with_tls %}
port 0
tls-port {{ redis_port }}
tls-cert-file /opt/bitnami/redis/certs/{{ redis_tls_cert_file }}
tls-key-file /opt/bitnami/redis/certs/{{ redis_tls_key_file }}
tls-ca-cert-file /opt/bitnami/redis/certs/{{ redis_tls_ca_cert_file }}
tls-dh-params-file /opt/bitnami/redis/certs/{{ redis_tls_dh_params_file }}
tls-auth-clients no
tls-replication yes
{% if redis_cluster_mode %}
tls-cluster yes
{% endif %}
tls-protocols "TLSv1.2 TLSv1.3"
tls-prefer-server-ciphers yes
{% else %}
port {{ redis_port }}
{% endif %}

{% if redis_cluster_mode %}
cluster-enabled yes
cluster-node-timeout {{ redis_cluster_node_timeout }}
cluster-config-file /bitnami/redis/data/nodes.conf
replica-ignore-maxmemory no
{% endif %}

{% if redis_is_replica | bool %}
replicaof {{ redis_master_host }} {{ redis_port }}
{% endif %}

maxmemory {{ redis_cache_memory_mb }}mb
maxmemory-policy {{ redis_eviction_policy }}
appendonly {{ redis_appendonly }}
dir /bitnami/redis/data
loglevel {{ redis_loglevel }}

{% if redis_snapshot_disable | bool %}
save ""
{% endif %}

j2 template for haproxy configuration

global
  log /dev/log local0 err alert
  log /dev/log local1 err alert
  spread-checks 5
  max-spread-checks 15000
  maxconn 50000

defaults
  mode tcp
  timeout connect 3s
  timeout client 120s
  timeout server 120s
  timeout tunnel 3600s

frontend stats
  bind *:8081
  mode http
  default_backend stats

backend stats
  mode http
  balance roundrobin
  stats enable
  stats uri /haproxy/stats
  stats refresh 1s
  stats show-legends

frontend ft_redis
  bind *:6380 name redis
  mode tcp
  default_backend bk_redis

backend bk_redis
  mode tcp
  option tcp-check
  tcp-check connect
  tcp-check send PING\r\n
  tcp-check expect string +PONG
  tcp-check send info\ replication\r\n
  tcp-check expect string role:master
{% for host in groups[redis_group_name] %}
  server {{ host }} {{ hostvars[host]['ansible_default_ipv4']['address'] }}:{{ redis_port }} check inter 1s
{% endfor %}

And for start, stop, restart we use systemd with follow configurations

j2 template for redis configuration

[Unit]
Description=Redis container
Requires=docker.service
After=docker.service

[Service]
Restart=always
ExecStartPre=-/usr/bin/docker rm {{ redis_service_name }}
ExecStart=/usr/bin/docker run --rm \
  -m {{ redis_container_memory_mb }}m \
  -p {{ redis_port }}:{{ redis_port }} \
  -v {{ redis_data_dir }}:/bitnami/redis/data \
  -v {{ redis_conf_dir }}/redis.conf:/opt/bitnami/redis/mounted-etc/redis.conf \
{% if redis_expose_socket | bool %}
  -v {{ redis_conf_dir }}/redis_socket/:{{ redis_socket_dir }}:rw,z \
{% endif %}
{% if redis_is_master | bool %}
  -e REDIS_REPLICATION_MODE=master \
{% elif redis_is_replica | bool %}
  -e REDIS_REPLICATION_MODE=slave \
  -e REDIS_MASTER_HOST={{ redis_master_host }} \
  -e REDIS_MASTER_PORT_NUMBER={{ redis_port }} \
  -e REDIS_MASTER_PASSWORD={{ redis_password }} \
{% endif %}
{% if redis_password is defined %}
  -e REDIS_PASSWORD={{ redis_password }} \
  -e REDISCLI_AUTH={{ redis_password }} \
{% else %}
  -e ALLOW_EMPTY_PASSWORD=yes \
{% endif %}
  -e REDIS_AOF_ENABLED={{ redis_appendonly }} \
{% if redis_with_tls %}
  -v {{ redis_data_secure_dir }}:/opt/bitnami/redis/certs \
  -e REDIS_TLS_ENABLED=yes \
  -e REDIS_PORT_NUMBER=0 \
  -e REDIS_TLS_PORT={{ redis_port }} \
  -e REDIS_TLS_CERT_FILE=/opt/bitnami/redis/certs/{{ redis_tls_cert_file }} \
  -e REDIS_TLS_KEY_FILE=/opt/bitnami/redis/certs/{{ redis_tls_key_file }} \
  -e REDIS_TLS_CA_FILE=/opt/bitnami/redis/certs/{{ redis_tls_ca_cert_file }} \
  -e REDIS_TLS_DH_PARAMS_FILE=/opt/bitnami/redis/certs/{{ redis_tls_dh_params_file }} \
  -e REDIS_TLS_AUTH_CLIENTS=no \
{% else %}
  -e REDIS_PORT_NUMBER={{ redis_port }} \
{% endif %}
  --net host \
  --name {{ redis_service_name }} {{ redis_docker_image_name }}

ExecStop=/usr/bin/docker stop -t 10 {{ redis_service_name }}

[Install]
WantedBy=multi-user.target

j2 template for redis proxy configuration

[Unit]
Description=Redis Proxy container
Requires=docker.service
After=docker.service

[Service]
Restart=always
ExecStartPre=-/usr/bin/docker rm {{ redis_proxy_service_name }}
ExecStart=/usr/bin/docker run --rm \
  -m {{ redis_proxy_container_memory_mb }}m \
  -v {{ redis_proxy_conf_dir }}:/usr/local/etc/haproxy \
  --net host \
  --name {{ redis_proxy_service_name }} {{ redis_proxy_image_name }}

ExecStop=/usr/bin/docker stop -t 10 {{ redis_proxy_service_name }}

[Install]
WantedBy=multi-user.target

j2 template for redis sentinel configuration

[Unit]
Description=Redis Sentinel container
Requires=docker.service
After=docker.service

[Service]
Restart=always
ExecStartPre=-/usr/bin/docker rm {{ redis_sentinel_service_name }}
ExecStart=/usr/bin/docker run --rm \
  -m {{ redis_sentinel_container_memory_mb }}m \
  -p {{ redis_sentinel_port }}:{{ redis_sentinel_port }} \
  -e REDIS_MASTER_HOST={{ redis_master_host }} \
  -e REDIS_MASTER_PORT_NUMBER={{ redis_port }} \
  -e REDIS_MASTER_PASSWORD={{ redis_password }} \
  -e REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS={{ redis_sentinel_down_after_milliseconds }} \
  -e REDIS_SENTINEL_FAILOVER_TIMEOUT={{ redis_sentinel_failover_timeout }} \
{% if redis_password is defined %}
  -e REDIS_SENTINEL_PASSWORD={{ redis_password }} \
{% else %}
  -e ALLOW_EMPTY_PASSWORD=yes \
{% endif %}
{% if redis_with_tls | bool %}
  -v {{ redis_data_secure_dir }}:/opt/bitnami/redis/certs \
  -e REDIS_SENTINEL_TLS_ENABLED=yes \
  -e REDIS_SENTINEL_PORT_NUMBER={{ redis_sentinel_port_when_tls_enabled }} \
  -e REDIS_SENTINEL_TLS_PORT_NUMBER={{ redis_sentinel_port }} \
  -e REDIS_SENTINEL_TLS_CERT_FILE=/opt/bitnami/redis/certs/{{ redis_tls_cert_file }} \
  -e REDIS_SENTINEL_TLS_KEY_FILE=/opt/bitnami/redis/certs/{{ redis_tls_key_file }} \
  -e REDIS_SENTINEL_TLS_CA_FILE=/opt/bitnami/redis/certs/{{ redis_tls_ca_cert_file }} \
  -e REDIS_SENTINEL_TLS_AUTH_CLIENTS=no \
{% else %}
  -e REDIS_SENTINEL_PORT_NUMBER={{ redis_sentinel_port }} \
{% endif %}
  --net host \
  --name {{ redis_sentinel_service_name }} {{ redis_sentinel_image_name }}

ExecStop=/usr/bin/docker stop -t 10 {{ redis_sentinel_service_name }}

[Install]
WantedBy=multi-user.target

@jedevelop
Copy link
Author

@mp911de Hello. Is there anything I can do to fix the driver? I have all the necessary infrastructure, I can try to help. I need advice on which part of the code could be the problem

@mp911de
Copy link
Member

mp911de commented Jun 1, 2023

I don't have even an idea what J2 is. The pasted code bits are overly complex and by no means we are able to reproduce the issue. We're looking for a way to easily be able to reproduce the scenario without us spending hours and days to understand the infrastructure, but rather to diagnose what is going wrong.

@jedevelop
Copy link
Author

Could you stop all Redis while clients are running? And after the specified failover process time (down-after-milliseconds) will the failover process occur and then to start the Redis? The issue is likely to be reproducible as the client will still have the old master.

@mp911de
Copy link
Member

mp911de commented Sep 13, 2023

Closing this ticket because it isn't going anywhere and we cannot act on it.

@mp911de mp911de closed this as not planned Won't fix, can't repro, duplicate, stale Sep 13, 2023
@mp911de mp911de added status: invalid An issue that we don't feel is valid and removed status: feedback-provided Feedback has been provided labels Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: invalid An issue that we don't feel is valid
Projects
None yet
Development

No branches or pull requests

4 participants