Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock flaky #10006

Closed
sohami opened this issue Sep 12, 2023 · 10 comments
Labels
bug Something isn't working Cluster Manager flaky-test Random test failure that succeeds on second run

Comments

@sohami
Copy link
Collaborator

sohami commented Sep 12, 2023

Describe the bug
Test org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock is flaky

To Reproduce

سبت 11, 2023 1:44:23 م com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[#339,opensearch[node_t2][clusterApplierService#updateTask][T#1],5,TGRP-MinimumClusterManagerNodesIT]
java.lang.AssertionError: a started primary with non-pending operation term must be in primary mode [test][2], node[IADuWGkCTpuWEnWUFcbkSQ], [P], s[STARTED], a[id=oar4Dv6STMWSzO-FDH4bMA]
	at __randomizedtesting.SeedInfo.seed([7E7C985F304948B0]:0)
	at org.opensearch.index.shard.IndexShard.updateShardState(IndexShard.java:752)
	at org.opensearch.indices.cluster.IndicesClusterStateService.updateShard(IndicesClusterStateService.java:710)
	at org.opensearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:650)
	at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:293)
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:606)
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:593)
	at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:561)
	at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:484)
	at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:186)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849)
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282)
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1623)

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.seed=7E7C985F304948B0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-SD -Dtests.timezone=Europe/Lisbon -Druntime.java=20
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.seed=7E7C985F304948B0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-SD -Dtests.timezone=Europe/Lisbon -Druntime.java=20
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.seed=7E7C985F304948B0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-SD -Dtests.timezone=Europe/Lisbon -Druntime.java=20
NOTE: leaving temporary files on disk at: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.cluster.MinimumClusterManagerNodesIT_7E7C985F304948B0-001
NOTE: test params are: codec=Asserting(Lucene95), sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=ar-SD, timezone=Europe/Lisbon
NOTE: Linux 5.15.0-1039-aws amd64/Eclipse Adoptium 20.0.2 (64-bit)/cpus=32,threads=1,free=204825744,total=536870912
NOTE: All tests run in this JVM: [PendingTasksBlocksIT, GetIndexIT, ActiveShardsObserverIT, MinimumClusterManagerNodesIT]

Expected behavior
Test should always pass

Plugins
Standard

Screenshots

Host/Environment (please complete the following information):
https://build.ci.opensearch.org/job/gradle-check/25287/testReport/junit/org.opensearch.cluster/MinimumClusterManagerNodesIT/testThreeNodesNoClusterManagerBlock/

Additional context
https://build.ci.opensearch.org/job/gradle-check/25287/


I (@andrross) am adding the content from this comment to the description here because it has now been buried in the comment stream:

I believe I have traced this back to the commit that introduced the flakiness: 9119b6d (#9105)

The following command will reliably reproduce the failure for me:

./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.iters=100

If I select the commit immediately preceding 9119b6d then it does not reproduce.

This is a bit concerning because the commit in question is related to the remote store feature but MinimumClusterManagerNodesIT does not do anything related to remote store, so it is possible there is a significant regression here.

@sohami sohami added bug Something isn't working flaky-test Random test failure that succeeds on second run labels Sep 12, 2023
@andrross
Copy link
Member

#10519 (comment)

@reta
Copy link
Collaborator

reta commented May 17, 2024

java.lang.AssertionError: Missing cluster-manager, expected nodes: [{node_t4}{4EedRXkRQVKI0fmGCb6Y1Q}{rsoAXMTPQNW7slygLMSvQQ}{127.0.0.1}{127.0.0.1:35601}{dimr}{shard_indexing_pressure_enabled=true}, {node_t3}{NDGm--CAR-6KZLASPentjg}{AYdxiLmJTPKxFQ6pCIBxsA}{127.0.0.1}{127.0.0.1:44999}{dimr}{shard_indexing_pressure_enabled=true}, {node_t2}{ymARqza7Q0eocUFwC_3sbQ}{WhCTBd4tRa2tgW43N9mBnQ}{127.0.0.1}{127.0.0.1:42273}{dimr}{shard_indexing_pressure_enabled=true}] and actual cluster states [cluster uuid: y97dDapYSby5Tqr8dZbPZA [committed: true]
version: 10
state uuid: R5lx6pcBSomVZDCx7jW65Q
from_diff: false
meta data version: 7
   coordination_metadata:
      term: 1
      last_committed_config: VotingConfiguration{ymARqza7Q0eocUFwC_3sbQ,4EedRXkRQVKI0fmGCb6Y1Q,NDGm--CAR-6KZLASPentjg}
      last_accepted_config: VotingConfiguration{ymARqza7Q0eocUFwC_3sbQ,4EedRXkRQVKI0fmGCb6Y1Q,NDGm--CAR-6KZLASPentjg}
      voting tombstones: []
   [test/5EwrhsdDT2ShsBqHn77r-A]: v[7], mv[2], sv[1], av[1]
      0: p_term [1], isa_ids [wYcVyMvcQdOEOfu8mNcWfg, J1Y7tiTKQuCNpyQDskxqMQ]
      1: p_term [1], isa_ids [wMFwg3TiQV6Bq7tS1dmSVQ, zs4uh3wmRqiQs8HO7VEzSQ]
      2: p_term [1], isa_ids [dUkUHeD9QGuCY8do7NSPJg, q-h4jojQSkyi2z-2EroIlQ]
metadata customs:
   index-graveyard: IndexGraveyard[[]]
blocks: 
   _global_:
      2,no cluster-manager, blocks WRITE,METADATA_WRITE
nodes: 
   {node_t2}{ymARqza7Q0eocUFwC_3sbQ}{WhCTBd4tRa2tgW43N9mBnQ}{127.0.0.1}{127.0.0.1:42273}{dimr}{shard_indexing_pressure_enabled=true}, local
   {node_t1}{NDGm--CAR-6KZLASPentjg}{XHgAw-wyToy7ZtQ8rsTQ9g}{127.0.0.1}{127.0.0.1:41977}{dimr}{shard_indexing_pressure_enabled=true}
   {node_t0}{4EedRXkRQVKI0fmGCb6Y1Q}{tiXuHCRERKC7OsVyg0BuTg}{127.0.0.1}{127.0.0.1:37015}{dimr}{shard_indexing_pressure_enabled=true}
routing_table (version 7):
-- index [[test/5EwrhsdDT2ShsBqHn77r-A]]
----shard_id [test][0]
--------[test][0], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=J1Y7tiTKQuCNpyQDskxqMQ]
--------[test][0], node[4EedRXkRQVKI0fmGCb6Y1Q], [R], s[STARTED], a[id=wYcVyMvcQdOEOfu8mNcWfg]
----shard_id [test][1]
--------[test][1], node[4EedRXkRQVKI0fmGCb6Y1Q], [P], s[STARTED], a[id=zs4uh3wmRqiQs8HO7VEzSQ]
--------[test][1], node[NDGm--CAR-6KZLASPentjg], [R], s[STARTED], a[id=wMFwg3TiQV6Bq7tS1dmSVQ]
----shard_id [test][2]
--------[test][2], node[ymARqza7Q0eocUFwC_3sbQ], [R], s[STARTED], a[id=q-h4jojQSkyi2z-2EroIlQ]
--------[test][2], node[NDGm--CAR-6KZLASPentjg], [P], s[STARTED], a[id=dUkUHeD9QGuCY8do7NSPJg]

routing_nodes:
-----node_id[ymARqza7Q0eocUFwC_3sbQ][V]
--------[test][0], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=J1Y7tiTKQuCNpyQDskxqMQ]
--------[test][2], node[ymARqza7Q0eocUFwC_3sbQ], [R], s[STARTED], a[id=q-h4jojQSkyi2z-2EroIlQ]
-----node_id[4EedRXkRQVKI0fmGCb6Y1Q][V]
--------[test][1], node[4EedRXkRQVKI0fmGCb6Y1Q], [P], s[STARTED], a[id=zs4uh3wmRqiQs8HO7VEzSQ]
--------[test][0], node[4EedRXkRQVKI0fmGCb6Y1Q], [R], s[STARTED], a[id=wYcVyMvcQdOEOfu8mNcWfg]
-----node_id[NDGm--CAR-6KZLASPentjg][V]
--------[test][2], node[NDGm--CAR-6KZLASPentjg], [P], s[STARTED], a[id=dUkUHeD9QGuCY8do7NSPJg]
--------[test][1], node[NDGm--CAR-6KZLASPentjg], [R], s[STARTED], a[id=wMFwg3TiQV6Bq7tS1dmSVQ]
---- unassigned
, cluster uuid: y97dDapYSby5Tqr8dZbPZA [committed: true]
version: 11
state uuid: KWBexCcJSmiAcjbR6BWO7w
from_diff: false
meta data version: 8
   coordination_metadata:
      term: 2
      last_committed_config: VotingConfiguration{ymARqza7Q0eocUFwC_3sbQ,NDGm--CAR-6KZLASPentjg,4EedRXkRQVKI0fmGCb6Y1Q}
      last_accepted_config: VotingConfiguration{ymARqza7Q0eocUFwC_3sbQ,NDGm--CAR-6KZLASPentjg,4EedRXkRQVKI0fmGCb6Y1Q}
      voting tombstones: []
   [test/5EwrhsdDT2ShsBqHn77r-A]: v[7], mv[2], sv[1], av[1]
      0: p_term [1], isa_ids [wYcVyMvcQdOEOfu8mNcWfg, J1Y7tiTKQuCNpyQDskxqMQ]
      1: p_term [1], isa_ids [wMFwg3TiQV6Bq7tS1dmSVQ, zs4uh3wmRqiQs8HO7VEzSQ]
      2: p_term [1], isa_ids [dUkUHeD9QGuCY8do7NSPJg, q-h4jojQSkyi2z-2EroIlQ]
metadata customs:
   index-graveyard: IndexGraveyard[[]]
nodes: 
   {node_t2}{ymARqza7Q0eocUFwC_3sbQ}{WhCTBd4tRa2tgW43N9mBnQ}{127.0.0.1}{127.0.0.1:42273}{dimr}{shard_indexing_pressure_enabled=true}, cluster-manager
   {node_t0}{4EedRXkRQVKI0fmGCb6Y1Q}{tiXuHCRERKC7OsVyg0BuTg}{127.0.0.1}{127.0.0.1:37015}{dimr}{shard_indexing_pressure_enabled=true}
   {node_t3}{NDGm--CAR-6KZLASPentjg}{AYdxiLmJTPKxFQ6pCIBxsA}{127.0.0.1}{127.0.0.1:44999}{dimr}{shard_indexing_pressure_enabled=true}, local
routing_table (version 8):
-- index [[test/5EwrhsdDT2ShsBqHn77r-A]]
----shard_id [test][0]
--------[test][0], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=J1Y7tiTKQuCNpyQDskxqMQ]
--------[test][0], node[4EedRXkRQVKI0fmGCb6Y1Q], [R], s[STARTED], a[id=wYcVyMvcQdOEOfu8mNcWfg]
----shard_id [test][1]
--------[test][1], node[4EedRXkRQVKI0fmGCb6Y1Q], [P], s[STARTED], a[id=zs4uh3wmRqiQs8HO7VEzSQ]
--------[test][1], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2024-05-17T14:05:51.567Z], delayed=false, details[node_left [NDGm--CAR-6KZLASPentjg]], allocation_status[no_attempt]]
----shard_id [test][2]
--------[test][2], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=q-h4jojQSkyi2z-2EroIlQ]
--------[test][2], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2024-05-17T14:05:51.567Z], delayed=false, details[node_left [NDGm--CAR-6KZLASPentjg]], allocation_status[no_attempt]]

routing_nodes:
-----node_id[ymARqza7Q0eocUFwC_3sbQ][V]
--------[test][0], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=J1Y7tiTKQuCNpyQDskxqMQ]
--------[test][2], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=q-h4jojQSkyi2z-2EroIlQ]
-----node_id[4EedRXkRQVKI0fmGCb6Y1Q][V]
--------[test][1], node[4EedRXkRQVKI0fmGCb6Y1Q], [P], s[STARTED], a[id=zs4uh3wmRqiQs8HO7VEzSQ]
--------[test][0], node[4EedRXkRQVKI0fmGCb6Y1Q], [R], s[STARTED], a[id=wYcVyMvcQdOEOfu8mNcWfg]
-----node_id[NDGm--CAR-6KZLASPentjg][V]
---- unassigned
--------[test][1], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2024-05-17T14:05:51.567Z], delayed=false, details[node_left [NDGm--CAR-6KZLASPentjg]], allocation_status[no_attempt]]
--------[test][2], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2024-05-17T14:05:51.567Z], delayed=false, details[node_left [NDGm--CAR-6KZLASPentjg]], allocation_status[no_attempt]]
, cluster uuid: _na_ [committed: false]
version: 0
state uuid: ceEdRNQoSp2wgPupsMUYdA
from_diff: false
meta data version: 0
   coordination_metadata:
      term: 0
      last_committed_config: VotingConfiguration{}
      last_accepted_config: VotingConfiguration{}
      voting tombstones: []
metadata customs:
   index-graveyard: IndexGraveyard[[]]
blocks: 
   _global_:
      1,state not recovered / initialized, blocks READ,WRITE,METADATA_READ,METADATA_WRITE,CREATE_INDEX      2,no cluster-manager, blocks WRITE,METADATA_WRITE
nodes: 
   {node_t4}{4EedRXkRQVKI0fmGCb6Y1Q}{rsoAXMTPQNW7slygLMSvQQ}{127.0.0.1}{127.0.0.1:35601}{dimr}{shard_indexing_pressure_enabled=true}, local
routing_table (version 0):
routing_nodes:
-----node_id[4EedRXkRQVKI0fmGCb6Y1Q][V]
---- unassigned
]

@dblock dblock mentioned this issue May 20, 2024
9 tasks
@andrross
Copy link
Member

Closing in favor of the autocut #14289

@github-project-automation github-project-automation bot moved this from Now(This Quarter) to ✅ Done in Cluster Manager Project Board Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager flaky-test Random test failure that succeeds on second run
Projects
Status: ✅ Done
Development

No branches or pull requests

10 participants