Is it possible to retry failing specs on another worker? #157

MarkyMarkMcDonald · 2021-08-17T14:36:58Z

We use rspec-retry to deal with flakey browser specs. Unfortunately we're currently running into an issue where firefox / selenium / geckodriver crashes and is unrecoverable (we believe it's related to docker memory / too many file handles). All the retry attempts of the spec fail, and any further specs that use the browser also fail.

We're working towards solving the underlying issue, but is there a mechanism with knapsack to retry failed specs on another worker?

We've had a few other issues that have caused an entire worker to be inoperable (like postgres crashing and specs not waiting for it to recover), so I think this is applicable to more than just our current issue.

ArturT · 2021-08-17T17:30:12Z

Hi @MarkyMarkMcDonald

is there a mechanism with knapsack to retry failed specs on another worker?

No.

Even if there was such a mechanism there might be edge cases. When could we consider that all parallel jobs completed work properly? Let's say you have a problematic CI job that has the firefox process failing and you can't run tests there. Let's say tests are put back in the Queue in knapsack_pro Queue Mode so that other parallel CI jobs can consume it.
But it's possible that all jobs processed the tests at a similar time as the problematic CI job failed and there will be no available parallel jobs ready to pick up the tests from the problematic CI job.

Most likely there are ways to solve this edge case and maybe force CI jobs to wait for some time till all test files are acknowledge by CI jobs so that we know all jobs executed tests.

There would be also edge case when tests are not acknowledge by CI jobs and we would have to handle that as well - maybe with some time out.

Probably there are more edge cases we need to consider as well. These are some on the top of my head.

Most likely there is no simple solution right a way to handle this. We would have to collect more feedback from other users if they would find it useful to auto-assign tests to other jobs when the CI job can't run tests and then try to find a simple solution as possible to avoid edge cases.

The simplest action for you to take for now could be:

What CI provider do you use? Some of them like Buildkite allows to automatically retry failed jobs in a new isolated machine so this should restart the Firefox. Maybe there is a way to configure auto-retry of the parallel job on your CI server?
You could try to add more resources CPU/RAM/disk to CI server to see it's more stable. Upgrade to the latest Firefox to ensure it's not some bug.

What is your organization ID or email? You can send it to [email protected] and I can review your account.
Do you use Queue Mode or Regular Mode?

MarkyMarkMcDonald · 2021-09-07T22:58:06Z

Thanks @ArturT - sorry for the delay, this slipped my radar.

This all makes sense to me, thanks for detailing out everything 👍 .

For our specific case:

We're using CircleCI. I don't think there's an automated retry - I think we'd have to write some custom retry logic, like "Detect if firefox crashed based on the rspec exceptions and have a dependent job run".
Lack of resources has been suggested by various stackoverflows when digging into the root of the crashes - we've tried tweaking the worker sizes, but it's still reproducible with the next size up. I'll take a look at switching to the latest firefox and geckodriver, that's a good call 👍 .

I'll send in that email and try to find a few examples of runs exhibiting the crashes, thanks!

ArturT · 2021-09-08T16:30:56Z

I'm pasting here idea that might be useful for others looking at this issue:

You could collect test file paths of failing tests from all parallel nodes and generate file KNAPSACK_PRO_TEST_FILE_LIST_SOURCE_FILE=tmp/my_failing_specs.txt
You could extract that from junit XML report. https://knapsackpro.com/faq/question/how-to-use-junit-formatter#how-to-use-junit-formatter-with-knapsack_pro-queue-mode

Use this list of test files tmp/my_failing_specs.txt to initialize a new CI build (or dependent CircleCI job):
https://knapsackpro.com/faq/question/how-to-run-a-specific-list-of-test-files-or-only-some-tests-from-test-file
(please ensure you updated knapsack_pro gem version first)
Please ensure tmp/my_failing_specs.txt list of test files must be the same set of tests for all parallel nodes. Only one parallel CI node will initialize the Queue with a set of tests (the one that very first hits our API endpoint - we don't know which one it will be. That is why you must have the same set of test files in tmp/my_failing_specs.txt on all parallel CI nodes.
You would have to collect the list of failed test files from all CI nodes and merge it into one file tmp/my_failing_specs.txt before you run a new CI build (or a new job with retried failed tests).

ArturT · 2023-06-13T20:01:13Z

story

The idea of running failed tests on another worker/CI node is part of the idea of improving Queue API.
Related internal ticket:
https://trello.com/c/KjXa29IJ

MarkyMarkMcDonald · 2023-06-20T20:42:54Z

We're not running into the original problem I posted about anymore (feel free to close this issue), but as an FYI - circleci has started experimental support for "rerun failed tests only".

We're trying this out and here's how we combine circleci failure retries with knapsack:

    # Use circleci cli to find out if we need to run all tests or just failed tests
    # We are telling circleci to split the tests across 1 node to get the full list of all tests for consideration. We leave the splitting to Knapsack Pro.
    circleci tests glob "spec/**/*_spec.rb" | circleci tests run --index 0 --total 1 --command ">files.txt xargs echo" --verbose > files.txt

    # replace all spaces with newlines in files.txt file
    sed -i 's/ /\n/g' files.txt

    # tell knapsack pro to run only tests from files.txt (and still use queueing magic)
    if [[ -s "files.txt" ]]; then
      export KNAPSACK_PRO_TEST_FILE_LIST_SOURCE_FILE=files.txt
      bundle exec rake knapsack_pro:queue:rspec["${EXCLUDE_TAGS} ${FORMATTER_OPTIONS}"]
    fi

ArturT · 2023-06-20T23:04:15Z

@MarkyMarkMcDonald Thanks for sharing the example.

ArturT · 2023-12-04T20:47:00Z

SOLUTION

Here is an example of how to rerun only failed tests on CircleCI:
https://docs.knapsackpro.com/ruby/circleci/#rerun-only-failed-tests

ArturT added the question label Aug 17, 2021

ArturT added the planned It’s planned to be done label Jun 13, 2023

shadre removed the planned It’s planned to be done label Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to retry failing specs on another worker? #157

Is it possible to retry failing specs on another worker? #157

MarkyMarkMcDonald commented Aug 17, 2021

ArturT commented Aug 17, 2021

MarkyMarkMcDonald commented Sep 7, 2021

ArturT commented Sep 8, 2021 •

edited

Loading

ArturT commented Jun 13, 2023

MarkyMarkMcDonald commented Jun 20, 2023

ArturT commented Jun 20, 2023

ArturT commented Dec 4, 2023

Is it possible to retry failing specs on another worker? #157

Is it possible to retry failing specs on another worker? #157

Comments

MarkyMarkMcDonald commented Aug 17, 2021

ArturT commented Aug 17, 2021

MarkyMarkMcDonald commented Sep 7, 2021

ArturT commented Sep 8, 2021 • edited Loading

ArturT commented Jun 13, 2023

story

MarkyMarkMcDonald commented Jun 20, 2023

ArturT commented Jun 20, 2023

ArturT commented Dec 4, 2023

SOLUTION

ArturT commented Sep 8, 2021 •

edited

Loading