Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[24.0] More efficient change_state queries, maybe fix deadlock #17632

Merged

Conversation

mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Mar 7, 2024

Here's the deadlock:

Traceback (most recent call last):
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context
    self.dialect.do_execute(
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 317 waits for ShareLock on transaction 1057; blocked by process 318.
Process 318 waits for ShareLock on transaction 1056; blocked by process 317.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (0,7) in relation "dataset"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/galaxy/galaxy/galaxy root/lib/galaxy/jobs/runners/__init__.py", line 203, in put
    queue_job = job_wrapper.enqueue()
  File "/home/runner/work/galaxy/galaxy/galaxy root/lib/galaxy/jobs/__init__.py", line 1589, in enqueue
    self.change_state(model.Job.states.QUEUED, flush=False, job=job)
  File "/home/runner/work/galaxy/galaxy/galaxy root/lib/galaxy/jobs/__init__.py", line 1547, in change_state
    job.update_output_states(self.app.application_stack.supports_skip_locked())
  File "/home/runner/work/galaxy/galaxy/galaxy root/lib/galaxy/model/__init__.py", line 2053, in update_output_states
    sa_session.execute(statement, params)
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 1717, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1710, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1577, in _execute_clauseelement
    ret = self._execute_context(
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1953, in _execute_context
    self._handle_dbapi_exception(
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2134, in _handle_dbapi_exception
    util.raise_(
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context
    self.dialect.do_execute(
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 317 waits for ShareLock on transaction 1057; blocked by process 318.
Process 318 waits for ShareLock on transaction 1056; blocked by process 317.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (0,7) in relation "dataset"

[SQL:
            UPDATE dataset
            SET
                state = %(state)s,
                update_time = %(update_time)s
            WHERE id IN (
                SELECT hda.dataset_id FROM history_dataset_association hda
                INNER JOIN job_to_output_dataset jtod
                ON jtod.dataset_id = hda.id AND jtod.job_id = %(job_id)s
            );
        ]
[parameters: {'state': 'queued', 'update_time': datetime.datetime(2024, 3, 7, 12, 29, 10, 229364), 'job_id': 3}]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

The likely culprit for the deadlock is that __EXTRACT_DATASET__ deals with the same dataset as the tool that created the collection __EXTRACT_DATASET__ is running on, they might both be attempting to update the output state.

My thinking is that by filtering on the job_id we're not going to change state for the __EXTRACT_DATASET__ change_state method.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

Here's the deadlock:

```
Traceback (most recent call last):
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context
    self.dialect.do_execute(
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 317 waits for ShareLock on transaction 1057; blocked by process 318.
Process 318 waits for ShareLock on transaction 1056; blocked by process 317.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (0,7) in relation "dataset"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/galaxy/galaxy/galaxy root/lib/galaxy/jobs/runners/__init__.py", line 203, in put
    queue_job = job_wrapper.enqueue()
  File "/home/runner/work/galaxy/galaxy/galaxy root/lib/galaxy/jobs/__init__.py", line 1589, in enqueue
    self.change_state(model.Job.states.QUEUED, flush=False, job=job)
  File "/home/runner/work/galaxy/galaxy/galaxy root/lib/galaxy/jobs/__init__.py", line 1547, in change_state
    job.update_output_states(self.app.application_stack.supports_skip_locked())
  File "/home/runner/work/galaxy/galaxy/galaxy root/lib/galaxy/model/__init__.py", line 2053, in update_output_states
    sa_session.execute(statement, params)
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 1717, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1710, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1577, in _execute_clauseelement
    ret = self._execute_context(
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1953, in _execute_context
    self._handle_dbapi_exception(
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2134, in _handle_dbapi_exception
    util.raise_(
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context
    self.dialect.do_execute(
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 317 waits for ShareLock on transaction 1057; blocked by process 318.
Process 318 waits for ShareLock on transaction 1056; blocked by process 317.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (0,7) in relation "dataset"

[SQL:
            UPDATE dataset
            SET
                state = %(state)s,
                update_time = %(update_time)s
            WHERE id IN (
                SELECT hda.dataset_id FROM history_dataset_association hda
                INNER JOIN job_to_output_dataset jtod
                ON jtod.dataset_id = hda.id AND jtod.job_id = %(job_id)s
            );
        ]
[parameters: {'state': 'queued', 'update_time': datetime.datetime(2024, 3, 7, 12, 29, 10, 229364), 'job_id': 3}]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

```

The likely culprit for the deadlock  is that __EXTRACT_DATASET__ deals
with the same dataset as the tool that created the collection
__EXTRACT_DATASET__ is running on, they might both be attempting to
update the output state.

My thinking is that by filtering on the job_id we're not going to change
state for the `__EXTRACT_DATASET__` change_state method.
@mvdbeek mvdbeek added kind/bug area/database Galaxy's database or data access layer labels Mar 7, 2024
@github-actions github-actions bot added this to the 24.1 milestone Mar 7, 2024
@jdavcs
Copy link
Member

jdavcs commented Mar 7, 2024

Unfortunately, they are not equivalent. I haven't looked at the details, but here's what I've found: the first UPDATE statement is executed 5 times in the failing test. The 3rd and 4th pass result in 1 row selected for updating in the old version and 0 rows in the new version.

@mvdbeek
Copy link
Member Author

mvdbeek commented Mar 7, 2024

because we didn't add the job association in the right place.

@martenson
Copy link
Member

martenson commented Mar 7, 2024

these test fails look related

FAILED tests/app/managers/test_JobConnectionsManager.py::test_graph_manager_inputs_for_hda - AttributeError: 'NoneType' object has no attribute 'job'
FAILED tests/app/managers/test_JobConnectionsManager.py::test_graph_manager_outputs_for_hda - AttributeError: 'NoneType' object has no attribute 'job'
FAILED tests/app/managers/test_JobConnectionsManager.py::test_graph_manager_inputs_for_hdca - AttributeError: 'NoneType' object has no attribute 'job'
FAILED tests/app/managers/test_JobConnectionsManager.py::test_graph_manager_outputs_for_hdca - AttributeError: 'NoneType' object has no attribute 'job'
FAILED tests/app/managers/test_JobConnectionsManager.py::test_graph_manager_hda - AttributeError: 'NoneType' object has no attribute 'job'
FAILED tests/app/managers/test_JobConnectionsManager.py::test_graph_manager_hdca - AttributeError: 'NoneType' object has no attribute 'job'
= 6 failed, 683 passed, 85 skipped, 1 xfailed, 1 xpassed, 5960 warnings in 213.66s (0:03:33) =

Copy link
Member

@jdavcs jdavcs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Holding fingers crossed :)

@jdavcs jdavcs merged commit 53c2bc7 into galaxyproject:release_24.0 Mar 7, 2024
49 of 50 checks passed
@jdavcs jdavcs modified the milestones: 24.1, 24.0 Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/database Galaxy's database or data access layer kind/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants