Optimize iteration in DatasetInstance model + SA2.0 fix #16776

jdavcs · 2023-10-02T14:17:16Z

This is both an optimization and a SQLAlchemy 2.0 fix.

There's no need to prepopulate 2 lists pulling all records from the database, then concatenate them into one list, then iterate. Instead we simply chain the iterators: that way we don't pull everything from the database upfront, and exit the iteration early if condition met.

Ref: #12541

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

No need to make lists before iteration given possible early termination

bgruening · 2023-10-02T14:48:21Z

lib/galaxy/model/__init__.py

@@ -4487,10 +4488,9 @@ def ok_to_edit_metadata(self):
        # prevent modifying metadata when dataset is queued or running as input/output
        # This code could be more efficient, i.e. by using mappers, but to prevent slowing down loading a History panel, we'll leave the code here for now


should this comment now go away?

Well now that you mention it... I've looked at the generated queries. Given that this needs to be efficient, we can do better.

Currently, this is what's sent to the database before the loop (using dataset_id=42, job_id=1 as fillers):

SELECT job_to_input_dataset.id, job_to_input_dataset.job_id, job_to_input_dataset.dataset_id, job_to_input_dataset.dataset_version, job_to_input_dataset.name, dataset_1.id AS id_1, dataset_1.job_id AS job_id_1, dataset_1.create_time, dataset_1.update_time, dataset_1.state, dataset_1.deleted, dataset_1.purged, dataset_1.purgable, dataset_1.object_store_id, dataset_1.external_filename, dataset_1._extra_files_path, dataset_1.created_from_basename, dataset_1.file_size, dataset_1.total_size, dataset_1.uuid, history_dataset_association_1.id AS id_2, history_dataset_association_1.history_id, history_dataset_association_1.dataset_id AS dataset_id_1, history_dataset_association_1.create_time AS create_time_1, history_dataset_association_1.update_time AS update_time_1, history_dataset_association_1.state AS state_1, history_dataset_association_1.copied_from_history_dataset_association_id, history_dataset_association_1.copied_from_library_dataset_dataset_association_id, history_dataset_association_1.name AS name_1, history_dataset_association_1.info, history_dataset_association_1.blurb, history_dataset_association_1.peek, history_dataset_association_1.tool_version, history_dataset_association_1.extension, history_dataset_association_1.metadata_deferred, history_dataset_association_1.parent_id, history_dataset_association_1.designation, history_dataset_association_1.deleted AS deleted_1, history_dataset_association_1.visible, history_dataset_association_1.extended_metadata_id, history_dataset_association_1.version, history_dataset_association_1.hid, history_dataset_association_1.purged AS purged_1, history_dataset_association_1.validated_state, history_dataset_association_1.validated_state_message, history_dataset_association_1.hidden_beneath_collection_instance_id FROM job_to_input_dataset LEFT JOIN history_dataset_association AS history_dataset_association_1 ON history_dataset_association_1.id = job_to_input_dataset.dataset_id LEFT JOIN dataset AS dataset_1 ON dataset_1.id = history_dataset_association_1.dataset_id WHERE job_to_input_dataset.dataset_id = 42; SELECT job_to_output_dataset.id, job_to_output_dataset.job_id, job_to_output_dataset.dataset_id, job_to_output_dataset.name, dataset_1.id AS id_1, dataset_1.job_id AS job_id_1, dataset_1.create_time, dataset_1.update_time, dataset_1.state, dataset_1.deleted, dataset_1.purged, dataset_1.purgable, dataset_1.object_store_id, dataset_1.external_filename, dataset_1._extra_files_path, dataset_1.created_from_basename, dataset_1.file_size, dataset_1.total_size, dataset_1.uuid, history_dataset_association_1.id AS id_2, history_dataset_association_1.history_id, history_dataset_association_1.dataset_id AS dataset_id_1, history_dataset_association_1.create_time AS create_time_1, history_dataset_association_1.update_time AS update_time_1, history_dataset_association_1.state AS state_1, history_dataset_association_1.copied_from_history_dataset_association_id, history_dataset_association_1.copied_from_library_dataset_dataset_association_id, history_dataset_association_1.name AS name_1, history_dataset_association_1.info, history_dataset_association_1.blurb, history_dataset_association_1.peek, history_dataset_association_1.tool_version, history_dataset_association_1.extension, history_dataset_association_1.metadata_deferred, history_dataset_association_1.parent_id, history_dataset_association_1.designation, history_dataset_association_1.deleted AS deleted_1, history_dataset_association_1.visible, history_dataset_association_1.extended_metadata_id, history_dataset_association_1.version, history_dataset_association_1.hid, history_dataset_association_1.purged AS purged_1, history_dataset_association_1.validated_state, history_dataset_association_1.validated_state_message, history_dataset_association_1.hidden_beneath_collection_instance_id FROM job_to_output_dataset LEFT JOIN history_dataset_association AS history_dataset_association_1 ON history_dataset_association_1.id = job_to_output_dataset.dataset_id LEFT JOIN dataset AS dataset_1 ON dataset_1.id = history_dataset_association_1.dataset_id WHERE job_to_output_dataset.dataset_id = 42;

Then, at each iteration step this is sent to the database:

SELECT job.id AS job_id, job.create_time AS job_create_time, job.update_time AS job_update_time, job.history_id AS job_history_id, job.library_folder_id AS job_library_folder_id, job.tool_id AS job_tool_id, job.tool_version AS job_tool_version, job.galaxy_version AS job_galaxy_version, job.dynamic_tool_id AS job_dynamic_tool_id, job.state AS job_state, job.info AS job_info, job.copied_from_job_id AS job_copied_from_job_id, job.command_line AS job_command_line, job.dependencies AS job_dependencies, job.job_messages AS job_job_messages, job.param_filename AS job_param_filename, job.runner_name AS job_runner_name_1, job.job_stdout AS job_job_stdout, job.job_stderr AS job_job_stderr, job.tool_stdout AS job_tool_stdout, job.tool_stderr AS job_tool_stderr, job.exit_code AS job_exit_code, job.traceback AS job_traceback, job.session_id AS job_session_id, job.user_id AS job_user_id, job.job_runner_name AS job_job_runner_name, job.job_runner_external_id AS job_job_runner_external_id, job.destination_id AS job_destination_id, job.destination_params AS job_destination_params, job.object_store_id AS job_object_store_id, job.imported AS job_imported, job.params AS job_params, job.handler AS job_handler, job.preferred_object_store_id AS job_preferred_object_store_id, job.object_store_id_overrides AS job_object_store_id_overrides, EXISTS( SELECT history_dataset_collection_association.id FROM history_dataset_collection_association, job_to_output_dataset_collection WHERE job.id = job_to_output_dataset_collection.job_id AND history_dataset_collection_association.id = job_to_output_dataset_collection.dataset_collection_id AND history_dataset_collection_association.deleted = true ) AS anon_1, EXISTS( SELECT history_dataset_association.id FROM history_dataset_association, job_to_output_dataset WHERE job.id = job_to_output_dataset.job_id AND history_dataset_association.id = job_to_output_dataset.dataset_id AND history_dataset_association.deleted = true ) AS anon_2 FROM job WHERE job.id = 1;

So, instead, before the loop:

SELECT job_to_input_dataset.job_id FROM job_to_input_dataset WHERE job_to_input_dataset.dataset_id = 42; SELECT job_to_output_dataset.job_id FROM job_to_output_dataset WHERE job_to_output_dataset.dataset_id = 42;

And at each iteration step:

SELECT job.state FROM job WHERE job.id = 1;

Fix coming up in next commit.

jdavcs · 2023-10-02T17:34:44Z

Selenium test failure unrelated.

mvdbeek · 2023-10-03T12:55:13Z

lib/galaxy/model/__init__.py

+        stmt1 = select(JobToInputDatasetAssociation.job_id).filter_by(dataset_id=self.id)
+        stmt2 = select(JobToOutputDatasetAssociation.job_id).filter_by(dataset_id=self.id)
+        for job_id in chain(sa_session.scalars(stmt1), sa_session.scalars(stmt2)):
+            state = sa_session.scalar(select(Job.state).where(Job.id == job_id))


If we're already doing this, why not join on the job table and filter the state using exists ?

Done. This should be very fast now. Thanks for the suggestion!

and the test failures seem relevant...

Fixed. Missed a final select.

Optimize iteration

02d6b27

No need to make lists before iteration given possible early termination

jdavcs added kind/refactoring cleanup or refactoring of existing code, no functional changes area/database Galaxy's database or data access layer labels Oct 2, 2023

jdavcs added this to the 23.2 milestone Oct 2, 2023

jdavcs requested a review from a team October 2, 2023 14:17

bgruening reviewed Oct 2, 2023

View reviewed changes

Don't select everything but the kitchen sink

0e5b56f

mvdbeek reviewed Oct 3, 2023

View reviewed changes

Reduce check to one query

003ae50

jdavcs force-pushed the dev_sa20_optimize_iter branch from 43094af to 003ae50 Compare October 3, 2023 15:57

mvdbeek merged commit 5909481 into galaxyproject:dev Oct 10, 2023
40 of 42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize iteration in DatasetInstance model + SA2.0 fix #16776

Optimize iteration in DatasetInstance model + SA2.0 fix #16776

jdavcs commented Oct 2, 2023 •

edited

Loading

bgruening Oct 2, 2023

jdavcs Oct 2, 2023

jdavcs commented Oct 2, 2023

mvdbeek Oct 3, 2023

jdavcs Oct 3, 2023

jdavcs Oct 3, 2023

jdavcs Oct 3, 2023

		@@ -4487,10 +4488,9 @@ def ok_to_edit_metadata(self):
		# prevent modifying metadata when dataset is queued or running as input/output
		# This code could be more efficient, i.e. by using mappers, but to prevent slowing down loading a History panel, we'll leave the code here for now

Optimize iteration in DatasetInstance model + SA2.0 fix #16776

Optimize iteration in DatasetInstance model + SA2.0 fix #16776

Conversation

jdavcs commented Oct 2, 2023 • edited Loading

How to test the changes?

License

bgruening Oct 2, 2023

Choose a reason for hiding this comment

jdavcs Oct 2, 2023

Choose a reason for hiding this comment

jdavcs commented Oct 2, 2023

mvdbeek Oct 3, 2023

Choose a reason for hiding this comment

jdavcs Oct 3, 2023

Choose a reason for hiding this comment

jdavcs Oct 3, 2023

Choose a reason for hiding this comment

jdavcs Oct 3, 2023

Choose a reason for hiding this comment

jdavcs commented Oct 2, 2023 •

edited

Loading