Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature - stdout live reporting #16975

Merged
merged 36 commits into from
Nov 18, 2024

Conversation

gecage952
Copy link
Contributor

@gecage952 gecage952 commented Nov 3, 2023

Hi,
As part of our work at Oak Ridge National Lab, we've been using Galaxy for quite a while (some of us have attended GCC as well). We've also been doing some internal development on some features that users have requested here. One the most common ones was the ability to see live console output as jobs are running. This issue has been brought up before, for example #2332. But given, our users wanted it at the present, I took a stab at an implementation. My main goal was to minimize impact to the way Galaxy works now, so that there aren't any compatibility issues. I'll try to provide details about each part that was touched.

Overview

The overall idea here was to add logic to allow the job manager to read the tool_stdout and tool_stderr files that are saved in the job directory, and return them as part of a status. The reason I put it into the status is because the UI already calls the status regularly form the JobInformation page, so I wouldn't have to make a new thread or anything. It also just kinda made sense to me that you might want it as part ofhte status of the job. To facilitate this, the api endpoint for getting job status was adjusted to allow parameters to select which part of the stdout/stderr (both work the same, so I'll just refer to stdout from here) that you want. There's stdout_pos which is the starting index in the stdout_file, and stdout_length which is how much of the stdout that you want (in chars). Because stdout could potentially be a relatively larger file, I didn't want to force people to read the whole file every time status is called.

I then adjusted the UI for the job information view in a few different ways. First, I made the code blocks scrollable and set a max height for them. Then I moved the expand on click functionality over to only be on the expand icons, rather than the whole table row (if users would try and highlight a part of the stdout or click on the scroll bar to scroll, it would collapse the view). Lastly, I added an autoscroll feature that automatically scrolls the code blocks when the user is at the bottom of the stdout. If the user scrolls up, this is disabled. If they scroll back to the bottom, it starts again.

The last thing I want to note is that as far as compatibility with job runners go, we almost exclusively use Pulsar for running jobs. As is, this pr will only work for job runners that save their stdout to the job directory inside of the Galaxy. Internally, we've added functionality for Pulsar to do this (the purpose of lib/galaxy/webapps/galaxy/api/job_files.py changes). I did not include those changes here, because that would require an additional pr to the Pulsar repository. Let me know if there's interest in seeing that however. It would also be nice to get some feedback on testing this.

I understand this is a pretty big change, and I imagine there are a lot of areas for improvement. Please let me know if this is something people are interested in helping with, or if this is a terrible way to try to do this, or whatever. It's worked for us so far internally, but I would love to have some feedback from here.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. The best way to test is to start a tool that will run for some time with stdout.
    2. Start the tool and go to the JobInformation page for the job.
    3. Expand the Tool stdout by clicking on the expand icon to the right.

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@dannon
Copy link
Member

dannon commented Nov 6, 2023

@gecage952 Very cool! I'll defer to others for feedback on the API design, but I went ahead and pushed a minor change fixing up the linting and client test failures, and updating the client API schema.

@dannon dannon force-pushed the feature_stdout_live_reporting branch from e5b887e to 201f556 Compare November 6, 2023 14:03
@dannon dannon force-pushed the feature_stdout_live_reporting branch from 201f556 to 24e7d15 Compare November 6, 2023 14:05
@bgruening
Copy link
Member

That is very cool and a feature we also get asked every now and then. As a Pulsar deployer, I'm also interested in the Pulsar part and how this works in practice. Are you using the MQ deployment of Pulsar?

Comment on lines 192 to 195
- stdout_position: The index of the character to begin reading stdout from
- stdout_length: How many characters of stdout to read
- stderr_position: The index of the character to begin reading stderr from
- stderr_length: How many characters of stderr to read
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add two separate endpoints job for the jobs' stdout and stderr ?

@gecage952
Copy link
Contributor Author

That is very cool and a feature we also get asked every now and then. As a Pulsar deployer, I'm also interested in the Pulsar part and how this works in practice. Are you using the MQ deployment of Pulsar?

Yeah, rabbitmq.
The general idea of the pulsar piece is to send the stdout/stderr files to the job_files api endpoint periodically. To make sure we aren't sending the entire file every time, we keep track of the last position in a map. Then when it's time to send the next chunk, we seek to that position in the files, post it to the job_files endpoint which appends it to the file in the Galaxy job directory. We also considered using the message queue for this instead of the api, but ended up not going that direction.

@mvdbeek
Copy link
Member

mvdbeek commented Nov 7, 2023

This is very cool, thanks a lot!

Let me know if there's interest in seeing that however.

that'd be great!

It would also be nice to get some feedback on testing this.

The integration tests are going to run with the local job runner by default (as opposed to the API tests that could run against external Galaxy servers where the stdio streams may not be available). What you can do is submit a tool job against such an instance that prints something to stdout, then sleeps and prints something at the end, in your test you can then assert that you saw the first message but not the second. Take a look at https://github.com/galaxyproject/galaxy/blob/dev/test/integration/test_job_recovery.py#L30-L36 for running tools in integration tests. Let me know if you need any help with this.

@gecage952
Copy link
Contributor Author

gecage952 commented Nov 17, 2023

So, I opened a pr in the pulsar repo with that code: galaxyproject/pulsar#345
I'll try to get to the suggestions here in the coming weeks (busy time of year).

@martenson martenson marked this pull request as draft November 17, 2023 20:02
@gecage952
Copy link
Contributor Author

Ok, I added a new endpoint for getting stdout and stderr, and then updated everything to use the new endpoint. I see it was mentioned to have them separate, and I can still do that if necessary. I just combined them here now for my own personal testing purposes.

@gecage952
Copy link
Contributor Author

I'll also take a look at the merge conflicts.

@gecage952
Copy link
Contributor Author

Updated so that it will check the job destination params to see if the assigned destination has the live_tool_output_reporting param set to true. One thing I've noticed is that the way the pages refreshes can be a little jarring. I know in previous versions of Galaxy it was a different.

@gecage952
Copy link
Contributor Author

Fixed the current merge conflicts.

@bgruening
Copy link
Member

Can you please run make update-client-api-schema

@gecage952
Copy link
Contributor Author

Noticed the api schema needed to be updated again, so I went ahead and did that. Should be good to review if tests pass.

@nsoranzo
Copy link
Member

Looks like you need to run 'make update-client-api-schema' and commit results.

@gecage952
Copy link
Contributor Author

Gotcha thanks, just did it.

@mvdbeek mvdbeek requested a review from jmchilton September 24, 2024 13:11
@jmchilton
Copy link
Member

Pushed a merge commit to resolve conflicts. This looks really great, nice work and I'm so sorry for the delay. I think we will get this into the forthcoming 24.2 release.

@gecage952
Copy link
Contributor Author

Awesome! No worries on the delay. I totally get it.

@bernt-matthias
Copy link
Contributor

Was just wondering what happens in real user setups, where a chown happens before the job runs (i.e. the Galaxy user won't be able to access the files while the jobs runs).

@jmchilton
Copy link
Member

Was just wondering what happens in real user setups, where a chown happens before the job runs (i.e. the Galaxy user won't be able to access the files while the jobs runs).

I imagine it won't work - it is off by default though in my testing so I think it isn't a blocker. I don't have a setup for testing that - but it might feasible to fix if it doesn't work by ensuring the relevant files are readable to the Galaxy user (maybe a job destination option for setting group or world readable permissions).

@bernt-matthias
Copy link
Contributor

I think it isn't a blocker.

Me to.

Wondering if we really need the chgrp here:

If we can drop this, the Galaxy user would still have read access.

I could test this (but unlikely before the release).

@gecage952
Copy link
Contributor Author

In my testing, those types of errors get caught, and it just appears the same as the default behavior to users. It does log the error though.

@jmchilton
Copy link
Member

jmchilton commented Nov 14, 2024

I've rerun the failing integration tests and I think

test/integration/test_pulsar_embedded_mq.py::TestEmbeddedMessageQueuePulsarPurge::test_purge_while_job_running

is a valid failure. I'm not getting much from digging through the debug logging and this is kind of a pain to test locally because of the MQ. I'll try to keep digging we're trying to branch very soon and I really want to get this in before then.

From the logs:

So the error is we're waiting on a history that we expect to end find, but there are datasets in the "failed_metadata" state.

The job logs are verbose with all the file transfers up and down... but toward the end... they mostly look fine and have no indications about this dataset as far as I can tell.

pulsar.client.staging.down INFO 2024-11-14 15:17:00,865 [pN:main,p:9546,tN:PulsarJobRunner.work_thread-1] collecting output outputs_new/implicit_dataset_conversions.txt with action FileAction[path=/tmp/tmpztun6q_f/tmplbeuc894/tmp9x0bam0_/database/job_working_directory_py8fi90v/000/1/metadata/outputs_new/implicit_dataset_conversions.txt,action_type=remote_transfer,url=http://localhost:8199/api/jobs/adb5f5c93f827949/files?job_key=56a9d5119a62bf92&path=%2Ftmp%2Ftmpztun6q_f%2Ftmplbeuc894%2Ftmp9x0bam0_%2Fdatabase%2Fjob_working_directory_py8fi90v%2F000%2F1%2Fmetadata%2Foutputs_new%2Fimplicit_dataset_conversions.txt&file_type=output_metadata]
pulsar.client.staging.down DEBUG 2024-11-14 15:17:00,865 [pN:main,p:9546,tN:PulsarJobRunner.work_thread-1] Cleaning up job (failed [False], cleanup_job [onsuccess])
galaxy.tool_util.provided_metadata DEBUG 2024-11-14 15:17:00,887 [pN:main,p:9546,tN:PulsarJobRunner.work_thread-1] unnamed outputs [{'output_tool_supplied_metadata': {'name': 'my dynamic name', 'ext': 'txt', 'info': 'my dynamic info'}}]
galaxy.model.store.discover DEBUG 2024-11-14 15:17:00,890 [pN:main,p:9546,tN:PulsarJobRunner.work_thread-1] (1) Created dynamic collection dataset for path [/tmp/tmpztun6q_f/tmplbeuc894/tmp9x0bam0_/database/job_working_directory_py8fi90v/000/1/working/output.txt] with element identifier [output] for output [discovered_list] (0.756 ms)
galaxy.model.store.discover DEBUG 2024-11-14 15:17:00,893 [pN:main,p:9546,tN:PulsarJobRunner.work_thread-1] (1) Add dynamic collection datasets to history for output [discovered_list] (2.630 ms)
galaxy.jobs INFO 2024-11-14 15:17:00,953 [pN:main,p:9546,tN:PulsarJobRunner.work_thread-1] Collecting metrics for Job 1 in /tmp/tmpztun6q_f/tmplbeuc894/tmp9x0bam0_/database/job_working_directory_py8fi90v/000/1/metadata
galaxy.jobs DEBUG 2024-11-14 15:17:00,964 [pN:main,p:9546,tN:PulsarJobRunner.work_thread-1] job_wrapper.finish for job 1 executed (98.738 ms)
INFO:     127.0.0.1:46454 - "GET /api/histories/adb5f5c93f827949 HTTP/1.1" 200 OK
Problem in history with id adb5f5c93f827949 - summary of history's datasets and jobs below.
INFO:     127.0.0.1:46470 - "GET /api/histories/adb5f5c93f827949/contents HTTP/1.1" 200 OK
--------------------------------------

The history in console output logging does show the failed_metadata dataset:

--------------------------------------
| 6 - all_output_types (HID - NAME) 
INFO:     127.0.0.1:46556 - "GET /api/histories/adb5f5c93f827949/contents/f3f73e481f432006 HTTP/1.1" 200 OK
| Dataset State:
|  failed_metadata
| Dataset Blurb:
|  1 line
| Dataset Info:
|  *Dataset info is empty.*
| Peek:
|  <table cellspacing="0" cellpadding="3"><tr><td>hi</td></tr></table>
INFO:     127.0.0.1:46560 - "GET /api/histories/adb5f5c93f827949/contents/f3f73e481f432006/provenance HTTP/1.1" 200 OK
| Dataset Job Standard Output:
|  *Standard output was empty.*
| Dataset Job Standard Error:
|  *Standard error was empty.*
|
--------------------------------------

Most of the outputs are fine. They are defined here https://github.com/galaxyproject/galaxy/blob/dev/test/functional/tools/all_output_types.xml#L21. I guess one of the more esoteric was to catch datasets is failing here - likely discover_datasets but I cannot tell from the logs exactly.

Got permission from Marius in exchange for pledge to fix it before the release.
So lets restrict this new append in the job files behavior to just append for the tool_stdout and tool_stderr files.
@jmchilton jmchilton merged commit 8c30a87 into galaxyproject:dev Nov 18, 2024
55 of 56 checks passed
@jdavcs jdavcs added the highlight Included in user-facing release notes at the top label Nov 20, 2024
@martenson
Copy link
Member

This is awesome, thanks @gecage952 ! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API area/jobs highlight Included in user-facing release notes at the top kind/feature
Development

Successfully merging this pull request may close these issues.

9 participants