Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DQM test TestDQMGUIUpload times out #46682

Open
nothingface0 opened this issue Nov 12, 2024 · 12 comments
Open

DQM test TestDQMGUIUpload times out #46682

nothingface0 opened this issue Nov 12, 2024 · 12 comments

Comments

@nothingface0
Copy link
Contributor

The recently added TestDQMGUIUpload (#46551) has shown to fail even after 10 minutes of waiting, for recent PR tests and an IB:

After checking the logs of the target DQMGUI, the first impression I get is that during periods of heavy dev DQMGUI activity (upload of tier0 replays, PR root files), it looks like it might take a significant amount of time for the file uploaded by the test to be properly registered, meaning that the test fails. If this is the only problem of the test, we could increase the max waiting time.

Unfortunately, I forgot to add %H in the timestamp that is added to the file, so I don't know exactly how much time it takes the DQMGUI to discover each uploaded file, since I only know what time it arrived and was imported, but not when the test started.

I will keep this issue updated as I investigate from the DQM side.

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 12, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @nothingface0.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign dqm

@cmsbuild
Copy link
Contributor

New categories assigned: dqm

@antoniovagnerini,@rseidita you have been requested to review this Pull request/Issue and eventually sign? Thanks

@nothingface0
Copy link
Contributor Author

nothingface0 commented Nov 12, 2024

While debugging, we faced another issue and had to restart the dev DQMGUI, which led to another issue appearing. We are investigating.

Cms-talk post here

@smuzaffar
Copy link
Contributor

@nothingface0 , any idea why only the unit test fail while the dqm bin-by-bin comparison works [a]. dqm bin-bin comparison also uses visDQMUpload.py to upload many root files to https://cmsweb.cern.ch/dqm/dev https://github.com/cms-sw/cmssw/blob/master/DQMServices/FileIO/scripts/compareDQMOutput.py#L73-L99

[a]

Uploading output:Uploading output:

visDQMUpload.py https://cmsweb.cern.ch/dqm/dev /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/cms-bot/dqm-comparison/dqmComparisonOutput/pr/DQM_V0001_R000000001__RelVal_wf10224_0_pr__CMSSW_14_2_X-PRcmssw_46662-65580__DQMIO.root
visDQMUpload.py https://cmsweb.cern.ch/dqm/dev /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/cms-bot/dqm-comparison/dqmComparisonOutput/pr/DQM_V0001_R000000001__RelVal_wf13034_0_pr__CMSSW_14_2_X-PRcmssw_46662-65580__DQMIO.root
Uploading output:
visDQMUpload.py https://cmsweb.cern.ch/dqm/dev /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/cms-bot/dqm-comparison/dqmComparisonOutput/pr/DQM_V0001_R000165121__wf1000_0_pr__CMSSW_14_2_X-PRcmssw_46662-65580__DQMIO.root
Uploading output:Uploading output:


@nothingface0
Copy link
Contributor Author

@smuzaffar Regarding the bin-by-bin comparison, from what I understand it's done locally, where the test is running, and then the results are uploaded. In the script you link, there's no validation that the upload itself worked, e.g. by checking the GUI after the upload finished: it's just comparing and uploading.

@smuzaffar
Copy link
Contributor

smuzaffar commented Nov 13, 2024

the unit test is failing at the time of upload [a] in visDQMUpload.py ... right ? And this upload is working for DQM bin-by-bin otherwise we should have seen visDQMUpload.py failing too for dqm bin-bin .... right?

[a]

+ visDQMUpload.py https://cmsweb.cern.ch/dqm/dev DQM_V0001_R000000001__Harvesting__DQMTests202411134029559212184__DQMIO.root
DQM_V0001_R000000001__Harvesting__DQMTests202411134029559212184__DQMIO.root
Using SSL private key /data/cmsbld/jenkins/workspace/ib-run-qa/x509up_u501
Using SSL public key /data/cmsbld/jenkins/workspace/ib-run-qa/x509up_u501
ERROR HTTP Error 500: Internal Server Error
Status code:  None
Message:      None
Detail:       None

nothingface0 added a commit to nothingface0/cmssw that referenced this issue Nov 13, 2024
I.e., when the cmsbot is running the test for PRs or IBs,
where `CMSBOT_CI_TESTS` is set.

Also fix the timestamp of the renamed file to include the hour,
which was missing before, it will help with debugging.

See also: cms-sw#46682
nothingface0 added a commit to nothingface0/cmssw that referenced this issue Nov 13, 2024
I.e., when the cmsbot is running the test for PRs or IBs,
where `CMSBOT_CI_TESTS` is set.

- Also fix the timestamp of the renamed file to include the hour,
which was missing before, it will help with debugging.
- Added a `--help` argument, just in case.

See also: cms-sw#46682
nothingface0 added a commit to nothingface0/cmssw that referenced this issue Nov 13, 2024
I.e., when the cmsbot is running the test for PRs or IBs,
where `CMSBOT_CI_TESTS` is set.

- Also fix the timestamp of the renamed file to include the hour,
which was missing before, it will help with debugging.
- Added a `--help` argument, just in case.

See also: cms-sw#46682
@nothingface0
Copy link
Contributor Author

nothingface0 commented Nov 14, 2024

Taking this failed test as an example, judging from the logs I found in DQMGUI and the test's logs:

  1. The test file to be uploaded to the DQMGUI was created on 2024-11-12, XX:29:01 (Hour missing due to me forgetting the %H in the date command, run by the test).
  2. The file first appears in DQMGUI's logs at 2024-11-12, 12:29:02, CET (hence I'm assuming XX = 12, meaning that the file was sent within a second of being created).
  3. The file was processed by DQMGUI on 2024-11-12 14:43, CET, more than 2 hours later. It took 6 seconds to do it.
  4. DQMGUI logs around the time of the file's arrival include errors opening root files regarding different PRs (e.g. DQM_V0001_R000000001__RelVal_wf2500_201_base__CMSSW_14_2_X-PRcmssw_46659-65571__DQMIO.root).
  5. There is also a big chunk of time (~2 hours, from 12:45 to 14:41) where DQMGUI was occupied by the processing of a single root file: DQM_V0001_R000380306__wf2024_202001_base__CMSSW_14_2_X-PRcmssw_46666-65573__DQMIO.root. The file itself is small (27K) so I'm thinking something must have broken there.

Takeaway points:

  • The file uploaded by the test in question is small, arrives almost instantaneously, and is processed quickly, once processing starts.
  • We definitely can't increase the timeout of the test to 2 hours to include all cases.
  • We'll need to do some more investigation to find the reasons for the delay in root file processing in the DQMGUI.

@nothingface0
Copy link
Contributor Author

nothingface0 commented Nov 14, 2024

Another instance of the failure here.

  1. The file is generated 2024/11/14 14:30:16 CET.
  2. It's uploaded and acknowledged by DQMGUI at 14:30:19 CET (within 3 seconds).
  3. It's received by the GUI's background workers at 15:31:50 CET (at least 1 hour later),
  4. Starts being processed at 15:32:27 and processing finishes by 15:32:33 (6 seconds).

On the other hand, for this successful test:

  1. The file is generated 2024/11/13 18:03:37 (17:03:37 UTC, which is what the date command used to name the test file returns).
  2. It's uploaded and acknowledged by DQMGUI at 18:03:37 CET (within a second).
  3. It's received by the GUI's background workers at 18:03:42 CET (5 seconds later).
  4. Starts being processed at 18:03:50 and processing finishes by 18:03:56 (6 seconds).

@smuzaffar
Copy link
Contributor

smuzaffar commented Nov 14, 2024

How about we disable this test for PRs/IBs. We run it as a special test for each IB ( just like we run tests for crab and hlt) and there we can increase the wait time to few hours (we can run it on lxplus so it will not waste our build resources). If it does not get the processing after let say 6 hours then we can mark it failed?

@nothingface0
Copy link
Contributor Author

How about we disable this test for PRs/IBs. We run it as a special test for each IB ( just like we run tests for crab and hlt) and there we can increase the wait time to few hours (we can run it on lxplus so it will not waste our build resources). If it does not get the processing after let say 6 hours then we can mark it failed?

I didn't know there was such an option, sounds good to me! Let me know if any modifications are required for the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants