Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Building at UGent: Zen3/A100 #842

Open
wants to merge 8 commits into
base: 2023.06-software.eessi.io
Choose a base branch
from

Conversation

laraPPr
Copy link
Collaborator

@laraPPr laraPPr commented Dec 17, 2024

No description provided.

@laraPPr laraPPr added tests Related to software testing accel:nvidia labels Dec 17, 2024
@riscv-eessi-io-bot
Copy link

Instance eessi-bot-riscv is configured to build for:

  • architectures: riscv64/generic
  • repositories: riscv.eessi.io-20240402

Copy link

eessi-bot bot commented Dec 17, 2024

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-compat, eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software

Copy link

eessi-bot bot commented Dec 17, 2024

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-compat, eessi.io-2023.06-software

@laraPPr
Copy link
Collaborator Author

laraPPr commented Dec 17, 2024

bot: help

Copy link

eessi-bot bot commented Dec 17, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command help from laraPPr

    • expanded format: help
  • handling command help resulted in:
    How to send commands to bot instances

    • Commands must be sent with a new comment (edits of existing comments are ignored).
    • A comment may contain multiple commands, one per line.
    • Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
    • Currently supported COMMANDs are: help, build, show_config, status

    For more information, see https://www.eessi.io/docs/bot

@riscv-eessi-io-bot
Copy link

Updates by the bot instance eessi-bot-riscv (click for details)
  • account laraPPr has NO permission to send commands to the bot

1 similar comment
@riscv-eessi-io-bot
Copy link

Updates by the bot instance eessi-bot-riscv (click for details)
  • account laraPPr has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Dec 17, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command help from laraPPr

    • expanded format: help
  • handling command help resulted in:
    How to send commands to bot instances

    • Commands must be sent with a new comment (edits of existing comments are ignored).
    • A comment may contain multiple commands, one per line.
    • Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
    • Currently supported COMMANDs are: help, build, show_config, status

    For more information, see https://www.eessi.io/docs/bot

@laraPPr
Copy link
Collaborator Author

laraPPr commented Dec 17, 2024

bot: help

@riscv-eessi-io-bot
Copy link

Updates by the bot instance eessi-bot-riscv (click for details)
  • account laraPPr has NO permission to send commands to the bot

1 similar comment
@riscv-eessi-io-bot
Copy link

Updates by the bot instance eessi-bot-riscv (click for details)
  • account laraPPr has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Dec 17, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command help from laraPPr

    • expanded format: help
  • handling command help resulted in:
    How to send commands to bot instances

    • Commands must be sent with a new comment (edits of existing comments are ignored).
    • A comment may contain multiple commands, one per line.
    • Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
    • Currently supported COMMANDs are: help, build, show_config, status

    For more information, see https://www.eessi.io/docs/bot

Copy link

eessi-bot bot commented Dec 17, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command help from laraPPr

    • expanded format: help
  • handling command help resulted in:
    How to send commands to bot instances

    • Commands must be sent with a new comment (edits of existing comments are ignored).
    • A comment may contain multiple commands, one per line.
    • Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
    • Currently supported COMMANDs are: help, build, show_config, status

    For more information, see https://www.eessi.io/docs/bot

@laraPPr
Copy link
Collaborator Author

laraPPr commented Dec 17, 2024

@bedroge the riscv bot seems to be doing double duty maybe check that out. #842 (comment)

@laraPPr
Copy link
Collaborator Author

laraPPr commented Dec 17, 2024

bot: help

Copy link

eessi-bot bot commented Dec 17, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command help from laraPPr

    • expanded format: help
  • handling command help resulted in:
    How to send commands to bot instances

    • Commands must be sent with a new comment (edits of existing comments are ignored).
    • A comment may contain multiple commands, one per line.
    • Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
    • Currently supported COMMANDs are: help, build, show_config, status

    For more information, see https://www.eessi.io/docs/bot

@riscv-eessi-io-bot
Copy link

Updates by the bot instance eessi-bot-riscv (click for details)
  • account laraPPr has NO permission to send commands to the bot

1 similar comment
@riscv-eessi-io-bot
Copy link

Updates by the bot instance eessi-bot-riscv (click for details)
  • account laraPPr has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Dec 17, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command help from laraPPr

    • expanded format: help
  • handling command help resulted in:
    How to send commands to bot instances

    • Commands must be sent with a new comment (edits of existing comments are ignored).
    • A comment may contain multiple commands, one per line.
    • Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
    • Currently supported COMMANDs are: help, build, show_config, status

    For more information, see https://www.eessi.io/docs/bot

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Dec 17, 2024

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command help from laraPPr

    • expanded format: help
  • handling command help resulted in:
    How to send commands to bot instances

    • Commands must be sent with a new comment (edits of existing comments are ignored).
    • A comment may contain multiple commands, one per line.
    • Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
    • Currently supported COMMANDs are: help, build, show_config, status

    For more information, see https://www.eessi.io/docs/bot

@laraPPr
Copy link
Collaborator Author

laraPPr commented Dec 17, 2024

bot show_config

@trz42
Copy link
Collaborator

trz42 commented Dec 17, 2024

bot show_config instance:eessi-bot-vsc-ugent

missing ':' after 'bot'?

@laraPPr
Copy link
Collaborator Author

laraPPr commented Dec 17, 2024

bot: show_config instance:eessi-bot-vsc-ugent

Copy link

eessi-bot bot commented Dec 17, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command show_config instance:eessi-bot-vsc-ugent from laraPPr

    • expanded format: show_config instance:eessi-bot-vsc-ugent
  • handling command show_config instance:eessi-bot-vsc-ugent resulted in:

@riscv-eessi-io-bot
Copy link

Updates by the bot instance eessi-bot-riscv (click for details)
  • account laraPPr has NO permission to send commands to the bot

1 similar comment
@riscv-eessi-io-bot
Copy link

Updates by the bot instance eessi-bot-riscv (click for details)
  • account laraPPr has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Dec 17, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command show_config instance:eessi-bot-vsc-ugent from laraPPr

    • expanded format: show_config instance:eessi-bot-vsc-ugent
  • handling command show_config instance:eessi-bot-vsc-ugent resulted in:

@laraPPr
Copy link
Collaborator Author

laraPPr commented Dec 17, 2024

bot: build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software
# This is a test to see if build.sh will find the modules installed on the local cluster and not build because it can find that installation

@riscv-eessi-io-bot
Copy link

Updates by the bot instance eessi-bot-riscv (click for details)
  • account laraPPr has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Dec 17, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Dec 17, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software resulted in:

    • no jobs were submitted

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Dec 17, 2024

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software resulted in:

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Dec 17, 2024

New job on instance eessi-bot-vsc-ugent for CPU micro-architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /scratch/gent/vo/002/gvo00211/SHARED/jobs/2024.12/pr_842/15437096

date job status comment
Dec 17 16:00:46 UTC 2024 submitted job id 15437096 awaits release by job manager
Dec 17 16:01:58 UTC 2024 released job awaits launch by Slurm scheduler
Dec 17 16:46:15 UTC 2024 running job 15437096 is running
Dec 17 17:14:59 UTC 2024 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Job results file _bot_job15437096.result does not exist in job directory, or parsing it failed.
  • No artefacts were found/reported.
Dec 17 17:14:59 UTC 2024 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job15437096.test does not exist in job directory, or parsing it failed.

@bedroge
Copy link
Collaborator

bedroge commented Dec 17, 2024

@bedroge the riscv bot seems to be doing double duty maybe check that out. #842 (comment)

I restarted the event handler and smee, looks like that solved the issue.

@casparvl
Copy link
Collaborator

I would try Sam's handy work in here EESSI/eessi-bot-software-layer#281 and set REFRAME_ARGS="--tag CI --tag 1_4_node --nocolor ${REFRAME_NAME_ARGS}" Than It is still not running the GPU tests but that could be a work around I think?

True. For the sake of this PR, you could even adapt https://github.com/EESSI/software-layer/blob/2023.06-software.eessi.io/reframe_config_bot.py.tmpl temporarily to add the GPU feature as well. Run the build. Then change it back before merging. At least it'd allow you to prove the GPU-based tests pass.

But we need a more permanent fix, clearly :)

@casparvl
Copy link
Collaborator

ReFrame also finds our local modules like OSU-Micro-Benchmarks/7.4-gompi-2024a and Tensorflow and Gromacs (they are skipped) But they can't get to them in the rfm.job. SO it fails. So I need to look where I best insert and module purge --force somewhere best as soon as possible as I think this could also cause trouble with build.sh

This is weird by the way, since with the local spawner, the rfm.job essentially runs locally. So how can the runtime find your local modules, but then the job doesn't, even though that is spawned locally? :|

@casparvl
Copy link
Collaborator

casparvl commented Dec 17, 2024

It is probably good to mask all local modules in some way for the entire job (i.e. build and test phase). Not sure if that's easily possible. We've never really had this problem since we were building on the Magic Castle clusters that didn't have any modules other than the EESSI ones...

@laraPPr
Copy link
Collaborator Author

laraPPr commented Dec 17, 2024

New job on instance eessi-bot-vsc-ugent for CPU micro-architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /scratch/gent/vo/002/gvo00211/SHARED/jobs/2024.12/pr_842/15437096

date job status comment
Dec 17 16:00:46 UTC 2024 submitted job id 15437096 awaits release by job manager
Dec 17 16:01:58 UTC 2024 released job awaits launch by Slurm scheduler
Dec 17 16:46:15 UTC 2024 running job 15437096 is running
Dec 17 17:14:59 UTC 2024 finished
🤷 UNKNOWN (click triangle for detailed information)
Dec 17 17:14:59 UTC 2024 test result
🤷 UNKNOWN (click triangle for detailed information)

It started the build so the in the build.sh script it does not seem to find the local builds. Will add a check for modulepath in the two scripts as will also do the quick and dirty solution of the test-suite

@laraPPr
Copy link
Collaborator Author

laraPPr commented Dec 17, 2024

bot: build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software
building ptm and added a check to see what is in the $MODULEPATH and made small changes to the test-suite so that the tests will run

Copy link

eessi-bot bot commented Dec 17, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software resulted in:

    • no jobs were submitted

@riscv-eessi-io-bot
Copy link

Updates by the bot instance eessi-bot-riscv (click for details)
  • account laraPPr has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Dec 17, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software resulted in:

    • no jobs were submitted

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Dec 17, 2024

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software resulted in:

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Dec 17, 2024

New job on instance eessi-bot-vsc-ugent for CPU micro-architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /scratch/gent/vo/002/gvo00211/SHARED/jobs/2024.12/pr_842/15437121

date job status comment
Dec 17 17:34:40 UTC 2024 submitted job id 15437121 awaits release by job manager

@laraPPr
Copy link
Collaborator Author

laraPPr commented Dec 17, 2024

bot: build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80
building ptm and added a check to see what is in the $MODULEPATH and made small changes to the test-suite so that the tests will run

@riscv-eessi-io-bot
Copy link

Updates by the bot instance eessi-bot-riscv (click for details)
  • account laraPPr has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Dec 17, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Dec 17, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Dec 17, 2024

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Dec 17, 2024

New job on instance eessi-bot-vsc-ugent for CPU micro-architecture x86_64-amd-zen3 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /scratch/gent/vo/002/gvo00211/SHARED/jobs/2024.12/pr_842/15437122

date job status comment
Dec 17 17:35:37 UTC 2024 submitted job id 15437122 awaits release by job manager
Dec 17 17:37:07 UTC 2024 released job awaits launch by Slurm scheduler
Dec 17 17:39:13 UTC 2024 running job 15437122 is running
Dec 17 17:53:37 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-15437122.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1734457462.tar.gzsize: 0 MiB (409562 bytes)
entries: 39
modules under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/modules/all
pmt/1.2.0-GCCcore-12.3.0-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/software
pmt/1.2.0-GCCcore-12.3.0-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80
no other files in tarball
Dec 17 17:53:37 UTC 2024 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite produced failures.
ReFrame Summary
[ FAILED ] Ran 3/3 test case(s) from 3 check(s) (3 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-15437122.out
❌ found message matching ERROR:
❌ found message matching [\s*FAILED\s*].*Ran .* test case

@laraPPr
Copy link
Collaborator Author

laraPPr commented Dec 17, 2024

The local module path should not be set
Where I put the check right now it also seems early enough to do the module purge --force which unsets all the relevant module paths. @boegel and I were discussing that we want to use the local module of awscli but I don't think that is needed within the job environment?
bot/build.sh: MODULEPATH='/apps/gent/RHEL9/zen3-ampere-ib/modules/all:/etc/modulefiles/vsc'

@trz42
Copy link
Collaborator

trz42 commented Dec 17, 2024

The local module path should not be set
Where I put the check right now it also seems early enough to do the module purge --force which unsets all the relevant module paths. @boegel and I were discussing that we want to use the local module of awscli but I don't think that is needed within the job environment?
bot/build.sh: MODULEPATH='/apps/gent/RHEL9/zen3-ampere-ib/modules/all:/etc/modulefiles/vsc'

The aws command is used by the event handler on the host where it is running (not inside a job, that's correct). The bot expects aws is in the $PATH and the aws is actually called by another script scripts/eessi-upload-to-staging. So a quick-and-dirty work around could be to modify that script and load the local module of awscli.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accel:nvidia tests Related to software testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants