Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix failing test step on AWS #678

Conversation

casparvl
Copy link
Collaborator

Currently, the test step on AWS fails because we fail to get a memory limit from the cgroup. I'll add some more verbose output as a first step to debugging this.

Copy link

eessi-bot bot commented Aug 20, 2024

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-compat, eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software

Copy link

eessi-bot bot commented Aug 20, 2024

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-compat, eessi-hpc.org-2023.06-compat, eessi-hpc.org-2023.06-software, eessi.io-2023.06-software

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:zen2

Copy link

eessi-bot bot commented Aug 20, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:zen2 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:zen2
  • handling command build repository:eessi.io-2023.06-software architecture:zen2 resulted in:

Copy link

eessi-bot bot commented Aug 20, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:zen2 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:zen2
  • handling command build repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Aug 20, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_678/16702

date job status comment
Aug 20 09:03:49 UTC 2024 submitted job id 16702 awaits release by job manager
Aug 20 09:04:01 UTC 2024 released job awaits launch by Slurm scheduler
Aug 20 09:05:04 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-16702.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Aug 20 09:05:04 UTC 2024 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-16702.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

ERROR: Both files /sys/fs/cgroup/memory//slurm/uid_60008/job_16702/step_batch/memory.limit_in_bytes and /sys/fs/cgroup//slurm/uid_60008/job_16702/step_batch/memory.max couldn't be found. Failed to get the memory limit from the current cgroup

Interactively, I got the correct output, from file /sys/fs/cgroup/memory//slurm/uid_60014/job_16701/step_interactive/memory.limit_in_bytes. Now, I don't know why step_batch doesn't seem to have the memory.limit_in_bytes file.

I'll try to manually submit some batch jobs on AWS to figure this out. It's tricky that we don't see this interactively...

My bet is, it is due to the replacement with $(</proc/self/cpuset). This seems like a more robust way of doing things, but maybe it isn't - we can always revert that choice, at least for AWS.

@casparvl
Copy link
Collaborator Author

Very strange, this small test job just works correctly:

#!/bin/bash

#SBATCH -p x86-64-amd-zen3-node
#SBATCH -t 10:00
#SBATCH -n 1
#SBATCH -c 16
#SBATCH --exclusive

cpuset_file=$(</proc/self/cpuset)
echo "CPUset file: $cpuset_file"

echo "ls -al /sys/fs/cgroup/memory"
ls -al /sys/fs/cgroup/memory/

echo "ls -al /sys/fs/cgroup/memory/${cpuset_file}/"
ls -al /sys/fs/cgroup/memory/${cpuset_file}/

cgroup_mem_limit="/sys/fs/cgroup/memory/${cpuset_file}/memory.limit_in_bytes"
echo "${cgroup_mem_limit}"
cat "${cgroup_mem_limit}"

WIth output:

CPUset file: /slurm/uid_60014/job_16706/step_batch
ls -al /sys/fs/cgroup/memory
total 0
dr-xr-xr-x.  6 root root   0 Aug 20 07:57 .
drwxr-xr-x. 14 root root 360 Aug 20 07:57 ..
-rw-r--r--.  1 root root   0 Aug 20 09:30 cgroup.clone_children
--w--w--w-.  1 root root   0 Aug 20 09:30 cgroup.event_control
-rw-r--r--.  1 root root   0 Aug 20 08:03 cgroup.procs
-r--r--r--.  1 root root   0 Aug 20 09:30 cgroup.sane_behavior
drwxr-xr-x.  2 root root   0 Aug 20 07:57 init.scope
-rw-r--r--.  1 root root   0 Aug 20 09:30 memory.failcnt
--w-------.  1 root root   0 Aug 20 09:30 memory.force_empty
-rw-r--r--.  1 root root   0 Aug 20 09:30 memory.kmem.failcnt
-rw-r--r--.  1 root root   0 Aug 20 09:30 memory.kmem.limit_in_bytes
-rw-r--r--.  1 root root   0 Aug 20 09:30 memory.kmem.max_usage_in_bytes
-r--r--r--.  1 root root   0 Aug 20 09:30 memory.kmem.slabinfo
-rw-r--r--.  1 root root   0 Aug 20 09:30 memory.kmem.tcp.failcnt
-rw-r--r--.  1 root root   0 Aug 20 09:30 memory.kmem.tcp.limit_in_bytes
-rw-r--r--.  1 root root   0 Aug 20 09:30 memory.kmem.tcp.max_usage_in_bytes
-r--r--r--.  1 root root   0 Aug 20 09:30 memory.kmem.tcp.usage_in_bytes
-r--r--r--.  1 root root   0 Aug 20 09:30 memory.kmem.usage_in_bytes
-rw-r--r--.  1 root root   0 Aug 20 07:57 memory.limit_in_bytes
-rw-r--r--.  1 root root   0 Aug 20 09:30 memory.max_usage_in_bytes
-rw-r--r--.  1 root root   0 Aug 20 09:30 memory.memsw.failcnt
-rw-r--r--.  1 root root   0 Aug 20 08:00 memory.memsw.limit_in_bytes
-rw-r--r--.  1 root root   0 Aug 20 09:30 memory.memsw.max_usage_in_bytes
-r--r--r--.  1 root root   0 Aug 20 09:30 memory.memsw.usage_in_bytes
-rw-r--r--.  1 root root   0 Aug 20 09:30 memory.move_charge_at_immigrate
-r--r--r--.  1 root root   0 Aug 20 09:30 memory.numa_stat
-rw-r--r--.  1 root root   0 Aug 20 09:30 memory.oom_control
----------.  1 root root   0 Aug 20 09:30 memory.pressure_level
-rw-r--r--.  1 root root   0 Aug 20 09:30 memory.soft_limit_in_bytes
-r--r--r--.  1 root root   0 Aug 20 09:30 memory.stat
-rw-r--r--.  1 root root   0 Aug 20 09:30 memory.swappiness
-r--r--r--.  1 root root   0 Aug 20 09:30 memory.usage_in_bytes
-rw-r--r--.  1 root root   0 Aug 20 07:57 memory.use_hierarchy
-rw-r--r--.  1 root root   0 Aug 20 09:30 notify_on_release
-rw-r--r--.  1 root root   0 Aug 20 09:30 release_agent
drwxr-xr-x.  4 root root   0 Aug 20 09:36 slurm
drwxr-xr-x. 85 root root   0 Aug 20 07:57 system.slice
-rw-r--r--.  1 root root   0 Aug 20 08:00 tasks
drwxr-xr-x.  2 root root   0 Aug 20 07:57 user.slice
ls -al /sys/fs/cgroup/memory//slurm/uid_60014/job_16706/step_batch/
total 0
drwxr-xr-x. 3 casparvl casparvl 0 Aug 20 09:36 .
drwxr-xr-x. 4 root     root     0 Aug 20 09:36 ..
-rw-r--r--. 1 root     root     0 Aug 20 09:36 cgroup.clone_children
--w--w--w-. 1 root     root     0 Aug 20 09:36 cgroup.event_control
-rw-r--r--. 1 root     root     0 Aug 20 09:36 cgroup.procs
-rw-r--r--. 1 root     root     0 Aug 20 09:36 memory.failcnt
--w-------. 1 root     root     0 Aug 20 09:36 memory.force_empty
-rw-r--r--. 1 root     root     0 Aug 20 09:36 memory.kmem.failcnt
-rw-r--r--. 1 root     root     0 Aug 20 09:36 memory.kmem.limit_in_bytes
-rw-r--r--. 1 root     root     0 Aug 20 09:36 memory.kmem.max_usage_in_bytes
-r--r--r--. 1 root     root     0 Aug 20 09:36 memory.kmem.slabinfo
-rw-r--r--. 1 root     root     0 Aug 20 09:36 memory.kmem.tcp.failcnt
-rw-r--r--. 1 root     root     0 Aug 20 09:36 memory.kmem.tcp.limit_in_bytes
-rw-r--r--. 1 root     root     0 Aug 20 09:36 memory.kmem.tcp.max_usage_in_bytes
-r--r--r--. 1 root     root     0 Aug 20 09:36 memory.kmem.tcp.usage_in_bytes
-r--r--r--. 1 root     root     0 Aug 20 09:36 memory.kmem.usage_in_bytes
-rw-r--r--. 1 root     root     0 Aug 20 09:36 memory.limit_in_bytes
-rw-r--r--. 1 root     root     0 Aug 20 09:36 memory.max_usage_in_bytes
-rw-r--r--. 1 root     root     0 Aug 20 09:36 memory.memsw.failcnt
-rw-r--r--. 1 root     root     0 Aug 20 09:36 memory.memsw.limit_in_bytes
-rw-r--r--. 1 root     root     0 Aug 20 09:36 memory.memsw.max_usage_in_bytes
-r--r--r--. 1 root     root     0 Aug 20 09:36 memory.memsw.usage_in_bytes
-rw-r--r--. 1 root     root     0 Aug 20 09:36 memory.move_charge_at_immigrate
-r--r--r--. 1 root     root     0 Aug 20 09:36 memory.numa_stat
-rw-r--r--. 1 root     root     0 Aug 20 09:36 memory.oom_control
----------. 1 root     root     0 Aug 20 09:36 memory.pressure_level
-rw-r--r--. 1 root     root     0 Aug 20 09:36 memory.soft_limit_in_bytes
-r--r--r--. 1 root     root     0 Aug 20 09:36 memory.stat
-rw-r--r--. 1 root     root     0 Aug 20 09:36 memory.swappiness
-r--r--r--. 1 root     root     0 Aug 20 09:36 memory.usage_in_bytes
-rw-r--r--. 1 root     root     0 Aug 20 09:36 memory.use_hierarchy
-rw-r--r--. 1 root     root     0 Aug 20 09:36 notify_on_release
drwxr-xr-x. 2 casparvl casparvl 0 Aug 20 09:36 task_0
-rw-r--r--. 1 root     root     0 Aug 20 09:36 tasks
/sys/fs/cgroup/memory//slurm/uid_60014/job_16706/step_batch/memory.limit_in_bytes
32967229440

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:zen2

Copy link

eessi-bot bot commented Aug 20, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:zen2 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:zen2
  • handling command build repository:eessi.io-2023.06-software architecture:zen2 resulted in:

Copy link

eessi-bot bot commented Aug 20, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:zen2 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:zen2
  • handling command build repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Aug 20, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_678/16708

date job status comment
Aug 20 09:51:03 UTC 2024 submitted job id 16708 awaits release by job manager
Aug 20 09:51:14 UTC 2024 released job awaits launch by Slurm scheduler
Aug 20 09:52:15 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-16708.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Aug 20 09:52:15 UTC 2024 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-16708.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

Caspar van Leeuwen added 2 commits August 20, 2024 11:56
@casparvl
Copy link
Collaborator Author

casparvl commented Aug 20, 2024

The difference is that we are in a container... so we should use the info from the mounted directories of the host, not from the container's /sys and /proc. I.e. we should /hostsys, like we did prior to #670 (the PR that broke things on AWS)

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:zen2

Copy link

eessi-bot bot commented Aug 20, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:zen2 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:zen2
  • handling command build repository:eessi.io-2023.06-software architecture:zen2 resulted in:

Copy link

eessi-bot bot commented Aug 20, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:zen2 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:zen2
  • handling command build repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Aug 20, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_678/16709

date job status comment
Aug 20 10:00:50 UTC 2024 submitted job id 16709 awaits release by job manager
Aug 20 10:01:19 UTC 2024 released job awaits launch by Slurm scheduler
Aug 20 10:02:20 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-16709.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Aug 20 10:02:20 UTC 2024 test result
😢 FAILURE (click triangle for details)
Reason
Failed for unknown reason
Details
✅ job output file slurm-16709.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:zen2

Copy link

eessi-bot bot commented Aug 20, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:zen2 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:zen2
  • handling command build repository:eessi.io-2023.06-software architecture:zen2 resulted in:

Copy link

eessi-bot bot commented Aug 20, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:zen2 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:zen2
  • handling command build repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Aug 20, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_678/16710

date job status comment
Aug 20 10:02:44 UTC 2024 submitted job id 16710 awaits release by job manager
Aug 20 10:03:22 UTC 2024 released job awaits launch by Slurm scheduler
Aug 20 10:04:24 UTC 2024 running job 16710 is running
Aug 20 10:20:57 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-16710.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Aug 20 10:20:57 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 18/18 test case(s) from 18 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-16710.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

…it from the containers' /proc is fine. So let's do that
@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:zen2

Copy link

eessi-bot bot commented Aug 20, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:zen2 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:zen2
  • handling command build repository:eessi.io-2023.06-software architecture:zen2 resulted in:

Copy link

eessi-bot bot commented Aug 20, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:zen2 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:zen2
  • handling command build repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Aug 20, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_678/16712

date job status comment
Aug 20 10:07:38 UTC 2024 submitted job id 16712 awaits release by job manager
Aug 20 10:08:32 UTC 2024 released job awaits launch by Slurm scheduler
Aug 20 10:17:46 UTC 2024 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Job results file _bot_job16712.result does not exist in job directory or reading it failed.
  • No artefacts were found/reported.
Aug 20 10:17:46 UTC 2024 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job16712.test does not exist in job directory or reading it failed.

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:zen3

Copy link

eessi-bot bot commented Aug 20, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:zen3 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:zen3
  • handling command build repository:eessi.io-2023.06-software architecture:zen3 resulted in:

Copy link

eessi-bot bot commented Aug 20, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:zen3 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:zen3
  • handling command build repository:eessi.io-2023.06-software architecture:zen3 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Aug 20, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_678/16713

date job status comment
Aug 20 10:15:55 UTC 2024 submitted job id 16713 awaits release by job manager
Aug 20 10:16:42 UTC 2024 released job awaits launch by Slurm scheduler
Aug 20 10:17:45 UTC 2024 running job 16713 is running
Aug 20 10:33:34 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-16713.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Aug 20 10:33:34 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 18/18 test case(s) from 18 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-16713.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:zen2

Copy link

eessi-bot bot commented Aug 20, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:zen2 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:zen2
  • handling command build repository:eessi.io-2023.06-software architecture:zen2 resulted in:

Copy link

eessi-bot bot commented Aug 20, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:zen2 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:zen2
  • handling command build repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Aug 20, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_678/16714

date job status comment
Aug 20 10:19:51 UTC 2024 submitted job id 16714 awaits release by job manager
Aug 20 10:20:55 UTC 2024 released job awaits launch by Slurm scheduler
Aug 20 10:21:59 UTC 2024 running job 16714 is running
Aug 20 10:39:49 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-16714.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Aug 20 10:39:49 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 18/18 test case(s) from 18 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-16714.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:zen4

Copy link

eessi-bot bot commented Aug 20, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:zen4 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:zen4
  • handling command build repository:eessi.io-2023.06-software architecture:zen4 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Aug 20, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:zen4 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:zen4
  • handling command build repository:eessi.io-2023.06-software architecture:zen4 resulted in:

Copy link

eessi-bot bot commented Aug 20, 2024

New job on instance eessi-bot-mc-azure for architecture x86_64-amd-zen4 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_678/227

date job status comment
Aug 20 10:21:57 UTC 2024 submitted job id 227 awaits release by job manager
Aug 20 10:22:11 UTC 2024 released job awaits launch by Slurm scheduler
Aug 20 11:54:19 UTC 2024 running job 227 is running
Aug 20 12:27:17 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-227.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Aug 20 12:27:17 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 13/13 test case(s) from 13 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-227.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software arch:x86_64/intel/haswell

Copy link

eessi-bot bot commented Aug 20, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/intel/haswell from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/intel/haswell
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/intel/haswell resulted in:

Copy link

eessi-bot bot commented Aug 20, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/intel/haswell from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/intel/haswell
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/intel/haswell resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Aug 20, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-intel-haswell for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_678/16715

date job status comment
Aug 20 10:23:00 UTC 2024 submitted job id 16715 awaits release by job manager
Aug 20 10:23:03 UTC 2024 released job awaits launch by Slurm scheduler
Aug 20 10:29:20 UTC 2024 running job 16715 is running
Aug 20 10:45:57 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-16715.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Aug 20 10:45:57 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 18/18 test case(s) from 18 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-16715.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl casparvl marked this pull request as ready for review August 20, 2024 11:11
@casparvl casparvl changed the title Try to fix failing test step on AWS Fix failing test step on AWS Aug 20, 2024
Copy link
Collaborator

@trz42 trz42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine overall. Added a little question if the script could have made a little 'smarter' (using /sys or /hostsys) and a suggestion to explain a bit what this /hostsys is (and where it comes from).

Comment on lines +167 to +168
cgroup_v1_mem_limit="/hostsys/fs/cgroup/memory/$(</proc/self/cpuset)/memory.limit_in_bytes"
cgroup_v2_mem_limit="/hostsys/fs/cgroup/$(</proc/self/cpuset)/memory.max"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path probably makes only sense if run in a very specific environment (e.g., testing software built for EESSI). While this is fine, how about checking whether /sys or /hostsys is available and use that?

If there would be a comment that explains what /hostsys is and how it is made available, it might make debugging a little easier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we bind-mount this additional path in bot/test.sh. You're absolutely right about the commenting part: I'll make that clear.

Regarding a fallback on /sys, I'm not sure if we want to do that. If /hostsys isn't there, it means the bind-mount failed / was not executed. I'd probably prefer there to be a hard error, than a silent success here that maybe extracts the wrong amount of memory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, added the description now

Copy link
Collaborator

@trz42 trz42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. Makes sense that we use /hostsys and don't fall back to /sys.

@trz42 trz42 merged commit 8c3ba5d into EESSI:2023.06-software.eessi.io Aug 21, 2024
33 checks passed
@casparvl casparvl deleted the fix_memory_detection_testsuite_aws branch September 18, 2024 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants