Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Permission denied errors when rebuilding (Python?) bundles with extensions that install binaries #556

Open
bedroge opened this issue Apr 30, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@bedroge
Copy link
Collaborator

bedroge commented Apr 30, 2024

When trying to rebuild Python, hatchling, and Python-bundle-PyPI in #546, we ran into weird permission issues for both hatchling and Python-bundle-PyPI. It's not clear yet what's causing it, but it seems to happen for (Python) bundles that include extensions that not only install files to lib, but also to bin.. The removal step seems to work fine and successfully removes the existing installation, but in the build phase the extension will suddenly see the old bin directory again (with read-only permissions), and fail with errors like:

Successfully built hatchling
Installing collected packages: hatchling
ERROR: Could not install packages due to an OSError.
Consider using the `--user` option or check the permissions.
Traceback (most recent call last):
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/pip/_internal/commands/install.py", line 449, in run
    installed = install_given_reqs(
                ^^^^^^^^^^^^^^^^^^^
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/pip/_internal/req/__init__.py", line 72, in install_given_reqs
    requirement.install(
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/pip/_internal/req/req_install.py", line 800, in install
    install_wheel(
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/pip/_internal/operations/install/wheel.py", line 731, in install_wheel
    _install_wheel(
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/pip/_internal/operations/install/wheel.py", line 648, in _install_wheel
    generated_console_scripts = maker.make_multiple(scripts_to_generate)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/pip/_vendor/distlib/scripts.py", line 436, in make_multiple
    filenames.extend(self.make(specification, options))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/pip/_internal/operations/install/wheel.py", line 429, in make
    return super().make(specification, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/pip/_vendor/distlib/scripts.py", line 425, in make
    self._make_script(entry, filenames, options=options)
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/pip/_vendor/distlib/scripts.py", line 325, in _make_script
    self._write_script(scriptnames, shebang, script, filenames, ext)
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/pip/_vendor/distlib/scripts.py", line 293, in _write_script
    self._fileop.write_binary_file(outname, script_bytes)
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/pip/_vendor/distlib/util.py", line 555, in write_binary_file
    os.remove(path)
PermissionError: [Errno 13] Permission denied: '/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/hatchling/1.18.0-GCCcore-12.3.0/bin/hatchling'
 (at easybuild/tools/run.py:682 in parse_cmd_output)

I've tried a lot of possible workarounds in both #546 and #555:

  • add write permissions (chmod -R u+w) instead of removing the existing installation dir
  • add write permissions and remove the existing installation dir
  • also remove the parent dir (e.g. hatchling) instead of only removing the installation dir of the particular version
  • only remove the contents of the installation dir and not the dir itself
  • move the existing installation dir instead of removing it
  • wipe the CVMFS cache between the removal and build steps
  • use a newer fuse-overlayfs
  • unsetting EASYBUILD_READ_ONLY_INSTALLDIR before starting the build

None of them solved the issue, though. So, in the end, I opted for working around it by adding write permissions to the affected installation directories on the Stratum 0, and then the rebuilds completed successfully.

@bedroge
Copy link
Collaborator Author

bedroge commented Jun 7, 2024

We're now seeing the same issue for our EESSI-extend module (see #578), which is a bundle.

bedroge@x86-64-amd-zen3-node2 ~ $ ls -la /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/EESSI-extend/2023.06-easybuild/    
total 0
drwxrwxr-x 2 bedroge bedroge 42 Jun  7 11:54 .
drwxr-xr-x 3 bedroge bedroge 31 Jun  7 11:54 ..

# so it looks like the software was successfully removed, but you the old easybuild subdir is still there somehow:

bedroge@x86-64-amd-zen3-node2 ~ $ ls -la /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/EESSI-extend/2023.06-easybuild/easybuild/
total 62
dr-xr-xr-x 3 bedroge bedroge  4096 May  7 13:41 .
drwxrwxr-x 2 bedroge bedroge    42 Jun  7 11:54 ..
-r--r--r-- 1 bedroge bedroge  2754 May  7 13:41 EESSI-extend-2023.06-easybuild-easybuild-devel
-r--r--r-- 1 bedroge bedroge  7976 May  7 13:28 EESSI-extend-2023.06-easybuild.eb
-r--r--r-- 1 bedroge bedroge 22875 May  7 13:41 easybuild-EESSI-extend-2023.06-20240507.134135.log.bz2
-rw-rw-r-- 1 bedroge bedroge 19051 May  7 13:41 easybuild-EESSI-extend-2023.06-20240507.134135_test_report.md
dr-xr-xr-x 4 bedroge bedroge  4096 May  7 13:41 reprod

I still don't have a clue why it only happens for some easyconfigs...

@bedroge
Copy link
Collaborator Author

bedroge commented Oct 15, 2024

Also seeing similar issues for LAMMPS in #788. This one is not a bundle, but a cmakemake easyblock.

== installing...
== ... (took 6 secs)
== FAILED: Installation ended unsuccessfully (build directory: /tmp/bot/easybuild/build/LAMMPS/2Aug2023_update2/foss-2023a-kokkos): build failed (first 300 chars): Failed to remove directory /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/generic/software/LAMMPS/2Aug2023_update2-foss-2023a-kokkos even after 3 attempts.
Reasons: [OSError(39, 'Directory not empty'), OSError(39, 'Directory not empty'), OSError(39, 'Directory not empty')] (took 9 mins 59 secs)

I've debugged this a bit interactively, but still can't find the issue. After the directory gets recreated in the build container, it looks empty at first, but when you do an ls on a file that previously existed, it suddenly shows up... The upper dir looks like this after the directory gets removed:

c---------. 1 bedroge bedroge 0, 0 15 okt 12:53 /tmp/eessi.ugqk5zhaDW/software.eessi.io/overlay-upper/versions/2023.06/software/linux/aarch64/generic/software/LAMMPS/29Aug2024-foss-2023b-kokkos

and like this when it's recreated:

crwx------. 1 bedroge bedroge 0, 0 15 okt 12:54 .wh..opq
-rwx------. 1 bedroge bedroge    0 15 okt 12:54 .wh..wh..opq

Might also be related to containers/fuse-overlayfs#324?

@bedroge
Copy link
Collaborator Author

bedroge commented Oct 15, 2024

I've also tried some workarounds here, for instance by explicitly removing every individual file in the installation directory (using find ..... -exec rm ...), as suggested by @casparvl , but that didn't work either.

As an alternative I was thinking of bind mounting a host directory to the installation prefix in the container, but that doesn't seem possible on top of fuse-mounted CVMFS repos. The only way I could make it work is by also bind mounting /cvmfs itself from the host, and then bind mounting an empty host directory on top of it. Though it works, it would mean that we have to assume that /cvmfs is available on the host, which is probably not always true. But perhaps we could still consider implementing this, and only use it in case /cvmfs/software.eessi.io is available on the host. If not, we can fall back to using the fakeroot approach (and hope it works 😅 ). It will make the code in the build script quite complex, though.

@boegel
Copy link
Contributor

boegel commented Oct 15, 2024

@bedroge This smells a lot like a bug in fuse-overlayfs, no?

Which version are we using, when have we last tried to update it?

@bedroge
Copy link
Collaborator Author

bedroge commented Oct 15, 2024

Forgot to mention that, but I also tried using a newer version of fuse-overlayfs (1.14), that didn't help either.

It could be a bug, though we also have a bit of a complex setup here with read-only dirs in a fuse-mounted CVMFS repo, where we delete stuff with fakeroot, and then reuse the upper dir for a fuse-mounted writable overlay on top of it.

@bedroge
Copy link
Collaborator Author

bedroge commented Oct 15, 2024

And I really don't understand why we only see it sometimes. It worked fine for, for instance, GCC and OpenMPI rebuilds.

@boegel
Copy link
Contributor

boegel commented Oct 15, 2024

That complicates making a small reproducer to be able to report this to fuse-overlayfs upstream, of course... :(

Any luck with strace?

@bedroge
Copy link
Collaborator Author

bedroge commented Oct 18, 2024

Based on @casparvl's suggestion, I've been trying some other workarounds, and at some point it suddenly seemed to work when I removed all files and directories manually and individually. But when I tried to do the same thing with some smart find commands, I couldn't make it work anymore. Even worse, the manual procedure then suddenly didn't seem to work anymore either... After lots of attempt, I think I found out why: it seems to matter what you do in the second container session (i.e. the build step) in a very strange and unexpected way.

So I tried this on the EESSI-extend/2023.06-easybuild installation, which is known to fail for rebuilds.
Removal step, with --fakeroot:

Apptainer> rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/EESSI-extend/2023.06-easybuild/

Then resume the container without --fakeroot and run:

Apptainer> ls /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/EESSI-extend/2023.06-easybuild
Apptainer> ls /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/EESSI-extend/2023.06-easybuild/easybuild/
EESSI-extend-2023.06-easybuild-easybuild-devel	easybuild-EESSI-extend-2023.06-20241009.203845.log.bz2	       reprod
EESSI-extend-2023.06-easybuild.eb		easybuild-EESSI-extend-2023.06-20241009.203845_test_report.md

This is the issue that we often see in PRs, where the installation prefix seems empty, but if you know what to look for, files are suddenly still there.

Now I'm redoing the same steps in a clean container, do the removal in the same way, and just do one additional ls in the second step:

Apptainer> mkdir /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/EESSI-extend/2023.06-easybuild
Apptainer> ls /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/EESSI-extend/2023.06-easybuild/easybuild/
ls: cannot access '/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/EESSI-extend/2023.06-easybuild/easybuild/': No such file or directory
Apptainer> ls /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/EESSI-extend/2023.06-easybuild
Apptainer> ls /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/EESSI-extend/2023.06-easybuild/easybuild/
ls: cannot access '/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/EESSI-extend/2023.06-easybuild/easybuild/': No such file or directory

So, the issue only pops up when you first do an ls on the (empty) installation directory itself 🤯 😵‍💫

@bedroge
Copy link
Collaborator Author

bedroge commented Oct 18, 2024

And though the fix seems easy, it may not be trivial. The affected subdir seems to be different for each installation: sometimes it's easybuild, sometimes it's bin. We could make a dump of all subdirs of the installation prefix before actually removing them, and then do an ls on each of them in the build script, but I really don't like that 😅 I've tried several more simple workarounds, but haven't found any yet. For instance, doing this in the removal container doesn't solve it. It looks like it's solved, but whenever you recreate the removed directory, you would have to pull the same trick again. 😞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants