Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from SJLJ exceptions to DWARF-2 #23

Closed
vinriviere opened this issue Sep 4, 2023 · 85 comments
Closed

Switch from SJLJ exceptions to DWARF-2 #23

vinriviere opened this issue Sep 4, 2023 · 85 comments

Comments

@vinriviere
Copy link
Member

SJLJ exceptions slowly become obsolete, in favor to DWARF2 exceptions. m68k-elf and m68k-linux already use DWARF2 exceptions. So we should do the same in the new m68k-atari-mintelf toolchain.

It is stated there that DWARF2 .eh_frame section contains relocations to odd addresses:

/* Disable DWARF 2 frame unwind support for exceptions.
Currently, .eh_frame sections contain relocations at odd addresses. Bug? */
#define DWARF2_UNWIND_INFO 0

Is this really true? Why? Is there a way to avoid that?
I've just asked to the GCC mailing list. Let's see.
https://gcc.gnu.org/pipermail/gcc/2023-September/242414.html

@th-otto
Copy link
Contributor

th-otto commented Sep 4, 2023

To be more precise, problematic here are only relocations at odd addresses, not relocations to odd addresses. Relocations to odd addresses will happen also eg with char *array[] = ....

But something like

.byte 0xff
.long somevar

wont work.

@mikrosk
Copy link
Member

mikrosk commented Sep 4, 2023

Thanks for creating / noticing this issue. It would be very unfortunate if we ended up with yet another poorly supported, non-standard and soon-to-be-killed feature.

@vinriviere
Copy link
Member Author

To be more precise, problematic here are only relocations at odd addresses, not relocations to odd addresses.

@th-otto Very true. Thanks for the precision.

And we are lucky, because we got an answer from Andreas Schwab on the GCC mailing list:
https://gcc.gnu.org/pipermail/gcc/2023-September/242415.html

He proposes to configure GCC by with DW_EH_PE_aligned instead of DW_EH_PE_absptr in the ASM_PREFERRED_EH_DATA_FORMAT setting. I gave a quick try, and indeed, that solves the relocation alignment issue 😃 So DWARF-2 exceptions become not-impossible for the PRG/ELF format.

However, even configured that way, exceptions don't work out of the box. In my quick test, it failed as soon as the first exception was thrown. Full DWARF-2 exception support has some prerequisites. As I understand, it needs support for crtbegin.o/crtend.o. Maybe also .init/.fini sections. That would require coordinated support in binutils/gcc/MiNTLib.

In conclusion, adding DWARF-2 exception support is quite complex, and impact several components. Not the kind of thing you can easily turn on with a single option. We will have to do that to modernize the mint* targets. But as SJLJ exceptions are not yet completely obsolete, work well, and are and much simpler than DWARF-2 exceptions, I advise to keep SJLJ for now. So we can focus to the stabilization of binutils/gcc/gdb for the m68k-atari-mintelf target. Then we could start the DWARF-2 migration project in a second time.

Also, the switch from SJLJ to DWARF-2 exceptions will be an ABI change. So that will require a full rebuild of all the C++ libraries.

@vinriviere vinriviere changed the title Switch from SJLJ exceptions to DWARF2 Switch from SJLJ exceptions to DWARF-2 Sep 5, 2023
@vinriviere
Copy link
Member Author

vinriviere commented Sep 5, 2023

A good introduction to difference between SJLJ / DWARF-2 exceptions can be found here. There are pointers to more detailed information.
https://gcc.gnu.org/wiki/WindowsGCCImprovements

@th-otto
Copy link
Contributor

th-otto commented Sep 5, 2023

Hm, are you sure? I've read that message from Andreas already, and from what i understand, DW_EH_PE_aligned only has an influence about what values are written for relocations, and how they are interpreted when later read, but not where they are written.

We should maybe try to create a testcase first where the default setting of DW_EH_PE_absptr really produces relocations at odd addresses.

About crtbegin/crtend: i have to look, but i don't think that we need them. It's possible that unwind info has to be initialized somehow (in glibc that seems to be register_frame_info), but imho that can be done also at the same place as do_global_ctors_aux is called in libgcc.a. If more is needed for this, that would indeed be a bigger task.

@vinriviere
Copy link
Member Author

Hm, are you sure?

Well, I can just say that I encountered the odd-alignment issue during my early tests with PRG/ELF. Then yesterday, I looked at the m68k-elf relocations with my test program, and there was indeed relocations at odd addresses. Then I used DW_EH_PE_aligned with m68k-atari-mintelf, and there wasn't any odd relocation anymore. Maybe this was just by chance, no idea. Indeed, this should be checked with a proper testcase

@vinriviere
Copy link
Member Author

I've just added usage of crtbegin.o / crtend.o. That's a step forward.
GCC: 4e3a38a
binutils: freemint/m68k-atari-mint-binutils-gdb@40b0722

@th-otto
Copy link
Contributor

th-otto commented Sep 6, 2023

The patch to crtstuff.c looks wrong to me. CTOR_LIST/DTOR_LIST are already defined in libgcc2.c. A function to run the global constructors is also already defined there. Also all the the other stuff like __EH_FRAME_BEGIN__ would already be defined there.

That's why i meant that we currently don't not need crtbegin/crtend. They are only for shared libraries.

Changes to binutils are therefor also wrong. They must be usable by any compiler, and not rely on crtbegin/crtend being present.

@vinriviere
Copy link
Member Author

The patch to crtstuff.c looks wrong to me.

No. It works well, as expected.

CTOR_LIST/DTOR_LIST are already defined in libgcc2.c. A function to run the global constructors is also already defined there. Also all the the other stuff like __EH_FRAME_BEGIN__ would already be defined there.

No. They are used by libgcc2.c (compiled as __main.o), but defined elsewhere. In my initial mintelf patch (as well as in the a.out linker) they were defined in the linker script. Now they are defined in crtbegin.o / crtend.o, like any other modern target.

I must admit that, due to the many defines to handle all the special cases, GCC's crtstuff is a big mess.

That's why i meant that we currently don't not need crtbegin/crtend. They are only for shared libraries.

No. See above.

Changes to binutils are therefor also wrong. They must be usable by any compiler, and not rely on crtbegin/crtend being present.

No. They are good. You can even link an empty assembler file with -nostartfiles, and see that the link succeeds even without crtbegin.o / crtend.o. This is possible because the reference to those files is done with a wildcard like *crtbegin.o to make the dependency optional. I haven't invented that, that's just a copy/paste from m68kelf.

Be sure that I carefully test my patches before pushing. Did you test them, or my binaries, before declaring this was wrong?

BTW, if your real intent (that I must guess, as you don't report real-world issues) is to use the mintelf linker script with the a.out linker, that might indeed not work. But for a completely different reason. The original m68kelf linker uses a CONSTRUCTORS statement in the linker script. As it's an a.out-only thing (ignored by ELF), I didn't use it in the mintelf script. But it you really want to use that linker script with the a.out linker, just add that CONSTRUCTORS line to the external linker script m68kmintelf.x then it should work well.

@th-otto
Copy link
Contributor

th-otto commented Sep 6, 2023

No. They are used by libgcc2.c (compiled as __main.o), but defined elsewhere.

They are also defined there:

$ m68k-atari-mintelf-nm -A libgcc.a | grep CTOR
libgcc.a:__main.o:         U ___CTOR_LIST__
libgcc.a:_ctors.o:00000008 B ___CTOR_LIST__
$ m68k-atari-mintelf-nm -A libgcc.a | grep global_ctors
libgcc.a:__main.o:00000028 T ___do_global_ctors

Now they are defined in crtbegin.o / crtend.o, like any other modern target.

Your patches using __attribute__((used)) and disabling the error are also an indication that this was the wrong place to define them.

Be sure that I carefully test my patches before pushing. Did you test them, or my binaries, before declaring this was wrong?

No, i did not test them yet. But having the same definition in different files can't be right.

In any case, the function do_global_ctors_aux from crtstuff.c is not run at all. That would only be the case when we have support for an .init section.

BTW, if your real intent (that I must guess, as you don't report real-world issues) is to use the mintelf linker script with the a.out linker, that might indeed not work.

No, that would not work indeed. That toolchain uses different linker scripts for a.out and elf.

@th-otto
Copy link
Contributor

th-otto commented Sep 6, 2023

No, i did not test them yet.

Now i did, and it does not work. __CTOR_LIST__ and __DTOR__LIST__ are linked from libgcc.a(_ctors.o), not crtbegin.o. And constructors are not run.

(for testing i use

#include <stdio.h>
#include <mintbind.h>

void __attribute__((__constructor__)) cons(void)
{
	Cconws("C constructor\r\n");
}


void __attribute__((__destructor__)) dest(void)
{
	Cconws("C destructor\r\n");
}


int main(void)
{
	return 0;
}

It doesn't make much sense anyway. If we activate DWARF2 unwind info, then __EH_FRAME_BEGIN__ will also be defined in libgcc(__main.o).

@vinriviere
Copy link
Member Author

They are also defined there:

$ m68k-atari-mintelf-nm -A libgcc.a | grep CTOR
...
libgcc.a:_ctors.o:00000008 B ___CTOR_LIST__

This is a hack from the GCC team:
https://github.com/freemint/m68k-atari-mint-gcc/blob/gcc-13-mintelf/libgcc/libgcc2.c#L2434-L2455
The ___CTOR_LIST__ is defined in the BSS segment, just to satisfy the linker, as a fallback.
But in normal operation, that libgcc.a:_ctors.o should never be linked at all.

  • If you provide your own startup code, and no standard main(), there won't be any ___CTOR_LIST__ list at all : not used, not defined.
  • If you compile a normal program, with standard startup code and main(), ___CTOR_LIST__ will be provided by crtbegin.o, so libgcc.a:_ctors.o won't be used at all.
    So that libgcc.a:_ctors.o fallback should will only appear if someone uses the standard startup code and main(), but crtbegin.o / crtend.o are missing from the linker command line. And this shouldn't happen, as I've also updated the gcc specs.

Proof is that I successfully compiled and run both a standard C++ program, and also an empty .S file with -nostartfiles.

Your patches using __attribute__((used)) and disabling the error are also an indication that this was the wrong place to define them.

No. I added that __attribute__((used)) to silence a warning. This fix was already present on other variables. I added it on a separate line to make the patch easier to rebase.
In the same spirit, I also commented out the #error in that file.

I needed those workarounds because it's very uncommon to use crtbegin.o / crtend.o without support for the .init section. But that's a temporary situation. I will remove those workarounds as soon as the .init section is usable in mintelf. Then we won't use any uncommon configuration at all. Just progress step by step.

In any case, the function do_global_ctors_aux from crtstuff.c is not run at all. That would only be the case when we have support for an .init section.

Yes. That will be a further step.

Now i did, and it does not work. CTOR_LIST and DTOR__LIST are linked from libgcc.a(_ctors.o), not crtbegin.o. And constructors are not run.

This is unexpected. Did you also update GCC? I have changed the specs to provide crtbegin.o and crtend.o on the command line. Particularly: m68k-atari-mintelf-gcc -dumpspecs, look for startfile and endfile.

Both gcc and binutils must be updated at the same time. I've explicitly added a constraint in the Ubuntu packages.

@th-otto
Copy link
Contributor

th-otto commented Sep 6, 2023

This is unexpected. Did you also update GCC?

Yes, but i failed to realize that one of the patches did not apply cleanly, and the line that changed the STARTFILE_SPEC was rejected, so crtbegin.o was not used in the link. Apologies.

The #error only triggers because we have neither an .init nor an .init_array, but only need the definitions for the constructors. Apparently the source was not prepared for that, although exactly such a scenario is mentioned in libgcc2.c.

After i fixed that, constructors do work now. So in general, using crtbegin/crtend seems to be ok, but i still did a few more changes (in my own branch only for now), that makes sure that only one of the two __CTOR_LIST__ is compiled. In fact, using EH_FRAME_LIST and its end symbol only seem to work when using crtbegin, since no attempt is made in the linker script to link them in the right order (for ctors, that is done by linking the .ctors section from crtbegin first, then all others).

I've also protected all the differences by ifdefs now. Maybe not strictly neccessary for the elf toolchain, but makes it easier for me to sync the sources.

Now i have to do the same checks again, using the a.out toolchain. Sigh.

@th-otto
Copy link
Contributor

th-otto commented Sep 7, 2023

I've just pushed a patch that makes it possible to enable dwarf2 exceptions at configure time (default is still sjlj). It also changes the ASM_PREFERRED_EH_DATA_FORMAT as suggested by Andreas. Then i did some tests, using this sample:

#include <stdio.h>

  class exception
  {
  public:
    exception() throw() { }
    virtual ~exception() throw();
    virtual const char* what() const throw();
  };

exception::~exception() throw() { }
const char* exception::what() const throw() { return "hello"; }



void func_exception(int i)
{
	printf("in %s\n", __func__);
	if (i == 5)
		throw exception();
}

int main(void)
{
	int i;
	
	printf("Hello, C++\n");
	try {
		for (i = 0; i < 10; i++)
			func_exception(i);
	} catch (const exception &e)
	{
		printf("got exception %d %s\n", i, e.what());
	}
	return 0;
}

With the old setting of ASM_PREFERRED_EH_DATA_FORMAT = DW_EH_PE_absptr, i get a link error because of an relocation at an odd address. With the changed setting, i don't get that error, but i get a message

libstdc++.a(eh_terminate.o)(.eh_frame); no .eh_frame_hdr table will be created

instead. The link still succeeds with error status 0. If i run the program, the exception is not catched, and the program exits with a exit value of 1536 instead.

There are also other strange differences. With the old setting, the generated assembler source contains .cfi_startproc, .cfi_endproc and .cfi_offset statements, even without -g. It also does not have any .eh_frame section at all.

@th-otto
Copy link
Contributor

th-otto commented Sep 7, 2023

I found the cause for the link error. In my fork i don't include m68kelf.h, and for ASM_OUTPUT_ALIGNED the definition from m68k.h was used, which just emits .even. But for the dwarf2 and exception tables sometimes an alignment to a 4-byte address is needed. I think this branch is not affected by this, but it has be taken into account when inclusion of m68kelf.h is removed here, too (which is still not done, but needed to avoid the other ABI changes).

The exit-code 1536 is from an abort() call in unwind-pe.h in read_encoded_value_with_base(), so there still seems to be something wrong in the generated tables.

@vinriviere
Copy link
Member Author

With the changed setting, i don't get that error, but i get a message

libstdc++.a(eh_terminate.o)(.eh_frame); no .eh_frame_hdr table will be created

I don't remember having seen any warning. And certainly not an error, as the link succeeded. But at runtime, exceptions didn't work as expected. I don't remember actual message, something like invalid system call, and maybe a bus error at address 0 in the ARAnyM's log. Not completely surprising, as the startup code was incomplete.

I found the cause for the link error. In my fork i don't include m68kelf.h, and for ASM_OUTPUT_ALIGNED

Ah. Fine.

FYI, I'm preparing support for .init_array / .fini_array in binutils / gcc / mintlib. It's really straightforward, and that's the modern way to go. We will see if that's enough to get DWARF-2 exceptions working, or if there is yet another issue behind.

@th-otto
Copy link
Contributor

th-otto commented Sep 7, 2023

What for? Those arrays just contain the same pointer list as __CTOR_LIST__/__DTOR_LIST. Only difference is that mintlib will be responsible to call them, and that can only be achieved by changing the startup code, and making it much more complicated. As long as we don't have shared libraries, we don't need them.

@vinriviere
Copy link
Member Author

.init_array / .fini_array is the modern way to go. That's just another way to add hooks at startup. Concretely, crtinit.o use it to initialize the .eh_frame section.

@th-otto
Copy link
Contributor

th-otto commented Sep 7, 2023

??? That section is readonly, what do you want to initialize there? .init_array has nothing to do with exception handling.

And yes, it is modern, because needed for shared libs. Other than that, it is exactly the same as all the .ctors sections, just without the leading and trailing elements.

@vinriviere
Copy link
Member Author

FYI, I'm preparing support for .init_array / .fini_array in binutils / gcc / mintlib.

Done, I've pushed my work for the above (NOT the DWARF-2 experiments).
binutils: freemint/m68k-atari-mint-binutils-gdb@13147f4
gcc: 44322d4
mintlib: freemint/mintlib@40afd69

Ubuntu binaries are currently building.

As expected, now constructors use the .init_array section (instead of .ctors) and destructors use the .fini_array section (instead of .dtors). Also, the main() function in user programs doesn't implicitly call __main() anymore at the beginning. Any program expecting .init_array/.fini_array sections, starting by GCC itself, will work out of the box.

Key point is that modern ELF systems use .init_array/.fini_array for intitialization. For dynamic libraries, yes, but also for static libraries and executables. Combined with crtbegin.o/crtend.o support, the mintelf target is now closer to the Linux behaviour. So regarding to GCC's mess of #ifdef's, we now use a much more standard ELF configuration, keeping away from obscure and poorly tested code paths. Finally, that will require less patches. Some of them could already be removed. So that's a step forward.

Anyway, my initial motivation for this .init_array stuff was the ability to easily enable DWARF-2 exceptions and their .eh_frame stuff. Combined with @th-otto's --disable-sjlj-exceptions, it's very convenient. Everything compiles well and produce good-looking binaries.

Alas, it still fails like previously. As soon as the program throws an exception, it displays Bad System Call and exits immediately. I can see BAD SYSTEM CALL: User PC=0 in the ARAnyM log, so I guess that's something related to a NULL function pointer. I will have a closer look to what happens during initialization. Our brand new gdb will be of great help.

@th-otto
Copy link
Contributor

th-otto commented Sep 8, 2023

I still don't see why this should put us forward. Especially it is a step backward regarding exception handling, since the initial call to __register_frame_info

#ifdef __LIBGCC_EH_FRAME_SECTION_NAME__
{
static struct object object;
__register_frame_info (__EH_FRAME_BEGIN__, &object);
}
#endif
is now not done anymore.

It is also now impossible to omit that additional code when you know that you don't need it, by providing a dummy __main() function. This is done by several small tools, and often also in projects that use libcmini.

I also don't see that this saves any patches. Quite contrary, it needs additional patches in other projects.

MiNT is not linux, and we don't should not pretend to be like it. And btw, init_array is not used at all by linux, which uses the .init section instead.

Apart from that, your patch in gcc is misplaced, where it is only done for cross-compilers. If at all, it must go into config.gcc.

But really, i would ask to revert that. It is not going to help us in any way.

@th-otto
Copy link
Contributor

th-otto commented Sep 8, 2023

Well my above comments where not completely right, there is an entry made into the .init_array that calls frame_dummy which in turn calls __register_frame_info. But there are still problems with this approach. First off, it also calls register_tm_clone, which in turn generates references to _ITM_registerTMCloneTable from libitm.a, something that is definitely not needed, and the library does not even exist.

And it will also pull in all the exception handling stuff, even for C-compiled code that does not need it.

@th-otto
Copy link
Contributor

th-otto commented Sep 8, 2023

I've pushed fixes for this to gcc and mintlib now. But i'm still not happy with this. Amongst others: in your previous commit we got rid of the __CTOR_LIST__ etc. symbols in the linker script. But now we add back __init_array_start etc. again. Bad thing about this is, that those symbol names depend on the compiler generating an underscore or not. And those linker scripts are the only place where this is the case, all other tools are independent of this.

The binutils should be independent of the compiler. We might be using different compilers lke gcc-4.6.4, gcc-7, gcc-13 etc, but all use the same binutils. One solution to this would be to PROVIDE those symbols in both flavours.

@vinriviere
Copy link
Member Author

Well my above comments where not completely right, there is an entry made into the .init_array that calls frame_dummy which in turn calls __register_frame_info.

This was my initial motivation.

But there are still problems with this approach. First off, it also calls register_tm_clone, which in turn generates references to _ITM_registerTMCloneTable from libitm.a, something that is definitely not needed, and the library does not even exist.

Indeed, I noticed that TM clone stuff in the sources. I saw on the web this was related to threading/atomic stuff, so most likely useless for our platform. It seems that most embedded platflorms use --disable-tm-clone-registry, so that should also be a solution for us.

And it will also pull in all the exception handling stuff, even for C-compiled code that does not need it.

Ah, this is very bad. BTW, I wonder why this is different from the __main() solution. I would have said that such difference could come when switching from SJLJ to DWARF-2.

Anyway, I haven't strong opinions. If, after in-depth tests, it appears that the old method is clearly better, then go for it. We have seen that it's easy to configure gcc for one or other method. However, I'm keen to keep the new method for some time, to give us the time familiarize with the new method, and see precisely what is better or worse. When time comes to make a final choice, we will know why, regarding to pros and cons.

@th-otto
Copy link
Contributor

th-otto commented Sep 8, 2023

Ah, this is very bad. BTW, I wonder why this is different from the __main() solution.

In libgcc2.c that function is also called, but only when the target defines EH_FRAME_SECTION_NAME, which is not the case.

to give us the time familiarize with the new method

Yes, i'll keep that for now. Switching the methods is a pain, because you have to rebuild & install all of binutils, mintlib and gcc. But whatever method is choosen, we'll have to find a way to move the call to register_frame_info to a place where it is only used by the g++ driver. Maybe by adding an object with a static constructor to libcstd++.a, but we have to make sure it is the first one being called.

@vinriviere
Copy link
Member Author

Amongst others: in your previous commit we got rid of the __CTOR_LIST__ etc. symbols in the linker script.

Yes, it's the benefit of using crtbegin.o.

But now we add back __init_array_start etc. again. Bad thing about this is, that those symbol names depend on the compiler generating an underscore or not. And those linker scripts are the only place where this is the case, all other tools are independent of this.

True. And if you want my 2 cents, the GNU people should also have used crtbegin.o/crtend.o for those __init_array_start labels.

The binutils should be independent of the compiler. We might be using different compilers lke gcc-4.6.4, gcc-7, gcc-13 etc, but all use the same binutils.

Using the same ld for different gcc versions, yes. But I wouldn't have imagined using the same ld binary for both the mint and mintelf gcc. Or you mean using the same ld for ELF-underscore and ELF-not-underscore variants? As a big fan of minimalistic patches, I wouldn't have imagined that.

One solution to this would be to PROVIDE those symbols in both flavours.

Yes, that's a solution. Or provide aliases in C files, like EmuTOS.

If you want other ideas: the linker script is generated by a shell script called genscripts.sh. The standard ELF template is completely unreadable, due to the countless variants. This is why I chose to write a new template for scratch for PRG/ELF. But it might be easy (or not) to instantiate that template twice, with underscore or not. Then using some kind of parameter or environment variable to to switch between a linker script or the other.

Maybe by adding an object with a static constructor to libcstd++.a, but we have to make sure it is the first one being called.

No problem. The .init_array scheme has support for priorities. And if that's not enough, there is also .preinit_array.

@th-otto
Copy link
Contributor

th-otto commented Sep 8, 2023

But I wouldn't have imagined using the same ld binary for both the mint and mintelf gcc.

That actually might work, if you tell m68k-atari-mintelf to use the m68kmint (old a.out-mintprg) emulation. But that would be strange.

Or you mean using the same ld for ELF-underscore and ELF-not-underscore variants?

Yes, this is what i had in mind. Not that i'm a fan of supporting both (we still have to finally decide which one we use), we certainly don't want to confuse people. But it might take some time until we have backported the change to old compilers. And that should certainly be done for atleast 4.6.4 (for EmuTOS) and 7.5.0 (currently used for freemint).
Certainly you'll need binutils >= 2.41 when you want to make use of the new format, but that should then also be usable by other compilers.

As a big fan of minimalistic patches, I wouldn't have imagined that.

Its only a few lines more in the linker script, which is completely new anyway. No other patches to existing code.

Yes, that's a solution. Or provide aliases in C files, like EmuTOS.

In Emutos that was only needed for functions from libgcc. But it would be strange to require such "hacks" in every project. In fact, in most cases only a few symbols are affected, the newly introduced __init_array etc that are now referenced by mintlib, and etext() that was already referenced before.

But it might be easy (or not) to instantiate that template twice, with underscore or not. Then using some kind of parameter or environment variable to to switch between a linker script or the other.

That would require different, non-standard compiler switches depending on which compiler is used. Not a real option imho, and difficult to understand by casual users why this is needed.

@th-otto
Copy link
Contributor

th-otto commented Sep 8, 2023

Our brand new gdb will be of great help.

I've tracked it down to _Unwind_Find_FDE returning a NULL pointer, which should not happen. I also wonder whether code like

static inline unsigned int
read_4u (const void *p) { const union unaligned *up = p; return up->u4; }
will eventually crash with address-error on 68k.

Also dubios:

static inline unsigned long
read_8s (const void *p) { const union unaligned *up = p; return up->s8; }
that assumes that 8 bytes fit into an unsigned long.

@vinriviere
Copy link
Member Author

Seems to be fixed by freemint/m68k-atari-mint-binutils-gdb@90d470c in the linker. But further tests should be done.

Excellent job, @th-otto 😃 This immediately fixes my general testcase.

I'm in the mood of definitely enabling --disable-sjlj-exceptions in my toolchain.

@mikrosk
Copy link
Member

mikrosk commented Oct 2, 2023

I have been using it for a week now without something suspicious but my tests have been very limited so far.

@vinriviere
Copy link
Member Author

vinriviere commented Oct 2, 2023

Done. I've rebuilt all my mintelf binaries with current sources, and used --disable-sjlj-exceptions. I plan to keep that setting, unless we encounter a blocking issue.

BTH, there is still a bug with DWARF-2 exceptions. Basic testcase.

  • Run with ARAnyM + FreeMiNT: OK
  • Run with Hatari or Steem + plain TOS: OK
  • Run with Hatari or Steem + plain TOS + MonST2: fails with trap #7 on throw!

We know that trap #7 is generally the result of gcc_assert(), as I haven't used --with-sysroot yet (that will come).

Why is there a difference with or without MonST2? Is that trap #7 just ignored by plain TOS or ARAnyM+FreeMiNT? Or is some code path different?
Again, I will have to add traces. Sigh. It's crazy how that feature is full of trouble.

@vinriviere
Copy link
Member Author

vinriviere commented Oct 2, 2023

New tests:

  • Hatari + plain TOS 1.62 + MonST2: fails with trap #7
  • But same with EmuTOS works well!!

How possible?

@th-otto
Copy link
Contributor

th-otto commented Oct 2, 2023

I plan to keep that setting, unless we encounter a blocking issue.

Yes, i plan to use that setting too. If some problems come up, we have to fix it, but that way more users will test it.

Is that trap #7 just ignored by plain TOS or ARAnyM+FreeMiNT?

TOS atleast sets all vectors to an rte instruction upon boot, before initializing specific vectors like trap #13 etc.

Maybe MonST2 overwrites that vector to catch errors? It certainly does so for buserror etc.

As a first step, i would configure gcc with --sysroot. That way, you won't have to worry about trap #7, and gcc_assert() calls abort instead.

@vinriviere
Copy link
Member Author

I finally compiled gcc with --with-sysroot (not published). The test program still works when run from TOS 1.62 desktop. With MONST2, as soon as the exception is thrown, destructors are called in an infinite loop. It's crazy.
TBH, I preferred the trap #7. At least, it failed cleanly.

What's the difference between MONST2 and without it? Alignment? Relocation? TPA zero-initialization? Vectors initialization? Or just a MONST2 bug? Sigh.

@vinriviere
Copy link
Member Author

At least, I can use the MonST2 debugger. I see that abort() is hit at the end of read_encoded_value_with_base(), certainly from __gxx_abort().
And that reminds me that big 7562d09 commit without know testcase. Coincidence?

@vinriviere
Copy link
Member Author

@th-otto I've reverted the 7562d09 patch, and now everything works perfectly well. With or without MonST2. With Steem and its 68000 emulation, where unaligned access should cause address errors. I suggest you to reconsider your patch. I would still be interested to see a testcase justifying its need.

@vinriviere
Copy link
Member Author

BTW, generally speaking, the evil __attribute__ ((packed)) has 2 effects:

  • it removes any struct padding.
  • it allows the struct itself to be misaligned.

Consequence of the second point: all accesses to members, including short and long types, are forced to use a series of move.b. So packed structs are evil, because they are slow, even when properly aligned.

As illustration, here is a testcase:

union unal
{
    long val;
} __attribute__ ((packed));

union unal g;

void f(void)
{
    g.val = 0x12345678;
}
$ m68k-atari-mintelf-gcc -S unal.c -o - -Os
#NO_APP
        .file   "unal.c"
        .text
        .align  2
        .globl  f
        .type   f, @function
f:
        move.b #18,g
        move.b #52,g+1
        move.b #86,g+2
        move.b #120,g+3
        rts
        .size   f, .-f
        .globl  g
        .section        .bss
        .type   g, @object
        .size   g, 4
g:
        .zero   4
        .ident  "GCC: (GCC MiNT ELF 20230910) 13.2.0"

So if the goal of 7562d09 was to avoid unaligned access, it's useless because __attribute__ ((packed)) already does the job.

@th-otto
Copy link
Contributor

th-otto commented Oct 3, 2023

@th-otto I've reverted the 7562d09 patch, and now everything works perfectly well

Uh oh. That's bad. I'll have a look.

it's useless because attribute ((packed)) already does the job.

For 68020+ its not useless, because it avoids the slowness of byte moves.

So packed structs are evil, because they are slow, even when properly aligned.

Yes, but in this case we cannot avoid using packed structs. They are defined that way in the dwarf-2 spec, and access to them must match how they are generated by the compiler.

@th-otto
Copy link
Contributor

th-otto commented Oct 3, 2023

There are a few things that irritate me about that issue

  • when i compile a throwing test for -m68000, it works as expected. How is it possible that this problem only occurs when running inside MonST2?
  • i can't see any obvious error in the patch. However i don't have a running environment Monst2.
  • your code sample above uses byte moves even when compiling for m68020. TARGET_STRICT_ALIGNMENT is defined as ((target_flags & MASK_STRICT_ALIGNMENT) != 0), but that flag is only unset for coldfire:
    if (!TARGET_COLDFIRE)
    target_mask |= MASK_STRICT_ALIGNMENT;

So at least, the patch currently does not help for 68020, only for coldfire.

@mikrosk
Copy link
Member

mikrosk commented Oct 3, 2023

I've reverted the 7562d09 patch, and now everything works perfectly well.

Oh but you didn't push it. I should have checked before starting recompilation. :)

I'm going to revert & push it then until we have a 100% working version.

mikrosk added a commit that referenced this issue Oct 3, 2023
This reverts commit 7562d09.

There are some issues to be investigated: #23 (comment)
@vinriviere
Copy link
Member Author

@th-otto Indeed, we are far from having explained anything. Yesterday, I tried to revert the patch, as a desperate attempt, and it worked immediately. Before that (with patch enabled) I experienced some randomness, as it worked fine from desktop but failed with MonST2. I'm not aware of anything that could cause simple user programs to behave differently from desktop or inside the debugger. One possible cause could be uninitialized variables, causing randomness. In that case, fact that the reversion of the patch solved the issue could be pure chance. Real cause could very well be something else. I will have a look again.

@vinriviere
Copy link
Member Author

I'm back and kicking.

@mikrosk I see that you've reverted and pushed @th-otto's "unaligned access" patch on 03/10/2023. That's fine, as that patch doesn't seem to fix any real-world issue. Less patches we have, less is the probability we introduce new issues. So let's keep away from that, at least for now.

I've rebuilt everything again (pushed on my Ubuntu PPA ELF). But even without the above patch, the MonST2 issue is back! I still haven't added proper --with-sysroot to my official packages. But finally, for debugging I prefer that because it causes a clean trap #7 fault instead of a bogus error loop. Of course, after this bug is solved, I will use --with-sysroot to get a clean abort(), but for now I prefer to keep the trap #7.

I made 3 tests:

  • run testcase from TOS 1.62 desktop: works well.
  • run from Omikron Basic 3.01 "exec a.tos": fails with nice 39 bombs on first throw. This is the expected default behaviour for trap #7.
  • run from MonST2: fails with trap #7. If I revert to the default trap #7 handler (namely, reset $9c to $27e00ca2), I get the 39 bombs instead, as expected.

So there's something new: Omikron Basic behaves just like MonST2. It fails. While it works well from the desktop.

Definitely, there is some randomness. I can only imagine:

  • something related to alignment. Basepage alignment may be different depending on which program runs the testcase.
  • some uninitialized variable. Specially, the unused heap space after the BSS segment, not initialized when the FASTLOAD bit is set. I will check that.

Anyway, I'm determined to find the root cause of that random issue. Whatever it is, I will find it. Using debuggers, traces, or whatever possible.

@vinriviere
Copy link
Member Author

The problem is here:

if (encoding == DW_EH_PE_aligned)
{
_Unwind_Internal_Ptr a = (_Unwind_Internal_Ptr) p;
a = (a + sizeof (void *) - 1) & - sizeof(void *);
result = *(_Unwind_Internal_Ptr *) a;
p = (const unsigned char *) (_Unwind_Internal_Ptr) (a + sizeof (void *));
}

The DW_EH_PE_aligned code assumes that 4-byte alignment is respected. But TOS 1.x only honors 2-byte alignment.
By chance, in my case, the desktop loaded the test program at 4-byte alignment. This is why it worked.
But, also by chance, MonST2 loaded it at an address multiple of 2 but not not multiple of 4. This is why it failed.

I added a quick and dirty fix to deal with that wrong alignment at runtime. It seems to work on first hit, but crashes a bit further. I will continue investigation.

But now we know why we have more trouble than other m68k targets:

  • TOS doesn't support relocations at odd addresses. So the uncommon DW_EH_PE_aligned needs to to be used.
  • TOS 1.x doesn't respect 4-byte alignment, while DW_EH_PE_aligned requires it.

It's incredible to see that, whenever our runtime environment is different from UNIX or embedded standards, we get into major trouble.

@th-otto
Copy link
Contributor

th-otto commented Oct 19, 2023

Great find. I saw that code already, too bad i didn't realize that it could cause trouble.

But instead of fixing anything at runtime, maybe you can try the attached patch. It fixed the code above, and the counterpart where that alignment is generated.

pe_aligned.patch.txt

@vinriviere
Copy link
Member Author

I added a quick and dirty fix to deal with that wrong alignment at runtime. It seems to work on first hit, but crashes a bit further. I will continue investigation.

Ah, I've been severely fooled 😩 Fact is that unwind-pe.h is compiled twice:

  • once in libgcc.a (with clear-text name)
  • once in libstdc++.a (with mangled name)

After having added my patch, I only rebuilt libgcc.a, but not libstdc++.a. So my test program was only half-baked. After rebuilding libstdc++.a, it works perfectly, even when the program is loaded at a not long-aligned address.

For memory, here is my simple runtime patch. Could be refined, but perfectly working.
runtime.patch

We also have @th-otto's solution to force GCC to produce 2-byte aligned .eh_frame. I haven't tried that patch yet, but that's also a good solution. I also considered it while writing my simple runtime patch, as an alternate solution.

So we have 2 potential solutions:

  1. @th-otto's proposed solution: the compile-time patch.
    Pros:
    - optimal size
    - optimal speed at runtime
    Cons:
    - nonstandard DW_EH_PE_aligned packing (2-byte aligned instead of 4-byte). So our .eh_frame sections are nonstandard.
    - the patch is a bit complicated, and I fear to get trouble in future gcc updates.

  2. @vinriviere's proposed solution: the run-time patch.
    Pros:
    - use standard DW_EH_PE_aligned packing
    - extremely simple
    Cons:
    - use 4-byte aligned .eh_frame, while 2 bytes would be enough
    - insignificant performance loss

So what to do? Basically:

  • use nonstandard DW_EH_PE_aligned packing, with drawbacks of nonstandard structure and more complicated patch?
  • or simple runtime patch, with bigger (but standard) structure, and insignificant performance loss?

Note that the difference is an ABI change. So if that changes, everything has to be rebuilt again. I don't mind, but this has to be taken in consideration for the future.

Another option would probably be to introduce a new nonstandard 2-byte alignment define such as DW_EH_PE_aligned_16 to avoid breaking compatibility with DW_EH_PE_aligned, but that would be overkill.

@th-otto
Copy link
Contributor

th-otto commented Oct 22, 2023

We also have @th-otto's solution to force GCC to produce 2-byte aligned .eh_frame. I haven't tried that patch yet, but that's also a good solution. I also considered it while writing my simple runtime patch, as an alternate solution.

Would be nice if you could try it. Your patch has the problem that it generates references to _base. That is not very clean, since libgcc.a is supposed to be handled last, and should not have references to the c-library (yes i know, this is already the case because of atexit used somewhere else). But it will cause trouble when you try to link with --nostartfiles and/or --nostdlib.

   - the patch is a bit complicated, and I fear to get trouble in future gcc updates.

Huh? Whats so complicated about this?

Note that the difference is an ABI change. So if that changes, everything has to be rebuilt again. I don't mind, but this has to be taken in consideration for the future.

Yes, that is indeed the case. But only for an ABI that did not exist until now ;) And it is an ABI that is solely used by gcc itself, and not directly accessible to applications.

Another option would probably be to introduce a new nonstandard 2-byte alignment define such as DW_EH_PE_aligned_16 to avoid breaking compatibility with DW_EH_PE_aligned, but that would be overkill.

That's also possible, but it does not make the patch simpler. It would infact make it more complicated, because you then have to change some other locations where DW_EH_PE_aligned is checked.

@vinriviere
Copy link
Member Author

First of all, I have no strong opinion between the 2 solutions. But it would be better to do the best choice right now, if there is one, by examining the pros and cons.

But instead of fixing anything at runtime, maybe you can try the attached patch. It fixed the code above, and the counterpart where that alignment is generated.

pe_aligned.patch.txt

The patch doesn't apply to the GCC 13 branch. One reason is that gcc/dwarf2out.c doesn't exist anymore (was renamed to dwarf2out.cc).

Anyway, I slightly adjusted the patch to make it compile.

Unfortunately, it doesn't work. The test program fails with a Bus Error due to an invalid pointer (address 0x0e IIRC), even when run from desktop. So that's independent of the program's alignment.

However, I still have good news. The patch works partially. I traced read_encoded_value_with_base() and I saw that the pointer was aligned at a multiple of 2, as expected, and that the value was always read read correctly. So the alignment encoding/decoding works. But it fails somewhere later. This time, I have recompiled the whole gcc, so I expect my test to be valid (even if I could have been fooled again).

So there is something else to fix.
NB: This isn't an argument against that solution. I'm sure it will finally work after debugging.

Huh? Whats so complicated about this?

Look at the number of files and lines modified by the patch, and you will have a clue.

Concretely, I'm worried about patches which changes several places over the code. It's easy to forget some places (could even be the current issue - or not). If the GCC people add new code about that in newer GCC versions, we might again miss it, causing again hidden issues hard to debug, etc. This is really what worries me. regarding to the mint* situation, I would prefer simple patches unlikely to cause trouble in the future. If we are confident enough that any patch won't cause trouble in the future, then go for it.

Yes, that is indeed the case. But only for an ABI that did not exist until now ;) And it is an ABI that is solely used by gcc itself, and not directly accessible to applications.

I wonder if that .eh_frame ABI is also used by gdb. I have no clue, just an interrogation.

Your patch has the problem that it generates references to _base. That is not very clean, since libgcc.a is supposed to be handled last, and should not have references to the c-library (yes i know, this is already the case because of atexit used somewhere else). But it will cause trouble when you try to link with --nostartfiles and/or --nostdlib.

Yes, I know that. That was a quick and dirty patch, as proof of concept. I used _base as it was easy. To avoid that dependency, it would be possible to define a dummy variable in that same file, aligned on 4 bytes. And use it instead of _base. Another layer of uglyness, for the sake of avoiding a dependency.

Anyway @th-otto, I suggest you to test your own patch. You should see a Bus Error when running a testcase, whenever the program is long-aligned or not.
For non-aligned test, I use MonST2 on TOS 1.62. By chance, it always misaligns my test program. But I guess that any TOS/EmuTOS with patched Pexec would also do that reliably.

@vinriviere
Copy link
Member Author

The test program fails with a Bus Error due to an invalid pointer

By thinking twice, my test may not have been accurate. This is because such compile-time patch is an ABI change. So to test it accurately, everything must be recompiled, including all the libs.

To ease testing gcc patches, I only recompile gcc. Then I use a shell script that transparently redirects gcc/g++ to the newly compiled xgcc/xg++ with proper -B options. This is perfectly valid as long as gcc/libgcc are the only affected components. But as said earlier, unwind-pe.h patches affect both libgcc and libstdc++. I spent more than a few days trying to understand what was wrong with my patch. Finally, the patch was perfectly correct, but the old libstdc++.a was still used, causing incomprehensible issue.

As a rule of thumb, we must be very careful when testing gcc patches, specially those involving ABI changes. It's easy to get fooled if everything hasn't been recompiled with the new compiler.

@th-otto
Copy link
Contributor

th-otto commented Oct 23, 2023

The patch doesn't apply to the GCC 13 branch.

Apologies. Looks i did the changes in the gcc-7 branch.

I'm worried about patches which changes several places over the code.

The patches changes 3 files. One of them is mint.h, which will certainly not be changed by GNU people because it isn't even in the distribution. Another is unwind-pe.h, patching the same location as your patch. And (except for the changed filename) the patch can still be used for gcc-13, it applies cleanly only at some different line numbers.
Apart from that, your patch is simpler because you patch the source unconditionally, without checking for MiNT. My patch could have been also much simpler if i don't care about that. But i wanted to do it "right", in the hope that it might be acceptable in the upstream repo.

it would be possible to define a dummy variable in that same file, aligned on 4 bytes.

... unless gcc is too smart and optimizes away (addr & 3) when it "knows" that the variable is aligned on 4 bytes ;) And indeed, it does optimize it, so that variable can't be in the same source file atleast:

long x __attribute__((aligned(4)));

int test(void)
{
	if ((((long)&x) & 3) == 0)
		return 0;
	return 1;
}

results in

        .type   test, @function
test:
        moveq #0,%d0
        rts

I only recompile gcc. Then I use a shell script that transparently redirects gcc/g++ to the newly compiled xgcc/xg++ with proper -B options.

Its dangerous to do such things. I once did similar things, but you will sometimes shoot yourself in the foot. The build process it quite complex, and it is sometimes not even safe to re-run "make" in the build tree after changing source-files, depending on what components have already been compiled. I have given up on that, and do the "make install" instead into 2 different directories, one empty to generate the tarball, and other which is in on my $PATH so i can immediately invoke it.

But as said earlier, unwind-pe.h patches affect both libgcc and libstdc++

I have no idea why that header file is used in libstdc++, but that library has to be recompiled anyway, because of the different layout of the generated .eh_frame structures.

You should see a Bus Error when running a testcase, whenever the program is long-aligned or not.

I don't get a bus-error, but a call to abort() instead. Have to check where that comes from.

@th-otto
Copy link
Contributor

th-otto commented Oct 24, 2023

Problem with my patch seems to be the linker, which also needs to know how many bytes to skip for DW_EH_PE_aligned:
https://github.com/freemint/m68k-atari-mint-binutils-gdb/blob/90d470c102e6f79a8625df90706109849404d86c/bfd/elf-eh-frame.c#L823-L829

So maybe really stick to your solution, but please clean it up a bit and put it inside ifdef __MINT__. And also find a way to use a different variable than _base to avoid the reference.

Edit: maybe something like attached patch will do.

PS: how did you attach your patch above? When i try to use such a filename, i get we don't support that file type

runtime.patch.txt

@vinriviere
Copy link
Member Author

Problem with my patch seems to be the linker, which also needs to know how many bytes to skip for DW_EH_PE_aligned:

Oh! Good catch @th-otto, I wouldn't have imagined that.

Sadly, this is what I feared. Even if the compile-time patch is theoretically better, it potentially requires patches at several places. And having to patch both gcc and the linker is a severe drawback.

So maybe really stick to your solution,

OK, let's go for it.

Edit: maybe something like attached patch will do.

That looks fine, and minimal. So please push it, with a reference to this issue in the commit message, and that will be perfect.

PS: how did you attach your patch above? When i try to use such a filename, i get we don't support that file type

Hmm, nothing special. This might be related to the MIME type. This could be different among systems/browsers. I don't remember if I used Firefox or Chrome, on Windows or Linux.

th-otto added a commit that referenced this issue Oct 25, 2023
gcc & binutils both assume that the pointer for such addresses is aligned
at a 4-byte boundary, but this cannot be guaranteed by TOS at runtime

See also #23 (comment)
@th-otto
Copy link
Contributor

th-otto commented Oct 25, 2023

BTW, there must be something that we have missed when updating from gcc 4.6.4 to gcc-7 and above.

With the following code:

typedef long uint32_t;

typedef __attribute__((aligned(1))) uint32_t unalign32;

uint32_t MEM_read32(const void* ptr) { return *(const unalign32*)ptr; }

gcc-4.6.4 -m68000 produces

_MEM_read32:
        move.l 4(%sp),%a0
        move.l (%a0),%d0
        rts

So it does not care about the alignment.
gcc-7 produces the same for coldfire, but for m68000 or m68020 it produces

_MEM_read32:
        move.l 4(%sp),%a0
        moveq #0,%d1
        move.b (%a0),%d1
        lsl.w #8,%d1
        swap %d1
        clr.w %d1
        moveq #0,%d0
        move.b 1(%a0),%d0
        swap %d0
        clr.w %d0
        or.l %d0,%d1
        moveq #0,%d0
        move.b 2(%a0),%d0
        lsl.l #8,%d0
        or.l %d1,%d0
        or.b 3(%a0),%d0
        rts

So it uses single byte moves, shifts etc. That's why the access to unaligned pointer works. As already mentioned above, i think the distinct behaviour is because of

if (!TARGET_COLDFIRE)
target_mask |= MASK_STRICT_ALIGNMENT;
, which sets the flag also for m68020. Question is why gcc-4.6.4 behaves differently there. The two definitions that have an influence on this (the code above, and
#define STRICT_ALIGNMENT (TARGET_STRICT_ALIGNMENT)
#define M68K_HONOR_TARGET_STRICT_ALIGNMENT 1
) atleast is the same.

@vinriviere
Copy link
Member Author

Many thanks @th-otto for your commit 0e0cacb. I've rebuilt my Ubuntu Binaries, that works fine.

Well, that DWARF-2 exception stuff has been incredibly hard to setup. We spent many weeks on it. But we finally understood all the issues, and fixed them. So I propose to close this issue. We could always open new Issues for future specific cases.

BTW, there must be something that we have missed when updating from gcc 4.6.4 to gcc-7 and above.

Well, that alignment question is a different topic. So I propose to open another Issue for that. I don't remember well how ColdFire behaves. IIRC it can vary among different cores and hardware.

@mikrosk
Copy link
Member

mikrosk commented Nov 30, 2023

Nobody objects, closing. ;)

@mikrosk mikrosk closed this as completed Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants