Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple instruction decompiles to incorrect and overly verbose code #7

Open
flarn2006 opened this issue Aug 4, 2020 · 2 comments
Open

Comments

@flarn2006
Copy link

flarn2006 commented Aug 4, 2020

Here's an example:

                             //
                             // ram 
                             // fileOffset=0, length=30
                             // ram: 000000-00001d
                             //
          000000 e6 f4 00 01     mov        r4,0x100
                             LAB_000004                                      XREF[1]:     00000a(j)  
          000004 d7 00 09 00     exts       #0x9,#0x1
          000008 98 84           mov        r8,[r4+]
          00000a 2d fc           jmpr       cc_EQ,LAB_000004
          00000c 26 f4 02 01     sub        r4,#0x102
          000010 5c 44           shl        r4,#0x4
          000012 2b 58           prior      r5,r8
          000014 5c 15           shl        r5,#0x1
          000016 00 45           add        r4,r5
          000018 06 f4 00 10     add        r4,#0x1000
          00001c 9c 04           jmpi       cc_UC,[r4]

The lines

exts #0x9,#0x1
mov r8,[r4+]

should decompile to something like:

short foo, bar;  // r8, r4
foo = *(short* far)(0x90000 + bar);
bar += 2;

Instead, they decompile to:

uVar2 = uVar2 + 2;
uVar2 = *(ushort *)
         ((uint3)((ushort)((uVar2 & 0xc000) == 0) * unaff_DPP0 |
                  (ushort)((uVar2 & 0xc000) == 1) * unaff_DPP1 |
                  (ushort)((uVar2 & 0xc000) == 2) * unaff_DPP2 |
                 (ushort)((uVar2 & 0xc000) == 3) * unaff_DPP3) << 0xe | (uint3)(uVar2 & 0x3fff))
;

First of all, it's incrementing the variable representing r4 before accessing it, but the instruction set manual says [r4+] means to access the address in r4 and then increment it.

Second, it appears to be ignoring the exts line entirely, instead interpreting r4 as a 16-bit address. exts s, n means that the next n instructions should use 32-bit addressing, using s as the upper word for all addresses. Instead, it's using the DPP registers, which define the 16-bit address space.

Less importantly, even if the decompilation were correct, it seems far too verbose. Code using a lot of 16-bit addressing (which is extremely common) becomes very hard to read due to all the repetitive lines of code. Wouldn't it make more sense to use near/far pointer syntax, and just treat all the DPP stuff as implied whenever a near pointer is dereferenced? Or if Ghidra doesn't support that, you could probably define pointers as 24-bit (or 32-bit if it must be a power of two) and define an intrinsic for converting 16-bit addresses, like *__dpp(address) = value;

@esaulenka
Copy link
Owner

Hello Flarn!

incrementing the variable representing r4 before accessing it

It is very strange thing. Slaspec code written in correct order:

:mov Rwn1215, [Rwm0811+] is op0007=0x98 & Rwm0811 & Rwn1215 {
	local tmp:2;
	read_w(tmp, Rwm0811);
	mov_w(Rwn1215, tmp);
	Rwm0811 = Rwm0811 + 2;
}

it appears to be ignoring the exts line entirely

Yes, you are right, exts extp opcodes not supported now.
Also, I have no ideas, how to do it only with SLEIGH. Looks like there is need special analyzer plugin, written in java. I have no expirence with it...

define an intrinsic for converting 16-bit addresses, like *__dpp(address) = value

It is not good idea, because it breaks all cross-references, even for variables with constant addresses. I think, it should be done in that plugin too.

Right now I dont have enought time to investigate it.
I can just add some README, that describes flaws in current implementation.

@flarn2006
Copy link
Author

About the first thing, it's possible that a later decompilation step changes the order, and it only leaves it like that because it knows the order doesn't matter in this case. Like how an optimizing compiler may write instructions in a different order than a literal start-to-finish reading of the source code would indicate, if it knows it won't make a difference. I haven't tested, but maybe it was an instruction where the order mattered (e.g. mov r4, [r4+]) it would generate the correct code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants