Investigating crashes on self-modifying code
I wanted to run some self-modifying code on my "cluster", which includes RISC-V and AArch64 SBCs, but as soon as I ran the project on these boards, I noticed that it would sometimes crash with SIGBUS
or SIGSEGV
.
The execution flow of the program:
function();
*function = new_code;
function();
*function = original_code;
function(); // sometimes crashes here
I dumped both the original function memory and the restored function memory area, they always were identical, so I was not restoring bad values onto the original addresses.
To make things weirder:
- The crash is less likely (but still happening) if there's a
sleep(..)
after restoring the original code. - The crash 100% goes away if there's a breakpoint after restoring the original code.
Analyzing the crash, the test application logs:
writing 0x2a0 bytes to 0xfffff7ffc2c0
and proceeds to crash:
Program received signal SIGSEGV, Segmentation fault.
0x52800022540005e8 in ?? ()
The code on the function segment looks right
(gdb) x/5i 0xfffff7ffc2c0
0xfffff7ffc2c0 <__kernel_clock_gettime>: paciasp
0xfffff7ffc2c4 <__kernel_clock_gettime+4>: cmp w0, #0xf
0xfffff7ffc2c8 <__kernel_clock_gettime+8>: b.hi 0xfffff7ffc384 <__kernel_clock_gettime+196> // b.pmore
0xfffff7ffc2cc <__kernel_clock_gettime+12>: mov w2, #0x1 // #1
0xfffff7ffc2d0 <__kernel_clock_gettime+16>: mov w3, #0x883 // #2179
But when looking at the same code as hex values:
(gdb) x/10x 0xfffff7ffc2c0
0xfffff7ffc2c0 <__kernel_clock_gettime>: 0xd503233f 0x71003c1f 0x540005e8 0x52800022
0xfffff7ffc2d0 <__kernel_clock_gettime+16>: 0x52811063 0x1ac02042 0x10fee944 0x6a030043
0xfffff7ffc2e0 <__kernel_clock_gettime+32>: 0x54000640 0x8b20d089
I noticed that the program counter is set to the data at __kernel_clock_gettime+8
-- the program tried to jump there, but there's no br
instruction in this code!
The trampoline code does call br
on PC+8
:
ldr x0, 8 ; load to x0 the value at PC+8
br x0 ; jmp
.dword 0xffffffffff ; 8 bytes of data
but that's not the code we are currently executing!
To validate this hypothesis, I updated the trampoline code with some padding:
ldr x0, 16 ; load to x0 the value at PC+16
br x0 ; jmp
nop ; padding
nop ; padding
.dword 0xffffffffff ; 8 bytes of data
The program still crashed
Program received signal SIGBUS, Bus error.
0x1ac0204252811063 in ?? ()
(gdb) x/5i 0xfffff7ffc2c0
0xfffff7ffc2c0 <__kernel_clock_gettime>: paciasp
0xfffff7ffc2c4 <__kernel_clock_gettime+4>: cmp w0, #0xf
0xfffff7ffc2c8 <__kernel_clock_gettime+8>: b.hi 0xfffff7ffc384 <__kernel_clock_gettime+196> // b.pmore
0xfffff7ffc2cc <__kernel_clock_gettime+12>: mov w2, #0x1 // #1
0xfffff7ffc2d0 <__kernel_clock_gettime+16>: mov w3, #0x883 // #2179
(gdb) x/10x 0xfffff7ffc2c0
0xfffff7ffc2c0 <__kernel_clock_gettime>: 0xd503233f 0x71003c1f 0x540005e8 0x52800022
0xfffff7ffc2d0 <__kernel_clock_gettime+16>: 0x52811063 0x1ac02042 0x10fee944 0x6a030043
0xfffff7ffc2e0 <__kernel_clock_gettime+32>: 0x54000640 0x8b20d089
But now the value in PC is __kernel_clock_gettime+16
!
This implies that we are executing the old instructions (trampoline) with the updated data!.
Cache coherency
At this point I thought that it's likely that the D-cache and I-cache are not coherent - I'd expect both to read from the old or new data in memory, but that's not what's happening.
My expectation was that this would be done automatically by the kernel every time mprotect
is called on a given memory region, but researching a bit I found this discussion which contains this relevant snippet:
Subsequent changes to this mapping or writes to it are entirely the responsibility of the user. So if the user plans to execute instructions, it better explicitly flush the caches
When looking on how to flush these caches, I found an ARM blog post, which mentions that GCC has a built-in (__clear_cache
) specifically designed to clear the D/I caches in a given range.
Calling __clear_cache
I needed to call a GCC/LLVM builtin, but rustc
does not expose it. I could/should have yoinked the code from LLVM/GCC's implementation but instead I chose to rely on their implementation by creating a static, shared library (cacheflush-sys) which only exports the compiler built-in.
volatile void clear_cache(void* start, void* end) {
__builtin___clear_cache(start, end);
}
Having the ability to make the data/instruction caches coherent again, I changed the overwrite
implementation to be basically:
mprotect(READ | WRITE | EXECUTE);
memcpy(__kernel_clock_gettime, ...);
mprotect(READ | EXECUTE);
cache_flush(__kernel_clock_gettime, trampoline_len);
with this change, the program stopped crashing on AArch64 and RISC-V
References: