Tuesday, May 22, 2007

Thunking in x64 -- one assembly approach

As noted in an earlier post, thunking in x64 isn't nearly as trivial as with x86. Since all calls are fastcall, which implies passing parameters in registers as well as (possibly) on the stack; some data shifting will be necessary. It's not that the thunk itself isn't trivial enough (at least if you ignore floating point detection, covered later), but it's a whole lot more complex in terms of instruction count, memory impact and so forth. To give you an impression of said impact, here's my first approach to thunking a function with eight parameters in total:

mov rax, qword ptr [rsp]        ; snatch return address
sub rsp, 8h                     ; expand the stack
mov qword ptr [rsp + 20h], r9   ; save fourth parameter in next to last spill slot
mov r9, r8                      ; push registry parameter 3->4
mov r8, rdx                     ; push registry parameter 2->3
mov rdx, rcx                    ; push registry parameter 1->2
mov rcx, 4h                     ; number of qwords to displace
mov r10, rdi                    ; store volatile edi in scratch
mov r11, rsi                    ; store volatile esi in scratch
lea rdi, QWORD PTR [rsp + 28h]  ; destination = last spill position on stack
lea rsi, QWORD PTR [rsp + 30h]  ; source = fifth param on stack
rep movsq                       ; move params up
mov qword ptr [rsp + 48h], rax  ; save old return address at the end of the frame
mov rcx, THIS POINTER           
mov rdi, r10                    ; restore from scratch
mov rsi, r11                    ; restore from scratch
mov rax, FUNCTION ADDRESS       
call rax                        ; call thunked function
mov rdx, qword ptr [rsp + 48h]  ; restore old return address to scratch
mov rcx, 8h                     ; number of qwords to displace
mov r10, rdi                    ; store volatile edi in scratch
mov r11, rsi                    ; store volatile esi in scratch
lea rdi, qword ptr [rsp + 48h]  ; destination = beyond last param on stack
lea rsi, qword ptr [rsp + 40h]  ; source = last param on stack
std                             ; set direction flag
rep movsq                       ; restore stack position of params
cld                             ; clear direction flag
mov qword ptr [rsp + 8h], rdx   ; place old return address in the right spot
mov rdi, r10                    ; restore from scratch
mov rsi, r11                    ; restore from scratch
add rsp, 8h                     ; shrink stack
ret                             ; return to caller

So what this will do is move (in reverse order) the parameters stored in RCX to RDX, RDX => R8, R8 => R9, and R9 onto the stack. The stack will also be expanded to suit the new parameter (the 'this' pointer), with four spill slots preserved for the four registry passed params:
  • RCX = 'this'
  • RDX = param 1
  • R8 = param 2
  • R9 = param 3

Param 4, which used to be in R9, will now reside just after the four spill slots, ahead of param 5->8.

After adjusting the stack, the call will continue to the original target (member) function, which will run as normal. When that function returns, the thunk cleans the stack back up, and the original caller will be none the wiser. Apparently a non-member function was called, but in reality our thunk seamlessly adjusted the call to match a member function.

So that's it for the new thunk approach. I still haven't decided whether or not to make an adapted Thunk64 library, similar to the Thunk32 library I've previously released. The reason is simply that with the added complexity, the thunk is nearing too large to defend -- efficiency wise. Not to mention the fact that there will have to be a bunch of different thunk variations, depending on the parameter count and type. The fastcall calling convention places the first four floating point parameters in MMX registers, and that obviously wouldn't be handled by the thunk shown above. Some template magic would have to deal with all of that, but again: I'm not really sure if it's worth it.

I think we'll leave it at "we'll see" for now :)

1 comment:

peterchen said...

You seem to be the guy for thunks" :)

I've picked up a WNDPROC thunking mechanism for Win32 that replaces the HWND with the object pointer (rather than adding the pointer).

The idea is that the WNDPROC/Object combination is unique and each gets its own thunk, so the object can as well store the HWND.

Would that be easier, as you don't need to shift the remaining parameters?