RCE Endeavors 😅

July 23, 2015

Common Types of Disassemblers

Filed under: General x86,General x86-64,Programming — admin @ 8:44 AM

The point of a disassembler is to take an input series of bytes and output an architecture-specific interpretation of those bytes. For example, a typical disassembler targeting the x86 architecture will take the following bytes: 55 8B EC B8 FF 00 00 00 33 DB 93, and produce a readable representation of those bytes similar to below:

55                   push        ebp  
8B EC                mov         ebp, esp  
B8 FF 00 00 00       mov         eax, 0FFh  
33 DB                xor         ebx,ebx  
93                   xchg        eax,ebx  

The process involves looking at the opcode(s), getting the instruction length, parsing out extra information in the instruction such as displacements, relative/absolute destinations, register/memory affected, etc. — basically a large amount of lookups and parsing. Fortunately, there are libraries for this. The disassembly engine used in this example will be BeaEngine due to its simplicity. Capstone Engine is also a great engine that supports many architectures, a clean and thread-safe API, and a permissive license among other things. After all of this is implemented, the actual challenge of parsing executable files comes into play. This issue will be the topic of this post.

There are two common ways of disassembling a file: linearly and recursively. In the case of linear disassembly, the disassembler begins reading the first instruction at an address in the binary and continues reading until some termination condition, a termination condition being a set amount of instructions decoded, the end of a block, or an error condition such as an unknown opcode. The code for linear disassembly is straightforward and is shown below. The termination condition in the example code will stop printing when a RET instruction is hit.

DISASM disasm = { 0 };
disasm.EIP = (UIntPtr)pStartingAddress;
 
int iLength = UNKNOWN_OPCODE;
do
{
    iLength = DisasmFnc(&disasm);
    fprintf(stdout, "0x%X -- %s\n",
        disasm.EIP, disasm.CompleteInstr);
 
    disasm.EIP += iLength;
 
} while (!IsRet(disasm.Instruction) && iLength != UNKNOWN_OPCODE);

The “algorithm” is (very) easy to write, and with knowledge into the format of the file being disassembled proves to be pretty reliable. For example, the Portable Executable (PE) format on Windows provides information on all executable sections and their sizes on disk and in memory with alignment. The ELF format on Linux provides the same relevant information. Using this information, a disassembler knows the exact range to disassemble to produce reliable output. The major drawback with this technique is that there is no reliable way to separate useless code from executing code. Any unused code/data inserted intentionally (or not) into the target area to disassemble will be listed. Looking at this in an assembly dump usually sticks out because the instructions will be nonsensical relative to surrounding code. Also any use of instruction interleaving, i.e. a jump into the middle of an instruction — usually for obfuscation purposes — will be missed by the disassembler.

The second type of way to disassemble a file is to do it recursively, that is to say that the disassembler will (try to) follow the control path of the actual program. The involves analyzing the destinations of any control flow instructions: calls, jumps, and returns. For every CALL instruction encountered, the address of the next instruction must be pushed on a stack, and the disassembly continues on at the CALL address. This continues on, recursively if need be for multiple CALLs, until a RET instruction is hit. Once a RET instruction is hit, the top of the call stack is popped off and disassembly continues on from that point. This is pretty much exactly how execution happens in a program. Also, for every unconditional jump instruction, the disassembly merely continues at the target destination. The sample code is a bit more complex, but not by much

DISASM disasm = { 0 };
disasm.EIP = (UIntPtr)pStartingAddress;
 
int iLength = UNKNOWN_OPCODE;
 
do
{
    iLength = DisasmFnc(&disasm);
    fprintf(stdout, "0x%X -- %s\n",
        disasm.EIP, disasm.CompleteInstr);
    if (IsCall(disasm.Instruction))
    {
        m_retStack.push(disasm.EIP + iLength);
        disasm.EIP = ResolveAddress(disasm);
    }
    else if (IsJump(disasm.Instruction))
    {
        disasm.EIP = ResolveAddress(disasm);
    }
    else if (IsRet(disasm.Instruction))
    {
        if (!m_retStack.empty())
        {
            disasm.EIP = m_retStack.top();
            m_retStack.pop();
        }
        else
        {
            break;
        }
    }
    else
    {
        disasm.EIP += iLength;
    }
 
} while (iLength != UNKNOWN_OPCODE);

This technique has its own benefits and drawbacks. The major benefit is that (theoretically) only exectuable code will be disassembled. This means that only relevant and executing code will be shown to the user. Also, the approximate or exact number of instructions to disassemble does not need to be known like in the linear technique. With recursive disassembly, you provide starting set(s) of instructions and then begin tracing control flow into those. Obfuscation techniques such as instruction interleaving will also be discovered. This technique does have a major drawback, however. CALLs or JMPs made indirectly cannot be deciphered. For example, the destinations of instructions such as JMP [ESI+0x4], CALL EBX, CALL [0xAABBCCDD] where 0xAABBCCDD contains an import fixed up at runtime, and so on, cannot be followed with the disassembler. This means that there are a lot of edge cases to consider when encountering instructions such as these in terms of knowing where to go next and making sure that the call stack is consistent.

The sample code provides a trivial implementation of both of these techniques. To see how it performs, there are also two functions provided. TestFunction1 demonstrates how a recursive disassembler follows control flow. Compare the two outputs:
Linear

0x1146670 -- call dword ptr [0114B008h]
0x1146676 -- ret

Recursive

0x1146670 -- call dword ptr [0114B008h]
0x754218E0 -- mov eax, dword ptr fs:[00000018h]
0x754218E6 -- mov eax, dword ptr [eax+24h]
0x754218E9 -- ret
0x1146676 -- ret

The second example, TestFunction2, shows how the recursive disassembler skips over instructions that are not executed.

0x66680 -- push ebp
0x66681 -- mov ebp, esp
0x66683 -- mov eax, 000000FFh
0x66688 -- call 000666AAh
0x6668D -- xor ebx, ebx
0x6668F -- xchg eax, ebx
0x66690 -- jmp 000666B1h
0x66692 -- cmp ecx, AABBCCDDh
0x66698 -- push 00000000h
0x6669A -- push 00000000h
0x6669C -- push 00000000h
0x6669E -- push 00000000h
0x666A0 -- call dword ptr [0006B0A0h]
0x666A6 -- pop ebp
0x666A7 -- mov esp, ebp
0x666A9 -- ret

Overall, each approach has its benefits and drawbacks. With good knowledge of an executable files format, a linear disassembler works perfectly fine for showing a disassembly listing. Typically, disassemblers with a focus on code analysis, i.e. IDA Pro, will use a recursive approach and have a sophisticated analysis engine to complement it.

The Visual Studio 2015 RC project for this example can be found here. The source code is viewable on Github here.

Follow on Twitter for more updates

July 8, 2015

Code Snippet: Safe Objects

Filed under: General x86,General x86-64,Programming — admin @ 8:05 PM

I’ve found that one of the annoying things with using the Windows API is that there is (usually) no automatic cleanup of opened handles. For example, most functions that open a handle, i.e. OpenProcess, CreateFile, LoadLibrary, etc., return back an opaque pointer to you that you are required to close when you’re done using it. This act of closing the handle is usually done with the generic CloseHandle function, or with another specific cleanup function mentioned in the documentation, i.e. FreeLibrary for the handle returned by LoadLibrary.

This is the traditional C way of doing things, but I found that it can be improved a bit by using RAII with C++. The idea is to have a wrapper class that contains the underlying handle type and performs a cleanup when when the lifetime of the object is finished. Thus was born the prototype code for a safe object:

namespace AutoClean
{
 
    namespace SafeObjectCleanupFnc
    {
        bool ClnCloseHandle(const HANDLE &handle) { return BOOLIFY(CloseHandle(handle)); };
        bool ClnFreeLibrary(const HMODULE &handle) { return BOOLIFY(FreeLibrary(handle)); };
        bool ClnLocalFree(const HLOCAL &handle) { return (LocalFree(handle) == nullptr); };
        bool ClnGlobalFree(const HGLOBAL &handle) { return (GlobalFree(handle) == nullptr); };
        bool ClnUnmapViewOfFile(const PVOID &handle) { return BOOLIFY(UnmapViewOfFile(handle)); };
        bool ClnCloseDesktop(const HDESK &handle) { return BOOLIFY(CloseDesktop(handle)); };
        bool ClnCloseWindowStation(const HWINSTA &handle) { return BOOLIFY(CloseWindowStation(handle)); };
        bool ClnCloseServiceHandle(const SC_HANDLE &handle) { return BOOLIFY(CloseServiceHandle(handle)); };
        bool ClnVirtualFree(const PVOID &handle) { return BOOLIFY(VirtualFree(handle, 0, MEM_RELEASE)); };
    }
 
    template <typename T, bool (* Cleanup)(const T &), PVOID InvalidValue>
    class SafeObject final
    {
    public:
        SafeObject() : m_obj{ obj }
        {
        }
 
        SafeObject(const SafeObject &copy) = delete;
 
        SafeObject(const T &obj) : m_obj{ obj }
        {
        }
 
        SafeObject(const SafeObject &&obj)
        {
            *this = std::move(obj);
        }
 
        ~SafeObject()
        {
            if (IsValid())
            {
                (void)Cleanup(m_obj);
            }
        }
 
        const bool IsValid() const
        {
            return m_obj != (T)InvalidValue;
        }
 
        SafeObject &operator=(const SafeObject &copy) = delete;
 
        SafeObject &operator=(SafeObject &&obj)
        {
            if (IsValid())
            {
                (void)Cleanup(m_obj);
            }
 
            m_obj = std::move(obj.m_obj);
            obj.m_obj = InvalidValue;
 
            return *this;
        }
 
        T * const Ptr()
        {
            return &m_obj;
        }
 
        const T operator()() const
        {
            return m_obj;
        }
 
    private:
        T m_obj;
    };
 
    using SafeHandle = SafeObject<HANDLE, SafeObjectCleanupFnc::ClnCloseHandle, INVALID_HANDLE_VALUE>;
    using SafeLibrary = SafeObject<HMODULE, SafeObjectCleanupFnc::ClnFreeLibrary, nullptr>;
    using SafeLocal = SafeObject<HLOCAL, SafeObjectCleanupFnc::ClnLocalFree, nullptr>;
    using SafeGlobal = SafeObject<HGLOBAL, SafeObjectCleanupFnc::ClnGlobalFree, nullptr>;
    using SafeMapView = SafeObject<PVOID, SafeObjectCleanupFnc::ClnUnmapViewOfFile, nullptr>;
    using SafeDesktop = SafeObject<HDESK, SafeObjectCleanupFnc::ClnCloseDesktop, nullptr>;
    using SafeWindowStation = SafeObject<HWINSTA, SafeObjectCleanupFnc::ClnCloseWindowStation, nullptr>;
    using SafeService = SafeObject<SC_HANDLE, SafeObjectCleanupFnc::ClnCloseServiceHandle, nullptr>;
    using SafeVirtual = SafeObject<PVOID, SafeObjectCleanupFnc::ClnVirtualFree, nullptr>;
}

This is a basic object that supports assignment and moves. Copying in this example code has been disabled since it introduces a lot of extra bookkeeping, but can be can with the use of DuplicateHandle. A sample usage of this code is shown below:

int main(int argc, char *argv[])
{
    AutoClean::SafeHandle handle1 = CreateFile(L"testfile1.txt", GENERIC_READ, 0, nullptr,
        OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, nullptr);
 
    AutoClean::SafeHandle handle2 = CreateFile(L"testfile2.txt", GENERIC_READ, 0, nullptr,
        OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, nullptr);
 
    fprintf(stderr, "Handle 1: %X\n"
        "Handle 2: %X\n",
        handle1(), handle2());
 
    handle1 = std::move(handle2);
 
    fprintf(stderr, "Handle 1: %X\n"
        "Handle 2: %X\n",
        handle1(), handle2());
 
    return 0;
}

The Visual Studio 2015 RC project for this example can be found here. The source code is viewable on Github here.

Edit: Added some additional features to the example code per requests via Twitter.

Follow on Twitter for more updates.

June 15, 2015

Syscall Hooking Under WoW64: Implementation (2/2)

Filed under: General x86,General x86-64,Programming — admin @ 7:01 PM

Welcome to the second and final installment of the series on WoW64 syscall hooking. This part builds on the background provided by the first part and shows what is required to build a working implementation of a syscall hook under WoW64 for x64 Windows 8.1. The code presented here should be trivially portable across different versions of Windows by simply changing syscall numbers, which again can be found on this useful table mentioned in the first part. The code provided will be a basic syscall hook that allows for filtering of one particular syscall. There is not much stopping it from being extended to support hooking an arbitrary amount of syscalls.

Getting Started

From the first part in the series, we see that each thread holds a pointer to a block of code responsible for making the switch from x86 to x64 code execution. This is the pathway that WoW64 uses to perform a syscall, and the pointer to this code resides within the thread local storage region allotted to each thread. Since the value (FS:[0xC0]) in all threads points to the same location, the easiest way to go about hooking syscalls is to replace that jump with an inline hook to our filtering code. This code can then check the syscall number, which is stored in EAX, against the desired syscall to hook. If they match, our hook gets called, otherwise execution will continue on as normal.

The process begins by getting this address. With the aid of inline assembly, this is very straightforward:

const DWORD_PTR __declspec(naked) GetWow64Address()
{
    __asm
    {
        mov eax, dword ptr fs:[0xC0]
        ret
    }
}

The return value of this function, stored in EAX, will hold the address of the block of code responsible for performing the jump to x64 bit mode. As previously mentioned, this address is the one that will be replaced with a jump to the filtering code. This is achieved with a simple overwrite of the instruction bytes at the address:

const void EnableWow64Redirect(const DWORD_PTR dwWow64Address, const LPVOID lpNewJumpLocation)
{
    unsigned char trampolineBytes[] =
    {
        0x68, 0xDD, 0xCC, 0xBB, 0xAA,       /*push 0xAABBCCDD*/
        0xC3,                               /*ret*/
        0xCC, 0xCC, 0xCC                    /*padding*/
    };
    memcpy(&trampolineBytes[1], &lpNewJumpLocation, sizeof(DWORD_PTR));
 
    WriteJump(dwWow64Address, trampolineBytes, sizeof(trampolineBytes));
}
 
const void WriteJump(const DWORD_PTR dwWow64Address, const void *pBuffer, size_t ulSize)
{
    DWORD dwOldProtect = 0;
    (void)VirtualProtect((LPVOID)dwWow64Address, PAGE_SIZE, PAGE_EXECUTE_READWRITE, &dwOldProtect);
    (void)memcpy((void *)dwWow64Address, pBuffer, ulSize);
    (void)VirtualProtect((LPVOID)dwWow64Address, PAGE_SIZE, dwOldProtect, &dwOldProtect);
}

You can see the hook being installed in a before/after form:

syscall1

Above are the bytes of the original code. This carries out the inter-segment jump to x64 bit mode as is standard for all WoW64 calls.

This is the code with the hook is shown below. Here a jump to 0x13411B0 is made, which is the location of the filtering code.syscall2

This filtering code is just an implementation of what was described above, and performs a check of the syscall number against the desired target and acts accordingly

void __declspec(naked) Wow64Trampoline()
{
    __asm
    {
        cmp eax, SYSCALL_INTERCEPT
        jz NtWriteVirtualMemoryHook
        jmp lpJmpRealloc
    }
}

In the case that the syscall number is not the one to filter, execution continues as normal by the jump to lpJmpRealloc, which is just an executable region of memory that contains the bytes of the original inter-segment jump (the jmp 0033:77BE1D84 instruction above). If the syscall number does match, then our hooking function, NtWriteVirtualMemoryHook, gets invoked. The example code just prints a message denoting that the call happened then continues on to performing the syscall. There is also a pushad/popad to properly preserve the stack in the event that another syscall is made from the syscall hook.

void __declspec(naked) NtWriteVirtualMemoryHook()
{
    __asm pushad
 
    fprintf(stderr, "NtWriteVirtualMemory called.\n");
 
    __asm
    {
        popad
        jmp lpJmpRealloc
    }
}

And that is all that there is to it as far as the example code goes. It can quickly be verified by seeing if the hook gets invoked for a call to NtWriteVirtualMemory.

    //Test syscall
    int i = 0x123;
    int j = 0x678;
    SIZE_T ulBytesWritten = 0;
 
    fprintf(stderr, "i = 0x%X\n", i);
    NtWriteVirtualMemory(GetCurrentProcess(), &i, &j, sizeof(int), &ulBytesWritten);
    fprintf(stderr, "i = 0x%X\n", i);

The output below shows everything working successfully

i = 0x123
NtWriteVirtualMemory called.
i = 0x678

Get the Code

The Visual Studio 2015 RC project for this example can be found here. The source code is viewable on Github here.

Follow on Twitter for more updates.

June 5, 2015

Syscall Hooking Under WoW64: Introduction (1/2)

Filed under: General x86,General x86-64,Programming — admin @ 11:04 AM

These next two posts will cover the topic of hooking Windows system calls under WoW64, the Windows subsystem responsible for running x86 code on a 64-bit version of Windows. This post will be a brief introduction to system calls under WoW64 and how the transition between x86 and x64 occurs. It will lay the groundwork for the actual task (in part 2) of inserting a hook into this process to intercept desired system calls.

Anatomy of a System Call

On a native 64-bit version of windows, there are two ways that system calls get made: natively or through WoW64. The native case is rather straightforward. For example, take the following code snippet, compiled as x64, performing an NtWriteVirtualMemory syscall:

using pNtWriteVirtualMemory = NTSTATUS (NTAPI *)(HANDLE ProcessHandle,
    PVOID BaseAddress, PVOID Buffer, ULONG NumberOfBytesToWrite,
    PULONG NumberOfBytesWritten);
 
pNtWriteVirtualMemory NtWriteVirtualMemory = nullptr;
 
int main(int argc, char *argv[])
{
    HMODULE hModule = GetModuleHandle(L"ntdll.dll");
    NtWriteVirtualMemory = (pNtWriteVirtualMemory)GetProcAddress(hModule,
        "NtWriteVirtualMemory");
 
    int i = 0x321;
    int j = 0x123;
 
    fprintf(stderr, "j = %X\n", j);
 
    ULONG numBytesWritten = 0;
    NTSTATUS success = NtWriteVirtualMemory(GetCurrentProcess(), &j, &i,
        sizeof(int), &numBytesWritten);
 
    fprintf(stderr, "j = %X\n", j);
 
    return 0;
}

The disassembly for the call looks like the following:

00007FF6C49D193A FF 15 C0 C6 00 00    call        qword ptr [__imp_GetCurrentProcess (07FF6C49DE000h)]  
00007FF6C49D1940 48 8D 4D 64          lea         rcx,[numBytesWritten]  
00007FF6C49D1944 48 89 4C 24 20       mov         qword ptr [rsp+20h],rcx  
00007FF6C49D1949 41 B9 04 00 00 00    mov         r9d,4  
00007FF6C49D194F 4C 8D 45 24          lea         r8,[i]  
00007FF6C49D1953 48 8D 55 44          lea         rdx,[j]  
00007FF6C49D1957 48 8B C8             mov         rcx,rax  
00007FF6C49D195A FF 15 00 A8 00 00    call        qword ptr [NtWriteVirtualMemory (07FF6C49DC160h)]  

You can see the first four arguments being put in to the RCX, RDX, R8, and R9 registers, per the standard calling convention for x64 on Windows. The fifth parameter is put onto the stack. When the call is made to NtWriteVirtualMemory, the following code is executed:

00007FFC88BB1560 4C 8B D1             mov         r10,rcx  
00007FFC88BB1563 B8 39 00 00 00       mov         eax,39h  
00007FFC88BB1568 0F 05                syscall  
00007FFC88BB156A C3                   ret  

Here RCX (the first parameter) is moved into R10. Then 0x39 is moved in to EAX. Afterwards, the syscall instruction is executed, which handles the switch to kernel mode, where the actual call is carried out. The magic value of 0x39 is the syscall number corresponding to NtWriteVirtualMemory on x64 Windows 8.1. A very useful table of syscalls for x86 and x64 Windows versions can be found here. When the syscall finishes, it will return execution to the RET instruction, which subsequently returns execution back to the next instruction from the original call site.

x86 to x64 Transition

As a x86 process running under WoW64 on a 64-bit system, things change a bit. Looking at the x86 disassembly for the call initially shows nothing out of the ordinary:

00B51117 8D 45 EC             lea         eax,[numBytesWritten]  
00B5111A 50                   push        eax  
00B5111B 6A 04                push        4  
00B5111D 8D 4D F4             lea         ecx,[i]  
00B51120 51                   push        ecx  
00B51121 8D 55 F0             lea         edx,[j]  
00B51124 52                   push        edx  
00B51125 FF 15 00 30 B5 00    call        dword ptr ds:[0B53000h] ; GetCurrentProcess
00B5112B 50                   push        eax  
00B5112C FF 15 18 40 B5 00    call        dword ptr ds:[0B54018h] ; NtWriteVirtualMemory

However, looking at the call to NtWriteVirtualMemory shows the following:

77ECC810 B8 39 00 00 00       mov         eax,39h  
77ECC815 64 FF 15 C0 00 00 00 call        dword ptr fs:[0C0h]  
77ECC81C C2 14 00             ret         14h  
77ECC81F 90                   nop  

Here the syscall number is moved into EAX as in the x64 example. Then a call to a special area of memory, the FS segment, is made. This segment contains thread local data and its address is unique per thread. Despite the segment base being unique per thread, the address contained at FS:[0xC0] will always be the same. This can quickly be verified with some test code showing the addresses of the FS segments and the contents of FS:[0xC0] across different threads.

DWORD WINAPI ThreadEntry(LPVOID lpParameter)
{
    WaitForSingleObject((HANDLE)lpParameter, INFINITE);
 
    return 0;
}
 
int main(int argc, char *argv[])
{
    HANDLE hEvent = CreateEvent(nullptr, FALSE, FALSE, L"Useless event");
 
    HANDLE hThread1 = CreateThread(nullptr, 0, &ThreadEntry, hEvent, 0, nullptr);
    HANDLE hThread2 = CreateThread(nullptr, 0, &ThreadEntry, hEvent, 0, nullptr);
    HANDLE hThread3 = CreateThread(nullptr, 0, &ThreadEntry, hEvent, 0, nullptr);
 
    CONTEXT ctxThread1 = { CONTEXT_ALL };
    (void)GetThreadContext(hThread1, &ctxThread1);
 
    CONTEXT ctxThread2 = { CONTEXT_ALL };
    (void)GetThreadContext(hThread2, &ctxThread2);
 
    CONTEXT ctxThread3 = { CONTEXT_ALL };
    (void)GetThreadContext(hThread3, &ctxThread3);
 
    LDT_ENTRY ldtThread1 = { 0 };
    LDT_ENTRY ldtThread2 = { 0 };
    LDT_ENTRY ldtThread3 = { 0 };
 
    (void)GetThreadSelectorEntry(hThread1, ctxThread1.SegFs, &ldtThread1);
    (void)GetThreadSelectorEntry(hThread2, ctxThread2.SegFs, &ldtThread2);
    (void)GetThreadSelectorEntry(hThread3, ctxThread3.SegFs, &ldtThread3);
 
    NT_TIB *pTibMain = (NT_TIB *)__readfsdword(0x18);
 
    DWORD_PTR dwFSBase1 = (ldtThread1.HighWord.Bits.BaseHi << 24) |
        (ldtThread1.HighWord.Bits.BaseMid << 16) |
        ldtThread1.BaseLow;
 
    DWORD_PTR dwFSBase2 = (ldtThread2.HighWord.Bits.BaseHi << 24) |
        (ldtThread2.HighWord.Bits.BaseMid << 16) |
        ldtThread2.BaseLow;
 
    DWORD_PTR dwFSBase3 = (ldtThread3.HighWord.Bits.BaseHi << 24) |
        (ldtThread3.HighWord.Bits.BaseMid << 16) |
        ldtThread3.BaseLow;
 
    fprintf(stderr, "Thread 1 FS Segment base address: %X\n"
        "Thread 2 FS Segment base address : %X\n"
        "Thread 3 FS Segment base address : %X\n",
        dwFSBase1, dwFSBase2, dwFSBase3);
 
    DWORD_PTR dwWOW64Address1 = *(DWORD_PTR *)((unsigned char *)dwFSBase1 + 0xC0);
    DWORD_PTR dwWOW64Address2 = *(DWORD_PTR *)((unsigned char *)dwFSBase2 + 0xC0);
    DWORD_PTR dwWOW64Address3 = *(DWORD_PTR *)((unsigned char *)dwFSBase3 + 0xC0);
 
    fprintf(stderr, "Thread 1 FS:[0xC0] : %X\n"
        "Thread 2 FS:[0xC0] : %X\n"
        "Thread 3 FS:[0xC0] : %X\n",
        dwWOW64Address1, dwWOW64Address2, dwWOW64Address3);
 
    return 0;
}

The output for the following code is

Thread 1 FS Segment base address: 7FDBB000
Thread 2 FS Segment base address : 7FDB8000
Thread 3 FS Segment base address : 7FC8F000
Thread 1 FS:[0xC0] : 77E81218
Thread 2 FS:[0xC0] : 77E81218
Thread 3 FS:[0xC0] : 77E81218

which verifies the original claim.

Moving back to the original disassembly; stepping into this CALL instruction leads to the following:

77E81216 00 00                add         byte ptr [eax],al  
77E81218 EA 84 1D E8 77 33 00 jmp         0033:77E81D84  
77E8121F 00 00                add         byte ptr [eax],al  

Ignore some of the nonsense bytes — this is a result of the disassembly listing not being quite correct. The important instruction is the jump to 0x77E81D84. This is an inter-segment jump. Here the 0033 means a jump to 64-bit mode. The value prior here was 0023, which corresponds to x86. An interesting thing is that the (Visual Studio) debugger is unable to step into this address (x64 WinDbg can). Taking a look at where this address resides in memory reveals something interesting:wow64It resides in wow64cpu.dll, which is one of the three core WoW64 DLLs that gets loaded into every process running under WoW64. That particular DLL is the one responsible for handling the transition from x86 to x64. Interestingly enough, the actual DLL is a 64-bit DLL that is loaded into a 32-bit process. The instructions, traced through to the syscall, at 0x77E81D84 are the following

wow64cpu!CpupReturnFromSimulatedCode
00000000`77e81d84 4987e6          xchg    rsp,r14
00000000`77e81d87 458b06          mov     r8d,dword ptr [r14] ds:00000000`008df5e0=77ecc78c
00000000`77e81d8a 4983c604        add     r14,4
00000000`77e81d8e 4589453c        mov     dword ptr [r13+3Ch],r8d ds:00000000`007dfdec=77eea9b0
00000000`77e81d92 45897548        mov     dword ptr [r13+48h],r14d ds:00000000`007dfdf8=008df670
00000000`77e81d96 4d8d5e04        lea     r11,[r14+4]
00000000`77e81d9a 41897d20        mov     dword ptr [r13+20h],edi ds:00000000`007dfdd0=00000000
00000000`77e81d9e 41897524        mov     dword ptr [r13+24h],esi ds:00000000`007dfdd4=00000000
00000000`77e81da2 41895d28        mov     dword ptr [r13+28h],ebx ds:00000000`007dfdd8=7f676000
00000000`77e81da6 41896d38        mov     dword ptr [r13+38h],ebp ds:00000000`007dfde8=00000000
00000000`77e81daa 9c              pushfq
00000000`77e81dab 4158            pop     r8
00000000`77e81dad 45894544        mov     dword ptr [r13+44h],r8d ds:00000000`007dfdf4=00000000
wow64cpu!TurboDispatchJumpAddressStart
00000000`77e81db1 8bc8            mov     ecx,eax
00000000`77e81db3 c1e910          shr     ecx,10h
00000000`77e81db6 41ff24cf        jmp     qword ptr [r15+rcx*8] ds:00000000`77e81b38=0000000077e822d0
wow64cpu!TurboDispatchJumpAddressEnd+0x516
00000000`77e822d0 418b5304        mov     edx,dword ptr [r11+4] ds:00000000`008df5ec=00000000
00000000`77e822d4 458b13          mov     r10d,dword ptr [r11] ds:00000000`008df5e8=008df5fc
00000000`77e822d7 eb3a            jmp     wow64cpu!TurboDispatchJumpAddressEnd+0x559 (00000000`77e82313)
wow64cpu!TurboDispatchJumpAddressEnd+0x559
00000000`77e82313 e838000000      call    wow64cpu!TurboDispatchJumpAddressEnd+0x596 (00000000`77e82350)
wow64cpu!CpupSyscallStub
00000000`77e82350 0f05            syscall
00000000`77e82352 c3              ret

The entry point to the 64-bit code resides at the symbol CpupReturnFromSimulatedCode. This code is responsible for setting up the proper parameters and stack to perform the syscall and then call it. For a more full explanation of everything involved here, see this post.

That’s all that is involved as far as performing syscalls under WoW64. This article hopefully elucidated a few things about how x86 code can perform syscalls on a x64 system. With this baseline, the next article will cover what is involved in intercepting these syscalls.

Get the Code

The Visual Studio 2015 RC project for this example can be found here. The source code is viewable on Github here.

Follow on Twitter for more updates.

May 17, 2015

Nop Hopping: Hiding Functionality in Alignment

Filed under: General x86,General x86-64,Programming — admin @ 1:02 PM

This post will cover the topic of hiding code functionality by taking advantage of compiler alignment. In order to maximize speed of data access, optimizers can try to align loops, function entries, jump destinations, etc., on a native word boundary. One example of this in actual executable code is a large series of NOP bytes after the end of a function. For example, the following was taken from an x64 library:

...
00007FFC676D7207 48 23 4C 24 38       and         rcx,qword ptr [rsp+38h]  
00007FFC676D720C 48 89 4C 24 38       mov         qword ptr [rsp+38h],rcx  
00007FFC676D7211 0F 85 91 A3 02 00    jne         00007FFC677015A8  
00007FFC676D7217 48 8B 5C 24 30       mov         rbx,qword ptr [rsp+30h]  
00007FFC676D721C 48 83 C4 20          add         rsp,20h  
00007FFC676D7220 5F                   pop         rdi  
00007FFC676D7221 C3                   ret  
00007FFC676D7222 90                   nop  
00007FFC676D7223 90                   nop  
00007FFC676D7224 90                   nop  
00007FFC676D7225 90                   nop  
00007FFC676D7226 90                   nop  
...

The NOPs are shown after the RET instruction. The size of these NOP blocks, if present, varies throughout programs. During my experimentation, I found that a majority of them (>95%) were 20 bytes or less. This leaves plenty of room for hiding functionality. One advantage of doing this is that the pages that these NOP blocks are on are already allocated, and they have executable privileges on them since they’re right next to actual executable code. This can enhance stealth since no extra allocations need to be made inside the program. Additionally, since these blocks are all over the program, it is possible to randomly select blocks to write your code in, preventing things such as signature scanning. It’s a rather overall nice technique and one that I used to use to bypass anti-cheat detection systems.

Finding the Regions

These NOP blocks are all over the place; they’re inside the main executable, and in each loaded library. This gives a very large search space. To begin, it is easiest to find and store the base address of the image and every library and its size. These will be the starting points for searching for these NOP blocks. This is done in a straightforward manner with the help of the CreateToolhelp32Snapshot API along with Module32First/Module32Next. These will return the base address of the image and its libraries as well as their sizes in memory.

const ModuleMap GetModules(const DWORD dwProcessId)
{
    ModuleMap mapModules;
 
    const HANDLE hToolhelp32 = CreateToolhelp32Snapshot(TH32CS_SNAPMODULE, dwProcessId);
    MODULEENTRY32 moduleEntry = { 0 };
    moduleEntry.dwSize = sizeof(MODULEENTRY32);
 
    const BOOL bSuccess = Module32First(hToolhelp32, &moduleEntry);
    if (!bSuccess)
    {
        fprintf(stderr, "Could not enumeate modules. Error = %X.\n",
            GetLastError());
        exit(-1);
    }
 
    do
    {
        const DWORD_PTR dwBase = (DWORD_PTR)moduleEntry.modBaseAddr;
        const DWORD_PTR dwEnd = dwBase + moduleEntry.modBaseSize;
 
        mapModules[std::wstring(moduleEntry.szModule)] = std::make_pair(dwBase, dwEnd);
 
    } while (Module32Next(hToolhelp32, &moduleEntry));
 
    CloseHandle(hToolhelp32);
 
    return mapModules;
}

Now that all modules and their sizes are stored, the next step involves enumerating them for the proper pages. These will be committed pages which have executables privileges in combination with either read/write or just read. This involves nothing more than enumerating through every modules and checking its address range with VirtualQueryEx, which will return regions of pages with the same permissions. This permission flag is masked for what is desired.

const ExecutableMap GetExecutableRegions(const HANDLE hProcess, const ModuleMap &mapModules)
{
    ExecutableMap mapExecutableRegions;
    ExecutableRegionsList lstExecutableRegions;
 
    for (auto &module : mapModules)
    {
        MEMORY_BASIC_INFORMATION memBasicInfo = { 0 };
        DWORD_PTR dwBaseAddress = module.second.first;
        const DWORD_PTR dwEndAddress = module.second.second;
 
        while (dwBaseAddress <= dwEndAddress)
        {
            const SIZE_T ulReadSize = VirtualQueryEx(hProcess, (LPCVOID)dwBaseAddress, &memBasicInfo, sizeof(MEMORY_BASIC_INFORMATION));
            if (ulReadSize > 0)
            {
                if ((memBasicInfo.State & MEM_COMMIT) &&
                    ((memBasicInfo.Protect & PAGE_EXECUTE_READWRITE) || (memBasicInfo.Protect & PAGE_EXECUTE_READ)))
                {
                    const DWORD_PTR dwRegionStart = (DWORD_PTR)memBasicInfo.AllocationBase;
                    const DWORD_PTR dwRegionEnd = dwRegionStart + (DWORD_PTR)memBasicInfo.RegionSize;
                    lstExecutableRegions.emplace_back(std::make_pair(dwRegionStart, dwRegionEnd));
                }
                dwBaseAddress += memBasicInfo.RegionSize;
            }
        }
 
        if (lstExecutableRegions.size() > 0)
        {
            mapExecutableRegions[module.first] = lstExecutableRegions;
            lstExecutableRegions.clear();
        }
    }
 
    if (mapExecutableRegions.size() == 0)
    {
        fprintf(stderr, "Could not find any executable regions.\n");
        exit(-1);
    }
 
    return mapExecutableRegions;
}

This filters the original module ranges down and only leaves ranges of pages that are committed and have the PAGE_EXECUTE_READWRITE or PAGE_EXECUTE_READ permission. These will be the ranges that are searched for NOP blocks. Now that the collection is filtered down even further, it is time to find the NOP blocks.

Finding NOP Blocks

Finding the NOP blocks is achieved in multiple steps. Since this code is a process that writes into another process, the first step involves copying over the bytes from the target process. The bytes copied over will be the executable range for the module found in the previous section. This range will then be scanned for NOP bytes, and these ranges stored. The code for this looks like the following:

const NopRangeList FindNopRanges(const HANDLE hProcess, const ExecutableMap &executableRegions, const size_t ulSize)
{
    NopRangeList nopRangeList;
 
    for (auto &executableRegion : executableRegions)
    {
        for (auto &executableAddressRange : executableRegion.second)
        {
            const DWORD_PTR dwLowerAddress = executableAddressRange.first;
            const DWORD_PTR dwHigherAddress = executableAddressRange.second;
            const DWORD_PTR dwRangeSize = dwHigherAddress - dwLowerAddress;
 
            if (dwRangeSize > ulSize)
            {
                std::unique_ptr pLocalBytes(new unsigned char[dwRangeSize]);
                SIZE_T ulBytesRead = 0;
                const bool bSuccess = BOOLIFY(ReadProcessMemory(hProcess, (LPCVOID)dwLowerAddress,
                    pLocalBytes.get(), dwRangeSize, &ulBytesRead));
                if (bSuccess && ulBytesRead == dwRangeSize)
                {
                    const DWORD_PTR dwOffset = dwLowerAddress - (DWORD_PTR)pLocalBytes.get();
 
                    NopRange nopRange = FindNops(pLocalBytes.get(), dwRangeSize, dwOffset);
                    if (nopRange.size() > 0)
                    {
                        nopRangeList.emplace_back(nopRange);
                    }
                }
                else
                {
                    fprintf(stderr, "Could not read from 0x%X. Error = %X\n",
                        executableAddressRange.first, GetLastError());
                }
            }
        }
    }
 
    return nopRangeList;
}

Here the bytes are copied into a local array with ReadProcessMemory. The offset between the address of this local array and the address that was read is calculated. This is needed because the instructions are read into this local array and the addresses are different. When these instructions are later interpreted and checked against NOP (0x90), the address of that NOP will correspond to the local array and not to the target process. Calculating the difference between these two and adding it back later will fix that problem up. At the end of this loop, nopRangeList will contain the NOP ranges for every module in the executable as shown belownop1
The topmost index will be a module and the inner index will hold an address range of NOPs within that module. For example, in the image above, nopRangeList[4][0] = [0x7FFC6701185D – 0x7FFC670118FF] is a range of NOPs in the target process found within kernel32.dll. This function also calls FindNops to do the work; the definition for FindNops is below:

const NopRange FindNops(const unsigned char * const pBytes, const size_t ulSize, const DWORD_PTR dwOffset)
{
    //Find all NOPs in the code
    const InstructionList nopList = GetNopList(pBytes, ulSize, dwOffset);
 
    //Merge continuous NOPs into an address range
    NopRange nopListMerged;
    if (nopList.size() > 1)
    {
        auto firstElem = nopList.begin();
        auto nextElem = ++firstElem;
        --firstElem;
        nopListMerged.push_back(std::make_pair(*firstElem, *firstElem));
 
        while (nextElem != nopList.end())
        {
            if (*nextElem == ((*firstElem) + 1))
            {
                auto elem = nopListMerged.back();
                const DWORD_PTR dwRangeStart = elem.first;
                const DWORD_PTR dwRangeEnd = *nextElem;
                nopListMerged.pop_back();
                nopListMerged.push_back(std::make_pair(dwRangeStart, dwRangeEnd));
            }
            else
            {
                nopListMerged.push_back(std::make_pair(*nextElem, *nextElem));
            }
 
            ++firstElem;
            ++nextElem;
        }
    }
 
    //Toss out address ranges that are too small
    NopRange nopListTrimmed;
    const int iMinNops = 20;
    for (auto &nopRange : nopListMerged)
    {
        const DWORD_PTR dwRangeStart = nopRange.first;
        const DWORD_PTR dwRangeEnd = nopRange.second;
 
        if ((dwRangeEnd - dwRangeStart) > iMinNops)
        {
            nopListTrimmed.push_back(std::make_pair(dwRangeStart, dwRangeEnd));
        }
    }
 
    return nopListTrimmed;
}

This function is responsible for finding the NOP ranges via a call to GetNopList, which returns every instruction that was a NOP in the given range. These returned NOPs will be unmerged, as shown below: nop2
Here you can see continuous addresses (0x…1000, 0x…1001, 0x…1002, …) that contain NOPs. The next loop is responsible for merging these entries into a std::pair range, containing the starting address and ending address of the range. The last loop then filters this even further to only include NOP ranges that are 20 bytes of greater.

GetNopList is implemented with the help of the BeaEngine disassembler.

const InstructionList GetInstructionList(const unsigned char * const pBytes, const size_t ulSize, const DWORD_PTR dwOffset,
    const bool bNopsOnly = false)
{
    InstructionList instructionList;
 
    DISASM disasm = { 0 };
#ifdef _M_IX86
    //Do nothing
#elif defined(_M_AMD64)
    disasm.Archi = 64;
#else
#error "Unsupported architecture"
#endif
 
    disasm.EIP = (UIntPtr)pBytes;
    int iLength = 0;
    int iLengthTotal = 0;
    do
    {
        iLength = DisasmFnc(&disasm);
        if (iLength != UNKNOWN_OPCODE)
        {
            const DWORD_PTR dwInstructionStart = (DWORD_PTR)(disasm.EIP);
            if (bNopsOnly)
            {
                if (disasm.Instruction.Opcode == NOP)
                {
                    instructionList.push_back(dwInstructionStart + dwOffset);
                }
            }
            else
            {
                instructionList.push_back(dwInstructionStart + dwOffset);
            }
 
            iLengthTotal += iLength;
            disasm.EIP += iLength;
        }
        else
        {
            ++iLengthTotal;
            ++disasm.EIP;
        }
    } while (iLengthTotal < ulSize);
 
    return instructionList;
}
 
const InstructionList GetNopList(const unsigned char * const pBytes, const size_t ulSize, const DWORD_PTR dwOffset)
{
    return GetInstructionList(pBytes, ulSize, dwOffset, true);
}

Putting Everything Together

Now the collection has been filtered down even further to only desirable NOP ranges (those >20 bytes). These will be the ones used to write in our instructions. The algorithm for doing this will be as follows:

  1. Select a module at random
  2. Select a NOP range from that module to write into at random that hasn’t been chosen already
  3. Write an instruction to the region
  4. Write an unconditional jump to the next NOP range, which will contain the next instruction and an unconditional jump.
  5. Continue doing steps 3-4 while there are instructions left to write

The code for steps 1-2 is shown below:

InstructionList SelectRegions(const HANDLE hProcess, const NopRangeList &nopRangeList, InstructionList &writeInstructions)
{
    InstructionList writtenList;
 
    auto firstElem = writeInstructions.begin();
    auto nextElem = ++firstElem;
    --firstElem;
    while(nextElem != writeInstructions.end())
    {
        bool bContinueSearching = true;
        do
        {
            const size_t ulCurrentIndexModule = std::rand() % nopRangeList.size();
            const size_t ulCurrentIndexAddressRange = std::rand() % nopRangeList[ulCurrentIndexModule].size();
 
            const DWORD_PTR dwBaseWriteAddress = nopRangeList[ulCurrentIndexModule][ulCurrentIndexAddressRange].first;
            if(std::find(writtenList.begin(), writtenList.end(), dwBaseWriteAddress) == writtenList.end())
            {
                writtenList.push_back(dwBaseWriteAddress);
                bContinueSearching = false;
            }
        } while (bContinueSearching);
 
        ++firstElem;
        ++nextElem;
    }
 
    return writtenList;
}

Here writtenList will contain the addresses in the target process to write instructions to.nop3
The rest of the algorithm involves writing in an instruction and a jump for each instruction that should be written. This is implemented in the WriteJumps function shown below:

const bool WriteJumps(const HANDLE hProcess, const InstructionList &writeInstructions, const InstructionList &selectedRegions)
{
#ifdef _M_IX86
#elif defined (_M_AMD64)
    unsigned char jmpBytes[] =
    {
        0x48, 0xB8, 0xBB, 0xBB, 0xBB, 0xBB, 0xBB, 0xBB, 0xBB, 0xBB, /*mov rax, 0xBBBBBBBBBBBBBBBB*/
        0xFF, 0xE0                                                  /*jmp rax*/
    };
#else
#error "Unsupported architecture"
#endif
 
    auto firstElem = selectedRegions.begin();
    auto nextElem = ++firstElem;
    --firstElem;
 
    int i = 0;
    while (nextElem != selectedRegions.end())
    {
        const DWORD_PTR dwInstructionSize = writeInstructions[i + 1] - writeInstructions[i];
        DWORD dwOldProtect = 0;
        bool bSuccess = BOOLIFY(VirtualProtectEx(hProcess, (LPVOID)*firstElem, dwInstructionSize, PAGE_EXECUTE_READWRITE, &dwOldProtect));
        if (bSuccess)
        {
            size_t ulBytesWritten = 0;
            bSuccess = BOOLIFY(WriteProcessMemory(hProcess, (LPVOID)*firstElem, (LPCVOID)writeInstructions[i++], dwInstructionSize,
                &ulBytesWritten));
 
            DWORD_PTR dwNextAddress = *nextElem;
            memcpy(&jmpBytes[2], &dwNextAddress, sizeof(DWORD_PTR));
 
            bSuccess = BOOLIFY(WriteProcessMemory(hProcess, (LPVOID)(*firstElem + dwInstructionSize), jmpBytes, sizeof(jmpBytes),
                &ulBytesWritten));
 
            bSuccess = BOOLIFY(VirtualProtectEx(hProcess, (LPVOID)*firstElem, dwInstructionSize, dwOldProtect, &dwOldProtect));
            if (!bSuccess)
            {
                fprintf(stderr, "Could not put permissions back on address 0x%X. Error = %X\n",
                    *firstElem, GetLastError());
                return false;
            }
 
        }
        else
        {
            fprintf(stderr, "Could not change permissions on address 0x%X. Error = %X\n",
                *firstElem, GetLastError());
            return false;
        }
 
        ++firstElem;
        ++nextElem;
    }
 
    return true;
}

For each region to be written to, this function will begin by changing the page permissions to PAGE_EXECUTE_READWRITE. Then the first WriteProcessMemory call will write the first instruction to the region. Following that, it will write an unconditional jump in the form of mov rax, <address> -> jmp rax. The page permissions will be changed back to what they were and the loop continues until there are no more instructions to write. To begin execution of these bytes, a remote thread can be created with CreateRemoteThread with the base of these instructions as the entry point.

An Example

Here is an example of what writing MessageBoxA(0, 0, 0, 0) into another process looks like. The code to be written looks like the following:

    HMODULE hModule = LoadLibrary(L"user32.dll");
    DWORD_PTR dwTargetAddress = (DWORD_PTR)GetProcAddress(hModule, "MessageBoxA");
 
#ifdef _M_IX86
#elif defined(_M_AMD64)
    DWORD dwHigh = (dwTargetAddress >> 32) & 0xFFFFFFFF;
    DWORD dwLow = (dwTargetAddress) & 0xFFFFFFFF;
 
    unsigned char pBytes[] =
    {
        0x45, 0x33, 0xC9,                               /*xor r9d, r9d*/
        0x45, 0x33, 0xC0,                               /*xor r8d, r8d*/
        0x33, 0xD2,                                     /*xor edx, edx*/
        0x33, 0xC9,                                     /*xor ecx, ecx*/
        0x68, 0x11, 0x11, 0x11, 0x11,                   /*push 0x11111111*/
        0xC7, 0x44, 0x24, 0x04, 0xDD, 0xCC, 0xBB, 0xAA, /*mov [rsp+4], 0AABBCCDD*/
        0xC3,                                           /*ret*/
        0xC3, 0xC3, 0xC3                                /*dummy*/
    };
 
    memcpy(&pBytes[11], &dwLow, sizeof(DWORD));
    memcpy(&pBytes[19], &dwHigh, sizeof(DWORD));

At the end of this code segment, pBytes will contain the bytes of a call to MessageBoxA(0, 0, 0, 0). The target process was a x64 process that I chose, it happened to be the 64-bit version of Dependency Walker (depends.exe) for this example. Here is what it looks like in action. The start address was 0x00007ffc67dcc396. nop4

The first instruction was written with a call to the next NOP range at 0x7FFC6701185D. Then at 0x7FFC6701185Dnop5The next instruction is written. This continues on until the call. nop6

nop7

nop8

nop9

nop10Eventually, when the remote thread runs, the following should appear:nop11Closing the MessageBox will return execution to normal.

Issues

The example code works on x64 only, but can be very easily ported to work on x86. This technique also doesn’t seem to work universally on all executables. For example, trying this on a 64-bit Notepad instance will crash it with the following error: “RangeChecks instrumentation code detected an out of range array access.” This is something that I am currently investigating and will hope to update soon. Edit: As a commentator pointed out (and I have confirmed), this is caused by Control Flow Guard being used for executables on Windows 8.1 and higher. The example code works without issues for x64 on Windows 7.

Get the code

The Visual Studio 2015 RC project for this example can be found here. The source code is viewable on Github here.

Follow on Twitter for more updates.

« Newer PostsOlder Posts »

Powered by WordPress