Weaponizing Windows Thread Pool APIs: Proxying DLL Loads Using I/O Completion Callbacks

In today’s blog, we are going to be covering the topic of proxying DLL loads using the Windows thread pool API with C++/assembly. This specific example is going to use an I/O completion callback and is a complementary article for my GitHub repository here. Before we jump in, let me introduce some background information.

Background

Chetan Nayak has covered this topic in depth here and will serve as the inspiration behind this research. Getting to the core of what this research is really about, we have to imagine the situation where we have a payload in a position independent format that will ultimately load DLLs. When we are loading a shellcode, we must allocate a region of memory that is RX/RWX. Better OPSEC would be to use a RX region. Nonetheless, if we were to load these DLLs using a standard function call such as LoadLibraryA there will be a call assembly instruction executed to invoke LoadLibraryA. This would place the return address of the caller on the stack and would originate from a RX region of memory (our shellcode). Some of you may be wondering, so what? What is the problem here?

The problem is an EDR can hook DLL loading functions such as LoadLibraryA then examine the stack to determine the return address of the caller, which would point back to your RX shellcode region. It is also going to scan this memory region for anything malicious and against known payload signatures which can result in you getting caught – simply from loading specific DLLs in this way. So, you have unhooked a given DLL containing the DLL loading function, you should be fine right? This is not the only thing you have to be worried about, remember those very annoying things called kernel callbacks? There is a callback named PsSetLoadImageNotifyRoutine(Ex) allowing EDRs to register a callback function that gets called every time a DLL gets loaded in a process. Once the EDR examines the stack after the callback gets triggered, it can get the return address of the caller who loaded the DLL and examine this memory region the same as I mentioned before. In comes proxy DLL loading to address these OPSEC issues.

The Windows Thread Pool API

The Windows Thread Pool API provides a high-level abstraction for managing a set of worker threads that can be used to perform various asynchronous tasks or work items. This API simplifies the management of thread resources within applications, allowing developers to focus on application logic rather than the complexities of thread management, synchronization, and concurrency control. The API is part of the Windows operating system, and it enables efficient execution of callbacks on worker threads drawn from a pool managed by the system.

Why Certain Callback Functions Exist

Work Item Callbacks: These functions are executed when a work item is processed by a worker thread. They encapsulate the task or computation that needs to be performed asynchronously, allowing the application to offload work from the main thread and improve responsiveness or throughput.
Timer Callbacks: Timer callbacks are executed when a timer expires. This is useful for periodic updates, maintenance tasks, or delayed execution of code without blocking a thread by sleeping.
I/O Completion Callbacks: These functions are executed upon the completion of asynchronous I/O operations. They allow applications to initiate I/O operations without blocking and to process the results asynchronously, which is critical for maintaining high performance in I/O-intensive applications.
Wait Callbacks: Wait callbacks are executed when a wait object (such as an event or mutex) becomes signaled. This mechanism is used to asynchronously wait for events or conditions without blocking a worker thread, facilitating synchronization among threads or reacting to external events.

You Complete Me

As I previously mentioned, there are multiple callbacks that exist. The example I am sharing with you is going to be an I/O completion callback example. To utilize an I/O completion callback with the Windows Thread Pool API effectively, you essentially need to create a thread pool I/O object via CreateThreadpoolIo, associating it with a file handle that supports overlapped I/O operations. This setup allows for the execution of asynchronous I/O operations, such as file reads or writes, in a manner that does not block the executing thread. The key to this process is the OVERLAPPED structure, which the system uses to track the progress of these operations. When initiating any asynchronous I/O, you must first call StartThreadpoolIo to prepare the thread pool for the incoming operation, ensuring the system is ready to handle the completion callback properly.

Your callback function, defined to match the thread pool API’s expectations, will be invoked upon the completion of the I/O operation. This function will receive details about the operation, including the outcome and the number of bytes transferred, allowing for any necessary post-operation processing. After the operations and their associated callbacks have been completed, cleaning up resources by closing the thread pool I/O object and any open file handles is essential for resource management and to prevent leaks.

In summary, leveraging the Windows Thread Pool API for I/O completion callbacks involves preparing for asynchronous operations with a properly configured file handle and OVERLAPPED structure, managing the lifecycle of the operation with StartThreadpoolIo and CloseThreadpoolIo, and handling the results in a predefined callback function. This approach facilitates efficient, non-blocking I/O operations within Windows applications.

Now that we know what is required, let’s talk about the implementation. As I previously mentioned, the code for this example is in a GitHub repository of mine here, but the C++ code is as follows:

    
    #include <windows.h>
#include <stdio.h>

extern "C" void CALLBACK IoCompletionCallback(PTP_CALLBACK_INSTANCE Instance, PVOID Context, PVOID Overlapped, ULONG IoResult, ULONG_PTR NumberOfBytesTransferred, PTP_IO Io);
void StartRead(HANDLE pipe, PTP_IO tpIo, OVERLAPPED* overlapped, char* buffer);
void CALLBACK ClientWorkCallback(PTP_CALLBACK_INSTANCE Instance, PVOID Context, PTP_WORK Work);

PVOID pLoadLibraryA;
HANDLE g_WriteCompleteEvent; // Global event to signal completion of write operation

typedef struct LOAD_CONTEXT {
    char* DllName;
    PVOID pLoadLibraryA;
};

int main()
{
    HANDLE pipe;
    PTP_IO tpIo = NULL;
    OVERLAPPED overlapped = { 0 };
    char buffer[128] = { 0 };

    // Get the address of LoadLibraryA
    pLoadLibraryA = GetProcAddress(GetModuleHandleA("kernel32"), "LoadLibraryA");

    // Prepare the LOAD_CONTEXT structure
    LOAD_CONTEXT loadContext;
    loadContext.DllName = (char*)"wininet.dll";
    loadContext.pLoadLibraryA = pLoadLibraryA;

    // Create a global event to signal when the write operation is complete
    g_WriteCompleteEvent = CreateEvent(NULL, TRUE, FALSE, NULL);
    if (g_WriteCompleteEvent == NULL) {
        printf("Failed to create write complete event\n");
        return 1;
    }

    // Create a named pipe with FILE_FLAG_OVERLAPPED flag
    pipe = CreateNamedPipe(
        TEXT("\\\\.\\pipe\\MyPipe"),
        PIPE_ACCESS_DUPLEX | FILE_FLAG_OVERLAPPED,
        PIPE_TYPE_BYTE | PIPE_READMODE_BYTE | PIPE_WAIT,
        1,  // Number of instances
        4096,  // Out buffer size
        4096,  // In buffer size
        0,  // Timeout in milliseconds
        NULL); // Default security attributes

    if (pipe == INVALID_HANDLE_VALUE) {
        printf("Failed to create named pipe\n");
        CloseHandle(g_WriteCompleteEvent);
        return 1;
    }

    // Create an event for the OVERLAPPED structure
    overlapped.hEvent = CreateEvent(NULL, TRUE, FALSE, NULL);
    if (overlapped.hEvent == NULL) {
        printf("Failed to create event\n");
        CloseHandle(pipe);
        CloseHandle(g_WriteCompleteEvent);
        return 1;
    }

    // Associate the pipe with the thread pool
    tpIo = CreateThreadpoolIo(pipe, IoCompletionCallback, &loadContext, NULL);
    if (tpIo == NULL) {
        printf("Failed to associate pipe with thread pool\n");
        CloseHandle(overlapped.hEvent);
        CloseHandle(pipe);
        CloseHandle(g_WriteCompleteEvent);
        return 1;
    }

    // Create threadpool work item for the client code
    PTP_WORK clientWork = CreateThreadpoolWork(ClientWorkCallback, NULL, NULL);
    if (clientWork == NULL) {
        printf("Failed to create threadpool work item\n");
        CloseThreadpoolIo(tpIo);
        CloseHandle(overlapped.hEvent);
        CloseHandle(pipe);
        CloseHandle(g_WriteCompleteEvent);
        return 1;
    }

    // Submit the client work item to the thread pool
    SubmitThreadpoolWork(clientWork);

    // Wait for the client work item to signal that the write operation is complete
    WaitForSingleObject(g_WriteCompleteEvent, INFINITE);

    // Start an asynchronous read operation
    StartRead(pipe, tpIo, &overlapped, buffer);
    printf("Pipe buffer: %s\n", buffer);

    // Wait for the read operation to complete
    WaitForSingleObject(overlapped.hEvent, INFINITE);

    // Wait for client work to complete
    WaitForThreadpoolWorkCallbacks(clientWork, FALSE);
    CloseThreadpoolWork(clientWork);

    // Cleanup
    CloseThreadpoolIo(tpIo);
    CloseHandle(overlapped.hEvent);
    CloseHandle(pipe);
    CloseHandle(g_WriteCompleteEvent);

    printf("wininet.dll should be loaded! Input any key to exit...\n");
    getchar();

    return 0;
}

void StartRead(HANDLE pipe, PTP_IO tpIo, OVERLAPPED* overlapped, char* buffer)
{
    DWORD bytesRead = 0;
    StartThreadpoolIo(tpIo);
    if (!ReadFile(pipe, buffer, 128, &bytesRead, overlapped) && GetLastError() != ERROR_IO_PENDING) {
        printf("ReadFile failed, error %lu\n", GetLastError());
        CancelThreadpoolIo(tpIo);
    }
}

void CALLBACK ClientWorkCallback(PTP_CALLBACK_INSTANCE Instance, PVOID Context, PTP_WORK Work)
{
    // Open the named pipe
    HANDLE pipe = CreateFile(
        TEXT("\\\\.\\pipe\\MyPipe"),
        GENERIC_WRITE,
        0,
        NULL,
        OPEN_EXISTING,
        FILE_ATTRIBUTE_NORMAL,
        NULL);

    if (pipe == INVALID_HANDLE_VALUE) {
        printf("Client failed to connect to pipe\n");
        return;
    }

    const char message[] = "Hello from the pipe!";
    DWORD bytesWritten;
    if (!WriteFile(pipe, message, sizeof(message), &bytesWritten, NULL)) {
        printf("Client WriteFile failed, error: %lu\n", GetLastError());
    }
    else {
        printf("Client wrote to pipe\n");
    }

    // Signal that the write operation is complete
    SetEvent(g_WriteCompleteEvent);

    CloseHandle(pipe);
}

Rather than going through the code line by line, I am going to focus on the key parts for this article. The proof of concept also includes some assembly code, which can be seen below:

    
    .CODE

; LOAD_CONTEXT* is passed in RDX
IoCompletionCallback PROC
    ; Extract the 'DllName' member (first member of the structure) to RCX
    mov rcx, [rdx]       ; Moves the address pointed to by DllName into RCX

    ; Extract the 'pLoadLibraryA' member (second member of the structure) into RAX
    mov rax, [rdx + 8]   ; Assumes 64-bit pointers, so offset is 8 bytes

    ; Now RCX contains the address of the dll string,
    ; and RAX contains the address to jump to (pLoadLibraryA)

    xor rdx, rdx        ; Clear RDX

    ; Jump to LoadLibraryA address, avoiding call instruction and return address placement on stack
    jmp rax
IoCompletionCallback ENDP

END

A lot of the code is boilerplate code that is necessary to perform an I/O completion callback with the Windows thread pool API using named pipes. This can also be done with any I/O objects relevant to the Windows API including files and sockets. The parts we want to focus on for this example would be the LOAD_CONTEXT structure and how it is passed to CreateThreadPoolIo and which argument it will be when it is passed to the callback function. Also notice that our callback function IoCompletionCallback is marked as external and is defined within the assembly code. The structure will be passed to this assembly function when the callback is triggered.

When we call CreateThreadpoolIo, we pass it the pipe handle, the callback function, and the LOAD_CONTEXT structure. This instructs the thread pool that a pointer to the loadContext variable will be passed as the Context argument to the I/O completion callback function, or the second argument. This is important information as we are about to find out.

If you look at the LOAD_CONTEXT structure it contains two members: char* DllName and PVOID pLoadLibraryA. We have populated DllName with a pointer to the string “wininet.dll” and we have populated pLoadLibraryA with the memory address of LoadLibraryA.

Avoiding the Call Instruction

As I mentioned previously, what ultimately ends up leading to our detection by loading a DLL from a shellcode region on the stack is a call assembly instruction.

The call instruction in assembly language is used to invoke a subroutine (a procedure or function within a program). The primary purpose of this instruction is to transfer control from the calling function to the subroutine, allowing for code reuse, modular programming, and organized control flow within a program. Subroutines can perform tasks and return results without the need to replicate code across various parts of a program.

When a call instruction is executed, the processor does two main things:

1. Pushes the return address onto the stack: The return address is the address of the instruction immediately following the call instruction in the calling function. This address is saved on the stack so that, once the subroutine has completed its execution, the program knows where to return to continue executing the calling function. The stack is used for this purpose because it supports the nested calling of subroutines (functions calling other functions) in a Last In, First Out (LIFO) manner. This is essential for supporting recursive function calls and for managing the return addresses of multiple nested subroutines.

2. Transfers control to the subroutine: The program counter (PC) or instruction pointer (IP) is set to the address of the subroutine being called, causing execution to jump to that location.

After the subroutine has finished executing, a ret (return) instruction is typically used to pop the return address off the stack and jump back to that address, resuming execution of the calling function just after the point where it called the subroutine.

With this information in mind, we want to avoid a call instruction for our objective. We do not want our return address on the stack when we call LoadLibraryA so that anything looking at the stack is not led to our shellcode memory region for analysis. This is where the assembly function and the LOAD_CONTEXT structure come into play.

At the time the callback function (IoCompletionCallback) gets called, the LOAD_CONTEXT structure is passed as the second argument to the callback. When we look at the 64-bit Windows calling convention in terms of the stack, this means that structure is going to be contained in the rdx register.

The assembly function first extracts the DllName member of the structure and places it into rcx. This is going to simulate placing the DLL string as the first argument to LoadLibraryA when we later perform a jmp. The next thing the assembly function does is it extracts the second member of the LOAD_CONTEXT structure, which is the memory address of LoadLibraryA into rax. Before performing the jmp, we clear the rdx register as LoadLibraryA only takes one argument and without doing this it would cause the function to throw an error or fail. The last thing the assembly function does is it performs a jmp rax instruction with our specially crafted stack. With the DLL name string inside rcx, we simulate a LoadLibraryA call with the given DLL name as the first argument and we load the DLL.

By loading the DLL in this way, we have essentially “proxied” the load through the callback function. Normally, what you would see in this case would be a stack frame for the I/O completion callback function when examining the stack after the DLL load because the return address would have been placed on the stack. But if we examine the stack in this case, we see no stack frame for the callback function and we have achieved a clean stack with nothing pointing to our shellcode memory region:

If we check the list of loaded modules inside of Process Hacker, we can also see that wininet.dll was properly loaded:

Conclusion

In conclusion, in this article we covered a method for proxy DLL loading using an I/O completion callback function with the Windows thread pool API and C++/assembly. There are numerous Windows callbacks that can be used in a similar way. We do this for our OPSEC to prevent EDRs from detecting our payloads by removing the return address of the calling function that loads DLLs from the stack. I hope you enjoyed reading, cheers!