Evasion Engineering and Abstraction

This report provides an analysis of the architectural evolution of process injection methodologies, tracing the execution flow from standard user-mode Win32 wrapper functions down to hardware-level direct system calls. It details the structural limitations of relying on high-level APIs against modern Endpoint Detection and Response (EDR) solutions, the mechanics of Native API (NTAPI) evasion, and the low-level implementation of direct assembly system calls to achieve unmonitored execution.


Part I: Windows Internals and the Execution Boundary

The foundational objective of process injection is to allocate memory, write a payload, and manipulate an execution thread within a remote process. In the Windows NT architecture, processes are isolated. User-mode applications (Ring 3) cannot directly interact with hardware or other processes; they must politely ask the Windows Kernel (Ring 0) to perform these actions on their behalf.

This request mechanism is governed by a strict hierarchy of abstraction layers. Understanding how EDRs intercept malicious behavior requires a clinical understanding of how standard API calls traverse these layers before they ever cross the user/kernel boundary.

fig1

Fig 1: The Windows API Execution Flow and Abstraction Layers

1.1 How the Windows API Is Processed

When an operator writes an injector using the standard Microsoft documentation, they rely on the Win32 API. These functions, exported primarily by subsystem DLLs like kernel32.dll and user32.dll, are simply high-level wrappers designed to make programming easier. They do no actual work; they merely validate parameters and forward the request deeper into the operating system.

The execution flow of a standard memory allocation request illustrates this abstraction:

  1. The Subsystem Wrapper (kernel32.dll / kernelbase.dll): A developer calls VirtualAllocEx in their C++ or Rust code. This function resides in kernel32.dll (and subsequently kernelbase.dll). It packages the requested parameters (target process handle, size, memory protection) and forwards them to the Native API.
  2. The Native API (ntdll.dll): The call arrives at NtAllocateVirtualMemory within ntdll.dll. This is the lowest accessible layer in user-mode. ntdll.dll acts as the definitive bridge between Ring 3 and Ring 0.
  3. The Transition (The syscall instruction): Inside the ntdll.dll function stub, the system prepares to transition to the kernel. It moves a specific System Service Number (SSN)a unique identifier for the requested kernel function, into the EAX CPU register. It then executes the syscall instruction.
  4. The Kernel (ntoskrnl.dll): The syscall instruction triggers a hardware context switch. Execution jumps into Ring 0. The kernel uses the SSN stored in EAX to look up the actual implementation of the function in the System Service Descriptor Table (SSDT) and executes the highly privileged memory allocation.

Because every standard Win32 call must physically pass through ntdll.dll before reaching the kernel, this specific DLL becomes the primary chokepoint for modern security solutions to monitor process behavior.


Part II: Process Injection Theory

To practically apply the concepts of Windows abstraction, we must examine a standard operational technique. The most foundational technique in malware development is classic Process Injection (often referred to as shellcode injection). The objective is to force a legitimate, executing process (e.g., notepad.exe or calc.exe) to execute our arbitrary malicious code, thereby masquerading our execution telemetry under a trusted process name.

2.1 The Process Injection API Call Flow

Regardless of the programming language used, whether it is C++, C#, or Rust, the theoretical pipeline for classic process injection relies on four distinct operational stages. Each of these stages requires specific permissions from the Windows kernel, meaning each stage must traverse the Win32 -> NTAPI -> Syscall chokepoint.

The standard operational flow is mapped as follows:

  1. Target Acquisition (Handle Creation): Before manipulating a remote process, the injector must request a privileged handle from the operating system.

    • Win32: The operation begins with OpenProcess, requesting PROCESS_ALL_ACCESS (or a more OpSec-safe combination like PROCESS_VM_WRITE | PROCESS_VM_OPERATION).
    • NTAPI: This call is translated in ntdll.dll to NtOpenProcess.
    • Syscall: The SSN for NtOpenProcess is loaded, and the kernel evaluates the request against the target process's security descriptor before granting the handle.
  2. Memory Allocation: With a valid handle, the injector must carve out a section of memory within the target's isolated virtual address space to house the shellcode.

    • Win32: The operator invokes VirtualAllocEx, specifying the size of the payload and requesting memory protections, typically PAGE_EXECUTE_READWRITE (RWX).
    • NTAPI: This translates to NtAllocateVirtualMemory.
    • Syscall: The kernel updates the target process's Virtual Address Descriptor (VAD) tree to reflect the newly allocated, unbacked memory region.
  3. Payload Write: The newly allocated memory is initially empty. The injector must copy the malicious payload from its own memory space into the remote process.

    • Win32: The injector calls WriteProcessMemory.
    • NTAPI: This translates to NtWriteVirtualMemory.
    • Syscall: The memory manager natively copies the buffer across the process boundaries.
  4. Execution (Thread Hijacking/Creation): The shellcode resides in the remote process, but it is inert. The injector must instruct the OS to begin executing instructions at that specific memory address.

    • Win32: The classic approach uses CreateRemoteThread to spin up a new execution thread pointing to the base address of our allocated memory.
    • NTAPI: This translates to NtCreateThreadEx.
    • Syscall: The kernel constructs the necessary thread objects (ETHREAD, KTHREAD) and schedules the new thread for execution within the target process.

fig2

Fig 2: The classic Process Injection pipeline traversing the Win32, NTAPI, and Syscall boundaries.

By understanding this exact translation pipeline (VirtualAllocEx -> NtAllocateVirtualMemory -> Syscall), we can begin to see the inherent architectural weakness of relying on high-level wrappers. If an Endpoint Detection and Response (EDR) agent wants to stop this entire attack chain, it does not need to monitor the kernel, nor does it need to understand the attacker's source code. It only needs to watch the bridge.


Part III: The Win32 Implementation

To demonstrate the theoretical pipeline established in Part 2, we will examine a standard user-mode injector written in Rust, utilizing the official windows crate to interface with the Win32 API.

This implementation represents the most basic, heavily documented approach to process injection. Consequently, it is also the most heavily monitored.

3.1 Anatomy of the Win32 Injector

The injector operates by sequential authorization. It must successfully request and receive permission for each stage before proceeding to the next.

1. Handle Acquisition The program first requests a highly privileged handle to the target Process ID (PID).

// Stage 1: Requesting a privileged handle
let process_handle = OpenProcess(
    PROCESS_ALL_ACCESS, 
    false, 
    target_pid
).unwrap();

Note: Requesting PROCESS_ALL_ACCESS is a massive cryptographic footprint. Modern clinical architectures adhere strictly to the Principle of Least Privilege, requesting only the specific bitmasks required (e.g., PROCESS_VM_WRITE | PROCESS_VM_OPERATION).

2. Memory Allocation With the handle established, the injector instructs kernelbase.dll to allocate a block of memory inside the remote process large enough to hold the shellcode, requesting PAGE_EXECUTE_READWRITE (RWX) permissions.

// Stage 2: Remote memory allocation via Win32 Wrapper
let remote_memory = VirtualAllocEx(
    process_handle,
    Some(null_mut()),
    SHELLCODE.len(),
    MEM_COMMIT | MEM_RESERVE,
    PAGE_EXECUTE_READWRITE,
);

3. Execution Once the memory is allocated and the payload is written via WriteProcessMemory, the execution is triggered by requesting the OS kernel to spin up a new thread pointing to the base address of the allocated shellcode.

// Stage 3: Thread Hijacking/Creation
let thread_handle = CreateRemoteThread(
    process_handle,
    Some(null_mut()),
    0,
    Some(std::mem::transmute(remote_memory)),
    Some(null_mut()),
    0,
    Some(null_mut()),
).unwrap();

fig3

Fig 3: Running the Win32 Injector with Process Hacker Open

3.2 API Call Dissection (Using WinDbg)

To truly understand why this Win32 implementation is architecturally flawed against modern defenses, we must observe the execution flow dynamically. By attaching a debugger, we can track the VirtualAllocEx call as it descends through the abstraction layers.

fig4

Fig 4: WinDbg Call Stack demonstrating the transition from kernelbase!VirtualAllocEx to ntdll!NtAllocateVirtualMemory.

If we halt execution right before VirtualAllocEx is invoked and analyze the call stack, the underlying mechanics are revealed. VirtualAllocEx (housed in kernelbase.dll) does not perform the allocation. It merely repackages the parameters and immediately issues a call to NtAllocateVirtualMemory inside ntdll.dll.

It is within ntdll.dll that the execution reaches the System Call boundary, preparing the SSN to transition into Ring 0.

3.3 Caveats Against EDRs: The Hooking Dilemma

This predictable transition from Win32 -> NTAPI represents a critical vulnerability for the offensive operator. Endpoint Detection and Response (EDR) solutions exploit this chokepoint through a technique known as Userland API Hooking (or Inline Hooking).

When a modern EDR (such as CrowdStrike, SentinelOne, or Microsoft Defender for Endpoint) detects a new process spinning up, it uses a kernel callback (via PsSetCreateProcessNotifyRoutine) to preemptively inject its own monitoring DLL into the newly created process's memory space.

Once injected, the EDR alters the first few bytes (the prologue) of critical NTAPI functions inside ntdll.dll, specifically functions like NtAllocateVirtualMemory, NtWriteVirtualMemory, and NtCreateThreadEx.

  1. The Trampoline: The EDR overwrites the beginning of the NtAllocateVirtualMemory stub with an unconditional JMP (jump) instruction.
  2. The Interception: When our Rust injector calls VirtualAllocEx, it naturally flows into ntdll.dll. However, instead of executing the syscall to reach the kernel, the EDR's JMP instruction forcefully redirects execution to the EDR's own analysis routines.
  3. The Heuristic Block: The EDR inspects the parameters: Is this an untrusted process requesting RWX memory in another process? Yes. The EDR flags the behavior as malicious, terminates the thread, generates an alert, and returns an NTSTATUS failure code (e.g., STATUS_ACCESS_DENIED) to the injector.

The Win32 implementation operates entirely blind to these hooks. It inherently trusts the OS API, blindly walking into the traps set within ntdll.dll. To survive, the operational architecture must abandon the Win32 wrappers and descend a layer deeper.


Part IV: The NTAPI Implementation

To evade the telemetry generation and monitoring present in high-level Win32 wrappers like kernelbase.dll, an operational architecture must interact directly with the Native API (NTAPI). By interfacing with ntdll.dll directly, the injector significantly reduces its footprint and operates one step closer to the kernel boundary.

However, because Microsoft considers the NTAPI to be an internal, undocumented interface subject to change, standard libraries (including the official Rust windows crate) often lack the necessary structural definitions required to make these calls natively.

4.1 Structural Parity and Dynamic Resolution

To call functions like NtOpenProcess, the engine must manually reconstruct the C-style structures expected by the Windows kernel using Rust's #[repr(C)] directive. This ensures exact byte-alignment and memory layout parity.

// Reconstructing undocumented NTAPI structures
#[repr(C)]
pub struct OBJECT_ATTRIBUTES {
    pub length: ULONG,
    pub root_directory: HANDLE,
    pub object_name: PUNICODE_STRING,
    pub attributes: ULONG,
    pub security_descriptor: PVOID,
    pub security_quality_of_service: PVOID,
}

#[repr(C)]
pub struct CLIENT_ID {
    pub unique_process: HANDLE,
    pub unique_thread: HANDLE,
}

With the structures defined, the engine abandons static linking. Instead of relying on the Windows loader to populate the Import Address Table (IAT), which is heavily scrutinized by static analysis engines, the injector dynamically resolves the memory addresses of the NTAPI functions at runtime using GetModuleHandleA and GetProcAddress.

// Dynamic resolution of NTAPI exports
let ntdll = GetModuleHandleA(s!("ntdll.dll")).unwrap();

let nt_open_process: FnNtOpenProcess = transmute(
    GetProcAddress(ntdll, s!("NtOpenProcess")).unwrap()
);
let nt_allocate_virtual_memory: FnNtAllocateVirtualMemory = transmute(
    GetProcAddress(ntdll, s!("NtAllocateVirtualMemory")).unwrap()
);

4.2 The RW -> RX OpSec Pattern

This NTAPI implementation also introduces a critical Operational Security (OpSec) improvement over the Win32 variant.

Requesting a block of memory with PAGE_EXECUTE_READWRITE (RWX) permissions is highly anomalous. Legitimate applications rarely require memory that is simultaneously writable and executable. To blend in with normal process behavior, the engine allocates the memory initially as Read/Write (PAGE_READWRITE), writes the payload, and subsequently uses NtProtectVirtualMemory to lock the memory down to Read/Execute (PAGE_EXECUTE_READ).

// 1. Allocate as RW (Stealth)
nt_allocate_virtual_memory(
    process_handle, &mut base_address, 
    0, 
    &mut region_size, 
    0x3000 /* MEM_COMMIT | MEM_RESERVE */, 
    0x04 /* PAGE_READWRITE */
);

// 2. Write Payload
nt_write_virtual_memory(
    process_handle, 
    base_address, 
    SHELLCODE.as_ptr() as PVOID, 
    SHELLCODE.len(), 
    &mut bytes_written
);

// 3. Switch to RX (Execution)
let mut old_protect: ULONG = 0;
nt_protect_virtual_memory(
    process_handle, 
    &mut base_address, 
    &mut region_size, 
    0x20, /* PAGE_EXECUTE_READ */
    &mut old_protect
);

fig5

Fig 5: Output demonstrating successful dynamic resolution and execution via NTAPI.

4.3 Caveats Against EDRs: The Illusion of Evasion

While the NTAPI implementation successfully bypasses any inline hooks placed inside kernel32.dll or kernelbase.dll, it operates under a fatal architectural flaw.

By using GetProcAddress to find the location of NtAllocateVirtualMemory, the injector asks the operating system for the memory address of the function as it currently exists in memory. If an EDR agent has injected its monitoring DLL into the process and placed a JMP instruction at the start of that function, GetProcAddress will simply hand the injector a pointer to the hooked, compromised stub.

fig6

Fig 6: WinDbg memory dump showing it's free of inline EDR hook placed directly at the ntdll.dll export boundary.

Note: If an EDR is active, the execution flow remains identical: The Rust engine calls the dynamically resolved address -> the execution hits the EDR's JMP instruction -> the telemetry is captured, and the thread is terminated.


Part V: Direct Syscalls and the Architecture of Evasion

If navigating the Win32 API is equivalent to walking through the front door of a monitored building, and NTAPI is sneaking through the back door only to find a security camera (JMP instruction) pointing right at you, Direct Syscalls represent tunneling underneath the building entirely.

To truly evade userland hooks installed by Endpoint Detection and Response (EDR) solutions, an operational architecture must completely sever its reliance on ntdll.dll. Instead of asking the operating system to prepare the transition to the kernel, the engine must manually construct the execution stub and issue the hardware transition instruction itself.

5.1 The Theory of System Service Numbers (SSNs)

When standard execution flows through ntdll.dll, the function stub (e.g., NtAllocateVirtualMemory) performs a very specific task before executing the syscall instruction: it loads a System Service Number (SSN) into the EAX CPU register.

The SSN is essentially an index number. When execution drops into Ring 0, the kernel looks at the number in the EAX register, matches it against the System Service Descriptor Table (SSDT), and knows exactly which kernel-level function to execute.

The architectural dilemma for malware developers is that SSNs are not static. Microsoft frequently changes these index numbers between different Windows versions, service packs, and even minor patch builds.

  • On Windows 10 Build 19044, the SSN for NtAllocateVirtualMemory might be 0x18.
  • On Windows 11 Build 22621, it might be 0x19.

If an operator hardcodes 0x18 into their direct syscall injector and executes it on a machine requiring 0x19, the kernel will execute the wrong function, immediately resulting in a KMODE_EXCEPTION_NOT_HANDLED bug check (Blue Screen of Death). Therefore, the engine must dynamically discover the correct SSN at runtime before attempting a direct syscall.

5.2 Dynamic Resolution and "Hell's Gate" Theory

To find the correct SSN without triggering the EDR's hooks, the engine cannot simply rely on executing GetProcAddress. As established in Part 4, GetProcAddress points directly to the compromised memory address where the EDR has placed its JMP instruction.

Instead, the engine must act as a memory scanner. This methodology, pioneered by the offensive security research community as "Hell's Gate", involves reading the raw bytes of ntdll.dll loaded in memory.

A pristine, unhooked NTAPI stub always follows a strict assembly signature:

mov r10, rcx      ; 4C 8B D1
mov eax, <SSN>    ; B8 [SSN bytes]
syscall           ; 0F 05
ret               ; C3

The EDR hook destroys the beginning of this signature. However, by locating the base address of the function and inspecting the opcodes byte-by-byte, an engine can theoretically read the memory space to find the 4C 8B D1 B8 signature. Once found, it simply reads the next 4 bytes, extracting the pure SSN directly from the system's own dynamically loaded library.

fig7

Fig 7: Architectural Diagram of Direct Syscall Evasion vs. Standard API Hooking

5.3 Bypassing the Abstraction Layer

Armed with the dynamically resolved SSN, the engine no longer requires ntdll.dll. The operator can write their own assembly routine within their Rust or C++ payload that precisely mimics the behavior of a pristine NTAPI stub.

  1. The engine prepares the function arguments.
  2. The engine loads the dynamically resolved SSN into the EAX register.
  3. The engine natively executes the syscall instruction.

Execution jumps instantly from the payload's memory space into Ring 0. Because execution never traverses the hooked ntdll.dll memory space, the EDR's inline hooks are completely bypassed. The telemetry sensor is physically bypassed, blinding the userland monitoring components of the security solution.

This is the absolute apex of the user-mode evasion discussion. The raw assembly execution in Rust, specifically handling the x64 calling convention and stack alignment, is exactly what separates conceptual theory from clinical implementation.

5.4 The Rust Implementation: Dynamic Resolution and Raw Assembly

To operationalize the direct syscall theory, the architecture must handle two complex tasks entirely in memory: dynamically parsing the System Service Numbers (SSNs), and explicitly managing the CPU registers and stack to comply with the Windows x64 calling convention.

1. Dynamic SSN Resolution (Hell's Gate Lite) Instead of relying on hardcoded indices, the engine implements a targeted memory scanner. It resolves the base address of the requested NTAPI function in ntdll.dll and reads the raw opcodes byte-by-byte.

// Dynamically extracting the SSN from memory
let ptr = addr.unwrap() as *const u8;

// Match the pristine syscall signature: mov r10, rcx; mov eax, <SSN>
// Opcodes: 4C 8B D1 B8
if *ptr == 0x4C && *ptr.add(1) == 0x8B && *ptr.add(2) == 0xD1 && *ptr.add(3) == 0xB8 {
    let ssn = std::ptr::read_unaligned(ptr.add(4) as *const u32);
    println!("[+] Dynamically resolved {} SSN: {:#X}", function_name, ssn);
    return ssn;
}

If an EDR has placed a JMP instruction (typically opcode E9) at the function base, this signature check fails safely, alerting the operator that the function is hooked without actually triggering the malicious telemetry trap.

2. The Assembly Execution Stub Once the SSN is retrieved, the engine must execute the hardware transition. Using Rust's core::arch::asm! macro, the operator explicitly structures the execution stub.

This requires rigorous adherence to the Windows x64 calling convention. The first four arguments are passed via registers (RCX, RDX, R8, R9). Any subsequent arguments must be pushed to the stack. Furthermore, the caller must allocate 32 bytes (0x20) of "shadow space" on the stack immediately preceding the arguments.

// Raw assembly stub for NtAllocateVirtualMemory
#[cfg(target_arch = "x86_64")]
pub unsafe fn syscall_nt_allocate_virtual_memory(
    ssn: u32, process_handle: HANDLE, base_address: *mut PVOID,
    zero_bits: usize, region_size: PSIZE_T, allocation_type: ULONG, protect: ULONG,
) -> NTSTATUS {
    let mut status: NTSTATUS;
    asm!(
        // Allocate shadow space + stack arguments (0x20 + 0x18 = 0x38)
        "sub rsp, 0x38",
        
        // Pass argument 5 and 6 onto the stack
        "mov [rsp + 0x28], {alloc}",
        "mov [rsp + 0x30], {prot}",
        
        // Prepare syscall transition
        "mov r10, rcx",
        "mov eax, {ssn:e}",
        "syscall",
        
        // Restore stack pointer
        "add rsp, 0x38",
        
        // Register mapping
        ssn = in(reg) ssn,
        alloc = in(reg) allocation_type as u64,
        prot = in(reg) protect as u64,
        inout("rcx") process_handle => _, 
        in("rdx") base_address,
        in("r8") zero_bits as u64,
        in("r9") region_size,
        lateout("eax") status,
        out("r10") _, 
        out("r11") _, 
    );
    status
}

By executing this asm! block, the injector dictates the exact state of the hardware right before dropping into Ring 0.

fig8

Fig 8: Direct Syscall execution output demonstrating successful dynamic SSN extraction and injection.

5.5 API Call Dissection (Using WinDbg)

When we attach WinDbg to this direct syscall implementation, the profound impact of this evasion technique becomes visually apparent.

If we place a breakpoint directly on the syscall instruction within our Rust payload and observe the call stack, the telemetry footprint is drastically altered.

fig9

Fig 9: WinDbg Call Stack of a Direct Syscall. Note the complete absence of ntdll.dll in the user-mode execution flow.

In the Win32 and NTAPI implementations, the debugger revealed a cascading execution flow: Payload -> kernelbase.dll -> ntdll.dll -> Kernel.

With the direct syscall architecture, the call stack shows a direct, unmediated jump: Payload -> Kernel. The engine never traverses the memory space of ntdll.dll. Consequently, the EDR's inline hooks placed within the subsystem DLLs are left monitoring an empty hallway. The telemetry sensor is physically bypassed.


Part VI: Conclusion - The Horizon of Evasion

The evolution of process injection from standard Win32 wrappers to direct syscall manipulation highlights the fundamental cat-and-mouse nature of evasion engineering.

  • Win32 implementations rely on convenience, utilizing documented APIs that are heavily monitored and immediately flagged by heuristic engines.
  • NTAPI implementations reduce the application footprint but remain highly vulnerable to Userland API hooking techniques deployed by modern security solutions.
  • Direct Syscalls sever the reliance on the operating system's user-mode abstractions entirely, constructing the hardware execution context manually to blind user-mode telemetry sensors.

However, it is critical to understand that while direct syscalls effectively dismantle user-mode hooking, they are not a silver bullet.

When the syscall instruction executes, the kernel still receives the request. Modern Endpoint Detection and Response platforms rely heavily on Kernel-Mode Callbacks (e.g., ObRegisterCallbacks for handle creation, and EtwTi for memory allocation and thread creation telemetry). While the EDR may not see how the request arrived (due to the syscall), the kernel still alerts the EDR that a process is requesting RWX memory or spawning a thread in another process.

To defeat these Ring 0 telemetry mechanisms, an operator must escalate privileges beyond standard user-mode architecture. The horizon of evasion inevitably forces clinical architectures to transition from user-mode memory manipulation into kernel-mode subversion, utilizing techniques such as Bring Your Own Vulnerable Driver (BYOVD) to manipulate the OS kernel natively, but that is an architectural discussion for another time.


References