Roberto Paleari's blog

Monday, February 2, 2015

Fast coverage analysis for binary applications

(This post is a joint work with @joystick, see also his blog here)

Despite its simplicity, fuzzing has become a more and more popular technique for finding software bugs (and possibly security vulnerabilities), especially when dealing with complex or closed-source applications. The recipe for a basic fuzzer is well-known: pick your favorite target application and run it on “weird” inputs. Hopefully, one of these inputs will trigger some corner-case behaviors, which produce externally-observable side effects, such as a program crash.

The main drawback of this approach is that, at least in its naive form, it can trigger only very shallow program paths: it may take a humongous number of inputs to reach even slightly convoluted branches. For example, consider a statement such that "if (a == 31337) then { … }"; to reach the "then" block using a completely blind approach, the fuzzer would need to guess the correct value for variable "a" out of 2**32 distinct possibilities (considering 32-bit integers). To address this limitation, "smarter" testing approaches have been proposed. Among these, symbolic and concolic execution techniques have recently become quite popular among security researchers, but their applications to real-world products are still questionable, especially because of the complexity explosion when dealing with the intricacies of real-world software. Thus, simple fuzzing is still widely used, and many researchers strive to find ways to make it more effective while, at the same time, trying to keep its overhead as close as possible to that of a native black-box approach.

Following this trend, recently, @lcamtuf has demonstrated that simple fuzz testing can still identify dangerous bugs: his "american fuzzy lop" (afl) approach consists in running the target program over a carefully selected set of test cases, while observing it via compile-time instrumentation that permits to monitor instruction-level coverage. In a nutshell, test cases that reach novel instructions are "more interesting", as they could also trigger different program behaviors.

We were recently talking about the applicability of the afl approach to binary applications as well. The first problem concerns the monitoring phase: how to calculate the coverage for a given input? Obviously, as the target application is binary-only, source-level instrumentation is not an option.

Coverage analysis of binary applications

Strictly speaking, for the sake of tracking the progress of our fuzzing process, instead of instruction coverage, an analysis at the basic-block level should suffice: as basic blocks are uninterruptible single-entry, single-exit sequences of instructions, the two should be roughly equivalent (at least if we ignore asynchronous events).

As an example, the following x86 assembly snippet encodes a simple loop.

B1:    
    xor %ecx, %ecx

B2:
    inc %ecx
    cmp $0x3, %ecx
    jb B2

B3:
    ...

When monitoring the execution of this code, we would like to produce the basic blocks trace <B1, B2, B2, B2, B3>.

Monitoring via binary instrumentation

The problem of monitoring binary applications for coverage analysis was recently discussed also by some researchers (e.g., see @matalaz presentation here) who suggested to rely on binary instrumentation, e.g., using Pin, DynamoRIO, or Valgrind. More recently, other researchers started to implement similar approaches in order to port "afl" to closed-source applications, both relying on PIN or QEMU (the latter has also been integrated into "afl" since version 1.31b).

But what about performances? One of the main strengths of random fuzzing approaches is their high execution rate, as fuzzed inputs are executed nearly at native speed. But if the monitoring phase costs too much, the whole process is slowed down significantly.

As an example, consider this minimal Pin tool that inserts a callback function before the execution of every program basic block (well, actually at every Pin "trace", i.e., a single-entry sequence of basic blocks, but we can ignore the differences here). Obviously this trivial tool cannot be considered as a real process monitor, but can certainly be used to estimate a lower bound for the performances of this approach.

#include "pin.H"

VOID CallbackTrace(TRACE trace, VOID *v) {
   // Insert callback instructions here
}

int main(int argc, char **argv) {
  PIN_Init(argc, argv);
  TRACE_AddInstrumentFunction(CallbackTrace, 0);
  PIN_StartProgram();
  return 0;
}

Even such a trivial Pin tool introduces a significant overhead. As an example, running it over /bin/ls takes about 340ms on a normal laptop, a 100x overhead with respect to a native execution. Pin experts can certainly further reduce the performance penalty with some tweaks, but the order of magnitude should not change very much: after all, dynamic binary translation tools have to decode, translate, instrument, and finally recompile target code before executing it.

Introducing Intel BTS

Modern processors include sophisticated debug and performance monitoring facilities. Intel introduced these features in early Pentium processors and continued to extend them in subsequent CPU models (see chapters 17 and 18 of Intel manuals for details).

Among these facilities, Intel BTS ("Branch Trace Store") permits to record a trace of executed branch instructions to a memory buffer. In a nutshell, BTS records executed control-flow edges, as (source, destination) pairs. This mechanism is quite customizable, and can be configured to generate an interrupt when the BTS buffer is almost full, monitor only a specified privilege level (e.g., to track only user-space branches), or limit the capture to selected branch types (e.g., indirect/conditional branches, returns, calls). These characteristics make BTS an attractive approach for performing branch-level coverage analysis of a binary application.

BTS is configured by writing settings to dedicated MSR registers (again, see Intel manuals for low-level details). These operations should be carried out by kernel-level code, thus specific OS modules are required to permit the implementation of user-space monitor applications. Fortunately, Intel BTS is already supported by latest versions of the Linux performance monitoring subsystem, and is exposed to user-space via the perf_event_open() system call (for a user-space client see also the perf tool).

Coverage analysis using Intel BTS

Despite Linux performance monitoring is documented quite well, details about how to use perf_event_open() specifically for controlling BTS are scarce, except for few public examples: it is quite easy to invoke the API with improper parameters that force the subsystem to "fall-back" on software-based performance monitoring, with significant performance penalties. Thus, during our experiments we developed a coverage analysis tool that leverages this API to perform hardware-assisted tracing of the basic blocks executed by a target application. Information about executed basic blocks is dumped to a Google protobuf file, for easy post-processing. As an example, the following excerpt show the tracing of /bin/ls and the dump in human-readable form of the resulting protobuf file:

$ ./bts_trace -f /dev/shm/ls.trace -- /bin/ls >/dev/null
[*] Got 50758 events (2486 CFG edges), serializing to /dev/shm/ls.trace

$ python trace.py /dev/shm/ls.trace
#### Trace '/dev/shm/ls.trace' ####
[/dev/shm/ls.trace] cmd: test, data: 0 bytes, time: 2015-01-29 19:21:58
hash: 2dc92a, edges(s): 2486, exception(s): 0

 - CFG edges
 [00402176 -> 0040217d] 1 hit
 [00402181 -> 00412513] 1 hit
 [00402196 -> 7f1d372dc2f0] 28 hit
 [004021c0 -> 004021c6] 1 hit
 [004021c0 -> 7f1d36b281b0] 7 hit
 [004021cb -> 00402190] 1 hit
 ...
 [7f1d372e0ac7 -> 7f1d372e0a30] 2 hit
 [7f1d372e0af7 -> 7f1d372e0a20] 2 hit
 [7f1d372e0b1b -> 7f1d372e0635] 4 hit

 - Memory regions
 [0x00400000, 0x0041bfff] /bin/ls
 [0x7f1d3625d000, 0x7f1d36479fff] /lib/x86_64-linux-gnu/libpthread-2.19.so
 [0x7f1d3647a000, 0x7f1d3667efff] /lib/x86_64-linux-gnu/libattr.so.1.1.0
 ...

Running our tool over /bin/ls takes about 90ms, about 1/4 of the time required by the Pin-based tracer we sketched above. Consider also that this includes the time spent for the creation & serialization of the protobuf stream to file, while none of these operations are performed by the Pin-based tracer: the latter introduces a very high overhead with just the instrumentation required to trace BBs, without auxiliary functions for storing and eventually serializing the execution trace.

Both the BTS tracer and the trace viewer are available here, so feel free to give them a try! The current implementation is quite rough, so we believe there is still room for improvement.

Construction of the basic block "hit map"

Starting from the generated trace file, it is also possible to build the control-flow graph (CFG). Clearly, using this approach we can build a dynamic CFG only, i.e., a graph that includes control-flow edges observed in the concrete execution, but multiple execution traces could be also merged together to better approximate the static CFG. In addition, the information about how many times each basic block has been executed (i.e., the number of "hits") could also be used to generate a "hit map" of basic blocks frequencies, possibly to guide the fuzz testing phase.

As an example, consider the following C implementation of a classical binary search algorithm.

static int bisect(int v[], int size, int key) {
  int start, end, middle, pos;

  start = 0;
  end = size-1;
  pos = -1;
  while (start <= end) {
    middle = (end+start)/2;
    if (v[middle] > key) {
        end = middle-1;
    } else if (v[middle] < key) {
        start = middle+1;
    } else {
        pos = middle;
        break;
    }
  }

  return pos;
}

The CFG in Figure 1 is a "hit map" for a monitored execution of bisect(), on a sorted array of 1024 elements. Despite we previously shown the C source for this method, the CFG has been constructed dynamically from the binary code. Colors reflect the number of node hits, i.e., how many times the node has been executed in the observed execution. Nodes and edges labels indicate the actual hits. It is finally worth noting that for conditional branches, whenever possible, we also add to the graph those edges that have not been taken during the observed execution; this is done just to better approximate the static CFG of the application, but is technically not required to support a subsequent fuzzing phase.

To conclude, Figure 2 shows another example of a "hit map", this time for an execution of a recursive quicksort algorithm, running on a random array of 1024 integers (C code for this test case is available here).

Figure 1 - "Hit map" for the bisect() function (click on the image to enlarge)

Figure 2 - "Hit map" for an implementation of Quicksort (click on the image to enlarge)

Conclusions

Hardware-assisted performance monitoring could be an interesting approach to implement efficient coverage analysis tools, which in turn can be employed to support fuzz testing and other security-related applications. At this aim, we developed a sample tool that leverages the Intel "Branch Trace Store" facility. Our current implementation is just a little more than a proof-of-concept. Currently, its major limitation is that, in some situations, we lose some branch events. This may happen when the branch rate is very high, such as when the monitored application enters a tight loop. We are still investigating these issues, and we hope some interested reader could also contribute with some patches ;-)

Monday, January 12, 2015

Time to fill OS X (Blue)tooth: Local privilege escalation vulnerabilities in Yosemite

(This post is a joint work with @joystick, see also his blog here)

Motivated by our previous findings, we performed some more tests on service IOBluetoothHCIController of the latest version of Mac OS X (Yosemite 10.10.1), and we found five additional security issues. The issues have been reported to Apple Security and, since the deadline we agreed upon with them expired, we now disclose details & PoCs for four of them (the last one was notified few days later and is still under investigation by Apple). All the issues are in class IOBluetoothHCIController, implemented in the IOBluetoothFamily kext (md5 e4123caff1b90b81d52d43f9c47fec8f).

Issue 1 (crash-issue1.c)

Many callback routines handled by IOBluetoothHCIController blindly dereference pointer arguments without checking them. The caller of these callbacks, IOBluetoothHCIUserClient::SimpleDispatchWL(), may actually pass NULL pointers, that are eventually dereferenced.

More precisely, every user-space argument handled by SimpleDispatchWL() consists of a value and a size field (see crash-issue1.c for details). When a user-space client provides an argument with a NULL value but a large size, a subsequent call to IOMalloc(size) fails, returning a NULL pointer that is eventually passed to callees, causing the NULL pointer dereference.

The PoC we provide targets method DispatchHCICreateConnection(), but the very same approach can be used to cause a kernel crash using other callback routines (basically any other callback that receives one or more pointer arguments). At first, we ruled out this issue as a mere local DoS. However, as discussed here, Yosemite only partially prevents mapping the NULL page from user-space, so it is still possible to exploit NULL pointer dereferences to mount LPE attacks. For instance, the following code can be used to map page zero:

Mac:tmp $ cat zeropage.c

#include <mach/mach.h>
#include <mach/mach_vm.h>
#include <mach/vm_map.h>
#include <stdio.h>

int main(int argc, char **argv)
{
  mach_vm_address_t addr = 0;
  vm_deallocate(mach_task_self(), 0x0, 0x1000);
  int r = mach_vm_allocate(mach_task_self(), &addr, 0x1000, 0);
  printf("%08llx %d\n", addr, r);
  *((uint32_t *)addr) = 0x41414141;
  printf("%08x\n", *((uint32_t *)addr));
}

Mac:tmp $ llvm-gcc -Wall -o zeropage{,.c} -Wl,-pagezero_size,0 -m32
Mac:tmp $ ./zeropage
00000000 0
41414141
Mac:tmp $

Trying the same without the -m32 flag results in the 64-bit Mach-O being blocked at load time by the OS with message "Cannot enforce a hard page-zero for ./zeropage" (unless you do it as "root", but then what’s the point?).

Issue 2 (crash-issue2.c)

As shown in the screenshot below, IOBluetoothHCIController::BluetoothHCIChangeLocalName() is affected by an "old-school" stack-based buffer overflow, due to a bcopy(src, dest, strlen(src)) call where src is fully controlled by the attacker. To the best of our knowledge, this bug cannot be directly exploited due to the existing stack canary protection. However, it may still be useful to mount a LPE attack if used in conjunction with a memory leak vulnerability, leveraged to disclose the canary value.

Issue 2, a plain stack-based buffer overflow

Issue 3 (crash-issue3.c)

IOBluetoothHCIController::TransferACLPacketToHW() receives as an input parameter a pointer to an IOMemoryDescriptor object. The function carefully checks that the supplied pointer is non-NULL; however, regardless of the outcome of this test, it then dereferences the pointer (see the figure below, the attacker-controlled input parameter is stored in register r15). The IOMemoryDescriptor object is created by the caller (DispatchHCISendRawACLData()) using the IOMemoryDescriptor::withAddress() constructor. As this constructor is provided with a user-controlled value, it may fail and return a NULL pointer. See Issue 1 discussion regarding the exploitability of NULL pointer dereferences on Yosemite.

Issue 3, the module checks if r15 is NULL, but dereferences it anyway

Issue 4 (lpe-issue1.c)

In this case, the problem is due to a missing sanity check on the arguments of the following function:

IOReturn BluetoothHCIWriteStoredLinkKey(
                           uint32_t req_index, 
                           uint32_t num_of_keys, 
                           BluetoothDeviceAddress *p_device_addresses, 
                           BluetoothKey *p_bluetooth_keys, 
                           BluetoothHCINumLinkKeysToWrite *outNumKeysWritten);

The first parameter, req_index, is used to find an HCI Request in the queue of allocated HCI Requests (thus this exploit requires first to fill this queue with possibly fake requests). The second integer parameter (num_of_keys) is used to calculate the total size of the other inputs, respectively pointed by p_device_addresses and p_bluetooth_keys. As shown in the screenshot below, these values are not checked before being passed to function IOBluetoothHCIController::SendHCIRequestFormatted(), which has the following prototype:

IOReturn SendHCIRequestFormatted(uint32_t req_index, uint16_t inOpCode,
                                 uint64_t outResultsSize,
                                 void *outResultBuf,
                                 const char *inFormat, ...);

Issue 4, an exploitable buffer overflow (click to enlarge)

The passed format string "HbbNN" will eventually cause size_of_addresses bytes to be copied from p_device_addresses to the HCI request object identified by req_index in reverse order (the 'N' format consumes two arguments, the first is a size, the second a pointer to read from). If the calculated size_of_addresses is big enough (i.e., if we provide a big enough num_of_keys parameter), the copy overflows the buffer of the request, thrashing everything above it, including a number of function pointers in the vtable of the request object. These pointers are overwritten with attacker-controlled data (i.e., those pointed by p_bluetooth_keys) and called before returning to userspace, thus we can divert the execution wherever we want.

As a PoC, lpe-issue1.c exploits this bug and attempts to call a function located at the attacker-controller address 0x4141414142424242. Please note that the attached PoC requires some more tuning before it can cleanly return to user-space, since more than one vtable pointer is corrupted during the overflow and needs to be fixed with valid pointers.

Execution of our "proof-of-concept" exploit (Issue 4)

Notes

All the PoCs we provide in this post are not "weaponized", i.e., they do not contain a real payload, nor they attempt to bypass existing security features of Yosemite (e.g., kASLR and SMEP). If you’re interested in bypass techniques (as you probably are, if you made it here), Ian Beer of Google Project Zero covered pretty much all of it in a very thorough blog post. In this case, he used a leak in the IOKit registry to calculate kslide and defeat kASLR, while he used an in-kernel ROP-chain to bypass SMEP. More recently, @i0n1c posted here about how kASLR is fundamentally broken on Mac OS X at the moment.

Conclusions

Along the last issue identified, we shared with Apple our conclusions on this kext: according to the issues we identified, we speculate there are many other crashes and LPE vulnerabilities in it. Ours, however, is just a best-effort analysis done in our spare time, and given the very small effort that took us to identify the vulnerabilities, we would suggest a serious security evaluation of the whole kext code.

Disclosure timeline

02/11: Notification of issues 1, 2 and 3.
23/11: No answer received from Apple, notification of issue 4. As no answer was received since the first contact, propose December 2 as possible disclosure date.
25/11: Apple answers requesting more time. We propose to move the disclosure date to January 12.
27/11: Apple accepts the new deadline.
05/12: Contact Apple asking for the status of the vulnerabilities.
06/12: Apple says they're still "investigating the issue".
23/12: Notification of a new issue (#5), proposing January 23 as a tentative disclosure date.
06/01: Apple asks for more time for issue #5. We propose to move the disclosure date to February 23. We remind our intention to disclose the 4 previous issues on January 12.
12/01: No answer from Apple, disclosing first 4 issues.
27/01: Apple assigned our issues CVE-2014-8837 and patched all of them in Mac OS X 10.10.2.

Thursday, October 30, 2014

Mac OS X local privilege escalation (IOBluetoothFamily)

(This post is a joint work with @joystick, see also his blog here)

Nowadays, exploitation of user-level vulnerabilities is becoming more and more difficult, because of the widespread diffusion of several protection methods, including ASLR, NX, various heap protections, stack canaries, and sandboxed execution. As a natural consequence, instead of extricating themselves with such a plethora of defensive methods, attackers prefer to take the “easy” way and started to move at the kernel-level, where sophisticated protection techniques are still not very common (indeed, things like as KASLR and SMEP are implemented only in the latest versions of the most popular OSes). This trend is also confirmed by the rising number of kernel-level vulnerabilities reported in the last few months in Windows, Linux, and OS X.

Following this trend, we recently looked at few OS X drivers (“KEXT”s) and found a integer signedness bug affecting service IOBluetoothHCIController (implemented by the IOBluetoothFamily KEXT). This vulnerability can be exploited by a local attacker to gain root privileges. The issue is present on the latest versions of OS X Mavericks (tested on 10.9.4 and 10.9.5), but has been “silently” patched by Apple in OS X Yosemite.

Vulnerability overview

In a nutshell, the bug lies in the IOBluetoothHCIUserClient::SimpleDispatchWL() function. The function eventually takes a user-supplied 32-bit signed integer value and uses it to index a global array of structures containing a function pointer. The chosen function pointer is finally called. As the reader can easily imagine, SimpleDispatchWL() fails at properly sanitizing the user-supplied index, thus bad things may happen if a malicious user is able to control the chosen function pointer.

More in detail, the vulnerable part of the function is summarized in the pseudocode below. At line 14, the user-supplied 32-bit integer is casted to a 64-bit value. Then, the "if" statement at line 16 returns an error if the casted (signed) value is greater than the number of methods available in the global _sRoutines array; obviously, due to the signed comparison, any negative value for the method_index variable will pass this test. At line 20 method_index is used to access the _sRoutines array, and the retrieved callback is finally called at line 23.

typedef struct {
  void (*function_pointer)();
  uint64 num_arguments;
} BluetoothMethod;

BluetoothMethod _sRoutines[] = {
  ...
};

uint64 _sRoutineCount = sizeof(_sRoutines)/sizeof(BluetoothMethod);

IOReturn IOBluetoothHCIUserClient::SimpleDispatchWL(IOBluetoothHCIDispatchParams *params) {
  // Here "user_param" is a signed 32-bit integer parameter
  int64 method_index = (int64) user_param;

  if (method_index >= _sRoutineCount) {
    return kIOReturnUnsupported;
  }

  BluetoothMethod method = _sRoutines[method_index];
  ...
  if (method.num_arguments < 8) {
       method.function_pointer(...);
  }
  ...  
}

Exploitation details

Exploitation of this vulnerability is just a matter of supplying the proper negative integer value in order to make IOBluetoothFamily index the global _sRoutines structure out of its bounds, and to fetch an attacker-controlled structure. The supplied value must be negative to index outside the _sRoutines structure while still satisfying the check at line 16.

As a foreword, consider that for our "proof-of-concept" we disabled both SMEP/SMAP and KASLR, so some additional voodoo tricks are required to get a fully weaponized exploit. Thus, our approach was actually very simple: we computed a value for the user-supplied parameter that allowed us to index a BluetoothMethod structure such that BluetoothMethod.function_ptr is a valid user-space address (where we placed our shellcode), while BluetoothMethod.num_arguments is an integer value less than 8 (to satisfy the check performed by SimpleDispatchWL() at line 22).

As shown in the C code fragment above, the user-supplied 32-bit value (user_param) is first casted to a 64-bit signed value, and then used as an index in _sRoutines. Each entry of the global _sRoutines array is 16-byte wide (two 8-byte values). These operations are implemented by the following assembly code:

; r12+70h points to the user-supplied index value
mov     ecx, [r12+70h]
mov     r13d, kIOReturnUnsupported
lea     rdx, _sRoutineCount
cmp     ecx, [rdx]
jge     fail
; Go on and fetch _sRoutine[method_index]
...
movsxd  rax, ecx             ; Sign extension to 64-bit value
shl     rax, 4               ; method_index *= sizeof(BluetoothMethod)
lea     rdx, _sRoutines
mov     esi, [rdx+rax+8]     ; esi = _sRoutines[method_index].num_arguments
cmp     esi, 7               ; Check method.num_arguments < 8
ja      loc_289BA
...

At a higher-level, the address of the BluetoothMethod structure fetched when processing an index value "user_param" is computed by the following formula:

struct_addr = (ext(user_param & 0xffffffff) * 16) + _sRoutine

Where ext() is the sign-extension operation (implemented by the movsxd instruction in the assembly code snipped above).

By solving this formula for user_param and searching inside the kernel address space, we found several candidate addresses that matched our criteria (i.e., a valid user-space pointer followed by an integer value < 8). The rest of the exploit is just a matter of mmap()'ing the shellcode at the proper user-space address, connecting to the IOBluetoothHCIController service and invoking the vulnerable method.

The source code for a (very rough) proof-of-concept implementation of the aforementioned exploit is available here, while the following figure shows the exploit "in action".

Execution of our "proof-of-concept" exploit

Patching

We verified the security issue both on OS X Mavericks 10.9.4 and 10.9.5 (MD5 hash values for the IOBluetoothFamily KEXT bundle on these two OS versions are 2a55b7dac51e3b546455113505b25e75 and b7411f9d80bfeab47f3eaff3c36e128f, respectively). After the release of OS X Yosemite (10.10), we noticed the vulnerability has been silently patched by Apple, with no mention about it in the security change log.

A side-by-side comparison between versions 10.9.x and 10.10 of IOBluetoothFamily confirms Apple has patched the device driver by rejecting negative values for the user-supplied index. In the figure below, the user-supplied index value is compared against _sRoutineCount (orange basic block). Yosemite adds an additional check to ensure the (signed) index value is non-negative (green basic block, on the right).

Comparison of the vulnerable OS X driver (Mavericks, on the left) and patched version (Yosemite, on the right)

Conclusions

We contacted Apple on October 20th, 2014, asking for their intention to back-port the security fix to OS X Mavericks. Unfortunately, we got no reply, so we decided to publicly disclose the details of this vulnerability: Yosemite has now been released since a while and is available for free for Apple customers; thus, we don’t think the public disclosure of this bug could endanger end-users.

Update (31/10/2014)

Yesterday evening, few hours after the publication of our blog post, we received a reply from Apple Product Security. They confirmed the bug has been fixed in Yosemite, and they are still evaluating whether the issue should be addressed in the previous OS versions as well.

Wednesday, March 26, 2014

Introducing QTrace, a "zero knowledge" system call tracer

It has been a while since my last post on this blog (one year!), but I have been quite busy both with my work and personal matters, including my wedding :-). Today, I would like to break the silence to introduce QTrace, a "zero knowledge" system call tracer I have just released open source.

TL;DR: QTrace is yet-another (Windows) system call tracer. Its peculiarity is that it requires no information about the structure of system call arguments: the tracer can infer their format automagically, by observing kernel memory access patterns. QTrace also includes a taint-tracking module to identify data dependencies between system calls.

Please, not another syscall tracer!

Some time ago I was writing an IOCTL fuzzer to test a Windows device driver. The fuzzer itself was pretty standard stuff: intercept the IOCTLs issued by the driver, fuzz the input buffer and monitor what happens. Unfortunately, blindly fuzzing IOCTL input buffers, without any idea of their format is usually ineffective, especially when facing structured arguments, and can reveal only shallow bugs. The typical solution is to undertake a boring reverse engineering session to determine the structure of the input data the kernel driver is expecting, and adapt the fuzzer accordingly. A somehow similar problem arose when I was developing WUSSTrace, a user-space syscall tracer for Windows. That time we relied on some preprocessor macros to recursively serialize system call arguments with as little code as possible. This approach was great to save some coding, but we still had to provide the tracer with the prototypes for all Windows system calls, together with the definitions for all the data types used by their arguments. Even if some of them are well-documented, others aren't, especially when moving to GUI (i.e., win32k.sys) syscalls.

But wouldn't it be nice if it was possible to automatically infer the format of system calls input arguments, in a (almost) OS-independent manner? This is exactly what QTrace tries to do!

QTrace in a nutshell

QTrace is a "zero knowledge" system call tracer. Its main characteristic is that system call arguments are dumped without the need to instruct the tracer about their structure. The only information explicitly provided to QTrace are the names of the system calls, but this is required just for "pretty printing" purposes (saying "NtOpenFile" is way better than just "0xb3"). This also makes QTrace almost OS-independent, at least when moving to different Windows versions.

The basic idea behind QTrace is that arguments structures can be determined by observing kernel memory access patterns: during the execution of a system call, if the kernel accesses a user-space memory address, then this location must belong to a system call argument, with only few exceptions (credits for this nice idea go to Lorenzo). Similarly, user-space data pointers can be recognized by observing the kernel accessing a location whose address has also been previously read from user-space.

Practically speaking, QTrace is implemented on top of QEMU and includes two main modules: the system call tracer itself and a dynamic taint-tracking engine. The former implements the "zero knowledge" syscall tracing technique just described, while the latter is used to track data dependencies (use/def) between system call arguments (more about this later). Traced system calls, together with taint information, are serialized to a protobuf stream and can be parsed off-line. QTrace also includes some basic Python post-processing tools to parse and display syscall traces in a human-readable form; recorded system calls can be even re-executed.

For the rest of this post we are assuming we are dealing with a 32-bit Windows system, but the same approach can be adapted to 64-bit environments as well.

Example: Syscall tracing

Imagine the Windows kernel is reading a UNICODE_STRING syscall argument located at address 0x0205e2ac. The definition for this structure is provided below; inline comments specify field offsets and sample values.

typedef struct _UNICODE_STRING {
  USHORT Length;          // Offset 0, value: 0x0042
  USHORT MaximumLength;   // Offset 2, value: 0x021a
  PWSTR  Buffer;          // Offset 4, value: 0x02bd11a0
} UNICODE_STRING, *PUNICODE_STRING;

To access the Unicode data buffer (field "Buffer") the kernel will first access user-space address 0x0205e2b0 (corresponding to 0x0205e2ac+4) and read a pointer from there (value 0x02bd11a0). Then, kernel will start accessing 0x42 bytes starting at address 0x02bd11a0. By observing this memory access pattern, we can reconstruct the overall layout of the UNICODE_STRING parameter.

To give an idea about the information that can be inferred through this approach, figure below shows an excerpt of a sample QTrace trace file from a Windows 7 system. In this case, process explorer.exe executed a NtOpenFile() system call, passing six arguments. As an example, the third argument (at 0x0205e258) is a pointer to a structure located at address 0x0205e274; this structure includes a data pointer (at offset 8) to 0x0205e2ac, which in turn has a data pointer (at offset 4) pointing to the Unicode string "\??\C:\Windows\System32\pnpui.dll".

Sample QTrace log file for a NtOpenFile system call (Windows 7)

We can assess the correctness of this information by checking Microsoft documentation for the prototype of this system call:

NTSTATUS ZwOpenFile(
  _Out_  PHANDLE FileHandle,
  _In_   ACCESS_MASK DesiredAccess,
  _In_   POBJECT_ATTRIBUTES ObjectAttributes,
  _Out_  PIO_STATUS_BLOCK IoStatusBlock,
  _In_   ULONG ShareAccess,
  _In_   ULONG OpenOptions
);

Third parameter of NtOpenFile() is actually a pointer to a OBJECT_ATTRIBUTES structure, which contains a UNICODE_STRING (at offset 8), that in turn contains a wide-character buffer (at offset 4) that provides the name of the file to be opened. Except for parameter names (which of course cannot be guessed), inferred arguments structure matches the official documentation. Obviously things are slightly more complicated than this, as there are some corner-cases that must be considered, but this should give a rough idea of the approach.

Example: Taint-tracking

Briefly, in our context system call B depends on system call A if one of the output arguments of A is eventually used as an input argument for B. The reason why we could be interested in recording this kind of information is that syscall B cannot be re-executed alone, as it probably leverages some resources created by system call A. Details about the dynamic taint-tracking module will be presented in a future blog post.

A very simple example of a data dependency between two system calls is provided by the next figure. Here NtQuerySymbolicLinkObject() operates on the handle of a symbolic link object; this handle value is provided through its first argument. As can be seen from the "taint" column for this system call, the HANDLE argument is tainted with label 257 (the one highlighted in red): this taint label identifies the system call that opened this handle, in this case NtOpenSymbolicLinkObject().

Data dependency between NtOpenSymbolicLinkObject() and NtQuerySymbolicLinkObject() system calls; the former defines a handle value (0x00000010) that is then used by the latter

Conclusions

This post just sketched out the basic idea behind QTrace and its architecture. QTrace is now available open source, but do not expect it to be bug-free ;-) In some future blog posts I will discuss QTrace internals more in detail, describing the syscall tracer, the taint-tracking engine and providing some use cases.

Tuesday, March 19, 2013

Owning Samsung phones for fun (...but with no profit :-))

I was planning to open a blog since some months, but I decided to do it only now, to summarize some of the findings of a quick look I gave at a couple of Samsung Android devices.

But let's start at the beginning. During last Christmas holidays I finally had some free time to try to better understand the inner workings of some Samsung devices, focusing on the manufacturer's customizations to the Android system. I confess I was quite surprised to see how many Samsung applications are included in the original firmware image, including several customizations to lots of Android core packages.