Monday, February 2, 2015

Fast coverage analysis for binary applications

(This post is a joint work with @joystick, see also his blog here)

Despite its simplicity, fuzzing has become a more and more popular technique for finding software bugs (and possibly security vulnerabilities), especially when dealing with complex or closed-source applications. The recipe for a basic fuzzer is well-known: pick your favorite target application and run it on “weird” inputs. Hopefully, one of these inputs will trigger some corner-case behaviors, which produce externally-observable side effects, such as a program crash.

The main drawback of this approach is that, at least in its naive form, it can trigger only very shallow program paths: it may take a humongous number of inputs to reach even slightly convoluted branches. For example, consider a statement such that "if (a == 31337) then { … }"; to reach the "then" block using a completely blind approach, the fuzzer would need to guess the correct value for variable "a" out of 2**32 distinct possibilities (considering 32-bit integers). To address this limitation, "smarter" testing approaches have been proposed. Among these, symbolic and concolic execution techniques have recently become quite popular among security researchers, but their applications to real-world products are still questionable, especially because of the complexity explosion when dealing with the intricacies of real-world software. Thus, simple fuzzing is still widely used, and many researchers strive to find ways to make it more effective while, at the same time, trying to keep its overhead as close as possible to that of a native black-box approach.

Following this trend, recently, @lcamtuf has demonstrated that simple fuzz testing can still identify dangerous bugs: his "american fuzzy lop" (afl) approach consists in running the target program over a carefully selected set of test cases, while observing it via compile-time instrumentation that permits to monitor instruction-level coverage. In a nutshell, test cases that reach novel instructions are "more interesting", as they could also trigger different program behaviors.

We were recently talking about the applicability of the afl approach to binary applications as well. The first problem concerns the monitoring phase: how to calculate the coverage for a given input? Obviously, as the target application is binary-only, source-level instrumentation is not an option.

Coverage analysis of binary applications

Strictly speaking, for the sake of tracking the progress of our fuzzing process, instead of instruction coverage, an analysis at the basic-block level should suffice: as basic blocks are uninterruptible single-entry, single-exit sequences of instructions, the two should be roughly equivalent (at least if we ignore asynchronous events).

As an example, the following x86 assembly snippet encodes a simple loop.

B1:    
    xor %ecx, %ecx

B2:
    inc %ecx
    cmp $0x3, %ecx
    jb B2

B3:
    ...

When monitoring the execution of this code, we would like to produce the basic blocks trace <B1, B2, B2, B2, B3>.

Monitoring via binary instrumentation

The problem of monitoring binary applications for coverage analysis was recently discussed also by some researchers (e.g., see @matalaz presentation here) who suggested to rely on binary instrumentation, e.g., using Pin, DynamoRIO, or Valgrind. More recently, other researchers started to implement similar approaches in order to port "afl" to closed-source applications, both relying on PIN or QEMU (the latter has also been integrated into "afl" since version 1.31b).

But what about performances? One of the main strengths of random fuzzing approaches is their high execution rate, as fuzzed inputs are executed nearly at native speed. But if the monitoring phase costs too much, the whole process is slowed down significantly.

As an example, consider this minimal Pin tool that inserts a callback function before the execution of every program basic block (well, actually at every Pin "trace", i.e., a single-entry sequence of basic blocks, but we can ignore the differences here). Obviously this trivial tool cannot be considered as a real process monitor, but can certainly be used to estimate a lower bound for the performances of this approach.

#include "pin.H"

VOID CallbackTrace(TRACE trace, VOID *v) {
   // Insert callback instructions here
}

int main(int argc, char **argv) {
  PIN_Init(argc, argv);
  TRACE_AddInstrumentFunction(CallbackTrace, 0);
  PIN_StartProgram();
  return 0;
}

Even such a trivial Pin tool introduces a significant overhead. As an example, running it over /bin/ls takes about 340ms on a normal laptop, a 100x overhead with respect to a native execution. Pin experts can certainly further reduce the performance penalty with some tweaks, but the order of magnitude should not change very much: after all, dynamic binary translation tools have to decode, translate, instrument, and finally recompile target code before executing it.

Introducing Intel BTS

Modern processors include sophisticated debug and performance monitoring facilities. Intel introduced these features in early Pentium processors and continued to extend them in subsequent CPU models (see chapters 17 and 18 of Intel manuals for details).

Among these facilities, Intel BTS ("Branch Trace Store") permits to record a trace of executed branch instructions to a memory buffer. In a nutshell, BTS records executed control-flow edges, as (source, destination) pairs. This mechanism is quite customizable, and can be configured to generate an interrupt when the BTS buffer is almost full, monitor only a specified privilege level (e.g., to track only user-space branches), or limit the capture to selected branch types (e.g., indirect/conditional branches, returns, calls). These characteristics make BTS an attractive approach for performing branch-level coverage analysis of a binary application.

BTS is configured by writing settings to dedicated MSR registers (again, see Intel manuals for low-level details). These operations should be carried out by kernel-level code, thus specific OS modules are required to permit the implementation of user-space monitor applications. Fortunately, Intel BTS is already supported by latest versions of the Linux performance monitoring subsystem, and is exposed to user-space via the perf_event_open() system call (for a user-space client see also the perf tool).

Coverage analysis using Intel BTS

Despite Linux performance monitoring is documented quite well, details about how to use perf_event_open() specifically for controlling BTS are scarce, except for few public examples: it is quite easy to invoke the API with improper parameters that force the subsystem to "fall-back" on software-based performance monitoring, with significant performance penalties. Thus, during our experiments we developed a coverage analysis tool that leverages this API to perform hardware-assisted tracing of the basic blocks executed by a target application. Information about executed basic blocks is dumped to a Google protobuf file, for easy post-processing. As an example, the following excerpt show the tracing of /bin/ls and the dump in human-readable form of the resulting protobuf file:
$ ./bts_trace -f /dev/shm/ls.trace -- /bin/ls >/dev/null
[*] Got 50758 events (2486 CFG edges), serializing to /dev/shm/ls.trace

$ python trace.py /dev/shm/ls.trace
#### Trace '/dev/shm/ls.trace' ####
[/dev/shm/ls.trace] cmd: test, data: 0 bytes, time: 2015-01-29 19:21:58
hash: 2dc92a, edges(s): 2486, exception(s): 0

 - CFG edges
 [00402176 -> 0040217d] 1 hit
 [00402181 -> 00412513] 1 hit
 [00402196 -> 7f1d372dc2f0] 28 hit
 [004021c0 -> 004021c6] 1 hit
 [004021c0 -> 7f1d36b281b0] 7 hit
 [004021cb -> 00402190] 1 hit
 ...
 [7f1d372e0ac7 -> 7f1d372e0a30] 2 hit
 [7f1d372e0af7 -> 7f1d372e0a20] 2 hit
 [7f1d372e0b1b -> 7f1d372e0635] 4 hit

 - Memory regions
 [0x00400000, 0x0041bfff] /bin/ls
 [0x7f1d3625d000, 0x7f1d36479fff] /lib/x86_64-linux-gnu/libpthread-2.19.so
 [0x7f1d3647a000, 0x7f1d3667efff] /lib/x86_64-linux-gnu/libattr.so.1.1.0
 ...

Running our tool over /bin/ls takes about 90ms, about 1/4 of the time required by the Pin-based tracer we sketched above. Consider also that this includes the time spent for the creation & serialization of the protobuf stream to file, while none of these operations are performed by the Pin-based tracer: the latter introduces a very high overhead with just the instrumentation required to trace BBs, without auxiliary functions for storing and eventually serializing the execution trace.

Both the BTS tracer and the trace viewer are available here, so feel free to give them a try! The current implementation is quite rough, so we believe there is still room for improvement.

Construction of the basic block "hit map"

Starting from the generated trace file, it is also possible to build the control-flow graph (CFG). Clearly, using this approach we can build a dynamic CFG only, i.e., a graph that includes control-flow edges observed in the concrete execution, but multiple execution traces could be also merged together to better approximate the static CFG. In addition, the information about how many times each basic block has been executed (i.e., the number of "hits") could also be used to generate a "hit map" of basic blocks frequencies, possibly to guide the fuzz testing phase.

As an example, consider the following C implementation of a classical binary search algorithm.

static int bisect(int v[], int size, int key) {
  int start, end, middle, pos;

  start = 0;
  end = size-1;
  pos = -1;
  while (start <= end) {
    middle = (end+start)/2;
    if (v[middle] > key) {
        end = middle-1;
    } else if (v[middle] < key) {
        start = middle+1;
    } else {
        pos = middle;
        break;
    }
  }

  return pos;
}

The CFG in Figure 1 is a "hit map" for a monitored execution of bisect(), on a sorted array of 1024 elements. Despite we previously shown the C source for this method, the CFG has been constructed dynamically from the binary code. Colors reflect the number of node hits, i.e., how many times the node has been executed in the observed execution. Nodes and edges labels indicate the actual hits. It is finally worth noting that for conditional branches, whenever possible, we also add to the graph those edges that have not been taken during the observed execution; this is done just to better approximate the static CFG of the application, but is technically not required to support a subsequent fuzzing phase.

To conclude, Figure 2 shows another example of a "hit map", this time for an execution of a recursive quicksort algorithm, running on a random array of 1024 integers (C code for this test case is available here).

Figure 1 - "Hit map" for the bisect() function (click on the image to enlarge)
Figure 2 - "Hit map" for an implementation of Quicksort (click on the image to enlarge)

Conclusions

Hardware-assisted performance monitoring could be an interesting approach to implement efficient coverage analysis tools, which in turn can be employed to support fuzz testing and other security-related applications. At this aim, we developed a sample tool that leverages the Intel "Branch Trace Store" facility. Our current implementation is just a little more than a proof-of-concept. Currently, its major limitation is that, in some situations, we lose some branch events. This may happen when the branch rate is very high, such as when the monitored application enters a tight loop. We are still investigating these issues, and we hope some interested reader could also contribute with some patches ;-)

Monday, January 12, 2015

Time to fill OS X (Blue)tooth: Local privilege escalation vulnerabilities in Yosemite

(This post is a joint work with @joystick, see also his blog here)

Motivated by our previous findings, we performed some more tests on service IOBluetoothHCIController of the latest version of Mac OS X (Yosemite 10.10.1), and we found five additional security issues. The issues have been reported to Apple Security and, since the deadline we agreed upon with them expired, we now disclose details & PoCs for four of them (the last one was notified few days later and is still under investigation by Apple). All the issues are in class IOBluetoothHCIController, implemented in the IOBluetoothFamily kext (md5 e4123caff1b90b81d52d43f9c47fec8f).

Issue 1 (crash-issue1.c)

Many callback routines handled by IOBluetoothHCIController blindly dereference pointer arguments without checking them. The caller of these callbacks, IOBluetoothHCIUserClient::SimpleDispatchWL(), may actually pass NULL pointers, that are eventually dereferenced.

More precisely, every user-space argument handled by SimpleDispatchWL() consists of a value and a size field (see crash-issue1.c for details). When a user-space client provides an argument with a NULL value but a large size, a subsequent call to IOMalloc(size) fails, returning a NULL pointer that is eventually passed to callees, causing the NULL pointer dereference.

The PoC we provide targets method DispatchHCICreateConnection(), but the very same approach can be used to cause a kernel crash using other callback routines (basically any other callback that receives one or more pointer arguments). At first, we ruled out this issue as a mere local DoS. However, as discussed here, Yosemite only partially prevents mapping the NULL page from user-space, so it is still possible to exploit NULL pointer dereferences to mount LPE attacks. For instance, the following code can be used to map page zero:

Mac:tmp $ cat zeropage.c
#include <mach/mach.h>
#include <mach/mach_vm.h>
#include <mach/vm_map.h>
#include <stdio.h>

int main(int argc, char **argv)
{
  mach_vm_address_t addr = 0;
  vm_deallocate(mach_task_self(), 0x0, 0x1000);
  int r = mach_vm_allocate(mach_task_self(), &addr, 0x1000, 0);
  printf("%08llx %d\n", addr, r);
  *((uint32_t *)addr) = 0x41414141;
  printf("%08x\n", *((uint32_t *)addr));
}
Mac:tmp $ llvm-gcc -Wall -o zeropage{,.c} -Wl,-pagezero_size,0 -m32
Mac:tmp $ ./zeropage
00000000 0
41414141
Mac:tmp $

Trying the same without the -m32 flag results in the 64-bit Mach-O being blocked at load time by the OS with message "Cannot enforce a hard page-zero for ./zeropage" (unless you do it as "root", but then what’s the point?).

Issue 2 (crash-issue2.c)

As shown in the screenshot below, IOBluetoothHCIController::BluetoothHCIChangeLocalName() is affected by an "old-school" stack-based buffer overflow, due to a bcopy(src, dest, strlen(src)) call where src is fully controlled by the attacker. To the best of our knowledge, this bug cannot be directly exploited due to the existing stack canary protection. However, it may still be useful to mount a LPE attack if used in conjunction with a memory leak vulnerability, leveraged to disclose the canary value.

Issue 2, a plain stack-based buffer overflow

Issue 3 (crash-issue3.c)

IOBluetoothHCIController::TransferACLPacketToHW() receives as an input parameter a pointer to an IOMemoryDescriptor object. The function carefully checks that the supplied pointer is non-NULL; however, regardless of the outcome of this test, it then dereferences the pointer (see the figure below, the attacker-controlled input parameter is stored in register r15). The IOMemoryDescriptor object is created by the caller (DispatchHCISendRawACLData()) using the IOMemoryDescriptor::withAddress() constructor. As this constructor is provided with a user-controlled value, it may fail and return a NULL pointer. See Issue 1 discussion regarding the exploitability of NULL pointer dereferences on Yosemite.

Issue 3, the module checks if r15 is NULL, but dereferences it anyway

Issue 4 (lpe-issue1.c)

In this case, the problem is due to a missing sanity check on the arguments of the following function:

IOReturn BluetoothHCIWriteStoredLinkKey(
                           uint32_t req_index, 
                           uint32_t num_of_keys, 
                           BluetoothDeviceAddress *p_device_addresses, 
                           BluetoothKey *p_bluetooth_keys, 
                           BluetoothHCINumLinkKeysToWrite *outNumKeysWritten);

The first parameter, req_index, is used to find an HCI Request in the queue of allocated HCI Requests (thus this exploit requires first to fill this queue with possibly fake requests). The second integer parameter (num_of_keys) is used to calculate the total size of the other inputs, respectively pointed by p_device_addresses and p_bluetooth_keys. As shown in the screenshot below, these values are not checked before being passed to function IOBluetoothHCIController::SendHCIRequestFormatted(), which has the following prototype:

IOReturn SendHCIRequestFormatted(uint32_t req_index, uint16_t inOpCode,
                                 uint64_t outResultsSize,
                                 void *outResultBuf,
                                 const char *inFormat, ...);


Issue 4, an exploitable buffer overflow (click to enlarge)

The passed format string "HbbNN" will eventually cause size_of_addresses bytes to be copied from p_device_addresses to the HCI request object identified by req_index in reverse order (the 'N' format consumes two arguments, the first is a size, the second a pointer to read from). If the calculated size_of_addresses is big enough (i.e., if we provide a big enough num_of_keys parameter), the copy overflows the buffer of the request, thrashing everything above it, including a number of function pointers in the vtable of the request object. These pointers are overwritten with attacker-controlled data (i.e., those pointed by p_bluetooth_keys) and called before returning to userspace, thus we can divert the execution wherever we want.

As a PoC, lpe-issue1.c exploits this bug and attempts to call a function located at the attacker-controller address 0x4141414142424242. Please note that the attached PoC requires some more tuning before it can cleanly return to user-space, since more than one vtable pointer is corrupted during the overflow and needs to be fixed with valid pointers.

Execution of our "proof-of-concept" exploit (Issue 4)

Notes

All the PoCs we provide in this post are not "weaponized", i.e., they do not contain a real payload, nor they attempt to bypass existing security features of Yosemite (e.g., kASLR and SMEP). If you’re interested in bypass techniques (as you probably are, if you made it here), Ian Beer of Google Project Zero covered pretty much all of it in a very thorough blog post. In this case, he used a leak in the IOKit registry to calculate kslide and defeat kASLR, while he used an in-kernel ROP-chain to bypass SMEP. More recently, @i0n1c posted here about how kASLR is fundamentally broken on Mac OS X at the moment.

Conclusions

Along the last issue identified, we shared with Apple our conclusions on this kext: according to the issues we identified, we speculate there are many other crashes and LPE vulnerabilities in it. Ours, however, is just a best-effort analysis done in our spare time, and given the very small effort that took us to identify the vulnerabilities, we would suggest a serious security evaluation of the whole kext code.

Disclosure timeline

  • 02/11: Notification of issues 1, 2 and 3.
  • 23/11: No answer received from Apple, notification of issue 4. As no answer was received since the first contact, propose December 2 as possible disclosure date.
  • 25/11: Apple answers requesting more time. We propose to move the disclosure date to January 12.
  • 27/11: Apple accepts the new deadline.
  • 05/12: Contact Apple asking for the status of the vulnerabilities.
  • 06/12: Apple says they're still "investigating the issue".
  • 23/12: Notification of a new issue (#5), proposing January 23 as a tentative disclosure date.
  • 06/01: Apple asks for more time for issue #5. We propose to move the disclosure date to February 23. We remind our intention to disclose the 4 previous issues on January 12.
  • 12/01: No answer from Apple, disclosing first 4 issues.
  • 27/01: Apple assigned our issues CVE-2014-8837 and patched all of them in Mac OS X 10.10.2.