Monday, December 12, 2011

Stack crawling in DrMemory without frame pointers or debug info

I work on DrMemory, a dynamic instrumentation tool for finding memory bugs in native apps similar to memcheck, more widely known as just Valgrind. We have to be able to generate stack traces in a wide variety of interesting situations, and we want to do it quickly and without depending on slow, non-portable, and hard to integrate libraries.

If you remember your early computer systems classes, you'll recall those exam questions about unwinding the stack by chasing the base pointer. The reality is that for years we've been stuck with x86_32, which has a measly 8 general purpose registers, and compilers have been itching to grab that tender little %ebp register. So, they took it, and now the world is much more complicated because walking the stack is a whole lot harder.

To unwind the stack, we need to answer the question: given the stack pointer (SP) and program counter (PC), what is the offset from the SP to the next return address? With that, we can repeat the question until we hit main (or NULL or whatever).

Maintaining a huge table mapping any arbitrary module offset to a stack offset would waste a lot of space, so most systems use some form of state machine encoding that can compute the value on the fly, should you need to walk the stack. (Win64 SEH uses an interesting mechanism involving carefully constructed prologues which I'd like to learn more about.) If stack walking is rare, such as for an "exceptional" exception or when in the debugger, then the state machine (DWARF CFI) works great. But if you need to do it all the time and you need to do it on multiple platforms, it will be slow, you will need platform-specific code, or you will need to depend on some other library. Pulling arbitrary code into a tool that runs in the same as the address space as the application can often break isolation between the tool and the app, so we don't like to do it very often.

However, we're in dynamic binary instrumentation land, so we get to see every instruction as it's executed. The next most obvious solution for unwinding is what we call the "shadow stack" solution. Every time you see a call or ret instruction, you just push or pop the PC on to or off of the current thread's shadow stack. If you need to record lots of stack traces, then this may be your best bet, especially since you can imagine logging these updates and then recording a trace is as easy as logging a timestamp. I believe that this is what ThreadSanitizer does, because they need to produce a stack trace not just for the current memory access, but for the other memory access that races with the current thread.

For a tool like DrMemory, we only need to store stack traces for every heap routine invocation, which is not nearly quite as often. We also don't want to unnecessarily slow down an app that performs many calls in a loop without calling any heap routines or creating any memory bug reports. You also have to worry about a lot of corner cases, such as an app which clobbers the return addr to go somewhere else, or longjmp, or exception handling.

The solution we use is I think quite elegant. First, if there are frame pointers, we use them, because it's fast (like iterating a linked list that's in cache :). While the compiler may get a 10% perf boost from an extra register, we get a bigger boost by not doing a "scan", which is what I am about to describe.

If we don't have frame pointers, we start with the stack pointer, and scan each pointer-aligned cell for a value that looks like a pointer into executable memory. This part is obvious, and you can get Windbg to do the same thing and symbolize it to boot (anyone know how to do this in gdb?).

The problem is that most code rarely initializes its entire stack frame. In any given frame, there is a lot of spill space for temporaries in unexecuted code, or arrays that are not filled. This means that stale return addresses from prior frames are still on the stack. The clever bit is realizing that we're already shadowing every memory write to track definedness information to find uninitialized reads, so we actually *know* which bytes are initialized and which are not. That means we can tell what is a probable return address and what is stale.

Still, what if the app stores a function pointer to the stack? That will look like a code pointer and show up as an extra frame. The next bit of cleverness is where we take the candidate return address and look backwards in the code for likely encodings of a call instruction. This is pretty straightforward, as there are only a handful of encodings for direct and indirect calls, and we can just enumerate the opcodes along with their offsets. If it's a function pointer, this test should fail, unless we're really unlucky.

And with that, we have our (mostly) reliable, (reasonably) performant, and (mostly) portable stack crawler! :) To my knowledge, Memcheck doesn't attempt any of this craziness, and encourages you to recompile with FPs if it gets a weird looking stack.

Edit: Comments on G+ are preferred, since I'll actually respond.

Thursday, August 11, 2011

Early injection of DBI frameworks on Linux

I just had an interesting discussion with Qin Zhao about how the various instrumentation frameworks out there achieve early injection. I haven't actually cross-checked this all at all, so it may not be 100% accurate, but I thought it was interesting enough to write down.

Early injection is the ability of a binary instrumentation framework to interpret an application from the very first instruction. On Linux, applications do not start life in userspace at main. They start life at _start, and that usually hops into the dynamic loader to bring in the shared libraries such as libc. Usually this code is very similar for every application, and you don't need to instrument it to find uses of uninitialized memory in your application. However, its possible you have an app that doesn't use the dynamic loader, or there really are bugs there you want to see. A related problem is how to load your own dynamic libraries without conflicting with the loader being used by the application. Typically the DBI framework will load two copies of libc: one for the application, and one for the framework and its libraries. DynamoRIO, Valgrind, and Pin all have different ways of loading and injecting with different tradeoffs.

DynamoRIO is the easiest, because we don't support early injection on Linux. (Windows is a different story which I can't speak to.) We simply set LD_LIBRARY_PATH and LD_PRELOAD in a shell script that wraps the application invocation. This way, the loader will pull in one of our shared libraries, and call _init, at which point we take control. Instead of returning back to the loader, we start interpreting from where we would have returned. For DynamoRIO's dynamic libraries, the most important of which is the actual tool that you want to run, ie your race detector or memory debugger, we have our own private loader which keeps its own libraries on the side and does all the memory mapping and relocations itself. This is a pain, since it practically reimplements all of the system loader, but we do have the ability to place restrictions on the kinds of libraries we load to make things easier.

Valgrind is the next easiest, since it's basically the inverse of the above strategy. The command "valgrind" is a proper binary, so the system loader kicks in and loads all of Valgrind's libraries for it. It can use dlopen to bring in the tools it wants. To load libraries for the application, Valgrind has a private loader similar to our own, except that it loads the application's libraries instead of Valgrind's.

Finally, Pin has the most interesting strategy. Pin does not have it's own loader. Pin is also a proper binary. They start by loading all the libraries they ever want, including the tool you want to run, using the system loader (dlopen, etc). Then they fork a child process. The *parent*, not the child, sets up a call to exec to match the invocation of the application. Somehow, probably using ptrace or some other API, the child is able to pause the parent application before the first instruction and attach. The child then copies its memory map into the parent. If there are conflicts, it errors out. The child then redirects execution to Pin's entry point, which eventually starts interpreting the application from _start.

Pin's approach allows early injection without the maintenance burden of a private loader. However, it can cause conflicts with the application binary and requires heuristics to set preferred addresses which won't conflict. It also slows down startup, due to the extra forking and memory mapping and copying. Personally, I think this is pretty cool, even if it has some transparency issues.

Thursday, May 5, 2011

FLAGS dependencies on x86

One big annoyance I have with x86 in my work on DynamoRIO is the special FLAGS register. You can't read or write to it like a normal register, and 2/3 of the bits are reserved or always 0 or 1. The only easy way to get flags in or out is to use pushf and popf, but in DynamoRIO we can't use those without switching off the application stack, since it could be doing things like using the x86_64 Linux red zone. So, we try to avoid clobbering FLAGS in inserted code whenever possible. For example:

You wanted to subtract a constant? Instead of "sub $-1, %rax" try "lea -1(%rax), %rax". Works for add, too.

With the more complex base, displacement, index, scale, and offset form of the x86 memory operand, you can get even more fancy, accomplishing small shifts and multiplies and register to register adds using the LEA instruction. You can even get a 3-operand add like so:
lea 0(%rax, %rbx, 1), %rcx

Which adds %rax and %rbx and stores the result in %rcx. Another old trick is multiplication by 5, which works like so:
lea 0(%rax, %rax, 4), %rax

This is powerful enough to perform register to register adds and simple arithmetic with constants, but we can't yet subtract a register from another. Unfortunately, SUB and NEG both touch FLAGS. However, the NOT instruction does not. This means, assuming %rax is dead, we can rewrite "sub %rax, %rbx" like so:
not %rax
lea 1(%rax, %rbx, 1), %rbx

Remembering how two's complement works, negation is equivalent to a complement and add one. Addition is commutative, so we can fold the increment into the displacement field of the LEA.

Finally, you can also use JRCXZ (JECXZ in 32-bit land) to perform a jump if RCX is zero, without looking at flags. Putting it all together, you can implement equality checks without changing flags like so (assuming one of %rdi or %rsi and %rcx are dead or have been saved):

Flags code:
cmp %rdi, %rsi
je .Llabel

No flags code:
not %rsi
lea 1(%rdi, %rsi, 1), %rcx
jrcxz .Llabel

I haven't checked, but I imagine the compare is faster. However, in DynamoRIO, we have to save the flags for the cmp, which requires three instructions to save and to restore. With that factored in, I have a feeling the no flags version may be preferable. I will have to implement the no-flags transformation and see if it performs better.

Unfortunately, the toy client I am developing with has code like this:
lea -1(%rdx), %eax
test %rax, %rdi
jz .Llabel

Which performs an alignment check (assuming rdx is a power of two). I'm not sure how to implement an arithmetic AND (which is what TEST does) without touching flags.

Wednesday, April 6, 2011

HipHop talk

I just went to a talk by Facebook HipHop engineer Haiping Zhao about various systems issues FB has had. There was a discussion of language issues with data movement in distributed systems and how to get that right, but I'm going to focus on HipHop, Facebook's open source PHP to C++ translator, since that's more up my alley.

Thinking carefully about HipHop, it seems to make better tradeoffs for large web deployments than the (at this point) traditional JIT approach. HipHop essentially trades compilation time, code size, and maximum performance for performance predictability and immediate startup performance. Furthermore, static compilation is probably perceived as being less likely to have bugs than a JIT compilation to someone in charge of deploying the system, although I would argue against that perception.

The main argument against static compilation of dynamic languages is that you don't have enough information to generate efficient code, and that you need to be able to run it before compiling it. To address that, HipHop is uses feedback directed (aka profile guided) optimization, even though PGO has never caught on in the C/C++ world. Essentially, they translate all their PHP to instrumented C++ to get types of variables, and then run that on an instance with a model workload. The compile takes about 15 minutes to make a 1 GB binary (the gold linker helps), and the profiling takes about another 15 minutes. Then they re-do the compilation with the profile information. This allows optimizations like speculating the target of a call, and speculating the type of a variable as being int or string, the easy to optimize types.

I would argue that this approach won't have as good peak performance as a JIT, which only has to generate the specialized version of the code, and can do some more clever optimizations. Also, a JIT can always fallback to the interpreter for the general case. If your profiling ends up being wrong (unlikely), a JIT can respond appropriately.

The drawback of the JIT is that you don't know when or if your code is going to get compiled, and you have to wait for that to happen. Debugging performance problems with JITed code is tricky. On the other hand, there are many solid tools for profiling C++, and mapping that back to PHP doesn't sound too hard.

It's probably possible to do a little better in the static model by avoiding generating or loading code for slow paths that you feel will never actually be loaded. I imagine right now the generated code looks like:

if (is_int(v)) { /* large int code */ }
else { /* massive general code */ }

The else branch is probably never taken. It seems like there should be techniques to either move the else branch far away from the hot code so that the cold code is never even faulted in from disk, or you could try to rig up a fallback to the interpreter, which I think would be much harder and not help your code size. On the other hand, branch predictors these days are good and moving code around is really only going to show up as instruction cache effects, which are pretty hard to predict. HipHop probably already uses __builtin_expect to tell g++ which branch it expects to take as well.

One other major thing I remembered during the talk is that optimizing PHP turns out to be much easier than Python. In Python, functions, classes, and globals in a module all share the same namespace. In PHP, variables start with a $ and functions do not. Short of eval, which HipHop forbids, there is no way to shadow a function without the compiler noticing that it might be shadowed. For example, this works:

if ($a) {
function foo() { return true; }
} else {
function foo() { return false; }
}

But this is not possible:

function foo() { return true; }
function bar() { return false; }
foo = bar; // No can do.

You can also invoke functions through variables, like so:

function foo() { return true; }
$func = "foo";
$bar = $func();

That will set $bar to true. This is generally rare and easy to implement by falling back to the standard dynamic lookup.

OTOH, local variable lookup seems like it sucks because PHP supports something like the following:

$which = $a ? "foo" : "bar";
$$which = "string";

Which will assign "string" to the variable $foo or $bar depending on whether $a is true or false. Hopefully the scoping rules restrict this to the scope of the local function, so you only need to worry about this feature in functions that use it.

Another feature of HipHop is that, once you forbid certain features like eval or using include to pull in generated or uploaded source from the filesystem, you get to do whole program compilation. One of the major optimization blockers even for C and C++ is that the compiler only gets to see a translation unit (TU) at a time. HipHop gets to see the whole program, so it makes it possible to do the checks I mentioned earlier, like making sure that there is only a single definition of a function 'foo' at top-level which is always present, so you can statically bind it at the call-site even if it's in a different source file.

Finally, Haiping mentioned that the massive binary produced is deployed to servers using BitTorrent, which surprisingly seems like exactly the right tool for the job. You could try to be more clever if you knew something about the network topology, or you could just use a general, well-tested tool and call it a day. =D

P.S. Hopefully none of this is considered FB confidential and Haiping doesn't get in any trouble. It was a very interesting talk!

Wednesday, March 30, 2011

Building spec benchmarks with newer libstdc++

This post is mostly for the sake of others searching with the same problem.

I was having issues building the 483.xalancbmk benchmark with a not-so-new libstdc++ (4.3.4). There's a particular source file (FormatterToHTML.cpp) that uses memset without including string.h or cstring. This was the error message:

FormatterToHTML.cpp:139: error: 'memset' was not declared in this scope

It depended on an include of 'vector' to pull in 'cstring' as a transitive dep. That seems to have changed. Since the SPEC benchmarks go so far as to checksum the benchmark sources before letting you run them, it was easier to add '#include <cstring>' to /usr/include/c++/4.3.4/vector than fix the application. =(

Saturday, March 26, 2011

Unladen Swallow Retrospective

I wrote this while I was at PyCon, but I kept revising it. Anyway, here goes. :)
As is apparent by now, no one has really been doing any work directly on Unladen Swallow or in porting it to py3k. Why not?

Lack of Sponsor Interest

The primary reason is that we weren't able to generate enough internal customers at Google. There are a few reasons for that:
  1. Most Python code at Google isn't performance critical. It's used mainly for tools and prototyping, and most user-facing applications are written in Java and C++.
  2. For those internal customers who were interested, deployment was too difficult: being a replacement was not enough. Any new Python implementation had to be turn-key. We thought building on CPython instead of starting fresh would sidestep this problem because C extensions and SWIG code would just work. However, simply changing from the previous version of CPython to a 2.6-based Python was too difficult.
  3. Our potential customers eventually found other ways of solving their performance problems that they felt more comfortable deploying.
After my internship was over, I tried to make Unladen the focus of my Master's thesis at MIT, but my advisor felt that the gains so far were insufficient to have big impact and the techniques I wanted to implement are all no longer considered novel. Most feedback techniques were implemented for Smalltalk by Urs Hölzle and tracing for Java by Andreas Gal. Not to say there aren't novel techniques to be discovered, but I didn't have any ideas at the time.

Lack of Personal Interest

Most of this was decided around Q1 of 2010. We still could have chosen to pursue it in our spare time, but at that point things looked a little different.
First of all, it's a lot less fun to work on a project by yourself than with other people, especially if it's unclear if you'll even have users.
Secondly, a large part of the motiviation for the project was that we felt like PyPy would never try to support CPython C extension modules or SWIG wrapped code. We were very surprised to see PyPy take steps in that direction. That somewhat obviated the need to build a bolt-on JIT for CPython. Also, when the project was launched, PyPy didn't have x86_64 support, but in the meantime they have added it.
Finally, the signals we were getting from python-dev were not good. There was an assumption that if Unladen Swallow were landed in py3k, Google would be there to maintain it, which was no longer the case. If the merge were to have gone through, it is likely that it would have been disabled by default and ripped out a year later after bitrot. Only a few developers seemed excited about the new JIT. We never finished the merge, but our hope was that if we had, we could entice CPython developers do hack on the JIT.
So, with all that said for why none of us are working on Unladen anymore, what have we learned?

Lessons About LLVM

First of all, we explored a lot of pros and cons of using LLVM for the JIT code generator. The initial choice to use LLVM was made because at the time none of us had significant experience with x86 assmebly, and we really wanted to support x86 and x86_64 and potentially ARM down the road. There were also some investigations of beefing up psyco, which I beleive were frusturated by the need for a good understanding of x86.
Unfortunately, LLVM in its current state is really designed as a static compiler optimizer and back end. LLVM code generation and optimization is good but expensive. The optimizations are all designed to work on IR generated by static C-like languages. Most of the important optimizations for optimizing Python require high-level knowledge of how the program executed on previous iterations, and LLVM didn't help us do that.
An example of needing to apply high-level knowledge to code generation is optimizing the Python stack access. LLVM will not fold loads from the Python stack across calls to external functions (ie the CPython runtime, so all the time). We eventually wrote an alias analysis to solve this problem, but it's an example of what you have to do if you don't roll your own code generator.
LLVM also comes with other constraints. For example, LLVM doesn't really support back-patching, which PyPy uses for fixing up their guard side exits. It's a fairly large dependency with high memory usage, but I would argue that based on the work Steven Noonan did for his GSOC that it could be reduced, especially considering that PyPy's memory usage had been higher.
I also spent the summer adding an interface between LLVM's JIT and gdb. This wasn't necessary, but it was a nice tool. I'm not sure what the state of debugging PyPy is, but we may be able to take some of the lessons from that experience and apply it to PyPy.

Take Aways

Personally, before working on this project, I had taken a compiler class and OS class, but this experience really brought together a lot of systems programming skills for me. I'm now quite experienced using gdb, having hacked on it and run it under itself. I also know a lot more about x86, compiler optimization techniques, and JIT tricks, which I'm using extensively in my Master's thesis work.
I'm also proud of our macro-benchmark suite of real world Python applications which lives on and PyPy uses it for speed.pypy.org. In all the performance work I've done before and after Unladen, I have to say that our macro benchmark suite was the most useful. Every performance change was easy to check with a before and after text snippet.
We also did a fair amount of good work contributing to LLVM, which other LLVM JIT projects, such as Parrot and Rubinius, can benefit from. For example, there used to be a 16 MB limitation on JITed code size, which I helped to fix. LLVM's JIT also has a gdb interface. Jeff also did a lot of work towards being able to inline C runtime functions into the JITed code, as well as fixing memory leaks and adding the TypeBuilder template for building C types in LLVM.
So, while I wish there were more resources and the project could live on, it was a great experience for me, and we did make some material contributions to LLVM and the benchmark suite that live on.

Working remotely over a high-latency connection

Does anyone have any tips for working remotely with high network latency? My project is Linux/Windows only, so I work remotely on a beastly machine at CSAIL. It works great when I'm at CSAIL, but terribly from Palo Alto. My two solutions thus far have been:

Use vi over ssh. This sucks because vi then occupies my terminal, and all my keystrokes are buffered over ssh so typing is slow.

Use MacVim and OpenAFS (or any other network filesystem here, including sshfs). This works better for me, but when switching files it takes *forever* (> 15 seconds) to tab-complete filenames or open or save a file. I have to believe this is due to inefficiencies in AFS, because I can scp files faster than MacVim can write to AFS.

I think the ideal solution might look something like Dropbox, where I have a full copy of my source code on both machines, and I only wait for the network when I want to go run the tests. However, I'm doing the initial sync now, and it's taking forever, so I'm not sure this is a real solution.

I haven't tried using bcvi or vim's scp:// file protocol. I'm not enthusiastic about them because I tend to use vim like people use emacs, meaning I keep lots of buffers open and open new files from the same vim instance. I want to be able to use tab completion to open new files, and I don't think vim's scp protocol will let me do that, and if it does, it would suffer from the same latency issues.

Ideas?

Thursday, March 17, 2011

x86 register save performance

Most people who deal with machine code already know that out-of-order CISC chips like most x86 variants are extremely complicated. Optimizing for them has always been a bit of a black art. In my work on DynamoRIO, I'm trying to optimize instrumentation calls to straight C functions. If it's a leaf function, we can try to inline it. For now, I'm just switching stacks and saving used registers inline. There are two ways you can do this. Most C compilers, when they have to spill locals to the stack, will subtract the size of the frame from the stack pointer on entry and address the locals relative to the SP. So, for register saves, I could do the same. However, I could also use the builtin push and pop opcodes. To decide, I did a microbenchmark, and the push and pop opcodes won. Let's look at the assembly:

/* Compile with -DPUSH to get pushes/pops
* and without for movs. */
.globl main
main:
push %rbp
mov %rsp, %rbp
mov $500000000, %rax
.Lloop:
#ifdef PUSH
push %rax
push %rbx
/* etc... */
push %r15
/* Mid */
pop %r15
/* etc... */
pop %rbx
pop %rax
#else
lea -120(%rsp), %rsp /* Update rsp. */
mov %rax, 0x70(%rsp)
mov %rbx, 0x68(%rsp)
/* etc... */
mov %r15, 0x00(%rsp)
/* Mid */
mov 0x00(%rsp), %r15
/* etc... */
mov 0x68(%rsp), %rbx
mov 0x70(%rsp), %rax
lea 120(%rsp), %rsp /* Update rsp. */
#endif
dec %rax
test %rax, %rax
jnz .Lloop
leave
ret

So, we have a loop here that saves all registers (I only need to save some for my purposes, this is just a micro-benchmark) in the two ways described. Here are the results for running both on a Core i7 I have access to:

Command:
gcc microbench_push_mov.S -DPUSH -o microbench_push_mov && time ./microbench_push_mov

User time:
trial 1 0m3.400s
trial 2 0m3.404s
trial 3 0m3.404s

Command:
gcc microbench_push_mov.S -o microbench_push_mov && time ./microbench_push_mov

User time:
trial 1 0m3.444s
trial 2 0m3.444s
trial 3 0m3.444s

So, the push/pop version is 1% faster, which depending on how you look at it is unintuitive. At first, one would think that push and pop have to do more work by updating the rsp register for each instruction. Looking at the encoding, I can say that the code size of the moves is much larger, because an offset needs to be encoded for every instruction. "push %rax" is just 50, while "mov %rax, 0x70(%rsp)" is 48 89 44 24 70. I suppose doing the indirect addressing every cycle is also work, but it doesn't need to be managed as a write dependency like the stack pointer updates.

With that in mind, it now makes better sense why most compilers generate pushes for saving caller saved registers on entry, but use stack offset addressing for locals. The encoding for pushes and pops is much smaller, and the execution is marginally faster. Locals are accessed unpredictably, so push and pop aren't really an option.