Wednesday, March 30, 2011

Building spec benchmarks with newer libstdc++

This post is mostly for the sake of others searching with the same problem.

I was having issues building the 483.xalancbmk benchmark with a not-so-new libstdc++ (4.3.4). There's a particular source file (FormatterToHTML.cpp) that uses memset without including string.h or cstring. This was the error message:

FormatterToHTML.cpp:139: error: 'memset' was not declared in this scope

It depended on an include of 'vector' to pull in 'cstring' as a transitive dep. That seems to have changed. Since the SPEC benchmarks go so far as to checksum the benchmark sources before letting you run them, it was easier to add '#include <cstring>' to /usr/include/c++/4.3.4/vector than fix the application. =(

Saturday, March 26, 2011

Unladen Swallow Retrospective

I wrote this while I was at PyCon, but I kept revising it. Anyway, here goes. :)
As is apparent by now, no one has really been doing any work directly on Unladen Swallow or in porting it to py3k. Why not?

Lack of Sponsor Interest

The primary reason is that we weren't able to generate enough internal customers at Google. There are a few reasons for that:
  1. Most Python code at Google isn't performance critical. It's used mainly for tools and prototyping, and most user-facing applications are written in Java and C++.
  2. For those internal customers who were interested, deployment was too difficult: being a replacement was not enough. Any new Python implementation had to be turn-key. We thought building on CPython instead of starting fresh would sidestep this problem because C extensions and SWIG code would just work. However, simply changing from the previous version of CPython to a 2.6-based Python was too difficult.
  3. Our potential customers eventually found other ways of solving their performance problems that they felt more comfortable deploying.
After my internship was over, I tried to make Unladen the focus of my Master's thesis at MIT, but my advisor felt that the gains so far were insufficient to have big impact and the techniques I wanted to implement are all no longer considered novel. Most feedback techniques were implemented for Smalltalk by Urs Hölzle and tracing for Java by Andreas Gal. Not to say there aren't novel techniques to be discovered, but I didn't have any ideas at the time.

Lack of Personal Interest

Most of this was decided around Q1 of 2010. We still could have chosen to pursue it in our spare time, but at that point things looked a little different.
First of all, it's a lot less fun to work on a project by yourself than with other people, especially if it's unclear if you'll even have users.
Secondly, a large part of the motiviation for the project was that we felt like PyPy would never try to support CPython C extension modules or SWIG wrapped code. We were very surprised to see PyPy take steps in that direction. That somewhat obviated the need to build a bolt-on JIT for CPython. Also, when the project was launched, PyPy didn't have x86_64 support, but in the meantime they have added it.
Finally, the signals we were getting from python-dev were not good. There was an assumption that if Unladen Swallow were landed in py3k, Google would be there to maintain it, which was no longer the case. If the merge were to have gone through, it is likely that it would have been disabled by default and ripped out a year later after bitrot. Only a few developers seemed excited about the new JIT. We never finished the merge, but our hope was that if we had, we could entice CPython developers do hack on the JIT.
So, with all that said for why none of us are working on Unladen anymore, what have we learned?

Lessons About LLVM

First of all, we explored a lot of pros and cons of using LLVM for the JIT code generator. The initial choice to use LLVM was made because at the time none of us had significant experience with x86 assmebly, and we really wanted to support x86 and x86_64 and potentially ARM down the road. There were also some investigations of beefing up psyco, which I beleive were frusturated by the need for a good understanding of x86.
Unfortunately, LLVM in its current state is really designed as a static compiler optimizer and back end. LLVM code generation and optimization is good but expensive. The optimizations are all designed to work on IR generated by static C-like languages. Most of the important optimizations for optimizing Python require high-level knowledge of how the program executed on previous iterations, and LLVM didn't help us do that.
An example of needing to apply high-level knowledge to code generation is optimizing the Python stack access. LLVM will not fold loads from the Python stack across calls to external functions (ie the CPython runtime, so all the time). We eventually wrote an alias analysis to solve this problem, but it's an example of what you have to do if you don't roll your own code generator.
LLVM also comes with other constraints. For example, LLVM doesn't really support back-patching, which PyPy uses for fixing up their guard side exits. It's a fairly large dependency with high memory usage, but I would argue that based on the work Steven Noonan did for his GSOC that it could be reduced, especially considering that PyPy's memory usage had been higher.
I also spent the summer adding an interface between LLVM's JIT and gdb. This wasn't necessary, but it was a nice tool. I'm not sure what the state of debugging PyPy is, but we may be able to take some of the lessons from that experience and apply it to PyPy.

Take Aways

Personally, before working on this project, I had taken a compiler class and OS class, but this experience really brought together a lot of systems programming skills for me. I'm now quite experienced using gdb, having hacked on it and run it under itself. I also know a lot more about x86, compiler optimization techniques, and JIT tricks, which I'm using extensively in my Master's thesis work.
I'm also proud of our macro-benchmark suite of real world Python applications which lives on and PyPy uses it for speed.pypy.org. In all the performance work I've done before and after Unladen, I have to say that our macro benchmark suite was the most useful. Every performance change was easy to check with a before and after text snippet.
We also did a fair amount of good work contributing to LLVM, which other LLVM JIT projects, such as Parrot and Rubinius, can benefit from. For example, there used to be a 16 MB limitation on JITed code size, which I helped to fix. LLVM's JIT also has a gdb interface. Jeff also did a lot of work towards being able to inline C runtime functions into the JITed code, as well as fixing memory leaks and adding the TypeBuilder template for building C types in LLVM.
So, while I wish there were more resources and the project could live on, it was a great experience for me, and we did make some material contributions to LLVM and the benchmark suite that live on.

Working remotely over a high-latency connection

Does anyone have any tips for working remotely with high network latency? My project is Linux/Windows only, so I work remotely on a beastly machine at CSAIL. It works great when I'm at CSAIL, but terribly from Palo Alto. My two solutions thus far have been:

Use vi over ssh. This sucks because vi then occupies my terminal, and all my keystrokes are buffered over ssh so typing is slow.

Use MacVim and OpenAFS (or any other network filesystem here, including sshfs). This works better for me, but when switching files it takes *forever* (> 15 seconds) to tab-complete filenames or open or save a file. I have to believe this is due to inefficiencies in AFS, because I can scp files faster than MacVim can write to AFS.

I think the ideal solution might look something like Dropbox, where I have a full copy of my source code on both machines, and I only wait for the network when I want to go run the tests. However, I'm doing the initial sync now, and it's taking forever, so I'm not sure this is a real solution.

I haven't tried using bcvi or vim's scp:// file protocol. I'm not enthusiastic about them because I tend to use vim like people use emacs, meaning I keep lots of buffers open and open new files from the same vim instance. I want to be able to use tab completion to open new files, and I don't think vim's scp protocol will let me do that, and if it does, it would suffer from the same latency issues.

Ideas?

Thursday, March 17, 2011

x86 register save performance

Most people who deal with machine code already know that out-of-order CISC chips like most x86 variants are extremely complicated. Optimizing for them has always been a bit of a black art. In my work on DynamoRIO, I'm trying to optimize instrumentation calls to straight C functions. If it's a leaf function, we can try to inline it. For now, I'm just switching stacks and saving used registers inline. There are two ways you can do this. Most C compilers, when they have to spill locals to the stack, will subtract the size of the frame from the stack pointer on entry and address the locals relative to the SP. So, for register saves, I could do the same. However, I could also use the builtin push and pop opcodes. To decide, I did a microbenchmark, and the push and pop opcodes won. Let's look at the assembly:

/* Compile with -DPUSH to get pushes/pops
* and without for movs. */
.globl main
main:
push %rbp
mov %rsp, %rbp
mov $500000000, %rax
.Lloop:
#ifdef PUSH
push %rax
push %rbx
/* etc... */
push %r15
/* Mid */
pop %r15
/* etc... */
pop %rbx
pop %rax
#else
lea -120(%rsp), %rsp /* Update rsp. */
mov %rax, 0x70(%rsp)
mov %rbx, 0x68(%rsp)
/* etc... */
mov %r15, 0x00(%rsp)
/* Mid */
mov 0x00(%rsp), %r15
/* etc... */
mov 0x68(%rsp), %rbx
mov 0x70(%rsp), %rax
lea 120(%rsp), %rsp /* Update rsp. */
#endif
dec %rax
test %rax, %rax
jnz .Lloop
leave
ret

So, we have a loop here that saves all registers (I only need to save some for my purposes, this is just a micro-benchmark) in the two ways described. Here are the results for running both on a Core i7 I have access to:

Command:
gcc microbench_push_mov.S -DPUSH -o microbench_push_mov && time ./microbench_push_mov

User time:
trial 1 0m3.400s
trial 2 0m3.404s
trial 3 0m3.404s

Command:
gcc microbench_push_mov.S -o microbench_push_mov && time ./microbench_push_mov

User time:
trial 1 0m3.444s
trial 2 0m3.444s
trial 3 0m3.444s

So, the push/pop version is 1% faster, which depending on how you look at it is unintuitive. At first, one would think that push and pop have to do more work by updating the rsp register for each instruction. Looking at the encoding, I can say that the code size of the moves is much larger, because an offset needs to be encoded for every instruction. "push %rax" is just 50, while "mov %rax, 0x70(%rsp)" is 48 89 44 24 70. I suppose doing the indirect addressing every cycle is also work, but it doesn't need to be managed as a write dependency like the stack pointer updates.

With that in mind, it now makes better sense why most compilers generate pushes for saving caller saved registers on entry, but use stack offset addressing for locals. The encoding for pushes and pops is much smaller, and the execution is marginally faster. Locals are accessed unpredictably, so push and pop aren't really an option.