General Stuff: Tech Commentry

Sunday, June 21, 2009

Notes from ISCA - ACLD 2009 Workshop

Jim Larus from Microsoft Research gave a nice overview of their research in the large datacenter space. He pointed out the problem of limited bisection bandwidth and how that prevents applications from running across clusters. Microsoft is working on a mesh network called Monsoon connecting adjacent datacenters. Besides the mesh network, two of the key ideas in Monsoon are Valiant Load Balancing and source routing. Source routing keeps the switches simple and largely stateless and avoid hot spots. Mike Foss from Rice who presented a poster later also was working on something similar in principle that employed source routing to its advantage.

Another problem that Jim mentioned of which I heard about in several other peoples' research was putting idle machines into a low energy sleep state in order to conserve power. There are interesting challenges here. For instance, a network packet or a timer interrupt is sufficient for "disturbing" the sleep of conventional operating systems. Tickless linux solves part of the problem. For the networking part of the problem, someone is working on a microcontroller that can address most of the stuff without involving the processor. Others were working on disabling ethernet ports.

A direction MSR is exploring is "self-aware" and "self-managing" grid applications which automatically reconfigure themselves based on input and SLA requirements. It was great to hear that Jim's team is thinking about this problem since I've been thinking about this for a while myself and I should read up more on MSR's publications in this area. Ben Sigelman from Google pointed out later in the session how this is a hard problem in general and there are less powerful but practical alternatives (like visualization).

Also, Jim was very optimistic about PCM (Phase Change Memory) as a more scalable and efficient alternative to FLASH. PRAM is quite expensive today but Jim felt it was a matter of time before it was cost effective. Chuck Thacker from MSR also later showed some confidence in PCM but did not quite share Jim's enthusiasm.

This also came up in the next talk by Sarah Bird, who is a grad student at UCB about hardware performance counters. Sarah's research deals with providing a single consistent view across all the many performance counters in a multicore machine + adding many multicore specific counters to the hardware -- and doing this all in a standard way across platforms. Ben Sigelman pointed out that integrating this information with a distributed tracing service might be useful. Sarah also talked about the Autotuning project they are working on at Berkeley. Several people in the audience pointed out that adding new hardware support is a bad idea. But Sarah pointed out to a previous project (PAPI) that aimed to solve the same problem in software was a failure.

This was followed the poster session. The next speaker was Richard Kaufman from HP Labs. Richard's point of view was that of a server supplier as compared to MSR's make-your-own-servers. So he cited several general purpose trends in power and datacenter design.

The final speaker was the Chuck Thacker. He presented the case for modular datacenters based on containers, several of which will be deployed in a "parking lot" type datacenter. He gave a good high level overview of the technology required, including left to right cooling, network switches, and rack space considerations. In response to an audience question, he seemed to be pretty determined not to violate the absolute symmetry in the system -- all containers are meant to be equal, so you cannot have special containers with high level network switches for example. I found a nice overview of Chuck's talk here.

Friday, May 29, 2009

Code Coverage and KLEE from OSDI 2008

I once wrote a coverage tool that randomly replaced function calls at runtime with one of the many values assumed by the output of the call (as determined at compile time). I did this for "well tested" systems code and was pleasantly surprised by the number of bugs it discovered, especially on error paths. Of course, the number of false positives was rather high as well, and replacing entire function calls doesn't always work either.

A more "correct" approach, one with zero false positives, is to have the programmer inject faults in the code, perhaps randomly thereby forcing error paths to be taken. This still requires executing the code in development environments in order to trigger a whole lot of code paths. Also, there is a trade-off between coverage and code readability.

An intermediate approach is to do symbolic execution of the code and ensure that all code paths are taken for all possible program inputs. That right there is an intractable problem for path space can be infinite and input unknown and hard to model. The paper by Cristian Cadar et al on "KLEE" at OSDI 2008 describes a "new symbolic execution tool capable of automatically generating tests that achieve high coverage on a diverse set of complex and environmentall intensive programs". KLEE manages to get a handle on both, the path explosion problem and the "environment problem".

Symbolic execution isn't new. What KLEE does differently is optimizing the constraint set in order to speed up constraint satisfiability tests upon each "unsafe" access (viz., pointer dereferences, asserts, etc.). But first, let's take a step back and look at KLEE's architecture which I find cooler than icecream.

KLEE is modeled as an operating system for symbolic processes. Each symbolic process is incharge of a code path currently being explored by KLEE, and contains a register file, stack, heap, program counter, and path condition - state typically associated with a "real" process on a "real" operating system. Just like a "real" operating system picks up one among several competing processes for scheduling, KLEE's scheduler picks up a symbolic process amongst many for symbolic-execution. Of course, KLEE's "scheduler" has a different goal than a real OS's scheduler. KLEE's scheduler's goal is to run those symbolic processes that are likely to offer biggest improvements in coverage. Each symbolic-process (called "state" in KLEE lingo) has a current set of constraints on the environment, which must be met for the current path to be reached. If KLEE detects a violation in the constraint set, it triggers an error and generates a test case.

Apparently, KLEE is a total rewrite of another tool by the same research group (EXE) which I'm not familiar with. But the interesting part is that in EXE's case, the individual processes were real operating system processes. Hmmm...

Symbolic execution of even small problems can result in path explosion thereby limiting the chances of finding true bugs. Moreover, most constraint solvers (KLEE uses STP) take quite a bit of time as well. KLEE's approach is to optimize away large parts of the state of each symbolic-process before the constraints are passed on to the solver. This alone reduced the running time of an example in the paper by more than an order of magnitude. More importantly, they show how the running of KLEE using their optimizations is nearly linear in the lines of code as compared to super-linear without the optimizations. Also, even with all the optimizations, KLEE does not sacrifice accuracy, so it never reports false positives - which is one of the key benefits of using a symbolic executor over other program analysis methods IMHO.

Additionally, they use some cool tricks in picking which symbolic-processes to execute in order to maximize coverage. For example, KLEE favors scheduling paths which are high up in the code tree, since they are likely to expose more unique paths.

The other trick used by KLEE is environment modelling. Instead of requiring the environment to be concretely specified, users can create models for the environment, written in proper C code. During (symbolic) execution, the model can generate inputs and error conditions. I did not get a good handle on the usefulness of environment modelling though, so don't know how powerful it is in practice.

Overall, it seems like a pretty good value proposition from a programmer point of view. Zero false positives, benefits proportional to analysis time (KLEE allows you to specify how long should it run and it uses that to optimize its scheduling behavior), and improved code coverage all seem like useful features. I wonder when these features will be available in some open source package, or how much does one have to pay Coverity for this.

General Stuff

Sunday, June 21, 2009

Notes from ISCA - ACLD 2009 Workshop

Friday, May 29, 2009

Code Coverage and KLEE from OSDI 2008

Labels

Followers

Blog Archive

Links

About Me