Friday, May 29, 2009

C++ super simplified

Caution: This blog post is the result of a long and boring flight. Also, my "C++ age" is little over an year, same being true for Python as well.

In the beginning there was C. C has a statically typed system which the compiler uses to verify type safety at compile time. It is possible to evade type checking by making raw memory accesses, but it's rarely a good idea to do so. Compare this with Python. In Python, type checking happens at runtime. It is type safe, but you find out violations only at runtime.

A language's type system allows you to write code to an interface. Programming to an interface allows different components of a system to evolve separately, one can be changed without affecting the other. However, different languages, or specifically C, C++, and Python differ in how the interface is enforced.

In traditional C, all code is checked against a predefined interface. It is generally tough to alter implementations without affecting multiple components or breaking the interface. Function pointers offer one way of doing this but the language lacks features to expose their full potential to the programmer.

C++ formalizes and streamlines function pointers using polymorphism. Polymorphism allows multiple implementations of the same interface. The different implementations can be attached to an object at runtime and they are all type-safe since the compiler verified it so at compile time. Thus, C++ increases the life of an interface and improves isolation between components.

This in some ways is similar to Python's dynamic typing. Python programs are written to an implicit interface which provides maximum isolation between components and expands the possible implementations of an interface.

For example, consider the following piece of Python code:

def fn(arg1, arg2):
if (arg1.valid) {
    return arg1.x;
} else {
    return arg2.y;
}

This method is programmed to an implicit interface of arg1 and arg2, viz., that arg1 have fields 'valid' and 'x', and arg2 have field 'y'. Beyond that it doesn't care about what else is inside arg1 and arg2. Furthermore, it can accept far more implementations of arg1 and arg2 than say similar C code.

This fashion of defining interfaces between components is definitely more generic than polymorphism in C++. In particular, it allows for a richer set of objects to be passed between components (and hence improves component isolation ?).

Fortunately, C++ provides a similar framework, viz., templates. A template is essentially programmed to the implicit interface of the template parameters. Just like Python, the C++ compiler does not care about properties of objects outside of those used in the template code. Still, the compiler verifies that the implicit interface used in the template is valid for each instantiation of the template (strictly speaking, it is possible to "control" C++ type checking of template code using various tricks, e.g., by offloading the checking to compile time (this->...) or by "using the using directive". See Scott Meyers Item #43 for details).

So, C++ does come close to Python's implicit interfaces and dynamic typing with features such as templates and polymorphism, albeit with their syntactic weirdness. Well, that explains most of C++, except "overloading" of course. I assert that the sole purpose of overloading is to facilitate template programming. All other uses of overloading can be attributed to programmer laziness (aka convenience). For example, consider the following code snippet.

struct A {
    int x;
};

struct B {
    int y[10];
};

template <>
void fn(A_or_B arg, others) {
if (/* found A */) {
    arg.x = 10;
} else if (/* found B */) {
    arg.y[0] = 10;
}


Will this compile ? No. Is this something you may want to do ? Maybe. The solution (to make it compile) is to use overloading.

template <>
void fn(A_or_B arg, others) { do_fn(arg); }

void do_fn(A arg) { arg.x = 10; }

void do_fn(B arg) { arg.y[0] = 10; }

Since all code inside a template must adher to the *same* implicit interface, an object whose implicit interface is a proper subset of the template's implicit interface may not compile even though the usage is logically correct. The solution is make the template's implicit interface the intersection of implicit interfaces of all possible type arguments and move type-specific code into overloaded methods/classes.

Code Coverage and KLEE from OSDI 2008

I once wrote a coverage tool that randomly replaced function calls at runtime with one of the many values assumed by the output of the call (as determined at compile time). I did this for "well tested" systems code and was pleasantly surprised by the number of bugs it discovered, especially on error paths. Of course, the number of false positives was rather high as well, and replacing entire function calls doesn't always work either.

A more "correct" approach, one with zero false positives, is to have the programmer inject faults in the code, perhaps randomly thereby forcing error paths to be taken. This still requires executing the code in development environments in order to trigger a whole lot of code paths. Also, there is a trade-off between coverage and code readability.

An intermediate approach is to do symbolic execution of the code and ensure that all code paths are taken for all possible program inputs. That right there is an intractable problem for path space can be infinite and input unknown and hard to model. The paper by Cristian Cadar et al on "KLEE" at OSDI 2008 describes a "new symbolic execution tool capable of automatically generating tests that achieve high coverage on a diverse set of complex and environmentall intensive programs". KLEE manages to get a handle on both, the path explosion problem and the "environment problem".

Symbolic execution isn't new. What KLEE does differently is optimizing the constraint set in order to speed up constraint satisfiability tests upon each "unsafe" access (viz., pointer dereferences, asserts, etc.). But first, let's take a step back and look at KLEE's architecture which I find cooler than icecream.

KLEE is modeled as an operating system for symbolic processes. Each symbolic process is incharge of a code path currently being explored by KLEE, and contains a register file, stack, heap, program counter, and path condition - state typically associated with a "real" process on a "real" operating system. Just like a "real" operating system picks up one among several competing processes for scheduling, KLEE's scheduler picks up a symbolic process amongst many for symbolic-execution. Of course, KLEE's "scheduler" has a different goal than a real OS's scheduler. KLEE's scheduler's goal is to run those symbolic processes that are likely to offer biggest improvements in coverage. Each symbolic-process (called "state" in KLEE lingo) has a current set of constraints on the environment, which must be met for the current path to be reached. If KLEE detects a violation in the constraint set, it triggers an error and generates a test case.

Apparently, KLEE is a total rewrite of another tool by the same research group (EXE) which I'm not familiar with. But the interesting part is that in EXE's case, the individual processes were real operating system processes. Hmmm...

Symbolic execution of even small problems can result in path explosion thereby limiting the chances of finding true bugs. Moreover, most constraint solvers (KLEE uses STP) take quite a bit of time as well. KLEE's approach is to optimize away large parts of the state of each symbolic-process before the constraints are passed on to the solver. This alone reduced the running time of an example in the paper by more than an order of magnitude. More importantly, they show how the running of KLEE using their optimizations is nearly linear in the lines of code as compared to super-linear without the optimizations. Also, even with all the optimizations, KLEE does not sacrifice accuracy, so it never reports false positives - which is one of the key benefits of using a symbolic executor over other program analysis methods IMHO.

Additionally, they use some cool tricks in picking which symbolic-processes to execute in order to maximize coverage. For example, KLEE favors scheduling paths which are high up in the code tree, since they are likely to expose more unique paths.

The other trick used by KLEE is environment modelling. Instead of requiring the environment to be concretely specified, users can create models for the environment, written in proper C code. During (symbolic) execution, the model can generate inputs and error conditions. I did not get a good handle on the usefulness of environment modelling though, so don't know how powerful it is in practice.

Overall, it seems like a pretty good value proposition from a programmer point of view. Zero false positives, benefits proportional to analysis time (KLEE allows you to specify how long should it run and it uses that to optimize its scheduling behavior), and improved code coverage all seem like useful features. I wonder when these features will be available in some open source package, or how much does one have to pay Coverity for this.