Monday, September 28, 2009

Paxos

Paxos is the most efficient distributed consensus algorithm for asynchronous systems with non-Byzantine faults. There is no deterministic algorithm for always achieving consensus successfully in an asynchronous distributed system, but Paxos comes pretty close to doing so in reality for non-Byzantine faults.

The best reference on
Paxos is a paper by Butler Lampson titled "How to build a highly available system using consensus".

Paxos' creator Leslie Lamport has written two papers, "The Part-time Parliament" which is the main detailed paper, and "paxos made simple" which presents the same thing simplified. However, Lampson's paper provides a very good context for Paxos: Why is consensus needed when designing an HA service ? What makes Paxosa really good consensus algorithm for asynchronous unreliable systems ? How to use Paxos in a _real_ distributed replicated (HA) service. Google's chubby lock service is a real life application of Paxos for solving a real problem and gives interesting perspectives on where consensus fits in an HA service implementation.

For instance, Lampson shows how an HA service need not use consensus for each and every consensus decision. Instead, it can use consensus once to elect a lease-based primary or grant a lock which can then be used. Chubby also uses
Paxos to provide a lock service rather than let clients use Paxos for every consensus decision. This helps performance since the lock owner or primary can pretty much do what it wants without having to consult other hosts. Another real world benefit is when modifying existing non-HA code to be HA. Its easier to modify existing code to convert it to a distributed system using locks, than make it use consensus for every majority decision.

Lampson's paper and the paper on chubby lock service claim that
Paxos consensus algorithm and the consensus algorithm used in Viewstamped replication which was published around the same time are equivalent. Viewstamped replication was used by Ori and Liskov to build a replication file system where the primary election happens through consensus. "Viewstamping" is used to make sure that even after a new primary is elected (and view changed), events pending with the older primary are well known to predate those seen by the new primary. The concept is similar to "round number" in Paxos where each view is like a round number and viewstamps are monotonically increasing like Paxos round numbers.

There was an MIT project by Barbara Liskov in 1990 called Argus which tried to create a programming model for writing applications which would then be automatically made highly available (HA) using viewstamped replication based system. Argus staff (Sanjay Ghemawat) joined Google, so mapreduce is probably inspired to some degree by Argus. Mapreduce does not have Argus style fault tolerance, it has basic fault tolerance implemented by the master.

Safety property of Consensus:

All non-failed processes agree on a value v that is proposed by at least one process (v must be proposed by some process, otherwise the system can e.g., trivially agree on value 0 all the time).

The basic idea behind Paxos:

Consensus is easy if there are no faults. For example, two phase commit: one process decides the outcome and asks all processes to prepare. If all processes agree, it issues a commit and job is done. If leader fails, the outcome is not clear. Another example: majority-based consensus: there is no leader. All processes choose a value and broadcast. If a majority agrees on the same value, there is consensus. But if there is no majority, or if some processes crash so that nobody knows if consensus was reached, this algorithm fails.

Paxox's basic idea comes from this non-fault-tolerant majority consensus algorithm. To make it fault-tolerant,
Paxos repeats majority consensus until there is consensus. However, the challenge comes from the asynchronous nature of the problem due to which its difficult to know when the current voting round has failed to reach consensus and hence a new round needs to be started. In practice, some process will start a new round after waiting for a certain time if consensus has not reached till then (Paxos is protected from multiple processes starting a round since all round numbers are ordered, so one of them will take precedence over the other thus effectively killing that other round). But the decision about the perceived failure of the previous round can be false as its made by some process based on messages it receives from other processes (after consensus is reached, typically the result is broadcasted for everyone to know though this is not needed for correctness). So if some messages are lost, its decision can be incorrect. Thus there could be a processe A which has witnessed that there has been a consensus on some value but process B doesn't know it. So B start s another round possibly using a different value. So if B's new request to A is lost, A will wrongly stick to its older consensus value while the new consensus could be on a different value. So for Paxos to work, each new round should use a value for which there has possibly been a consensus in a previous round. The tricky part of Paxos is to ensure this property.

Paxos ensures this by having the leader process of a round ask other processes for their past values and decide for which value there could possibly have been a consensus. Two key details that make the algorithm work are:
  1. Each process can only be participating in one round at a time.
  2. Round numbers are monotonically increasing so knowing which round is "previous" is trivial.

References:
  • Butler W. Lampson, How to Build a Highly Available System Using Consensus. 10th International Workshop on Distributed Algorithms (WDAG 1996)
  • Leslie Lamport, The Part-time Parliament, ACM Transactions on Computer Systems, May 1998
  • Leslie Lamport, Paxos made simple,http://research.microsoft.com/users/lamport/pubs/paxos-simple.pdf

Future Reading:
  • Jean-Philippe Martin, Lorenzo Alvisi, Fast Byzantine Paxos,http://www.cs.utexas.edu/users/lorenzo/papers/fab.pdf

Sunday, June 21, 2009

Notes from ISCA - ACLD 2009 Workshop

Jim Larus from Microsoft Research gave a nice overview of their research in the large datacenter space. He pointed out the problem of limited bisection bandwidth and how that prevents applications from running across clusters. Microsoft is working on a mesh network called Monsoon connecting adjacent datacenters. Besides the mesh network, two of the key ideas in Monsoon are Valiant Load Balancing and source routing. Source routing keeps the switches simple and largely stateless and avoid hot spots. Mike Foss from Rice who presented a poster later also was working on something similar in principle that employed source routing to its advantage.

Another problem that Jim mentioned of which I heard about in several other peoples' research was putting idle machines into a low energy sleep state in order to conserve power. There are interesting challenges here. For instance, a network packet or a timer interrupt is sufficient for "disturbing" the sleep of conventional operating systems. Tickless linux solves part of the problem. For the networking part of the problem, someone is working on a microcontroller that can address most of the stuff without involving the processor. Others were working on disabling ethernet ports.

A direction MSR is exploring is "self-aware" and "self-managing" grid applications which automatically reconfigure themselves based on input and SLA requirements. It was great to hear that Jim's team is thinking about this problem since I've been thinking about this for a while myself and I should read up more on MSR's publications in this area. Ben Sigelman from Google pointed out later in the session how this is a hard problem in general and there are less powerful but practical alternatives (like visualization).

Also, Jim was very optimistic about PCM (Phase Change Memory) as a more scalable and efficient alternative to FLASH. PRAM is quite expensive today but Jim felt it was a matter of time before it was cost effective. Chuck Thacker from MSR also later showed some confidence in PCM but did not quite share Jim's enthusiasm.

This also came up in the next talk by Sarah Bird, who is a grad student at UCB about hardware performance counters. Sarah's research deals with providing a single consistent view across all the many performance counters in a multicore machine + adding many multicore specific counters to the hardware -- and doing this all in a standard way across platforms. Ben Sigelman pointed out that integrating this information with a distributed tracing service might be useful. Sarah also talked about the Autotuning project they are working on at Berkeley. Several people in the audience pointed out that adding new hardware support is a bad idea. But Sarah pointed out to a previous project (PAPI) that aimed to solve the same problem in software was a failure.

This was followed the poster session. The next speaker was Richard Kaufman from HP Labs. Richard's point of view was that of a server supplier as compared to MSR's make-your-own-servers. So he cited several general purpose trends in power and datacenter design.

The final speaker was the Chuck Thacker. He presented the case for modular datacenters based on containers, several of which will be deployed in a "parking lot" type datacenter. He gave a good high level overview of the technology required, including left to right cooling, network switches, and rack space considerations. In response to an audience question, he seemed to be pretty determined not to violate the absolute symmetry in the system -- all containers are meant to be equal, so you cannot have special containers with high level network switches for example. I found a nice overview of Chuck's talk here.

Friday, May 29, 2009

C++ super simplified

Caution: This blog post is the result of a long and boring flight. Also, my "C++ age" is little over an year, same being true for Python as well.

In the beginning there was C. C has a statically typed system which the compiler uses to verify type safety at compile time. It is possible to evade type checking by making raw memory accesses, but it's rarely a good idea to do so. Compare this with Python. In Python, type checking happens at runtime. It is type safe, but you find out violations only at runtime.

A language's type system allows you to write code to an interface. Programming to an interface allows different components of a system to evolve separately, one can be changed without affecting the other. However, different languages, or specifically C, C++, and Python differ in how the interface is enforced.

In traditional C, all code is checked against a predefined interface. It is generally tough to alter implementations without affecting multiple components or breaking the interface. Function pointers offer one way of doing this but the language lacks features to expose their full potential to the programmer.

C++ formalizes and streamlines function pointers using polymorphism. Polymorphism allows multiple implementations of the same interface. The different implementations can be attached to an object at runtime and they are all type-safe since the compiler verified it so at compile time. Thus, C++ increases the life of an interface and improves isolation between components.

This in some ways is similar to Python's dynamic typing. Python programs are written to an implicit interface which provides maximum isolation between components and expands the possible implementations of an interface.

For example, consider the following piece of Python code:

def fn(arg1, arg2):
if (arg1.valid) {
    return arg1.x;
} else {
    return arg2.y;
}

This method is programmed to an implicit interface of arg1 and arg2, viz., that arg1 have fields 'valid' and 'x', and arg2 have field 'y'. Beyond that it doesn't care about what else is inside arg1 and arg2. Furthermore, it can accept far more implementations of arg1 and arg2 than say similar C code.

This fashion of defining interfaces between components is definitely more generic than polymorphism in C++. In particular, it allows for a richer set of objects to be passed between components (and hence improves component isolation ?).

Fortunately, C++ provides a similar framework, viz., templates. A template is essentially programmed to the implicit interface of the template parameters. Just like Python, the C++ compiler does not care about properties of objects outside of those used in the template code. Still, the compiler verifies that the implicit interface used in the template is valid for each instantiation of the template (strictly speaking, it is possible to "control" C++ type checking of template code using various tricks, e.g., by offloading the checking to compile time (this->...) or by "using the using directive". See Scott Meyers Item #43 for details).

So, C++ does come close to Python's implicit interfaces and dynamic typing with features such as templates and polymorphism, albeit with their syntactic weirdness. Well, that explains most of C++, except "overloading" of course. I assert that the sole purpose of overloading is to facilitate template programming. All other uses of overloading can be attributed to programmer laziness (aka convenience). For example, consider the following code snippet.

struct A {
    int x;
};

struct B {
    int y[10];
};

template <>
void fn(A_or_B arg, others) {
if (/* found A */) {
    arg.x = 10;
} else if (/* found B */) {
    arg.y[0] = 10;
}


Will this compile ? No. Is this something you may want to do ? Maybe. The solution (to make it compile) is to use overloading.

template <>
void fn(A_or_B arg, others) { do_fn(arg); }

void do_fn(A arg) { arg.x = 10; }

void do_fn(B arg) { arg.y[0] = 10; }

Since all code inside a template must adher to the *same* implicit interface, an object whose implicit interface is a proper subset of the template's implicit interface may not compile even though the usage is logically correct. The solution is make the template's implicit interface the intersection of implicit interfaces of all possible type arguments and move type-specific code into overloaded methods/classes.

Code Coverage and KLEE from OSDI 2008

I once wrote a coverage tool that randomly replaced function calls at runtime with one of the many values assumed by the output of the call (as determined at compile time). I did this for "well tested" systems code and was pleasantly surprised by the number of bugs it discovered, especially on error paths. Of course, the number of false positives was rather high as well, and replacing entire function calls doesn't always work either.

A more "correct" approach, one with zero false positives, is to have the programmer inject faults in the code, perhaps randomly thereby forcing error paths to be taken. This still requires executing the code in development environments in order to trigger a whole lot of code paths. Also, there is a trade-off between coverage and code readability.

An intermediate approach is to do symbolic execution of the code and ensure that all code paths are taken for all possible program inputs. That right there is an intractable problem for path space can be infinite and input unknown and hard to model. The paper by Cristian Cadar et al on "KLEE" at OSDI 2008 describes a "new symbolic execution tool capable of automatically generating tests that achieve high coverage on a diverse set of complex and environmentall intensive programs". KLEE manages to get a handle on both, the path explosion problem and the "environment problem".

Symbolic execution isn't new. What KLEE does differently is optimizing the constraint set in order to speed up constraint satisfiability tests upon each "unsafe" access (viz., pointer dereferences, asserts, etc.). But first, let's take a step back and look at KLEE's architecture which I find cooler than icecream.

KLEE is modeled as an operating system for symbolic processes. Each symbolic process is incharge of a code path currently being explored by KLEE, and contains a register file, stack, heap, program counter, and path condition - state typically associated with a "real" process on a "real" operating system. Just like a "real" operating system picks up one among several competing processes for scheduling, KLEE's scheduler picks up a symbolic process amongst many for symbolic-execution. Of course, KLEE's "scheduler" has a different goal than a real OS's scheduler. KLEE's scheduler's goal is to run those symbolic processes that are likely to offer biggest improvements in coverage. Each symbolic-process (called "state" in KLEE lingo) has a current set of constraints on the environment, which must be met for the current path to be reached. If KLEE detects a violation in the constraint set, it triggers an error and generates a test case.

Apparently, KLEE is a total rewrite of another tool by the same research group (EXE) which I'm not familiar with. But the interesting part is that in EXE's case, the individual processes were real operating system processes. Hmmm...

Symbolic execution of even small problems can result in path explosion thereby limiting the chances of finding true bugs. Moreover, most constraint solvers (KLEE uses STP) take quite a bit of time as well. KLEE's approach is to optimize away large parts of the state of each symbolic-process before the constraints are passed on to the solver. This alone reduced the running time of an example in the paper by more than an order of magnitude. More importantly, they show how the running of KLEE using their optimizations is nearly linear in the lines of code as compared to super-linear without the optimizations. Also, even with all the optimizations, KLEE does not sacrifice accuracy, so it never reports false positives - which is one of the key benefits of using a symbolic executor over other program analysis methods IMHO.

Additionally, they use some cool tricks in picking which symbolic-processes to execute in order to maximize coverage. For example, KLEE favors scheduling paths which are high up in the code tree, since they are likely to expose more unique paths.

The other trick used by KLEE is environment modelling. Instead of requiring the environment to be concretely specified, users can create models for the environment, written in proper C code. During (symbolic) execution, the model can generate inputs and error conditions. I did not get a good handle on the usefulness of environment modelling though, so don't know how powerful it is in practice.

Overall, it seems like a pretty good value proposition from a programmer point of view. Zero false positives, benefits proportional to analysis time (KLEE allows you to specify how long should it run and it uses that to optimize its scheduling behavior), and improved code coverage all seem like useful features. I wonder when these features will be available in some open source package, or how much does one have to pay Coverity for this.

Friday, March 06, 2009

Emails Are Forever

I don't know about you but the way I access my emails is almost always through the Gmail search box, whether I'm looking for that email about someone's flashy new job (since I forgot where they are working) or my todo list from last week.  And that's primarily because of the amount and nature of data in my Gmail.  All the communication of any consequence is reflected in emails (including facebook, orkut scraps), and additionally I've moved a lot of note taking to Gmail as well - implemented by sending an email to myself.

With so much of my life depending on my online existence, I'm getting a bit paranoid about several what if scenarios, like what if the online servicing maintaining my data goes bankrupt, etc.  Of course my case is not unique.  Millions of people worldwide have a lot of precious data sitting on computers spread out across the world, owned by a foreign corporation, and supported free of charge (e.g., free Gmail is partially supported by Google's online ads revenue).  But assuming some of this data will remain as valuable to me a decade or two later, I'd like to ensure that I don't leave all of it to the vagaries of a ticker symbol.

Maybe you would argue that Google and Yahoo and Microsoft are not going away anytime soon and I believe you.  But less than 20% of firms that existed two decades back are still around, so that increases the odds against your hypothesis.  Another possible argument is that a product getting phased out doesn't necessarily imply that user data will be lost (e.g., Yahoo photos to Flickr), maybe user data will become all the more precious over time.  Still, I'd argue that the expected lifetime of your present online data will only go down over time.  So the big question is do you care enough about _all_ that data or not.  In my case, I care about most of it.

A poor man's solution is to immortalize your data is to periodically "download" and archive it at your home or better still at another online service.  Of course, switching services may affect the usability of the data, which affects its value to you.  E.g., if I zipped all my Gmail into one single archive, I can't search freely like before.  It may still be useful for litigation purposes but not for looking up my roommate's phone number.  This brings us to the first rule:

1.  When archiving your data, the functionality supported by the archived data should be comparable to the primary copy's functionality.  If not, it will probably be forgotten.

A more difficult problem for me is figuring out _where_ to archive or move my data, making the decision, verifying that it worked, etc.  This is potentially very time consuming.  So what I'd want to see is a computer "program" that will keep track of the service level trends for your current data provider, monitors/evaluates upcoming alternative services, and replicates your data across a "diverse portfolio" of service providers, so over time your data is likely to survive somewhere - and almost always it can be found on the new and upcoming service.

It may seem that I just borrowed a dialog from Star Trek and I can understand your sentiment.  Some key components that will be needed to create such a "program" include:
  • A service for evaluating and recommending data service providers.
  • A data API supported by different data service providers that enables easy migration between them.  For example, if you must migrate a photo sharing service to an email provider, you'd probably store a single photo album as a single email.  Someone will have to write such interfaces for all new services.
Before I conclude, I'd like to observe that this problem is similar to money management.  When you give your money to a money management firm, you expect them to keep reinvesting it in the most appropriate business presently.  Maybe that is too complex to delegate fully to a program, but I think keeping your data up and running forever should be simpler.