Tuesday, January 03, 2017

Trip to Maui

This is a travelog of a trip that Manisha, Amarya, and I undertook of Maui in December 2016.

First some geological trivia to get your bearings.  The island of Maui is geologically founded on two volcanoes of the Hawaiian volcano chain.  The older volcano, which is now extinct, lies on the west of the island, and is considerably eroded, lush, and live.  The newer volcano, Haleakala, lies to the east of the island and is an imposing presence - rising 10k+ ft above sea level.  It is the largest dormant volcano in the world with the last eruption in 1790, about 1 million years old and stretching a whopping 25k ft from the surrounding seabed!  Its slopes are home to rich fauna and many of the key attractions of Maui.  The island itself is formed by these volcanoes and the isthmus between them formed through their erosion.

The Hawaiian volcanic chain is the result of a rare geological phenomenon called a "Hotspot" and the gradual movement of the Pacific tectonic plate.  The hotspot is continuously spewing out lava, and as the Pacific tectonic plate shifts over it, a volcano gets created periodically and roughly at even distances from each other.  Interestingly, as the volcano moves away from the hotspot, the volcano goes through a cycle of aging which ultimately ends with it completely collapsing below sea level, leaving only the surrounding reef as a sign that it ever existed!  The Hawaiian Hotspot has been active for the past 85 million years during which it has generated 129 volcanoes, most of which are below sea level today.

That's enough on the geology!  Our trip started on Dec 20, 2016 when we reached Maui late evening from SF.  The first two days of our stay were in Kahului, which is located on the plains of the West Maui mountains.  Dec 21 started early with breakfast at Vanta Cafe after which we headed over to Maui Tropical Plantation in Wailuku, also located on the plains of West Maui mountains.  Crisscrossing the plantation, we saw a variety of fauna including pineapples, jackfruit, bananas, coconut, and mangoes.  Macadamia nut trees were the real novelty here.  After getting a bite of local papayas at the organic plantation store and tasting real Maui coffee, we headed over to Kula.

Kula is a large agrarian community located on the northwest plains of the Haleakala.  This side of the Haleakala slopes gradually from the summit down to the plains, offering a vast area of temperate climate, ideal for farming.  Our first stop in Kula was Kumu farms.  This was our first encounter with the breathtaking views of the plains below and the ocean that you can enjoy from Kula, and which became a regular feature in all our farm visits over the next few days.  Kumu farms had strawberries, bananas, lots of goats, some sheep too if I recall, ponies, cows, and rabbits.  It was a kid friendly place with a play area.

We got out from Kumu farms at about 3PM and drove straight up to the Haleakala summit.  The summit area has good tourist infrastructure with a visitor center, large parking lot, restrooms, etc.  Parking at the highest point is tight though and several folks just left their cars on the narrow 1-lane street as they tried to catch the sunset.  From the summit, you can see Haleakala's many craters - each of which rise like a mini volcano around the main crater.  At this altitude, there is very little vegetation and the surrounding terrain dotted with craters and black soil looks very extra terrestrial.  Watching the sunset from the top of the Haleakala was an unforgettable experience - also much easier to pull off than the sunrise.  We told ourselves that we'll be back some day to bike down from the crater to the plains, which is a popular activity.  We drove down to the plains after sunset, had dinner at Tante's and that was it for the day.

Next morning, we started our day with the Ali'i Kula Lavender farm in Kula, one of the main attractions in Maui.  This was a serene place with beautifully planned lavender plantation and other fauna on a gradual slope.  The store had some interesting novelty items (we tried the lavender scone, lavender honey, lavender tea) and the same breathtaking views of the ocean that we saw the day before.

After hanging out at the fragrant farm, we headed further down to the Surfing Goat Dairy.  This is an excellent place for kids.  Amarya couldn't stop playing with baby goats and I finished a full goat cheese platter made for 6 people with Manisha enjoying her Maui coffee.  We learned about the goats, toured the farm, and loved the concept of the "Surfing Goat"!  Afterwards, we headed to Lahaina, not realizing how busy the highway would be around sunset, and eventually turning back and retiring early for the day.

Next day, December 23rd, I spent first half of the day snorkeling with Ed Robinson Diving Adventures / Maui Snorkeling at Ulua beach in Wailea.  The reef here is rich with life, more than anywhere I recall.  The surprise element was running into two full grown turtles.  Spent some time trying to understand one of them (and giving up), and swam with the second one.  Thanks to our guide, Tobey, for making this a memorable experience.

When I got back, we started on the road to Hana.  Our first stop was the surfing town of Paia where we had lunch at the Paia fish market and enjoyed some local Kombucha nearby.  We also saw some sugarcane plantations along the way which had a rich political and economic history that Manisha got very interested in.  After lunch, we started towards Hana, the goal being to make it to the "Halfway to Hana" stop which is 17 miles from this point.

Hana is on the eastern tip of the island.  The road to Hana from Paia goes along the northern and eastern shore.  These sides of the Haleakala see a lot of rain and are steeper than the Kula slopes.  The result is thick tropical rain forests in all their majesty.  It really is a sight to see - stop at any of the designated hike spots and wander through the woods.  Since Hawaii doesn't have snakes, walking through these tropical rain forests with no fear feels very unreal!

After a few short hikes, our first stop was the Garden of Eden Arboretum.  This was a well planned paradise located in thick tropical rain forest with peacocks and ducks as added attraction.  A highly recommended stopover on the way.  The garden also offers great views of the Keopuka rock which was featured in the opening scene of Jurassic park.  After some great shots of the three of us in this tropical paradise, we continued on the Hana highway.  This highway is challenging to drive but it's possible to both be safe and enjoy your trip.

Finally, when we got to the "Halfway to Hana" stop, it had just closed, the time was 4:05PM I think.  Manisha used her negotiation skills and got us their famous banana bread though, which started a long obsessions with banana breads, but more on that later.  We headed back from here and retired early for the day, spending rest of the evening at Nalu's cafe in Kihei, near our resort, the Maui Sunset.

We started the 24th of December again at Ulua beach but this time with Manisha and Amarya too.  Go early to find parking (before 10AM), this was the lesson we learned the hard way.  Fortunately, parking at the nearby Wailea shopping center is an option.  Manisha and Amarya had some nice quiet time at the beach while I went looking for the same fishes in the same spots, meeting Tobey again!  When we got back to our resort, I was able to get some "Maui Gold" pineapple which I had been craving all along and made some quick pineapple smoothie.

Next, we left for Mama's fish house to try our luck without reservation.  Fortunately, Manisha got us a table again in quick time and what followed was one of our most enjoyable meals outside.  The fish here is caught fresh daily, it's a new menu every day, it even has the day's date on the menu with detail of the fisherman and the boat with each fish!

This being our second last day, we decided to take it slow afterwards, spent some time at a mall in Kahului and retired to our resort.  The next morning, we discovered that our resort had a great paddle boarding location, alas it was too late for us as we had just checked out (but something for next time).  We started the day the Maui Ocean Center, sipping Maui coffee during breaks.

We learned that Hawaii has never been connected to any other landmass due to its distance (~2300 miles from land) and the depth of the ocean in between, even during ice ages.  The result is that all reef fish and wildlife has evolved in isolation.  The fish and coral larvae (planktons) arrived most likely with ocean currents.  Since the odds of survival and regrowth in a 2300 mile journey are slim, Hawaii's reefs have much fewer variety of fish (500 species as compared to 2500 found in Southeast Asia).  The Aquarium also had a very interesting presentation on the complex life and food chain of planktons.

Our next stop was the Maui Dragonfruit Farm in Lahaina, one of the few places open during Christmas.  The farm was great and our guide very knowledgeable who showed us a wide variety of fauna, life at the farm with hurricanes and winds, challenges of keeping the farm economically viable, and the many breathtaking views from the farm.  We also learned from her about WWOOF, an exchange program for working on organic farms across the world (World Wide Opportunity for Organic Farming).

From here, we headed deeper into Lahaina at the downtown.  During our lunch, we were honored to have a guest appearance from a turtle which completely thrilled each of us alike!  We walked down the waterfront, enjoying some breathtaking views of the sunset, Manisha indulging in art galleries, and ended up at the Lahaina Banyan Tree - a huge Banyan tree stretching over 2 acres.  Exploring its various branches with Amarya was a lot of fun!

That brought our trip to an end.  It was one of the most memorable trips that Manisha, Amarya and I have ever undertaken.  Maui truly is a unique tropical paradise and exploring its diverse natural riches an adventure of a lifetime.

Monday, June 14, 2010

Copy cat web services

Recently I've been thinking off a tangent from a previous post, Emails are Forever. The scenario is something we are all too familiar with -- as users of various Internet services, we are continuously generating a lot of digital content (e.g. emails, pictures, conversations on your Facebook wall, our tweets and tweets of people we follow, etc.). Sometimes we don't care whether some content will still be around few days or weeks after it is "consumed", e.g., random comments you might have left on someone's picture on Facebook. But on other occasions, you really hope that it is, e.g., any pictures you uploaded on Facebook and have no other copies in the "cloud". However, what's the guarantee that your data will be around few weeks, months, or years from now ?

It depends. Some web services are more likely to be around few years from now than others. If you really care about something, you can mainain an offline backup at home or elsewhere in the cloud, e.g., using S3. But there are several issues with backups. Firstly, not all kind of digital content you might have created using a web service can be easily tar'd up. Secondly, for highly dynamic content, such as emails, it's non-trivial to keep the backup in sync with the primary. Thirdly, although the backup is always "around", it's rarely called into service. In some cases it's possible that the data was lost due to bad disks (e.g., if using home storage), or you misplaced your credentials in case of an online service, or maybe someone compromised your online account, etc. I know some people who maintain 4+ copies of all their pictures and actively manage the storage, I cannot imagine doing that. Wouldn't it be nice if all "replicas" of the primary data had same or similar properties as the primary so you could simply use one or the other.

I'll jump to the problem statement now. For each web service, build an alternative service that stores the same user generated content as the primary service, has roughly the same functionality, and offers a high level of durability for the content duplicated from the primary. Reliability of the alternate service, measured in terms of downtime, is perhaps not that important. Durability of the alternate service is not necessarily related to the primary's service, simply stated it should be as durable as possible.

How can this be done ? Honestly, I don't know. However, what I do know is that this should be done in a way so as not to stiffle innovation in creating new web apps and services. Also, it will probably require an open source framework which makes it easy to develop such alternate services. Finally, it's not clear to what extent can the functionality of an existing web service be duplicated without key support from the provider itself. E.g., think of a funky new picture editing and sharing service that allows you to also store your pictures online. Does it make sense to duplicate the entire picture editing feature set in an alternate service simply so that we can store and access all the same data from the primary ?

Monday, September 28, 2009

Paxos

Paxos is the most efficient distributed consensus algorithm for asynchronous systems with non-Byzantine faults. There is no deterministic algorithm for always achieving consensus successfully in an asynchronous distributed system, but Paxos comes pretty close to doing so in reality for non-Byzantine faults.

The best reference on
Paxos is a paper by Butler Lampson titled "How to build a highly available system using consensus".

Paxos' creator Leslie Lamport has written two papers, "The Part-time Parliament" which is the main detailed paper, and "paxos made simple" which presents the same thing simplified. However, Lampson's paper provides a very good context for Paxos: Why is consensus needed when designing an HA service ? What makes Paxosa really good consensus algorithm for asynchronous unreliable systems ? How to use Paxos in a _real_ distributed replicated (HA) service. Google's chubby lock service is a real life application of Paxos for solving a real problem and gives interesting perspectives on where consensus fits in an HA service implementation.

For instance, Lampson shows how an HA service need not use consensus for each and every consensus decision. Instead, it can use consensus once to elect a lease-based primary or grant a lock which can then be used. Chubby also uses
Paxos to provide a lock service rather than let clients use Paxos for every consensus decision. This helps performance since the lock owner or primary can pretty much do what it wants without having to consult other hosts. Another real world benefit is when modifying existing non-HA code to be HA. Its easier to modify existing code to convert it to a distributed system using locks, than make it use consensus for every majority decision.

Lampson's paper and the paper on chubby lock service claim that
Paxos consensus algorithm and the consensus algorithm used in Viewstamped replication which was published around the same time are equivalent. Viewstamped replication was used by Ori and Liskov to build a replication file system where the primary election happens through consensus. "Viewstamping" is used to make sure that even after a new primary is elected (and view changed), events pending with the older primary are well known to predate those seen by the new primary. The concept is similar to "round number" in Paxos where each view is like a round number and viewstamps are monotonically increasing like Paxos round numbers.

There was an MIT project by Barbara Liskov in 1990 called Argus which tried to create a programming model for writing applications which would then be automatically made highly available (HA) using viewstamped replication based system. Argus staff (Sanjay Ghemawat) joined Google, so mapreduce is probably inspired to some degree by Argus. Mapreduce does not have Argus style fault tolerance, it has basic fault tolerance implemented by the master.

Safety property of Consensus:

All non-failed processes agree on a value v that is proposed by at least one process (v must be proposed by some process, otherwise the system can e.g., trivially agree on value 0 all the time).

The basic idea behind Paxos:

Consensus is easy if there are no faults. For example, two phase commit: one process decides the outcome and asks all processes to prepare. If all processes agree, it issues a commit and job is done. If leader fails, the outcome is not clear. Another example: majority-based consensus: there is no leader. All processes choose a value and broadcast. If a majority agrees on the same value, there is consensus. But if there is no majority, or if some processes crash so that nobody knows if consensus was reached, this algorithm fails.

Paxox's basic idea comes from this non-fault-tolerant majority consensus algorithm. To make it fault-tolerant,
Paxos repeats majority consensus until there is consensus. However, the challenge comes from the asynchronous nature of the problem due to which its difficult to know when the current voting round has failed to reach consensus and hence a new round needs to be started. In practice, some process will start a new round after waiting for a certain time if consensus has not reached till then (Paxos is protected from multiple processes starting a round since all round numbers are ordered, so one of them will take precedence over the other thus effectively killing that other round). But the decision about the perceived failure of the previous round can be false as its made by some process based on messages it receives from other processes (after consensus is reached, typically the result is broadcasted for everyone to know though this is not needed for correctness). So if some messages are lost, its decision can be incorrect. Thus there could be a processe A which has witnessed that there has been a consensus on some value but process B doesn't know it. So B start s another round possibly using a different value. So if B's new request to A is lost, A will wrongly stick to its older consensus value while the new consensus could be on a different value. So for Paxos to work, each new round should use a value for which there has possibly been a consensus in a previous round. The tricky part of Paxos is to ensure this property.

Paxos ensures this by having the leader process of a round ask other processes for their past values and decide for which value there could possibly have been a consensus. Two key details that make the algorithm work are:
  1. Each process can only be participating in one round at a time.
  2. Round numbers are monotonically increasing so knowing which round is "previous" is trivial.

References:
  • Butler W. Lampson, How to Build a Highly Available System Using Consensus. 10th International Workshop on Distributed Algorithms (WDAG 1996)
  • Leslie Lamport, The Part-time Parliament, ACM Transactions on Computer Systems, May 1998
  • Leslie Lamport, Paxos made simple,http://research.microsoft.com/users/lamport/pubs/paxos-simple.pdf

Future Reading:
  • Jean-Philippe Martin, Lorenzo Alvisi, Fast Byzantine Paxos,http://www.cs.utexas.edu/users/lorenzo/papers/fab.pdf

Sunday, June 21, 2009

Notes from ISCA - ACLD 2009 Workshop

Jim Larus from Microsoft Research gave a nice overview of their research in the large datacenter space. He pointed out the problem of limited bisection bandwidth and how that prevents applications from running across clusters. Microsoft is working on a mesh network called Monsoon connecting adjacent datacenters. Besides the mesh network, two of the key ideas in Monsoon are Valiant Load Balancing and source routing. Source routing keeps the switches simple and largely stateless and avoid hot spots. Mike Foss from Rice who presented a poster later also was working on something similar in principle that employed source routing to its advantage.

Another problem that Jim mentioned of which I heard about in several other peoples' research was putting idle machines into a low energy sleep state in order to conserve power. There are interesting challenges here. For instance, a network packet or a timer interrupt is sufficient for "disturbing" the sleep of conventional operating systems. Tickless linux solves part of the problem. For the networking part of the problem, someone is working on a microcontroller that can address most of the stuff without involving the processor. Others were working on disabling ethernet ports.

A direction MSR is exploring is "self-aware" and "self-managing" grid applications which automatically reconfigure themselves based on input and SLA requirements. It was great to hear that Jim's team is thinking about this problem since I've been thinking about this for a while myself and I should read up more on MSR's publications in this area. Ben Sigelman from Google pointed out later in the session how this is a hard problem in general and there are less powerful but practical alternatives (like visualization).

Also, Jim was very optimistic about PCM (Phase Change Memory) as a more scalable and efficient alternative to FLASH. PRAM is quite expensive today but Jim felt it was a matter of time before it was cost effective. Chuck Thacker from MSR also later showed some confidence in PCM but did not quite share Jim's enthusiasm.

This also came up in the next talk by Sarah Bird, who is a grad student at UCB about hardware performance counters. Sarah's research deals with providing a single consistent view across all the many performance counters in a multicore machine + adding many multicore specific counters to the hardware -- and doing this all in a standard way across platforms. Ben Sigelman pointed out that integrating this information with a distributed tracing service might be useful. Sarah also talked about the Autotuning project they are working on at Berkeley. Several people in the audience pointed out that adding new hardware support is a bad idea. But Sarah pointed out to a previous project (PAPI) that aimed to solve the same problem in software was a failure.

This was followed the poster session. The next speaker was Richard Kaufman from HP Labs. Richard's point of view was that of a server supplier as compared to MSR's make-your-own-servers. So he cited several general purpose trends in power and datacenter design.

The final speaker was the Chuck Thacker. He presented the case for modular datacenters based on containers, several of which will be deployed in a "parking lot" type datacenter. He gave a good high level overview of the technology required, including left to right cooling, network switches, and rack space considerations. In response to an audience question, he seemed to be pretty determined not to violate the absolute symmetry in the system -- all containers are meant to be equal, so you cannot have special containers with high level network switches for example. I found a nice overview of Chuck's talk here.

Friday, May 29, 2009

C++ super simplified

Caution: This blog post is the result of a long and boring flight. Also, my "C++ age" is little over an year, same being true for Python as well.

In the beginning there was C. C has a statically typed system which the compiler uses to verify type safety at compile time. It is possible to evade type checking by making raw memory accesses, but it's rarely a good idea to do so. Compare this with Python. In Python, type checking happens at runtime. It is type safe, but you find out violations only at runtime.

A language's type system allows you to write code to an interface. Programming to an interface allows different components of a system to evolve separately, one can be changed without affecting the other. However, different languages, or specifically C, C++, and Python differ in how the interface is enforced.

In traditional C, all code is checked against a predefined interface. It is generally tough to alter implementations without affecting multiple components or breaking the interface. Function pointers offer one way of doing this but the language lacks features to expose their full potential to the programmer.

C++ formalizes and streamlines function pointers using polymorphism. Polymorphism allows multiple implementations of the same interface. The different implementations can be attached to an object at runtime and they are all type-safe since the compiler verified it so at compile time. Thus, C++ increases the life of an interface and improves isolation between components.

This in some ways is similar to Python's dynamic typing. Python programs are written to an implicit interface which provides maximum isolation between components and expands the possible implementations of an interface.

For example, consider the following piece of Python code:

def fn(arg1, arg2):
if (arg1.valid) {
    return arg1.x;
} else {
    return arg2.y;
}

This method is programmed to an implicit interface of arg1 and arg2, viz., that arg1 have fields 'valid' and 'x', and arg2 have field 'y'. Beyond that it doesn't care about what else is inside arg1 and arg2. Furthermore, it can accept far more implementations of arg1 and arg2 than say similar C code.

This fashion of defining interfaces between components is definitely more generic than polymorphism in C++. In particular, it allows for a richer set of objects to be passed between components (and hence improves component isolation ?).

Fortunately, C++ provides a similar framework, viz., templates. A template is essentially programmed to the implicit interface of the template parameters. Just like Python, the C++ compiler does not care about properties of objects outside of those used in the template code. Still, the compiler verifies that the implicit interface used in the template is valid for each instantiation of the template (strictly speaking, it is possible to "control" C++ type checking of template code using various tricks, e.g., by offloading the checking to compile time (this->...) or by "using the using directive". See Scott Meyers Item #43 for details).

So, C++ does come close to Python's implicit interfaces and dynamic typing with features such as templates and polymorphism, albeit with their syntactic weirdness. Well, that explains most of C++, except "overloading" of course. I assert that the sole purpose of overloading is to facilitate template programming. All other uses of overloading can be attributed to programmer laziness (aka convenience). For example, consider the following code snippet.

struct A {
    int x;
};

struct B {
    int y[10];
};

template <>
void fn(A_or_B arg, others) {
if (/* found A */) {
    arg.x = 10;
} else if (/* found B */) {
    arg.y[0] = 10;
}


Will this compile ? No. Is this something you may want to do ? Maybe. The solution (to make it compile) is to use overloading.

template <>
void fn(A_or_B arg, others) { do_fn(arg); }

void do_fn(A arg) { arg.x = 10; }

void do_fn(B arg) { arg.y[0] = 10; }

Since all code inside a template must adher to the *same* implicit interface, an object whose implicit interface is a proper subset of the template's implicit interface may not compile even though the usage is logically correct. The solution is make the template's implicit interface the intersection of implicit interfaces of all possible type arguments and move type-specific code into overloaded methods/classes.

Code Coverage and KLEE from OSDI 2008

I once wrote a coverage tool that randomly replaced function calls at runtime with one of the many values assumed by the output of the call (as determined at compile time). I did this for "well tested" systems code and was pleasantly surprised by the number of bugs it discovered, especially on error paths. Of course, the number of false positives was rather high as well, and replacing entire function calls doesn't always work either.

A more "correct" approach, one with zero false positives, is to have the programmer inject faults in the code, perhaps randomly thereby forcing error paths to be taken. This still requires executing the code in development environments in order to trigger a whole lot of code paths. Also, there is a trade-off between coverage and code readability.

An intermediate approach is to do symbolic execution of the code and ensure that all code paths are taken for all possible program inputs. That right there is an intractable problem for path space can be infinite and input unknown and hard to model. The paper by Cristian Cadar et al on "KLEE" at OSDI 2008 describes a "new symbolic execution tool capable of automatically generating tests that achieve high coverage on a diverse set of complex and environmentall intensive programs". KLEE manages to get a handle on both, the path explosion problem and the "environment problem".

Symbolic execution isn't new. What KLEE does differently is optimizing the constraint set in order to speed up constraint satisfiability tests upon each "unsafe" access (viz., pointer dereferences, asserts, etc.). But first, let's take a step back and look at KLEE's architecture which I find cooler than icecream.

KLEE is modeled as an operating system for symbolic processes. Each symbolic process is incharge of a code path currently being explored by KLEE, and contains a register file, stack, heap, program counter, and path condition - state typically associated with a "real" process on a "real" operating system. Just like a "real" operating system picks up one among several competing processes for scheduling, KLEE's scheduler picks up a symbolic process amongst many for symbolic-execution. Of course, KLEE's "scheduler" has a different goal than a real OS's scheduler. KLEE's scheduler's goal is to run those symbolic processes that are likely to offer biggest improvements in coverage. Each symbolic-process (called "state" in KLEE lingo) has a current set of constraints on the environment, which must be met for the current path to be reached. If KLEE detects a violation in the constraint set, it triggers an error and generates a test case.

Apparently, KLEE is a total rewrite of another tool by the same research group (EXE) which I'm not familiar with. But the interesting part is that in EXE's case, the individual processes were real operating system processes. Hmmm...

Symbolic execution of even small problems can result in path explosion thereby limiting the chances of finding true bugs. Moreover, most constraint solvers (KLEE uses STP) take quite a bit of time as well. KLEE's approach is to optimize away large parts of the state of each symbolic-process before the constraints are passed on to the solver. This alone reduced the running time of an example in the paper by more than an order of magnitude. More importantly, they show how the running of KLEE using their optimizations is nearly linear in the lines of code as compared to super-linear without the optimizations. Also, even with all the optimizations, KLEE does not sacrifice accuracy, so it never reports false positives - which is one of the key benefits of using a symbolic executor over other program analysis methods IMHO.

Additionally, they use some cool tricks in picking which symbolic-processes to execute in order to maximize coverage. For example, KLEE favors scheduling paths which are high up in the code tree, since they are likely to expose more unique paths.

The other trick used by KLEE is environment modelling. Instead of requiring the environment to be concretely specified, users can create models for the environment, written in proper C code. During (symbolic) execution, the model can generate inputs and error conditions. I did not get a good handle on the usefulness of environment modelling though, so don't know how powerful it is in practice.

Overall, it seems like a pretty good value proposition from a programmer point of view. Zero false positives, benefits proportional to analysis time (KLEE allows you to specify how long should it run and it uses that to optimize its scheduling behavior), and improved code coverage all seem like useful features. I wonder when these features will be available in some open source package, or how much does one have to pay Coverity for this.

Friday, March 06, 2009

Emails Are Forever

I don't know about you but the way I access my emails is almost always through the Gmail search box, whether I'm looking for that email about someone's flashy new job (since I forgot where they are working) or my todo list from last week.  And that's primarily because of the amount and nature of data in my Gmail.  All the communication of any consequence is reflected in emails (including facebook, orkut scraps), and additionally I've moved a lot of note taking to Gmail as well - implemented by sending an email to myself.

With so much of my life depending on my online existence, I'm getting a bit paranoid about several what if scenarios, like what if the online servicing maintaining my data goes bankrupt, etc.  Of course my case is not unique.  Millions of people worldwide have a lot of precious data sitting on computers spread out across the world, owned by a foreign corporation, and supported free of charge (e.g., free Gmail is partially supported by Google's online ads revenue).  But assuming some of this data will remain as valuable to me a decade or two later, I'd like to ensure that I don't leave all of it to the vagaries of a ticker symbol.

Maybe you would argue that Google and Yahoo and Microsoft are not going away anytime soon and I believe you.  But less than 20% of firms that existed two decades back are still around, so that increases the odds against your hypothesis.  Another possible argument is that a product getting phased out doesn't necessarily imply that user data will be lost (e.g., Yahoo photos to Flickr), maybe user data will become all the more precious over time.  Still, I'd argue that the expected lifetime of your present online data will only go down over time.  So the big question is do you care enough about _all_ that data or not.  In my case, I care about most of it.

A poor man's solution is to immortalize your data is to periodically "download" and archive it at your home or better still at another online service.  Of course, switching services may affect the usability of the data, which affects its value to you.  E.g., if I zipped all my Gmail into one single archive, I can't search freely like before.  It may still be useful for litigation purposes but not for looking up my roommate's phone number.  This brings us to the first rule:

1.  When archiving your data, the functionality supported by the archived data should be comparable to the primary copy's functionality.  If not, it will probably be forgotten.

A more difficult problem for me is figuring out _where_ to archive or move my data, making the decision, verifying that it worked, etc.  This is potentially very time consuming.  So what I'd want to see is a computer "program" that will keep track of the service level trends for your current data provider, monitors/evaluates upcoming alternative services, and replicates your data across a "diverse portfolio" of service providers, so over time your data is likely to survive somewhere - and almost always it can be found on the new and upcoming service.

It may seem that I just borrowed a dialog from Star Trek and I can understand your sentiment.  Some key components that will be needed to create such a "program" include:
  • A service for evaluating and recommending data service providers.
  • A data API supported by different data service providers that enables easy migration between them.  For example, if you must migrate a photo sharing service to an email provider, you'd probably store a single photo album as a single email.  Someone will have to write such interfaces for all new services.
Before I conclude, I'd like to observe that this problem is similar to money management.  When you give your money to a money management firm, you expect them to keep reinvesting it in the most appropriate business presently.  Maybe that is too complex to delegate fully to a program, but I think keeping your data up and running forever should be simpler.