Sunday, February 15, 2009

Lessons learned and magic numbers.

One of the things you do when you start developing software, especially in a new or cutting-edge area, is believe you know what's best for your users and try to hide any complexities from them. That's one of the reasons we pushed RPC as the best way in which to develop distributed applications in the 1980's. Well back when we started developing Arjuna we took hiding complexity to heart, since simplifying the development of transactional applications was core to all of our PhDs. And I think we did a very good job with the initial releases.

However, once we gave the system to people (the source was made freely available by FTP in 1991, but various industry sponsors were using it before then) the feedback we got was very interesting: users will always do things you didn't expect and demand flexibility in what you produce.

In Arjuna this didn't impact the interfaces users saw but many of the internal development parameters that we had used, thinking that they would never need to be configurable (statically or dynamically). For example, when making a remote invocation on an object failures can occur (the network could partition, the machine hosting the object could crash, etc.) In the absence of a perfect failure detector you use failure suspicion and timeouts. Basically if a response does not come back from the object in time T then you assume the object has failed and act accordingly. But if you get the timeout value wrong then you can make the wrong decision with associated consequences. The timeout value really needs to take into account how long work might take to execute, how overloaded a machine may be, network congestion etc. So one value is rarely right for every application. But we didn't take that into account initially and hard-coded a magic number into the system.

There were other examples of magic numbers, including: the number of RPC retransmissions to use before assuming that a request (or response) cannot get through to the endpoint (why should 5 be any better than 2 or worse than 10?); the number of clustered name server instances; the object store location; the orphan detection period; the lock-conflict detection timeout etc. These were all things we believed had the best (sensible) values, but with hindsight it was clear that it was all based on limited deployment knowledge.

What this all lead to quite quickly was a methodology of expect the unexpected and develop accordingly as far as your users are concerned. We made the system extremely flexible and configurable, where many of the old magic numbers could be overridden either as the system ran or during deployment, with sensible defaults (trying to hit Pareto's 80/20 principle). Those magic numbers we didn't make configurable were clearly documented (both so we could remember as much as users could determine why something was behaving the way it was).

From the feedback we received over the past 20 years I think we managed to come close to the right set of compromises. It's true that most users are happy with the default values we set (which were/are revised based on feedback). But those smaller users who really do need the ability to change things now (or since 20 years ago) have the ability to modify them without rebuilding the system. This has been important in the systems adoption and it's a lesson that we continue to apply. So if you're developing software, beware of using too many magic numbers and if you don't make them configurable you need to understand (and believe) why that's the case and, importantly, document them just in case!

No comments: