Conscientious Software, part 3

(Sigh, keeping on a schedule is pretty darn tough. Sorry for week hiatus. Onwards! Finally finishing off this particular sub-series of posts.)

Conscientious Software, Part 3: Sick Software

Final part in a 3-part series. Part 1, Part 2.

An interesting thing happened when garbage collection came into wider use. Back when I was doing mostly C++ programming, a memory error often killed the program dead with a seg fault. You’d deallocate something and then try to use a dangling pointer, and whammo, your program core dumps. Fast hard death. Even once you fixed the dangling pointer bugs, if your program had a memory leak, you’d use more and more memory and then run out — and whammo, your program would crash, game over, man!

Once garbage collection came along, I remember being delighted. No more running out of heap! No more memory leaks! Whoa there, not so, is it? You could still leak memory, if your program had a cache that was accumulating stale data. But now instead of killing your program immediately, the garbage collector would just start running more and more often. Your program started running slower and slower. It was still working… kind of. But it wasn’t healthy.

It was sick.

Sick software is software that is functioning poorly because of essentially autopoietic defects. It’s fail-slow software, instead of fail-fast software. A computer with a faulty virus checker, infected with trojans and spyware that are consuming its resources, is sick. A memory-leaking Java program is sick. These systems still function, but poorly. And the fix can be harder to identify, because the problems, whatever they are, are more difficult (or impossible) to trace back to your own (allopoietic) code. Fail-fast environments make it clear when your code is at fault. Fail-slow environments, self-sustaining environments, are trying to work around problems but failing to do so.

Now, there is every likelihood that as systems scale up further, we will have to deal more and more with sick software. Perhaps it’s an inevitability. I hear that Google, for instance, is having to cope with unbelievable amounts of link spam — bogus sites that link to each other, often running on botnets or other synthesized / stolen resources. In some sense this is analogous to a memory leak at Internet scale — huge amounts of useless information that somehow has to be detected as such. Except this is worse, because it’s a viral attack on Google’s resources, not a fault in Google’s code itself. (I realize I’m conflating issues here, but like I said, this posting series is about provocative analogies, not conceptual rigor.)

I think there is a real and fundamental tension here. We want our systems to be more adaptive, more reactive, more self-sustaining. But we also want guarantees that our systems will reliably perform. Those two desires are, at the extremes, fundamentally in opposition. Speaking for myself as an engineer, and even just as a human being, I find that certainty is comforting… humans fundamentally want reassurance that Things Won’t Break. Because when things stop working, especially really large systems, it causes Big Problems.

So we want our systems to be self-sustaining, but not at the cost of predictability and reliability. And as we scale up, it becomes more and more difficult to have both. Again, consider Google. Google has GFS and BigTable, which is great. But once it has those systems, it then needs more systems on top of them to track all the projects and the storage that exist in those systems. It needs garbage collection within its storage pool. The system needs more self-knowledge in order to effectively manage (maybe even self-manage) its autopoietic resources. And in the face of link spam, the system needs more ability to detect and manage maliciously injected garbage.

Returning to the original paper, they spend a fair amount of time discussing the desire for software to be aware of and compliant with its environment. They give many potential examples of applications reconfiguring their interfaces, their plugins, their overall functioning, to work more compatibly with the user’s pre-existing preferences and other software. While extremely thought-provoking, I also find myself somewhat boggled by the level of coupling they seem to propose. Exactly how does their proposed computing environment describe and define the application customizations that they want to share? How are the boundaries set between the environment and the applications within the environment?

In fact, that’s a fundamental question in an autopoietic system. Where are the boundaries? Here’s another autopoietic/allopoietic tension. In an allopoietic system, we immediately think in terms of boundaries, interfaces, modules, layers. We handle problems by decomposition, breaking them down into smaller sub-problems. But in an autopoietic system, the system itself requires self-management knowledge, knowledge of its own structure. Effective intervention in its own operation may require a higher layer to operate on a lower layer. This severely threatens modularity, and introduces feedback loops which can either stabilize the system (if used well) or destabilize the system (if used poorly). This is one of the main reactions I have when reading the original paper — how do you effectively control a system composed of cross-module feedback loops? How do you define proper functioning for such a system, and how do you manage its state space to promote (and if possible, even guarantee) hysteresis? This circles back to my last post, in that what we may ultimately want is a provable logic (or at least a provable mathematics) of engineered autopoietic systems.

There are some means to achieving these goals. You can characterize the valid operating envelope of a large system, and to provide mechanisms for rate throttling, load shedding, resource repurposing, and other autopoietic functions to enable the system to preserve its valid operating regime. It’s likely that autopoietic systems will be initially defined in terms of the allopoietic service they provide to the outside world, and that their safe operating limits will be characterized in terms of service-level agreements that can be defined clearly and monitored in a tractable way. This is like a thermometer that, when the system starts getting overworked, tells the system that it’s feverish and it needs to take some time off.

So the destiny of large systems is clear: we need to be more deliberate about thinking of large systems as self-monitoring and to some extent self-aware. We need to characterize large systems in terms of their expected operating parameters, and we need to clearly define the means by which these systems cope with situations outside those parameters. We need to use strong engineering techniques to ensure as much reliability as possible wherever it is possible, and we need to use robust distributed programming practices to build the whole system in an innately decomposed and internally robust way. Over time, we will need to apply multi-level control theory to these systems, to enhance their self-control capabilities. And ultimately, we will need to further enhance our ability to describe the layered structure of these systems, to allow self-monitoring at multiple levels — both the basic issues of hardware and failure management, the intermediate issues of application upgrade and service-level monitoring, and the higher-level application issues of spam control, customer provisioning, and creation of new applications along with their self-monitoring capabilities.

We’ve made much progress, but there’s a lot more to do, and the benefits will be immense. I hope I’ve shown how the concept of conscientious software is both valid and very grounded in many areas of current practice and current research. I greatly look forward to our future progress!

Written by robjellinghaus

2007/10/23 at 04:14

Posted in conscientious software, research

robjsoftware.info