A blog about software – researching it, developing it, and contemplating its future.

Archive for the ‘research’ Category

OOPSLA 2007 – Languages Gone Wild

leave a comment »

It has come to my attention that my writing here has been a bit boring. Dry. Stuffy. Well, Kevin Bourrillion has shown me the way. Time to liven it up a bit. Just a bit, mind you. Juuuuuuuuust a bit.

And please don’t give me trouble over my inability to post more than once a week. Really all I can say is ARRRRGH! Work, upcoming conferences, and raising two very young kids = JEEZ WHEN DO PEOPLE EVER GET TIME TO BLOG? And while we’re on that topic, I don’t see you blogging enough either, do I? Don’t be throwing any stones, Mr. Glass House.

OK, where were we? Ah yes, OOPSLA 2007. What a beautiful conference. Not that I attended or anything, but who needs to attend a conference when you have the Internet?

The dynamic vs. static languages flame war has bogged down the language community over the last decade. It’s great to see that, judging by some recent papers from the latest OOPSLA, the logjam has broken with a vengeance. There are all kinds of brain-bending new languages in the works, and it’s frankly exhilarating. (Sorry that some of these are only available through the ACM’s Digital Library… but at $99 per year, how can you afford NOT to be a member???)

First we have a great paper on RPython, a project which creates a bimodal version of Python. You can run your Python program in a fully interpreted way in a startup metaprogramming phase, and then the RPython compiler kicks in, infers static types for your entire metaprogrammed Python code base, and compiles to the CLR or to the JVM with a modest performance increase of EIGHT HUNDRED TIMES. Yes, folks, they’ve run benchmarks of Python apps that run 800x faster with their system than with IronPython (which is no slouch of a Python implementation AFAIK). If that isn’t a great example of how you can have your dynamic cake and eat it statically, I don’t know what is.

Another lovely system is OMeta, a pattern language for describing grammars. You can write everything from a tokenizer up to a full language in a really nice executable grammar structure, with productions that map directly to some underlying base language. They also have a good modularity story worked out, and support for stateful parsing. They have a 177-line Javascript parser, and that’s not much!

Then there’s an equally great paper on JastAdd, an extensible compiler for Java. The JastAdd compiler is built around an extensible abstract syntax tree. The abstract syntax tree is the ONLY data structure in the compiler — there are no separate symbol tables or binding tables; everything is implemented as extensions to the abstract syntax tree. The extensions are expressed with a declarative language that lets you define dataflow equations relating different elements in the tree — inherited elements (for referring to names bound in a parent construct, for example), or reference elements (for referring to remote type declarations, for example).

The compiler has an equation analysis engine that can process all these equations until it reaches a fixpoint, which completely avoids all the usual multi-phase scheduling hassles in compilers around interleaving type analysis with type inference, etc. It seems like The Right Thing on a number of levels, and it makes me want to hack around with building a compiler along similar declarative lines. They give examples of extending Java with non-null types and of implementing Java 5 generics purely as a declarative compiler extension. That, to me, pretty much proves their point. Bodacious! I had been thinking that executable grammars were a nice way to go, but seeing their declarative framework’s power is seriously making me reconsider that idea. What would you get if you combined OMeta and JastAdd? Something beautiful. I’m not sure how you’d combine the statefulness of OMeta with the declarativeness of JastAdd, but we must ponder it deeply, because the One True AST is a goal worth seeking.

A truly mind-bending paper discusses breaking the two-level barrier. What’s the two-level barrier? Simple: it’s the class/object distinction. They point out that many kinds of modeling can’t be done with a class hierarchy. What you really want is a programmer-accessible metaclass hierarchy. (And not a weenie one like Smalltalk’s, either.) For example, consider an online store. You thought you knew everything about online stores? THINK AGAIN, JACKSON. Let’s say you have a DVD for sale, such as Titanic. That Titanic DVD is an instance of the DVD product class. The DVD product class is conceptually a further instance of the DigitalMedia product class. I meant exactly what I said there — in their framework, one class can be an instance of another class.

You can then state that the DigitalMedia metaclass defines a “categoryName” and a “net price”, requiring that “category name” be defined by instances of DigitalMedia, and that “net price” be defined by instances of instances of DigitalMedia.. The DVD class then defines “categoryName” to be “DVD”, so ALL DVDs have the same category name. And then particular instances of DVD define their actual net prices individually. In this way, you can take the same kinds of “must provide a value for this field” constraints that exist in the class-object hierarchy, and extend it to multiple levels, where grandparent metaclasses can express requirements of their meta-grandchild instances.

(They use the abysmal word “clabject” — ClAss obJECT — to refer to entities that can be used to instantiate objects (like classes), but that ARE themselves instantiated (like objects). I think “clobject” would have been better, or maybe “obclass” or something. “Clabject” just sounds… I don’t know… disturbing. Like some kind of hapless crab that’s filled with techno-malice. But the concept itself is very interesting. I think that having two orthogonal hierarchies — the metaclass hierarchy and the subclass hierarchy — is potentially too confusing for most programmers, including me, but it’s nonetheless really thought-provoking.)

Those are just four of the highlights — I’m only about a third of the way through reading the OOPSLA papers this year — but I think those are the top three when it comes to language design. It’s going to be a great next decade, as the whole static vs. dynamic war gives way to a myriad of bizarre hybrids and mutants, greatly enhancing (and confounding) the lives of hackers everywhere!


Written by robjellinghaus

2007/10/30 at 03:39

Conscientious Software, part 3

leave a comment »

(Sigh, keeping on a schedule is pretty darn tough. Sorry for week hiatus. Onwards! Finally finishing off this particular sub-series of posts.)

Conscientious Software, Part 3: Sick Software

Final part in a 3-part series. Part 1, Part 2.

An interesting thing happened when garbage collection came into wider use. Back when I was doing mostly C++ programming, a memory error often killed the program dead with a seg fault. You’d deallocate something and then try to use a dangling pointer, and whammo, your program core dumps. Fast hard death. Even once you fixed the dangling pointer bugs, if your program had a memory leak, you’d use more and more memory and then run out — and whammo, your program would crash, game over, man!

Once garbage collection came along, I remember being delighted. No more running out of heap! No more memory leaks! Whoa there, not so, is it? You could still leak memory, if your program had a cache that was accumulating stale data. But now instead of killing your program immediately, the garbage collector would just start running more and more often. Your program started running slower and slower. It was still working… kind of. But it wasn’t healthy.

It was sick.

Sick software is software that is functioning poorly because of essentially autopoietic defects. It’s fail-slow software, instead of fail-fast software. A computer with a faulty virus checker, infected with trojans and spyware that are consuming its resources, is sick. A memory-leaking Java program is sick. These systems still function, but poorly. And the fix can be harder to identify, because the problems, whatever they are, are more difficult (or impossible) to trace back to your own (allopoietic) code. Fail-fast environments make it clear when your code is at fault. Fail-slow environments, self-sustaining environments, are trying to work around problems but failing to do so.

Now, there is every likelihood that as systems scale up further, we will have to deal more and more with sick software. Perhaps it’s an inevitability. I hear that Google, for instance, is having to cope with unbelievable amounts of link spam — bogus sites that link to each other, often running on botnets or other synthesized / stolen resources. In some sense this is analogous to a memory leak at Internet scale — huge amounts of useless information that somehow has to be detected as such. Except this is worse, because it’s a viral attack on Google’s resources, not a fault in Google’s code itself. (I realize I’m conflating issues here, but like I said, this posting series is about provocative analogies, not conceptual rigor.)

I think there is a real and fundamental tension here. We want our systems to be more adaptive, more reactive, more self-sustaining. But we also want guarantees that our systems will reliably perform. Those two desires are, at the extremes, fundamentally in opposition. Speaking for myself as an engineer, and even just as a human being, I find that certainty is comforting… humans fundamentally want reassurance that Things Won’t Break. Because when things stop working, especially really large systems, it causes Big Problems.

So we want our systems to be self-sustaining, but not at the cost of predictability and reliability. And as we scale up, it becomes more and more difficult to have both. Again, consider Google. Google has GFS and BigTable, which is great. But once it has those systems, it then needs more systems on top of them to track all the projects and the storage that exist in those systems. It needs garbage collection within its storage pool. The system needs more self-knowledge in order to effectively manage (maybe even self-manage) its autopoietic resources. And in the face of link spam, the system needs more ability to detect and manage maliciously injected garbage.

Returning to the original paper, they spend a fair amount of time discussing the desire for software to be aware of and compliant with its environment. They give many potential examples of applications reconfiguring their interfaces, their plugins, their overall functioning, to work more compatibly with the user’s pre-existing preferences and other software. While extremely thought-provoking, I also find myself somewhat boggled by the level of coupling they seem to propose. Exactly how does their proposed computing environment describe and define the application customizations that they want to share? How are the boundaries set between the environment and the applications within the environment?

In fact, that’s a fundamental question in an autopoietic system. Where are the boundaries? Here’s another autopoietic/allopoietic tension. In an allopoietic system, we immediately think in terms of boundaries, interfaces, modules, layers. We handle problems by decomposition, breaking them down into smaller sub-problems. But in an autopoietic system, the system itself requires self-management knowledge, knowledge of its own structure. Effective intervention in its own operation may require a higher layer to operate on a lower layer. This severely threatens modularity, and introduces feedback loops which can either stabilize the system (if used well) or destabilize the system (if used poorly). This is one of the main reactions I have when reading the original paper — how do you effectively control a system composed of cross-module feedback loops? How do you define proper functioning for such a system, and how do you manage its state space to promote (and if possible, even guarantee) hysteresis? This circles back to my last post, in that what we may ultimately want is a provable logic (or at least a provable mathematics) of engineered autopoietic systems.

There are some means to achieving these goals. You can characterize the valid operating envelope of a large system, and to provide mechanisms for rate throttling, load shedding, resource repurposing, and other autopoietic functions to enable the system to preserve its valid operating regime. It’s likely that autopoietic systems will be initially defined in terms of the allopoietic service they provide to the outside world, and that their safe operating limits will be characterized in terms of service-level agreements that can be defined clearly and monitored in a tractable way. This is like a thermometer that, when the system starts getting overworked, tells the system that it’s feverish and it needs to take some time off.

So the destiny of large systems is clear: we need to be more deliberate about thinking of large systems as self-monitoring and to some extent self-aware. We need to characterize large systems in terms of their expected operating parameters, and we need to clearly define the means by which these systems cope with situations outside those parameters. We need to use strong engineering techniques to ensure as much reliability as possible wherever it is possible, and we need to use robust distributed programming practices to build the whole system in an innately decomposed and internally robust way. Over time, we will need to apply multi-level control theory to these systems, to enhance their self-control capabilities. And ultimately, we will need to further enhance our ability to describe the layered structure of these systems, to allow self-monitoring at multiple levels — both the basic issues of hardware and failure management, the intermediate issues of application upgrade and service-level monitoring, and the higher-level application issues of spam control, customer provisioning, and creation of new applications along with their self-monitoring capabilities.

We’ve made much progress, but there’s a lot more to do, and the benefits will be immense. I hope I’ve shown how the concept of conscientious software is both valid and very grounded in many areas of current practice and current research. I greatly look forward to our future progress!

Written by robjellinghaus

2007/10/23 at 04:14

The Best of Both Worlds

leave a comment »

Along the lines of my post about avoiding religious wars over computer languages, I recently ran across two papers that got me seriously thinking about this whole “how to have the best of all worlds” problem.

Specifically, there was an excellent follow-on to the last OOPSLA conference, called the Library-Centric Software Design ’06 conference. Josh Bloch (of Effective Java fame) was one of the co-chairs. The papers were all focused on the general problem space of “active libraries” or “metaprogramming” or “extensible languages” — generally, the concept that your programming language can be extended by libraries that your programs use. Generally, this means that the typechecking of the language, or perhaps its very grammar, can be enhanced in an extensible way.

One of the more thought-provoking examples is a paper discussing Extending Type Systems in a Library, specifically about adding new “field types” to C++. One use for this is defining new type modifiers such as “notNull”, or “positive”, or “tainted”. These are all in some sense type template modifiers, in that for any pointer type T you could define a modified type notNull with type assertions preventing you from having null instances of that type. Or, for “tainted”, you could have just about any type T and create a variant type tainted indicating that the value of the base type originated from a suspect source.

The “tainted” use case tweaked my memory. The other recent work I’ve seen on taint analysis was the paper discussing how to use the BDDBDDB system to track tainted values through a Java program (previously discussed here). In that case, the system used sophisticated alias analysis to track the movement of values from tainted sources to vulnerable sinks (e.g. from dangerous methods receiving input from the Internet, to vulnerable methods issuing output to a database or filesystem or even another server connection).

Consider how similar these are. In the first case, you’re extending the language’s type system to distinguish tainted from untainted values. In the latter case, you’re analyzing a source base which has no such type distinctions, and statically proving whether or not the value flow does in fact permit tainted values to be used in a context that requires untainted values.

Yet also notice how different these are. In the extended-C++ case, actually using the tainted/untainted types would require you — in my reading of the paper — to essentially modify almost every method in your source code. All methods would need to declare their values as being either tainted or untainted, even methods which don’t actually ever use the possibly-tainted values in a vulnerable way. In other words, the property of taintedness is only relevant at one place in the program — the methods whose inputs are required to be untainted. But declaring the property as an explicit type annotation throughout the program forces your code to expose this distinction everywhere. Suddenly your language extension forces you into a truly massive refactoring!

In contrast, the BDDBDDB-based analysis essentially overlays the tainted property on the program’s alias graph. The analysis may well be no less sound… but how much more convenient for the programmer! Taintedness becomes a property that is derivable only when the programmer wishes to be aware of it. The rest of the time, the programmer can avoid worrying about it. In some sense, the security analysis becomes a “pay-as-you-go” property of the system, as opposed to a “pay-up-front” property in the fully statically typed technique.

Now, consider this a bit more broadly. Isn’t this exactly what dynamic language advocates hate about statically typed languages? Instead of worrying about tainted vs. untainted, let’s consider basic data typing: string vs. number. If you read in a string, and then you want to treat it as a number, isn’t that essentially exactly like a taint analysis? The only time you care whether the string is in fact not a number is if you actually try to use it as one. If you don’t ever try to use it as one, then why worry? Conversely, if you do try to use it as one, wouldn’t it be nice to have the system be able to trace its provenance back to where you originally read it, so you could decide if you wanted to insert a validity check there… without forcing you to declare its type at every intermediate program point where it flows?

It seems to me that extensible languages will have to shift a lot of their safety checking and property checking away from the static type system that’s exposed directly to the programmer. Consider all the possible properties you might want to know about the values in your program: are they string? Numeric? Complex? Are they null? Never-null? Are they tainted? Untainted? Are they structurally compliant XML? Are they valid values for your object-relational database? Are their dimensions correct for the physics math you want to do with them? There is a literally infinite variety of possible types that you might want to track through your program. And if you have to declare all of these types on every variable and method you declare, you pay a “static declaration tax” equal to the number of extended type properties times the number of declarations in your program. Whereas if the system is able to derive, or trace, or deduce the flow of these properties, and if the programming environment gives you enough insight into how that derivation works and what annotations you can add (very selectively) at only the points where the distinctions are relevant, you offload that tax onto the programming environment.

Imagine an IDE for some Ruby++ language. You can write your Ruby code in the normal way. But you can also annotate various methods with fully optional types. Methods that read from an HTTP connection mark their values as tainted. Methods that write to the database require untainted values. Arithmetic methods mark their inputs as numeric. And even better, there are ways (handwaving furiously) to define new kinds of traits and types; so you can implement complex numbers in the language, and use them without needing to explicitly declare their type throughout the program, and the system can track their value flow and create optimized code for their use. The end result: an extensible language with far more ease of use than extended-Java or extended-C++, and far more static safety than basic Ruby.

Maybe it is possible to have the best of all worlds. And in fact, if we want to truly achieve the dream of extensible statically-analyzed languages with usable syntax, maybe it’s not only possible but necessary.

Up with pay-as-you-go! Down with the static declaration tax! Up with extensible languages! Down with pure dynamism!

Maybe I’m a bit more religious than I thought.

What would such a language look like? Stay tuned!

…After writing the above, I surfed around for pluggable types — which was Gilad’s terma for the concept — and found several people who have suggested basically this: that static typing could be considered optional, and that extensible type systems might well require that viewpoint. There has been some real progress in the direction of viable partial type systems, and Javascript 4 is looking at adding strict mode.

Anyway, I’m leaving the body of the post above as I originally wrote it, because the slightly uninformed enthusiasm gave it some good energy 🙂

Written by robjellinghaus

2007/10/16 at 03:41

BDDBDDB – Pointer alias analysis comes of age

with one comment

One of the most exciting program analysis techniques I’ve ever seen (I love having my own geek blog, where I can say things like that with a straight face) is the BDDBDDB system, from Whaley and Lam at Stanford. (I will capitalize BDDBDDB here, though they don’t.)

Pointer alias analysis is tracking the movement of object references through a large piece of code, across method calls and through heap references. I’ve been reading programming research papers for almost two decades now, and for much of that time, pointer alias analysis was a chimera. It just wasn’t possible to make it scale. The number of potential paths in even a moderately large piece of code is dizzying — numbers like 10 to the 14th power aren’t uncommon. And for most of the last two decades, this has been an uncracked nut.

Well, Whaley and Lam seem to have finally cracked it, by building on other research done using binary decision diagrams (BDDs) to compactly encode the movement of values through variables in a program, in a way that maximizes the amount of shared structure in the variable flow graphs. BDDs were originally created for hardware logic analysis, to track the movement of signals through a huge circuit. So it’s elegant that they can also apply to software analysis, tracking the movement of value through a large code base.

The further elegance in Whaley and Lam’s technique is the query language laid on top of the BDD-based analysis. Binary decision diagrams can be used to encode relations; they can be joined and queried much like relational tables. And there is a query language, Datalog, that is well suited for expressing queries over a BDD-based relational model. Datalog is a relational query language, similar to a more abstract version of SQL, except it supports recursive queries (which SQL does not). Whaley and Lam are able to encode a Datalog-based query and relationship-generation engine on top of their BDD aliasing structures. Hence, BDDBDDB — Binary Decision Diagram-Based Deductive DataBase.

One of the more amazing results in their paper is their discussion of how their analyses improved once they had the Datalog layer. originally they wrote a context-specific alias analysis on top of a BDD engine without any intervening abstract query language. This took them a few months, the code was thousands of lines long, the performance was not good, and there were a number of subtle bugs. Once they wrote the Datalog engine and reformulated their analysis as Datalog formulas, the total size of the analysis dropped to just around a dozen lines (!) of Datalog, and the performance was one to two orders of magnitude faster.

Once you have a query engine that can efficiently query an entire software system to track value and type flow throughout, in a reasonably accurate way, there are many systems you can build on top of it. For instance, you can write an Eclipse plugin to track the exact lines of code that expose a web application to SQL injection attacks. This can be done by notating particular methods as sources of input data from the Internet, and other methods as destinations (“sinks”) of query text for databases, and then evaluating whether any values flow from source methods to sink methods. Any paths through the code that allow unfiltered input to travel from the Internet to your database are security holes. The BDDBDDB analysis can tell you exactly what code paths allow that value flow. And it can do it purely as a static analysis!

They use Java as the basis of their system, for unstated but obvious reasons. (Namely, strong static typing makes their value flow analysis much more efficient, since it can eliminate many value flows that are not dynamically permitted; and lack of pointer arithmetic means their analysis can be trusted, with no need to limit to the “safe” subset of the language without pointer math.) Many people have been wooed away from Java by the flexibility of dynamic language programming, where you needn’t encumber yourself with types; but those of us who remain in the static typing camp are pretty much hooked on the tools. And BDDBDDB is the harbinger of a whole new level of Java-based analysis tools that are going to take Java programming and safety checking to the next level.

One thing not mentioned in any of the BDDBDDB papers I’ve yet seen is reflection. I can’t see how reflection can be handled in the BDDBDDB system, since tracking reflective method invocations requires not just tracking object references through variable sites, but actually tracking method values (and even specific method name strings!) through reflective invocations. Certainly their system as described doesn’t address reflection in any way. This also seems to me to limit the applicability of BDDBDDB to languages like Ruby, which make heavy use of code generation (in Rails, for instance). Though I’m sure that some enterprising Rubyists will undertake to prove me wrong.

In general it’s very challenging to consider the interactions of static analysis with metaprogramming frameworks — runtime bytecode generation is becoming more common in Java systems, but BDDBDDB currently runs on a source analysis, without access to the runtime-modified bytecode. Metaprogramming can blur the line between compile-time and runtime, and BDDBDDB is solidly on the compile-time side… at least for the moment. I say this not to denigrate the importance of this technology, but rather to indicate an even further level to which it can go.

I look forward to what’s next for these analyses. Good times in the static typing world!

Written by robjellinghaus

2007/10/11 at 04:17

Conscientious Software, Part 2: Provable Robustness

leave a comment »

Second post in a series. Part 1.

Onwards! Now, what’s really interesting to me is the further implications of this autopoietic / allopoietic dichotomy throughout the software stack. The Conscientious Software paper talks about potential means of implementing autopoietic software in terms of more vaguely defined membranes and other sorts of perception-oriented interfaces between software components, along the lines of Jaron Lanier’s work.

To me, that is a lot harder to think about than the Google example, in which a lot of allopoietic engineering (very clear definition of subsystems, logical contracts, operational proofs, and other extremely “rigid” software engineering techniques) was used to create a system that, to a large extent, is reliably self-sustaining.

In other words, it seems to me that we want our autopoietic systems to be as reliable as we can make them. And the best techniques we have for designing something to be reliable are essentially allopoietic techniques.

Let’s look at this another way. A lot of programming language research right now is devoted towards integrating proof systems into various languages. In some sense, every strongly typed language has a proof system built into it. A type system is a proof system (see Wadler’s paper on Proofs As Programs). The power of these proof systems and type systems is steadily increasing — a lot of work lately is going into logics for proving file system safety, data structure heap correctness, distributed messaging consistency, and on and on.

We are continually developing conceptual techniques to increase the accuracy and correctness with which we can describe the desired properties of a system, and prove that the system in fact has those properties as implemented.

It is a fascinating question, to me, whether all those tools can be turned to the purpose of developing autopoietic systems. Let’s take one example: the above-linked work on logically proving that a file system is reliable.

The intention of this work is to model the state of the computer’s memory relative to the known state of the computer’s disk, in terms of what the memory believes to be true versus what the disk’s contents declare as true. Logical statements relate changes in the memory beliefs to changes in the disk’s beliefs. Overall, there is an error if at any point one can prove that the system’s disk beliefs are inconsistent with memory.

In a sense, this proof work is intended to make the file system robust in the face of any and all environmental failures. The robustness comes from construction, not necessarily from recovery. This is building reliability into the system at the very bottom.

It is not too difficult to postulate that such logics could be further developed, for example, to characterize the consistency properties of the Google File System, or of other large-scale distributed software infrastructures. (Boy, what a handwaving sentence! Yes, it’s not too difficult to postulate that. Actually doing it, on the other hand, is likely to be extremely difficult indeed. But over time even hard problems can get solved….)

Another direction in which to deliver increased reliability is internal self-protection. Almost all major organisms on Earth are composed of cellular structures. A cell is defined by its boundary — without a boundary, a cell does not exist. Some programming languages have similarly oriented themselves around the concept of bounded, encapsulated units which are composed in large numbers to create a greater whole. Possibly the most industrially tested and publicly available example of such a language is Erlang, which composes large and immensely reliable systems out of many small software components, which communicate only by asynchronous message passing. There is no global state in an Erlang program, and no shared state among independent objects. This makes Erlang innately fault-tolerant, insofar as the failure of any one component cannot immediately cause other components to fail. This also allows the system to be upgraded while continuing to run. A really successful autopoietic system eventually reaches a point where it needs to provide continuous availability, with no externally visible downtime at all, ever. Systems decomposed into pure message-passing subcomponents inherently support this, since individual components can be replaced cleanly while the rest of the system continues normal operation.

Of course, autopoietic layers need to be built as other Erlang objects (a common Erlang idiom is to have a number of worker objects overseen by a smaller pool of supervisor objects — this is exactly an autopoietic software layer), but the innate properties of the language contribute to the enforced internal modularity which leads not only to improved stability but also to ease of distribution. (Autopoietic systems will necessarily also be distributed and parallel systems, since any individual hardware unit may fail, and if the overall system is to survive it must have redundant parallel hardware.)

Large-scale autopoietic systems currently, and in the near future, will combine these properties. They will be systems that provide a general-purpose computing environment, with well-defined and consistent contracts, which are engineered with as much built-in and provable correctness as we can give them, which are modularized into independent and well-protected components, and which provide sufficiently reliable guarantees to the allopoietic programs running in those environments.

What is especially interesting is that this view flies in the face of the original paper’s claim that what’s needed is more vagueness, more inconsistency if you will, more tolerance between components for partial failures or partial communication. I’m not convinced by this part of their argument, at all. I think that we currently lack the ability to engineer in that style. Perhaps ultimately we’ll make greater use of artificial evolution for engineering design, and I would expect such design tools to be far better than we are at leveraging interfaces that are ambiguous or otherwise not rigidly defined.

Next: what happens if software can get sick?

Written by robjellinghaus

2007/09/18 at 04:47

Conscientious Software, Part 1: Allopoiesis and Autopoiesis

leave a comment »

Recently some Sun researchers wrote a very interesting paper about the future evolution of software. The paper seeks to characterize a new type of software, which they term “conscientious”. I want to dig into this concept a bit.

This will be a multi-post series; this first post lays the groundwork, then the subsequent posts will delve into some further implications of the basic concepts.

This post will make much more sense if you read their paper first; otherwise, hang on 🙂 Their paper is very much a manifesto, in the sense that it aggressively commingles concepts from biology, engineering, and computer science to propose a very ambitious and not-fully-understood thesis. So, this post is going to do that too, and conceptual rigor be damned. It’s brainstorming time.

Conscientious software might be described (paraphrasing their paper heavily) as software which has a currently unknown degree of self-awareness, in the sense that it has the ability to test itself, to analyze its own functioning, to address external or internal malfunctions appropriately, and to maintain its operation under a variety of adverse conditions. And, of course, to actually do some job that we consider valuable.

There are two kinds of systems they talk about extensively. These are allopoietic systems and autopoietic systems. The technical definitions here come from biology, but I’ll paraphrase heavily and just say that an autopoietic system creates itself, sustains itself, and produces itself, whereas an allopoietic system is externally created and produces something other than itself.

An example they give is the distinction between a living amoeba and a watch. An amoeba consists of a large number of cellular structures and chemical processes. A living amoeba is a dynamic system — its continued existence depends on its active metabolism, by which it takes in nutrients, processes them, excretes material toxic to it, seeks out hospitable environments, and avoids dangerous ones. All the metabolic cycles and other processes that constitute the amoeba are primarily oriented towards self-sustainment.

A watch, on the other hand, has almost no ability to sustain itself. Its sole purpose is to tell time, and that purpose is only useful to the users of the watch, not to the watch itself. If the watch malfunctions, it must be repaired externally.

What does this have to do with software? They claim — and I think I agree — that software needs to develop in a more autopoietic direction. Fragile software, with frequent malfunctions, tricky configuration, and inadequate self-monitoring and self-repair capabilities, is increasingly hard to scale to larger and more complex functions. Yet we clearly want software to be capable of more than it currently is. Autopoiesis is a very thought-provoking concept when considering how to design such self-sustaining systems.

They make another point, which is that a purely autopoietic system — one which is only concerned with self-sustainment — is of little value to us. We want a system which can both sustain itself and produce something valuable. Their paper draws a picture of an autopoietic system containing an allopoietic core — the self-sustaining system devotes a large part of its resources to solving specific, externally defined problems.

This is all terribly abstract. What would such a system look like? Are we talking about some kind of muzzy fuzzy genetically engineered software? What are some examples of autopoietic software today?

They start with garbage collection as an example of an autopoietic software function. In some sense, memory is the environment in which software exists. As software runs, it populates that memory. Without lots of careful management, the memory can easily become cluttered with no-longer-used but not-yet-deallocated objects. It’s no big stretch to consider garbage objects as waste products of the software’s metabolism. (Well, OK, it is a stretch, but like I said, we’re brainstorming here.)

Garbage collection was originally developed to enable LISP programs to be developed in the way that was conceptually cleanest — in other words, in a largely functional, allocation-intensive style — without complicating them with memory reclamation logic. Externalizing the garbage collection code meant that the LISP program itself could more or less just do its job, with the underlying metabolism of its environment responsible for cleaning up its waste.

Another example (mine, not theirs), and probably one of the most well-known (or at least well-documented) large-scale examples, is Google’s infrastructure. Google has developed numerous systems — the Google File System, the BigTable large-scale database and the Chubby lock system — that are inherently widely distributed and fault-tolerant. These systems actively monitor themselves, replicate their data around failed components, and generally self-manage on top of a widely distributed pool of faulty servers. It is even less of a stretch to think of these systems as implementing a metabolic environment for user programs that is composed from cells (servers) that routinely die and are replaced, while the entire system has a dynamic, self-sustaining continuity. (I’m sure many Google admins would laugh very hard at the phrase “self-sustaining” here, but nonetheless these systems do internally monitor and repair themselves.)

The allopoietic component of the Google environment is the applications that actually run on the underlying autopoietic Google services. These applications can (mostly) ignore the details of the hardware they run on and the storage they use; they can simply consume the resources they need (in both persistent storage and CPU) to perform their job.

OK. So that’s autopoiesis and allopoiesis as applied to software as we currently know it. Next post: allopoiesis as the foundation of autopoiesis.

Written by robjellinghaus

2007/09/07 at 03:22

Research Paper Addict

leave a comment »

I first encountered the programming language literature as an undergraduate at Yale. Up until that time I’d only ever read programming books or manuals (I still recall hanging out in the backyard of my parent’s house one warm summer devouring the original K&R C book, raving to my bemused parents about how cool enumerations were). But once I got into my junior and senior years at college, I got exposed to the research community. And I started digging through the SIGPLAN Notices stacked up in the computer science building, and reading other students’ theses on Linda (David Gelernter was my thesis advisor – this was a couple years before he became one of the Unabomber’s victims).

This started a habit that has only gotten deeper over the years. Once the Internet hit it big, I realized that now there was a limitless fount of interesting research at my fingertips. And indeed that’s proven to be true — as you’ve already noticed, a lot of my posts here are driven by cool papers I’ve read recently, and you can confidently expect that to continue. Reading research papers is pretty much the best way to stay on top of the cutting edge of the software world.

So this post is about my personal mainlines for cool research; if you’ve got any to add, please comment, because I’m always on the prowl 🙂 My main interests are in (among others) programming languages, type systems, static analysis, metaprogramming, security, distributed systems, scalability, and reliability. So these are obviously skewed in that direction.

First off, the expensive ones. USENIX is probably the number one organization for research about systems in practice, tackling OS issues that have real-world implementations. It’s no wonder Google’s a major USENIX sponsor. And, of course, there’s the Big Daddy, the ACM Digital Library — it’s moderately expensive to be a member, but as a one-stop shop for a huge variety of conferences and research papers, it’s pretty much peerless. Most of the for-pay papers I want to read can be found on either the USENIX or ACM sites. It’s often easy to Google various papers that are also in ACM/USENIX publications, but sometimes it’s nice not to have to bother, and other times the papers really are embargoed.

After that, the most interesting research site bar none is Lambda the Ultimate. For programming language research, complete with an insightful and literate community, this blog wins the internets.

Then we get down into the major corporate research sites. First, the big disappointment of the year — Microsoft Research, once a very worthwhile site, recently crippled themselves with a redesign that removed all access to the chronological list of all their research publications! The only exposure they now give is to the last dozen or so papers published, and a search dialog that doesn’t seem to work for anything. What were they thinking?! And not only that, it’s impossible to find a direct contact email address for them. Sigh. I should talk to some of my Microsoft friends about that….

Memo to all research departments: THE NUMBER ONE FEATURE YOU IMPLEMENT MUST BE A FULL CHRONOLOGICAL ARCHIVE OF ALL YOUR RESEARCH PUBLICATIONS. Without that, you are nigh worthless to people wanting to learn what your researchers are doing in depth.

Anyway, leaving them aside, there’s Sun Research (low volume but often interesting), IBM (frustrating to browse their archive), and Google’s papers page (which gets updated pleasingly often, and which is a model of simplicity and accessibility). Aaaaand… those are my usual suspects.

Then there are the blogs of various programmers, such as Gilad, Crazy Bob Lee, and the Hibernate guys… but those often tend to be pretty low volume for my tastes. When I crave research, I crave it by the gallon!

I eagerly solicit any and all sources of more cool new software research and/or development. (Or even programmer blogs that are updated consistently.) Please please lay ’em on me in the comments, and I’ll cycle it into future addendums to this post. Enjoy!

[Edit, a month later: Turns out Microsoft Research’s RSS feed is the place to go for the stream of recent results. That’ll have to do for now, though one of my MS friends says he’ll put a bug in their ear.]

Written by robjellinghaus

2007/08/31 at 03:39

Posted in research

The Openness Dilemma

leave a comment »

A while ago (circa 2000), Rob Pike wrote a rant about how systems software research is becoming irrelevant, due to the difficulty in getting new operating systems adopted.

Since then, he’s gone on to work at Google, where systems software research is the lifeblood of one of the (if not the) most important Internet sites in history. And he’s happily doing plenty of systems software research that’s become fundamental to the company’s operations.

So his original concern about the irrelevance of operating system research got effectively sidelined, because the action moved from single-machine operating systems to wider distributed systems, especially as used at Google. And Google is as good as anyone at turning research ideas into production practice.

Meanwhile, Jim Waldo at Sun last year wrote another paper — a little less of a rant, but not much — about how systems software design is suffering greatly, largely from the lack of opportunity to learn from experience. Waldo makes good points about the difficulty of teaching system design except through example and experience.

His main concern is that opportunity to learn by doing is very hard to come by. In academia, systems tend to be small and rapidly discarded, due to the need to publish frequently and produce results quickly. In industry, systems tend to be proprietary, encrusted by patents, and impossible to discuss or talk about publicly. This leaves only limited latitude for public construction or discussion of systems large enough and interesting enough to really learn from.

Waldo suggests that open source projects are one of the few ways out of this dilemma. They are in many cases fairly large in scope, they are fully visible to anyone wishing to critique, extend, or adapt them, and they provide not only a code base but (in the best cases) a community of experienced designers from whom new contributors can learn. They therefore are in some ways the best hope for spreading effective education about system design, being unencumbered by either the short-term problems of academia or the proprietary problems of industry.

Recently, coincidentally enough, some Googlers working on the Google lock service — a key part of Google’s distributed infrastructure — wrote a paper describing their experiences building a production implementation of the Paxos protocol for distributed consistency. What’s especially interesting about this paper is how neatly it both decries and embodies the very dilemma Waldo is talking about.

The Google Paxos paper has a lot of extremely interesting technical content in its own right. It’s one of my favorite types of papers — a discussion of problems encountered when trying to take compelling theory and make it into something that really works in a live system. Without that kind of effort, excellent ideas never actually get their chance to make a difference in the world, because until they’re embodied in a real system, they can’t deliver tangible value. So this paper is very useful to anyone working on implementations of the Paxos protocol — it’s exactly the kind of experience that Waldo wishes more people could learn from.

The writers themselves have the following gripes:

Despite the large body of literature in the field, algorithms dating back more then 15 years, and experience of our team (one of us has designed a similar system before and the others have built other types of complex systems in the past), it was significantly harder to build this system then originally anticipated. We attribute this to several shortcomings in the field:

  • There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions. The cumulative effort will be substantial and the final system will be based on an unproven protocol.
  • The fault-tolerance computing community has not developed the tools to make it easy to implement their algorithms.
  • The fault-tolerance computing community has not paid enough attention to testing, a key ingredient for building fault-tolerant systems.

As a result, the core algorithms work remains relatively theoretical and is not as accessible to a larger computing community as it could be. We believe that in order to make a greater impact, researchers in the field should focus on addressing these shortcomings.

The ironies here are so deep it’s hard to know where to start. Their implementation itself is not only proprietary to Google (and not open sourced), but it also relies on many other proprietary Google systems, including the Google file system. Hence their work itself is not directly available to the wider community for development and further discussion! Their paper has a number of interesting allusions (such as exactly why they needed to make their local log writing multi-threaded) that are not followed up. Unless they write many more papers, we will never know all the details of how their system works.

They criticize the fault-tolerant systems community for not having provided a more solid experience base from which to build. Waldo’s paper makes it crystal clear exactly why this base has been lacking: where is it to come from? Not from academia; research projects in academia tend to be too short-term and too small in scope to encounter the kinds of issues the Googlers did. And not from industry; Google is not known for open sourcing its core distributed software components, yet Google is arguably ahead of anyone else in this area!

The only alternative would be a true open source project. But large-scale distributed systems are probably among the least likely to achieve real momentum as an open source project, because actually using and testing them requires substantial dedicated hardware resources (many of the failure cases the Google team encountered arise only after running on dozens or hundreds of machines), and those resources are not available to any open source projects I’m aware of.

The Googlers are part of the problem, even while their paper seeks to be part of the solution. To some extent it’s a chicken-and-egg dynamic; without access to a truly large pool of machines, and a truly demanding set of developers and applications, it’s hard to get real-world experience with creating robust distributed infrastructure — but you almost have to be inside a large business, such as Google, in order to have such access at all.

So, unfortunately, it would appear that in the near term the Googlers are doomed to disappointment in their expectations of the research community. Google itself is likely to remain the preeminent distributed systems research center in the world, and the fewer of its systems it open sources, the less assistance the rest of the world will be able to provide it.

One can only hope that several years from now, Google’s applications will have evolved so greatly on top of its base infrastructure that it will no longer consider the fundamental systems it uses — MapReduce, BigTable, GFS, Chubby — to be key competitive differentiators, and will choose to open source them all. Of course, by then Google’s real difficulties will still be with problems the rest of the world wishes they had the resources to encounter….

A coda to this: John Carmack, of id software and Armadillo Aerospace fame, is known for open sourcing his game engines after five years or so. Recently he’s been doing work in mobile games, cellphone programming. Here’s a quote from a liveblog about his keynote at Quakecon last week:

Met with mobile developers at the Apple thing, all talking about how they make mistakes all the time. Carmack: “Can’t the guys who made the mistakes the first time just make the chips right this time?” Other devs: “Yeah, but most of those guys are too rich to care anymore.”

So that’s the other reason the field doesn’t make good progress… proprietary stuff gets built, developers get rich, technology gets sold and eventually back-burnered, and then it all has to get reinvented all over again. Open source: the only way to not reinvent the wheel every five years!

Written by robjellinghaus

2007/08/09 at 03:35

Posted in open source, research