The Best of Both Worlds

Along the lines of my post about avoiding religious wars over computer languages, I recently ran across two papers that got me seriously thinking about this whole “how to have the best of all worlds” problem.

Specifically, there was an excellent follow-on to the last OOPSLA conference, called the Library-Centric Software Design ’06 conference. Josh Bloch (of Effective Java fame) was one of the co-chairs. The papers were all focused on the general problem space of “active libraries” or “metaprogramming” or “extensible languages” — generally, the concept that your programming language can be extended by libraries that your programs use. Generally, this means that the typechecking of the language, or perhaps its very grammar, can be enhanced in an extensible way.

One of the more thought-provoking examples is a paper discussing Extending Type Systems in a Library, specifically about adding new “field types” to C++. One use for this is defining new type modifiers such as “notNull”, or “positive”, or “tainted”. These are all in some sense type template modifiers, in that for any pointer type T you could define a modified type notNull with type assertions preventing you from having null instances of that type. Or, for “tainted”, you could have just about any type T and create a variant type tainted indicating that the value of the base type originated from a suspect source.

The “tainted” use case tweaked my memory. The other recent work I’ve seen on taint analysis was the paper discussing how to use the BDDBDDB system to track tainted values through a Java program (previously discussed here). In that case, the system used sophisticated alias analysis to track the movement of values from tainted sources to vulnerable sinks (e.g. from dangerous methods receiving input from the Internet, to vulnerable methods issuing output to a database or filesystem or even another server connection).

Consider how similar these are. In the first case, you’re extending the language’s type system to distinguish tainted from untainted values. In the latter case, you’re analyzing a source base which has no such type distinctions, and statically proving whether or not the value flow does in fact permit tainted values to be used in a context that requires untainted values.

Yet also notice how different these are. In the extended-C++ case, actually using the tainted/untainted types would require you — in my reading of the paper — to essentially modify almost every method in your source code. All methods would need to declare their values as being either tainted or untainted, even methods which don’t actually ever use the possibly-tainted values in a vulnerable way. In other words, the property of taintedness is only relevant at one place in the program — the methods whose inputs are required to be untainted. But declaring the property as an explicit type annotation throughout the program forces your code to expose this distinction everywhere. Suddenly your language extension forces you into a truly massive refactoring!

In contrast, the BDDBDDB-based analysis essentially overlays the tainted property on the program’s alias graph. The analysis may well be no less sound… but how much more convenient for the programmer! Taintedness becomes a property that is derivable only when the programmer wishes to be aware of it. The rest of the time, the programmer can avoid worrying about it. In some sense, the security analysis becomes a “pay-as-you-go” property of the system, as opposed to a “pay-up-front” property in the fully statically typed technique.

Now, consider this a bit more broadly. Isn’t this exactly what dynamic language advocates hate about statically typed languages? Instead of worrying about tainted vs. untainted, let’s consider basic data typing: string vs. number. If you read in a string, and then you want to treat it as a number, isn’t that essentially exactly like a taint analysis? The only time you care whether the string is in fact not a number is if you actually try to use it as one. If you don’t ever try to use it as one, then why worry? Conversely, if you do try to use it as one, wouldn’t it be nice to have the system be able to trace its provenance back to where you originally read it, so you could decide if you wanted to insert a validity check there… without forcing you to declare its type at every intermediate program point where it flows?

It seems to me that extensible languages will have to shift a lot of their safety checking and property checking away from the static type system that’s exposed directly to the programmer. Consider all the possible properties you might want to know about the values in your program: are they string? Numeric? Complex? Are they null? Never-null? Are they tainted? Untainted? Are they structurally compliant XML? Are they valid values for your object-relational database? Are their dimensions correct for the physics math you want to do with them? There is a literally infinite variety of possible types that you might want to track through your program. And if you have to declare all of these types on every variable and method you declare, you pay a “static declaration tax” equal to the number of extended type properties times the number of declarations in your program. Whereas if the system is able to derive, or trace, or deduce the flow of these properties, and if the programming environment gives you enough insight into how that derivation works and what annotations you can add (very selectively) at only the points where the distinctions are relevant, you offload that tax onto the programming environment.

Imagine an IDE for some Ruby++ language. You can write your Ruby code in the normal way. But you can also annotate various methods with fully optional types. Methods that read from an HTTP connection mark their values as tainted. Methods that write to the database require untainted values. Arithmetic methods mark their inputs as numeric. And even better, there are ways (handwaving furiously) to define new kinds of traits and types; so you can implement complex numbers in the language, and use them without needing to explicitly declare their type throughout the program, and the system can track their value flow and create optimized code for their use. The end result: an extensible language with far more ease of use than extended-Java or extended-C++, and far more static safety than basic Ruby.

Maybe it is possible to have the best of all worlds. And in fact, if we want to truly achieve the dream of extensible statically-analyzed languages with usable syntax, maybe it’s not only possible but necessary.

Up with pay-as-you-go! Down with the static declaration tax! Up with extensible languages! Down with pure dynamism!

Maybe I’m a bit more religious than I thought.

What would such a language look like? Stay tuned!

…After writing the above, I surfed around for pluggable types — which was Gilad’s terma for the concept — and found several people who have suggested basically this: that static typing could be considered optional, and that extensible type systems might well require that viewpoint. There has been some real progress in the direction of viable partial type systems, and Javascript 4 is looking at adding strict mode.

Anyway, I’m leaving the body of the post above as I originally wrote it, because the slightly uninformed enthusiasm gave it some good energy 🙂

Written by robjellinghaus

2007/10/16 at 03:41

Posted in programming languages, research

robjsoftware.info