robjsoftware.info

A blog about software – researching it, developing it, and contemplating its future.

Archive for the ‘programming languages’ Category

Intentional shipped!!!

leave a comment »

Intentional Software was founded in 2002 by Charles Simonyi, one of Microsoft’s first billionaires. They’ve been pretty vague ever since then, which led to the usual vaporware skepticism.

I’ve been following them for some time, because I had extremely positive results using domain-driven design at Nimblefish (my first serious enterprise software job). I’ve also blogged extensively here before about the appeal of extensible languages.

Well, last week they finally gave a real demo, and it turns out they’ve been very busy for these last seven years. The Intentional team has made some real leaps forward in the whole concept of language construction, multi-language projection, and bidirectional editing. Basically, their system lets you use any number of different languages to describe a problem domain; you can create your own languages, project them as Ruby/Java/C# code, tables, or diagrams, edit them in any format and have it translated directly to the others (to the extent possible), run the model directly while editing it, and just generally take metaprogramming to completely the next level.

Yes, I’m excited; it’s not every year (or even every decade) that you see a demo like this.

Here’s a breakdown of the contents, if you need to optimize your optical cycles:

  • 0:00 – 9:00: basic Powerpoint conceptual overview
  • 9:00 – 13:00: more technical Powerpoint about the structure of their system
  • 13:00 – 21:00: illustrating the “metamodel” for a state machine domain-specific language
  • 21:00 – 31:00: bidirectionally projecting and editing a state machine via Ruby, UML, etc.
  • 31:00 – 36:00: demonstrating an electronics DSL, including a live evaluation of electrical flow while editing the circuit graphically
  • 36:00 – 43:00: the Cap Gemini beta test product, implementing a pension system with live Intellisense on the actuarial math equations, and a temporal rule language for pension rules
  • 43:00 – 46:00: the multi-user versioning support that operates at the level of their fundamental tree-based data structure
  • 46:00 – end: various Q&A

If you only have 15 minutes, watch 31:00 – 46:00.

There are various systems that have done various subsets of all this before, but I’ve never seen it packaged in a unified way. Ever. It’s time to start watching Intentional very closely. It may also be time to check out the open-source JetBrains Meta Programming System. Software is all about raising levels of abstraction, and we might just have some new cranes coming online.

Edit: I just checked out the Meta Programming System’s tutorial, and yay! Looks like it’s a free version of many of these concepts that we can play with now! Time to tinker….

Advertisements

Written by robjellinghaus

2009/04/26 at 02:59

Not Dark, Just Busy

leave a comment »

It’s amazing what a difference not commuting can make. Up until our move to Seattle in April, I’d been commuting an hour each way every day into San Francisco. That time on BART — without net access — was where all my blogging got done. Well, now I’m driving 15 minutes each way to work (modulo time to take my daughter to her school), and still coming down off the stress of moving, so I’ve been putting my energy elsewhere in the evenings. In any case, consider this a summer vacation for your RSS feed 🙂

Not that I’ve personally been vacationing — not at all! I’m finally getting up to some kind of reasonable cruising speed at work. It’s been a colossal re-education in .NET, C#, low-level Windows debugging, SQL data stack design, and about 50 different interesting research projects which I can’t really discuss — at least not yet. It’s very educational being on the list for all the different talks that happen inside Microsoft Research; there are a lot of different pies that those folks have their fingers in. The internal Windows developer talks are also very intriguing.

Now, this isn’t to say that everything is on the hush-hush. Some of the better publicly visible tidbits I’ve run across lately involve LINQ, the techniques for basically doing data-oriented metaprogramming in the .NET framework. I mentioned LINQ in my last post, but suffice to say that it’s a combination of a set of data-oriented query operators (that form a composable, monadic algebra), some syntactic extensions to allow more idiomatic SQL-like query expressions, and an internal syntax-tree representation that can be transformed into native query languages (including but not limited to SQL) and partially evaluated into high-performance serializers. Overall it’s a very well-thought-out structure with a lot of room for growth. To wit:

  • DryadLINQ extends the LINQ operators to support MapReduce-like distributed cluster computation at scale.
  • PLINQ and the ParallelFX framework are building up the .NET infrastructure for fine-grained parallelism and efficient use of multi-core for data-parallel problems.
  • Matt Warren’s series on implementing a LINQ query provider is a great example of metaprogramming in practice — taking a C# abstract syntax tree describing a query, and transforming it into efficient SQL. This is the technique that is at the heart of LINQ. I’ve heard tell of F# projects that are using LINQ-like expressions to describe linear programming constraints declaratively — same metaprogramming paradigm, completely different domain.

All of this is exciting because you noble long-term readers (e.g. since December 2007) will know how interesting metaprogramming is to me. Microsoft really gets it on multiple levels, and has put it into production with LINQ. There’s a lot more on the horizon, too, and I’m eagerly waiting for each bit of it to reach RTM in some form so I can happily blog away about it!

Not only that, but I have a personal hacking project again. Matt Warren’s blogs (mentioned above) are a great example of how to implement metaprogramming with an imperative C# visitor-style API. But I find the code hard to read — it’s taking this lovely declarative representation and then transforming it with all of this intensely imperative, explicitly scheduled transformation code. It reminds me of the JastAdd project, which has evidently now reached public release. JastAdd creates a declarative framework for implementing language analyses and transformations. I want to hack on an F# implementation of the JastAdd paradigm, applied to LINQ to SQL transformations. It would be very interesting to see if it could lead to something significantly easier to maintain and reason about.

This is something that arguably is potentially relevant to work. So I am going to blog about it internally first, and then repost publicly a week or so later (on a per-post basis). If it gets to where it is interesting enough for Microsoft to want to privatize it, I’ll just have to find something else to blog publicly about! In any case, it’ll be fun to post about here for as long as it lasts 🙂

Written by robjellinghaus

2008/08/03 at 21:04

Only 20,000 Lines

leave a comment »

A while back I posted a big ol’ post titled A Growable Languge Manifesto which argued strongly for extensible languages.

Well, I just ran across the one-year progress report from Alan Kay‘s current research group, and it’s some extremely exciting work that is all about extensible languages!

The group is the Viewpoints Research Institute, and the progress report lays out their plan to implement a complete software stack — everything from the graphics driver, to the language interpreter / compiler, to the TCP/IP stack, to the windowing and rendering system, to the IDE and programming environment — in 20,000 lines of code. Total.

As they point out, a TCP/IP stack alone in many conventional languages is more than 20,000 lines. So how can they possibly pull it off?

The answer, it turns out, is extensible languages. Specifically, they have an extensible parsing system — OMeta, cited heavily in my manifesto — which allows them to easily and quickly extend their languages. They also have a “meta-meta-language runtime and parametric compiler” named IS, which is how they actually get their languages and metalanguages into executable form.

One especially cool example is their TCP/IP stack. The TCP/IP specification has text diagrams of packet formats. So they wrote a grammar to parse those specifications directly as ASCII. And lo and behold, they could use the TCP/IP RFCs themselves to generate their source code. They also can use their parsing framework to analyze the structure of TCP/IP messages — they basically define a grammar for parsing TCP/IP messages, and action rules for handling the various cases. (OMeta lets executable code be attached to matching productions in a grammar.)

They also wrote domain-specific languages for just about every area. One example is low-level pixel compositing, basically giving them the functionality, and most of the efficiency, of a generative 2D pixel processing library such as Adobe’s Generic Image Library, cited in Bourdev and Jaaki’s LCSD 2006 paper. Another example is polygon rendering (450 lines of code that implements anti-aliased rasterization, alpha compositing, line and Bezier curve rendering, coordinate transformations, culling, and clipping). Though evidently they have yet to fully define a custom language for polygon rendering, and they hope to cut those 450 lines by “an order of magnitude”.

Basically, they take almost all parsing and optimization problems and express them directly in their extensible language, which gives them almost optimal flexibility for building every part of the system in the most “language-centric” way possible.

They have no static typing at all, which doesn’t work for me (though their line counts make a compelling argument), but there’s no reason in principle that these techniques couldn’t also apply to static type system construction.

In fact, there is work right now on intuitive language environments for creating written formal definitions of programming languages. A system like Ott lets you write semantics declarations that look like they came straight out of a POPL paper, and convert them into TeX (for printing) or Coq/Isabelle/HOL (for automated verification). I don’t know how far the implementation of Ott or Coq/Isabelle/HOL would change if Viewpoint’s techniques were aggressively applied, but I look forward to finding out!

I think this kind of programming is the wave of the future. Reducing lines of code has always been an excellent way to improve your software, and the expressiveness of a language has always shaped how succinctly you can write your code. From that perspective, it seems obvious that an extensible language would provide maximum potential increase in expressiveness, since you can tune your language to the details of your specific problem. It’s a force multiplier, or possibly a force exponentiator.

Object-oriented class hierarchies, query languages, XML schemas, document structures, network protocols, display lists, parse trees… they all share a common meta-structure that the “extensible languages” concept subsumes, and the Viewpoint project is the clearest evidence I’ve seen of that yet. It’s going to be a dizzyingly exciting next few years in the programming language world, and programming might just get a lot more interesting soon — for everybody.

There’s more to say on that topic, but I’ll save it for another post!

Written by robjellinghaus

2008/04/15 at 04:37

A Growable Language Manifesto

with one comment

Warning: this is by far the largest and most action-packed post I’ve ever made. Grab a cup of [insert favored beverage here], sit back, and enjoy. If you get headcramp from reading this in narrow-column blog format, there’s another full-screen version here — but please return to this post to leave your comments!

I’ve posted recently about the dynamic versus static flamewar and about recent research in extensible languages. The more I think about these ideas, the more compelling they get. It seems clear to me that these ideas are synergistic and point in the direction of a new kind of programming environment — one which could potentially offer the ease of dynamic languages, the safety of static languages, and an altogether new level of extensibility (both for enhancing the type system and for allowing safe metaprogramming.)

So I want to lay out a manifesto here for a growable programming language. Or perhaps it’s more like a toolbox for language construction. Or perhaps it’s a framework for experimenting with extensible syntax and extensible analysis. In any case, here is the vision. At the end, I include extensive references to the research that’s inspired these ideas.

  • A growable language should be opt-in. It must support a graceful spectrum of usage styles. A growable language will necessarily be multi-paradigmatic, since it will be extensible with features from various fields of programming. The language should implement a dynamically-typed core on which all static analysis is built, to allow programmers a straightforward and lightweight syntax for prototyping.
  • A growable language should be increasingly powerful. It should support a rich and expanding variety of static analyses and dynamic checking, to give maximal benefit to the programmer who wishes to leverage these features. over time, as more program analyses become possible, a growable language should gracefully integrate them. A growable language must continually increase the expressive power of the programmer, both in terms of what the programmer can say and in terms of what the environment can help the programmer prove about their program. A growable language should have a powerful analysis framework as a foundational component, such that other analyses can share that basic capability; wherever possible, additional analyses should require only incremental effort by leveraging infrastructure already created.
  • A growable language needs a declarative metalanguage. Current languages do not define inherent mechanisms for syntax extension and for type extension. The syntax and the type system are both baked into the language’s compiler. A growable language needs to lift the specification level of the language itself to be both declarative and modularly defined, so that syntax and type analytics can both be expressed and implemented as layered extensions.
  • A growable language needs partial typing. Certain analyses are necessarily incomplete or even undecidable. A growable language should permit such type systems, and should be able to fall back on dynamic typing whenever analysis is inconclusive. This lets programmers work freely without having to continually meet the demands of the type system, yet supports graceful type annotation to enhance static safety at the programmer’s discretion. (Without partial typing, the opt-in goal cannot be met.) The language should ideally provide clear feedback as to when its analyses are conclusive or inconclusive, and ideally should identify the sources of inconclusiveness so the programmer can either annotate them appropriately or deliberately ignore them.
  • A growable language needs layered types. A growable language should be able to extend itself with primitive types, dependent types (e.g. data-dependent subrange types), traits or other parametric or abstraction-based genericity mechanisms, and multiple flavors of type qualifiers. Integers should be as much of a language extension as non-null types, tainted types, or range types. A growable language requires an extensible subtype lattice.
  • A growable language needs inferrable types. To avoid drowning in explicit type declarations, the language should be able to infer types wherever possible, and the programming environment should support controllable visibility for the inferred type information. Without inferred types and environmental support for controlling analysis visibility, a growable language cannot scale for users; being able to selectively ignore some or all of the (many) analyses is critical.
  • A growable language needs explicit, optional type annotations. An extensible analysis framework will be able to infer or deduce a great deal about the actual behavior of the program. But the actual behavior of the program may or may not correspond to the programmer’s intent. The programmer’s explicit annotations express the programmer’s expectations. A programmer might look at the analyzed annotations and make some of them explicit — perhaps they reflect properties of the code that the programmer considers important after the fact, and that the programmer now wants to enforce. Or a programmer might add explicit annotations during development, such that the system can confirm they are valid and warn when they are violated. Explicit annotations at module boundaries — whether programmer-generated or system-generated — are likely to aid in separate compilation and in module documentation.
  • A growable language must be efficiently implementable. As more language features are added — more qualifiers, more types, more syntax — the language must still be efficient and usable. This applies both to programs written in the language (which should have performance competitive at least with Java or the CLR) and to programming environments for the language. The latter requires aggressive attention to analytical optimizations and to multi-threaded analysis frameworks. As the language’s analysis structure grows, the language’s programming environments must be able to leverage multicore architectures to ensure consistent responsiveness for users. Moreover, the language should be continuously analyzing in the background; incremental feedback as the user edits should be available for all language features and extensions.
  • A growable language must have a unified internal representation. Concrete syntax, abstract syntax, type and name bindings, type qualifier propagation, type parameterization, dependent range type constraints, and logical queries over the program’s alias structure should all leverage a single internal representation of the program. This maximizes the reusability of internal language implementation code and ensures consistency in analytical derivations. Where multiple representations are necessary, the derivation rules must be clear and consistent.
  • A growable language must promote static metaprogramming. Fully dynamic metaprogramming — runtime extension of classes and objects with arbitrary code, or even unrestricted dynamic reflective access to arbitrary methods — is almost impossible to analyze effectively. To give a growable language’s extended type systems maximum exposure to the actual behavior of the code, static metaprogramming techniques must be definable, extensible, and compatible with the rest of the language’s structure. One would hope that the very extensibility techniques that implement the language itself would be usable for static metaprogramming.
  • A growable language must support diverse analyses. Some analyses are most naturally expressed as systems of recursive equations over an abstract syntax tree. Others can best be expressed as logical queries over a program’s alias graph. Ideally these could both be expressed naturally in the metalanguage.
  • A growable language must be analytically composable. This is likely the single most technically ambitious goal. Traditional compiler development involves subtle and explicit scheduling tradeoffs, layering multiple analyses through manual interleavings that can be fragile or non-optimal. A growable language with a declarative metalanguage needs an analysis engine that can automatically schedule multiple, potentially interleaved analyses, leveraging parallelism where possible to optimize non-dependent analyses. Achieving this goal will be immensely difficult, but proportionately valuable. Interestingly, here is where the metalanguage itself will require many of the same extensibility properties as the language it describes; meta-level closure — implementing the metalanguage using the language itself — will be a holy grail for this language design.

Is the above language even possible? Is it too much of a stretch — particularly the “unified internal representation” and “analytically composable” goals? Maybe so. I’m definitely not an expert at programming language implementation; the only compiler I ever wrote was back in high school. So this may be ridiculously unrealistic in whole or in part. I welcome feedback on which areas of this manifesto are more or less plausible.

Overall, consider this my stab at answering Paul Graham’s challenge to ponder the hundred-year language. Given current trends in programming language development, it seems that languages of the future will transcend being “languages” as we know them and will become more like unified environments for language creation and extension. Arguably, this vision has a lot in common with intentional programming, which doesn’t bode well, since the intentional guys have been in stealth mode for almost fifteen years and nothing has publicly come of it. But that doesn’t mean the general direction isn’t interesting, any more than the slow progress of Chandler means that a unified and flexible personal information manager isn’t worth pursuing.

I promised references. Here they are:

  • opt-in — Gilad Bracha’s pluggable type systems paper is the origin of this goal. Bracha forcefully posits that static type systems are all necessarily incomplete and that dynamically typed languages have necessary flexibility. Meijer makes a similar point in his static where possible, dynamic where necessary paper. I’m not clear that Bracha’s extreme view — that all static analysis must be kept out of the language kernel — is the correct one, given the potential performance cost, but I suppose RPython provides an encouraging counterpoint.
  • increasingly powerful — There are increasingly many varieties of static analysis being developed primarily for Java. One recent paper on http://www.cs.umd.edu/~jfoster/papers/oopsla07-uno.pdf points out that its framework could straightforwardly be implemented on top of other analyses with similar intraprocedural resolution. The BDDBDDB framework has already been used for implementing taint analysis, and the JastAdd system for non-null Java type inference. In general it seems there is a lot of opportunity for shared infrastructure here. Also note that support for excellent error messages and great visibility into analysis results (and analysis failures) will be critical for usability. See Grimm’s paper titled Systems Need Languages Need Systems for some forceful advocacy here.
  • a defining metalanguage — Some good examples of metalanguages for syntactic language description are the OMeta pattern-matching language for executable grammars, Gilad’s executable grammars in NewSpeak, and the Rats! extensible parsing system. A good example of an extensible language for static analysis is the JastAdd extensible Java compiler with its support for defining rewritable circular reference-attributed grammars… they implemented Java 1.5 generics as a modular declarative compiler extension, which proves their point to me, anyway!
  • partial typing — The two best examples of this that I know of are the gradual typing work of Siek and Taha, and the hybrid typing for undecidable type systems work of Flanagan, Freund, et al. In both cases, a static type system is enhanced with a generalized type (named Dynamic or “?”), which is inferred to be a more specific static type where possible, and otherwise cast at runtime to preserve dynamic type safety.
  • layered types — The already-mentioned JastAdd system is perhaps the best example of a structure which permits layering additional analyses. The extensible Java compiler Polyglot is another.
  • inferrable types — My original thinking about this entire line of research originated in a blog post from a couple of months ago where I realized that some implementations of type qualifiers — for example, a “tainted” type qualifier in C++ — would ripple through the whole program due to mandatory explicit static typing everywhere. It’s noteworthy that many type qualifier analyses for Java are not based on explicit syntax. For example, the taint analysis based on the BDDBDDB framework does not require explicit propagation of tainted or untainted declarations, yet it derives such information throughout the program’s structure. An environment which made the results of that analysis visible — and traceable — at every program point would let the programmer see the flow of tainted values without having to explicitly declare them.
  • explicit, optional type annotations — Programmers must also be able to add explicit qualifiers, since the programmer’s intent may or may not match the analysis; the analysis may be incomplete and the programmer needs to provide more information, or the analysis may be consistent and the programmer wants to declare that they confirm it and want it to be enforced at that location (e.g. if the program changes such that the property is no longer true there, the language would signal an error). The programmer’s explicit intent and the analyser’s implicit understanding should be able to be flexibly cross-checked. I’m not aware of any inference systems that support this fully; they seem to be either purely inference-based (e.g. JQual) or purely annotation-based (e.g. a C++-based type qualifier system discussed in the “Extending Type Systems in a Library” paper from LCSD ’06).
  • efficiently implementable — This is obviously enormously difficult, insofar as analyses can in general be interdependent. There is a great tension between layering analyses (for separability) and weaving them (for mutual dependence and synergy). See the “analytically composable” goal below. In general, I wouldn’t be surprised if aggressive parallelization of a language analysis / compilation framework required software transactions to support optimistic and incremental analysis by multiple threads.
  • a unified internal representation — I mean something along the lines of Grimm’s declarative extensible syntax trees, a concept proven by his weaving of Java and C typechecking into a single system. The JastAdd framework is another example; there is no separate symbol table in JastAdd, since name bindings become reference links in the abstract syntax tree (which is really more of an abstract syntax graph). Note that JastAdd’s own declarative language for extending the syntax tree is fundamentally similar to open classes, in that subsequent extensions can directly modify the structure of already-defined syntax node classes. This seems to inhibit modular development of language extensions, but modular language extension development is really hard anyway.
  • promote static metaprogramming — This goal is about ensuring the entire program text remains analyzable, and about permitting domain-specific languages to be implemented in the same structure used for extending the base language. See OMeta’s SQL extensions or C# 2.0’s language-integrated queries which reify some expressions as syntax trees visible to the runtime system. The Google Web Toolkit’s support for extensible code generation is another example, as is the RapidMind system for creating parallel compute kernels from C++ code. Finally, there’s the RPython system which creates a “metaprogramming startup phase” for Python programs, followed by a static compilation phase yielding 100x speedups. Interestingly, this whole goal contradicts Paul Graham’s 2001-era view that “true macro systems aren’t compatible with static type systems.”
  • support diverse analyses — The best two already-cited examples here are the JastADD grammars formalism and the BDDBDDB Datalog-based analysis specification. These are radically different but potentially very synergistic. I’d be fascinated to see whether there’s some deeper commonality between them….
  • analytically composable — The JastADD framework seems the best example of a declarative structure that supports automatic weaving of multiple analyses. For evidence to this effect, consider that the JastADD folks claim that the Polyglot team is reimplementing their framework in JastADD to avoid the difficulties of scheduling dozens of analysis passes.

I plan to start experimenting with some prototypes of an Eclipse plugin for an extensible language framework along these lines, likely starting with something much like OMeta and extending it with a JastADD-like rewritable grammar formalism. This will be open source all the way. I would enthusiastically welcome all pointers to similar projects, all interest in helping with such a framework, and all critical comments on all or part of this manifesto!

(Disclaimer: our family will also be moving out of our house in the next three months, so progress may well be slow 🙂

Written by robjellinghaus

2007/12/04 at 19:11

OOPSLA 2007 – Languages Gone Wild

leave a comment »

It has come to my attention that my writing here has been a bit boring. Dry. Stuffy. Well, Kevin Bourrillion has shown me the way. Time to liven it up a bit. Just a bit, mind you. Juuuuuuuuust a bit.

And please don’t give me trouble over my inability to post more than once a week. Really all I can say is ARRRRGH! Work, upcoming conferences, and raising two very young kids = JEEZ WHEN DO PEOPLE EVER GET TIME TO BLOG? And while we’re on that topic, I don’t see you blogging enough either, do I? Don’t be throwing any stones, Mr. Glass House.

OK, where were we? Ah yes, OOPSLA 2007. What a beautiful conference. Not that I attended or anything, but who needs to attend a conference when you have the Internet?

The dynamic vs. static languages flame war has bogged down the language community over the last decade. It’s great to see that, judging by some recent papers from the latest OOPSLA, the logjam has broken with a vengeance. There are all kinds of brain-bending new languages in the works, and it’s frankly exhilarating. (Sorry that some of these are only available through the ACM’s Digital Library… but at $99 per year, how can you afford NOT to be a member???)

First we have a great paper on RPython, a project which creates a bimodal version of Python. You can run your Python program in a fully interpreted way in a startup metaprogramming phase, and then the RPython compiler kicks in, infers static types for your entire metaprogrammed Python code base, and compiles to the CLR or to the JVM with a modest performance increase of EIGHT HUNDRED TIMES. Yes, folks, they’ve run benchmarks of Python apps that run 800x faster with their system than with IronPython (which is no slouch of a Python implementation AFAIK). If that isn’t a great example of how you can have your dynamic cake and eat it statically, I don’t know what is.

Another lovely system is OMeta, a pattern language for describing grammars. You can write everything from a tokenizer up to a full language in a really nice executable grammar structure, with productions that map directly to some underlying base language. They also have a good modularity story worked out, and support for stateful parsing. They have a 177-line Javascript parser, and that’s not much!

Then there’s an equally great paper on JastAdd, an extensible compiler for Java. The JastAdd compiler is built around an extensible abstract syntax tree. The abstract syntax tree is the ONLY data structure in the compiler — there are no separate symbol tables or binding tables; everything is implemented as extensions to the abstract syntax tree. The extensions are expressed with a declarative language that lets you define dataflow equations relating different elements in the tree — inherited elements (for referring to names bound in a parent construct, for example), or reference elements (for referring to remote type declarations, for example).

The compiler has an equation analysis engine that can process all these equations until it reaches a fixpoint, which completely avoids all the usual multi-phase scheduling hassles in compilers around interleaving type analysis with type inference, etc. It seems like The Right Thing on a number of levels, and it makes me want to hack around with building a compiler along similar declarative lines. They give examples of extending Java with non-null types and of implementing Java 5 generics purely as a declarative compiler extension. That, to me, pretty much proves their point. Bodacious! I had been thinking that executable grammars were a nice way to go, but seeing their declarative framework’s power is seriously making me reconsider that idea. What would you get if you combined OMeta and JastAdd? Something beautiful. I’m not sure how you’d combine the statefulness of OMeta with the declarativeness of JastAdd, but we must ponder it deeply, because the One True AST is a goal worth seeking.

A truly mind-bending paper discusses breaking the two-level barrier. What’s the two-level barrier? Simple: it’s the class/object distinction. They point out that many kinds of modeling can’t be done with a class hierarchy. What you really want is a programmer-accessible metaclass hierarchy. (And not a weenie one like Smalltalk’s, either.) For example, consider an online store. You thought you knew everything about online stores? THINK AGAIN, JACKSON. Let’s say you have a DVD for sale, such as Titanic. That Titanic DVD is an instance of the DVD product class. The DVD product class is conceptually a further instance of the DigitalMedia product class. I meant exactly what I said there — in their framework, one class can be an instance of another class.

You can then state that the DigitalMedia metaclass defines a “categoryName” and a “net price”, requiring that “category name” be defined by instances of DigitalMedia, and that “net price” be defined by instances of instances of DigitalMedia.. The DVD class then defines “categoryName” to be “DVD”, so ALL DVDs have the same category name. And then particular instances of DVD define their actual net prices individually. In this way, you can take the same kinds of “must provide a value for this field” constraints that exist in the class-object hierarchy, and extend it to multiple levels, where grandparent metaclasses can express requirements of their meta-grandchild instances.

(They use the abysmal word “clabject” — ClAss obJECT — to refer to entities that can be used to instantiate objects (like classes), but that ARE themselves instantiated (like objects). I think “clobject” would have been better, or maybe “obclass” or something. “Clabject” just sounds… I don’t know… disturbing. Like some kind of hapless crab that’s filled with techno-malice. But the concept itself is very interesting. I think that having two orthogonal hierarchies — the metaclass hierarchy and the subclass hierarchy — is potentially too confusing for most programmers, including me, but it’s nonetheless really thought-provoking.)

Those are just four of the highlights — I’m only about a third of the way through reading the OOPSLA papers this year — but I think those are the top three when it comes to language design. It’s going to be a great next decade, as the whole static vs. dynamic war gives way to a myriad of bizarre hybrids and mutants, greatly enhancing (and confounding) the lives of hackers everywhere!

Written by robjellinghaus

2007/10/30 at 03:39

The Best of Both Worlds

leave a comment »

Along the lines of my post about avoiding religious wars over computer languages, I recently ran across two papers that got me seriously thinking about this whole “how to have the best of all worlds” problem.

Specifically, there was an excellent follow-on to the last OOPSLA conference, called the Library-Centric Software Design ’06 conference. Josh Bloch (of Effective Java fame) was one of the co-chairs. The papers were all focused on the general problem space of “active libraries” or “metaprogramming” or “extensible languages” — generally, the concept that your programming language can be extended by libraries that your programs use. Generally, this means that the typechecking of the language, or perhaps its very grammar, can be enhanced in an extensible way.

One of the more thought-provoking examples is a paper discussing Extending Type Systems in a Library, specifically about adding new “field types” to C++. One use for this is defining new type modifiers such as “notNull”, or “positive”, or “tainted”. These are all in some sense type template modifiers, in that for any pointer type T you could define a modified type notNull with type assertions preventing you from having null instances of that type. Or, for “tainted”, you could have just about any type T and create a variant type tainted indicating that the value of the base type originated from a suspect source.

The “tainted” use case tweaked my memory. The other recent work I’ve seen on taint analysis was the paper discussing how to use the BDDBDDB system to track tainted values through a Java program (previously discussed here). In that case, the system used sophisticated alias analysis to track the movement of values from tainted sources to vulnerable sinks (e.g. from dangerous methods receiving input from the Internet, to vulnerable methods issuing output to a database or filesystem or even another server connection).

Consider how similar these are. In the first case, you’re extending the language’s type system to distinguish tainted from untainted values. In the latter case, you’re analyzing a source base which has no such type distinctions, and statically proving whether or not the value flow does in fact permit tainted values to be used in a context that requires untainted values.

Yet also notice how different these are. In the extended-C++ case, actually using the tainted/untainted types would require you — in my reading of the paper — to essentially modify almost every method in your source code. All methods would need to declare their values as being either tainted or untainted, even methods which don’t actually ever use the possibly-tainted values in a vulnerable way. In other words, the property of taintedness is only relevant at one place in the program — the methods whose inputs are required to be untainted. But declaring the property as an explicit type annotation throughout the program forces your code to expose this distinction everywhere. Suddenly your language extension forces you into a truly massive refactoring!

In contrast, the BDDBDDB-based analysis essentially overlays the tainted property on the program’s alias graph. The analysis may well be no less sound… but how much more convenient for the programmer! Taintedness becomes a property that is derivable only when the programmer wishes to be aware of it. The rest of the time, the programmer can avoid worrying about it. In some sense, the security analysis becomes a “pay-as-you-go” property of the system, as opposed to a “pay-up-front” property in the fully statically typed technique.

Now, consider this a bit more broadly. Isn’t this exactly what dynamic language advocates hate about statically typed languages? Instead of worrying about tainted vs. untainted, let’s consider basic data typing: string vs. number. If you read in a string, and then you want to treat it as a number, isn’t that essentially exactly like a taint analysis? The only time you care whether the string is in fact not a number is if you actually try to use it as one. If you don’t ever try to use it as one, then why worry? Conversely, if you do try to use it as one, wouldn’t it be nice to have the system be able to trace its provenance back to where you originally read it, so you could decide if you wanted to insert a validity check there… without forcing you to declare its type at every intermediate program point where it flows?

It seems to me that extensible languages will have to shift a lot of their safety checking and property checking away from the static type system that’s exposed directly to the programmer. Consider all the possible properties you might want to know about the values in your program: are they string? Numeric? Complex? Are they null? Never-null? Are they tainted? Untainted? Are they structurally compliant XML? Are they valid values for your object-relational database? Are their dimensions correct for the physics math you want to do with them? There is a literally infinite variety of possible types that you might want to track through your program. And if you have to declare all of these types on every variable and method you declare, you pay a “static declaration tax” equal to the number of extended type properties times the number of declarations in your program. Whereas if the system is able to derive, or trace, or deduce the flow of these properties, and if the programming environment gives you enough insight into how that derivation works and what annotations you can add (very selectively) at only the points where the distinctions are relevant, you offload that tax onto the programming environment.

Imagine an IDE for some Ruby++ language. You can write your Ruby code in the normal way. But you can also annotate various methods with fully optional types. Methods that read from an HTTP connection mark their values as tainted. Methods that write to the database require untainted values. Arithmetic methods mark their inputs as numeric. And even better, there are ways (handwaving furiously) to define new kinds of traits and types; so you can implement complex numbers in the language, and use them without needing to explicitly declare their type throughout the program, and the system can track their value flow and create optimized code for their use. The end result: an extensible language with far more ease of use than extended-Java or extended-C++, and far more static safety than basic Ruby.

Maybe it is possible to have the best of all worlds. And in fact, if we want to truly achieve the dream of extensible statically-analyzed languages with usable syntax, maybe it’s not only possible but necessary.

Up with pay-as-you-go! Down with the static declaration tax! Up with extensible languages! Down with pure dynamism!

Maybe I’m a bit more religious than I thought.

What would such a language look like? Stay tuned!

…After writing the above, I surfed around for pluggable types — which was Gilad’s terma for the concept — and found several people who have suggested basically this: that static typing could be considered optional, and that extensible type systems might well require that viewpoint. There has been some real progress in the direction of viable partial type systems, and Javascript 4 is looking at adding strict mode.

Anyway, I’m leaving the body of the post above as I originally wrote it, because the slightly uninformed enthusiasm gave it some good energy 🙂

Written by robjellinghaus

2007/10/16 at 03:41

BDDBDDB – Pointer alias analysis comes of age

with one comment

One of the most exciting program analysis techniques I’ve ever seen (I love having my own geek blog, where I can say things like that with a straight face) is the BDDBDDB system, from Whaley and Lam at Stanford. (I will capitalize BDDBDDB here, though they don’t.)

Pointer alias analysis is tracking the movement of object references through a large piece of code, across method calls and through heap references. I’ve been reading programming research papers for almost two decades now, and for much of that time, pointer alias analysis was a chimera. It just wasn’t possible to make it scale. The number of potential paths in even a moderately large piece of code is dizzying — numbers like 10 to the 14th power aren’t uncommon. And for most of the last two decades, this has been an uncracked nut.

Well, Whaley and Lam seem to have finally cracked it, by building on other research done using binary decision diagrams (BDDs) to compactly encode the movement of values through variables in a program, in a way that maximizes the amount of shared structure in the variable flow graphs. BDDs were originally created for hardware logic analysis, to track the movement of signals through a huge circuit. So it’s elegant that they can also apply to software analysis, tracking the movement of value through a large code base.

The further elegance in Whaley and Lam’s technique is the query language laid on top of the BDD-based analysis. Binary decision diagrams can be used to encode relations; they can be joined and queried much like relational tables. And there is a query language, Datalog, that is well suited for expressing queries over a BDD-based relational model. Datalog is a relational query language, similar to a more abstract version of SQL, except it supports recursive queries (which SQL does not). Whaley and Lam are able to encode a Datalog-based query and relationship-generation engine on top of their BDD aliasing structures. Hence, BDDBDDB — Binary Decision Diagram-Based Deductive DataBase.

One of the more amazing results in their paper is their discussion of how their analyses improved once they had the Datalog layer. originally they wrote a context-specific alias analysis on top of a BDD engine without any intervening abstract query language. This took them a few months, the code was thousands of lines long, the performance was not good, and there were a number of subtle bugs. Once they wrote the Datalog engine and reformulated their analysis as Datalog formulas, the total size of the analysis dropped to just around a dozen lines (!) of Datalog, and the performance was one to two orders of magnitude faster.

Once you have a query engine that can efficiently query an entire software system to track value and type flow throughout, in a reasonably accurate way, there are many systems you can build on top of it. For instance, you can write an Eclipse plugin to track the exact lines of code that expose a web application to SQL injection attacks. This can be done by notating particular methods as sources of input data from the Internet, and other methods as destinations (“sinks”) of query text for databases, and then evaluating whether any values flow from source methods to sink methods. Any paths through the code that allow unfiltered input to travel from the Internet to your database are security holes. The BDDBDDB analysis can tell you exactly what code paths allow that value flow. And it can do it purely as a static analysis!

They use Java as the basis of their system, for unstated but obvious reasons. (Namely, strong static typing makes their value flow analysis much more efficient, since it can eliminate many value flows that are not dynamically permitted; and lack of pointer arithmetic means their analysis can be trusted, with no need to limit to the “safe” subset of the language without pointer math.) Many people have been wooed away from Java by the flexibility of dynamic language programming, where you needn’t encumber yourself with types; but those of us who remain in the static typing camp are pretty much hooked on the tools. And BDDBDDB is the harbinger of a whole new level of Java-based analysis tools that are going to take Java programming and safety checking to the next level.

One thing not mentioned in any of the BDDBDDB papers I’ve yet seen is reflection. I can’t see how reflection can be handled in the BDDBDDB system, since tracking reflective method invocations requires not just tracking object references through variable sites, but actually tracking method values (and even specific method name strings!) through reflective invocations. Certainly their system as described doesn’t address reflection in any way. This also seems to me to limit the applicability of BDDBDDB to languages like Ruby, which make heavy use of code generation (in Rails, for instance). Though I’m sure that some enterprising Rubyists will undertake to prove me wrong.

In general it’s very challenging to consider the interactions of static analysis with metaprogramming frameworks — runtime bytecode generation is becoming more common in Java systems, but BDDBDDB currently runs on a source analysis, without access to the runtime-modified bytecode. Metaprogramming can blur the line between compile-time and runtime, and BDDBDDB is solidly on the compile-time side… at least for the moment. I say this not to denigrate the importance of this technology, but rather to indicate an even further level to which it can go.

I look forward to what’s next for these analyses. Good times in the static typing world!

Written by robjellinghaus

2007/10/11 at 04:17