In
Part 1 of this two part series, I made an argument that the standard way of structuring program authoring tools involves a peculiar and unnecessary model/view coupling that makes a number of problems in the programming tools domain more difficult than they need to be. Here I'll be describing a simple, general way of constructing generic 'program models,' which have much in common with ASTs, but improve on them in a couple of critical ways. I'll be referring to this method as the 'path formulation' of program models.
Why this structure?
Short answer: because it's simple and general, and bears a very close relation to the essential activity in high-level program construction.
The 'path formulation,' a particular way of creating program models, comes from asking the question,
"what are we really doing when writing source code?" and finding the answer,
"selecting and configuring abstractions provided by a programming language." In that case, writing source code is just one way of doing a more general and essential activity: selecting and configuring abstract 'language constructs.' Here's a
short dialogue illustrating the point a bit more.
So, let's say your programming language provides an abstract language construct called 'function_declaration'; it is comprised of a few parts: a 'name,' a 'return_type,' and an 'argument_list.' A configuration for this construct would be a particular 'name,' 'return_type,' and 'argument_list.' What makes it 'abstract' is that none of these entities are tied to a representation. In other words, while it has become reflex to think of these sorts of constructs in terms of character sequences, here we intentionally leave open the question of how they should look.
Making Program Models using the Path Formulation
The idea behind the path formulation is: if you have a model of a programming language in the form of a graph, then individual program models are just paths through the graph.
Here's a high-level, three step recipe for making these language/program models:
(1) Represent your programming language 'abstractly,' using a formal grammar that has no lexical section (now called an 'abstract grammar'): its fundamental units are the fundamental units from your language in the abstract, rather than character sequences. Only consider which abstractions your language should include and the rules for composing them; worry about how to represent them visually elsewhere. For example, we can say that a 'class' is made up of a 'name,' a set of 'variables,' and a set of 'methods,' without any assumptions about how these things are going to look. (I talk about this in the
first part, too.)
(2) Convert your abstract grammar into a graph, which will serve as the 'model' for your language. (I have some Java code that will do this for ANTLR grammars, btw, which I can clean up and share on Github if there's interest.)
(3) Represent individual programs as specific paths through the language graph.
So, if your grammar contains a subsection like this ('class_reference' is for inheritance, indicating a parent class):
program
: class+
interface*
;
class
: name
class_reference?
function_declaration*
;
function_declaration
: etc. etc. etc.
Your graph-based language model will have a subsection like this (black nodes are 'language constructs,' orange nodes are 'language atoms'):
And a particular program in your language, consisting of one simple class might look like this:
These paths could be represented by just listing the edges taken—though of course you have to number the edges:
Taking that approach, the model of our simple program looks like this:
(2 2 0 1 3 0 0 3)
If we would like a little structure in our model representation, we can distinguish between 'language constructs' and 'language atoms' (black nodes and orange nodes), by using opening and closing parentheses to mark the start and end of language constructs.
(2 (2 0 1 (0... 1... 2... 3) 0 0 3))
Note: the ellipses are where we followed some hypothetical edges in 'function declaration.'
Language constructs are just composite abstractions, made up of more than one part; language atoms cannot be broken into smaller units. Actually, 'language constructs' and 'atoms' have a relationship that mirrors S-expressions in Lisp, so it's no surprise that the notation for program models resembles Lisp (here, however, we aren't tied to using this as the visual interface for the programmer).
Using Insight from the Language Model
A big advantage of the path formulation of program models is that it maintains a strong connection between elements of a user's program, and the programming language itself (since programs are paths in the language graph). This connection can be taken advantage of by programming tools to guide programmers in using the language.
As an example, let's say a programmer has just selected a 'class' construct (maybe by typing out the keyword 'class'—or in some new way); the program model now contains a node for that 'class' construct, and the editor, just by examining the 'class' node in the language model, knows all the legal options for proceeding, because each option corresponds to an edge going out of the 'class' node:
To make this as concrete as possible, I'll show one possible UI that takes advantage of this connection to the language graph. Keep in mind, though: you could still render it like a traditional text editor, using the UI contemporary IDEs use for 'auto complete' to display alternatives. It's like automatic auto complete for all aspects of the language—at the least, this would be tremendously useful to new users of a language.
In this hypothetical editor, the UI is split into two main sections: the top is our document, which is just a rendering of the program model; and the bottom contains controls for selecting and configuring language constructs.
Let's say we've just instantiated a class definition by supplying all the necessary parameters; it has the name 'InputHandler,' and our editor has 'collapsed' it, so we just see the name and type:
Since our editor is following along in the language graph, it knows we're back in the 'program' node, from which point we can begin specifying either a class or an interface. Let's say we select 'class.' Our editor now looks like this:
Notice the whole 'class' construct and the 'name' section have red borders; this is to indicate that 'class' hasn't been fully instantiated: it still has 'free' parameters that must be bound to something (in this case 'name' must be bound to something). Also notice that in the bottom section, the options appearing in the grid are just the neighbors of the 'class' node in the language graph.
I imagine that in using a system like this, the cells in the bottom area would map to keys on your keyboard: this way you could accomplish a task like creating the skeleton of a new method declaration with a single keystroke. Something along these lines would also be much better than text editors for programming with virtual/augmented reality systems and mobile devices. Anyway, this UI is a just a quick sketch of one possible approach. The document region could also be something like
this (in that video, I'm rendering the AST in manner identical to how I'm suggesting we render program models).
Conclusion
It's been a long time since we laid down the character sequence and parsing-based architecture of program authoring tools, and contemporary work on programming languages is deeply invested in that established approach. It seems like the program model approach could be an improvement—but who knows what lethal oversights might still be lurking. What's especially needed at this point is a concrete implementation. I'm working on it, slowly, in my free time—but my hope is that others will read the ideas here, and if the they prove to be generally interesting after all, expand and solidify them into serious tools that will improve the experience of programming for the upcoming years. If you'd like to hire me to work on something related, I can be contacted at 'westoncb[at google's mail service]' (or even if you just want to talk about it—though the comment section is probably best for that).
----------------------------------------
Appendix:
Identifiers etc.
Let's take a look at the abstract program graph one last time:
Notice that 'class reference' and 'name' are language atoms, but these things need structure of their own, so where does that come from? First I'll point out the reason we don't include identifiers in the language definition is that they are mnemonics for humans, not part of the abstract structure of a language; all the language needs is a unique identifier, so we just generate a random one. As for the mnemonic, maybe it should be a string, maybe something else—accordingly we leave it outside of the language spec. In the program model, we just attach the random ID (discussed
earlier, too) which can be associated to some representation specified by the programmer (probably totally without their notice, by just typing in their editor as always). A program model with IDs might look like:
(2 (3 (#49843345) 0 0 (#95728745) 0 1 (... ... ... 3) 4))
There's an external file that maps these IDs to representations (often it's just a string) for editors to use.
The following section discusses how to handle the 'class reference' node and others like it.
Referencing the Program Model
There is an aspect of programming languages that isn't captured by the 'abstract grammar' that I've described so far. The abstract grammar only allows us to describe 'free' language constructs which, when supplied with specific parameters, are 'instantiated'; program models contain only instantiated language constructs. However, the 'abstract grammar' should describe the full capabilities of the language, and programming languages always contain mechanisms for referencing already instantiated language constructs: e.g., I have instantiated a 'function_declaration,' which had its 'name' parameter bound to the value 'testFunction'; other parts of my program should be able to reference this specific, instantiated 'function_declaration' by using it's 'name,' 'testFunction,' as a reference.
To be honest, I'm very curious to hear other people's ideas on how to go about doing this, though I do have an approach that seems like it would work well: extend the notation of our 'abstract grammar' (which is just some variation of BNF at the moment) to express 'queries' on program models: i.e. "select all the nodes from the program model of type 'class'." More concretely, let's say our 'class' construct is defined as follows:
class
: name
class_reference?
function_declaration*
visibility?
;
(The 'class_reference?' component is used to reference a parent class.)
'class_reference' would be defined in our grammar as follows (except using some appropriate notation, not English):
class_reference
: "select nodes on program model of type 'class' in same 'package'"
;
So, in order to instantiate a 'class_reference' the programmer would have to select a node from the program model that meets the criteria in the query. Ideally, the programmer's IDE would parse the query, run it on the program model, and offer up a selection of valid nodes. Present IDEs do this sort of thing of course, but including the necessary information in a unified, abstract language specification would be beneficial.