Update

For my two blog followers (welcome to the new guy who was foolish enough to sign up), I thought I should write a bit of an update. I have had too many distractions to be able to get back on the home stretch as soon as I wished, but am champing at the bit to get to work again. The time off has been useful in letting me talk some sense into myself about following the simplest implementation route instead of trying to get clever--I am biting off enough with this project as it is. I am also getting a better picture of how I might use LLVM for more practical projects, but don't want to start anything new until I bring this one across the finish line.

Dustin

Environment and scoping

As mentioned, we've reached another point at which I haven't yet reduced the interpreter to "a simple matter of coding," and I'm going to discuss how to get the environment code ready for user-defined symbols. I don't yet know "the right answer" in this area, as will be clear, so this is essentially the student thinking through the assignment. And of course I'm focused on simple implementation, not theoretically perfect semantics, but we would like the result not to be intolerably icky.

Note that throughout I assume that Nil is a lisp-1 with a single namespace, like Scheme, and not a lisp-2 with separate function and variable namespaces like, I gather, Common Lisp. (Though as Nil is case-sensitive one can easily get the effect of two namespace just by following my C/C++ rule that functions begin with a capital and variables begin with a minuscule--I haven't done this for the system functions yet, but am tempted.) Single-namespace is, at the very least, much cleaner (for example, we should only need one 'define' or 'def' function as in Scheme, not separate 'setq' and 'defun' style as I see in Common Lisp code (worded slightly evasively because it's quite likely I don't really understand what CL is doing). Also note that I'm assuming dynamic scoping for now, as it should be simpler for an interpreter. Dynamic scoping works just fine for toy languages and more, even though there is no question that lexical scoping is superior from an engineering point of view.

The simplest thing that could possibly work is to simply have a global environment variable. The downside to this is that all scoping would have to be done manually, by pushing and popping contexts onto the head of a list (the environment is already a list of contexts). While we could have a language with every variable being otherwise global (that is, nothing is ever popped off and symbols are simply re-defined rather than shadowed), I think it would be intolerable for function parameters to interact that way (I think this would take us back to the state of the art in maybe the '40s or '50s) even in a toy language. At the least Apply should push a context which maps the formal parameters to the actual parameters, call the function, and then pop the context again. This really isn't too bad, as Apply is a special magic routine inside the interpreter and basically would do this in any scenario.

More interesting is what happens with user definitions. Suppose Apply works as described above. Interestingly, if def simply pushes a symbol onto the current context, we get local variables for free. A little too local, in fact--the obvious code would have def create a new symbol in its context, which would get popped when it returns (because of course calling def involves a recursive call to Apply). def would effectively be a no-op as far as external code is concerned! So def, at least, needs access to the enclosing context to do anything. But that isn't really a problem--def can simply walk back one context on the list and push it's definition there. Unless I've missed something, that appears to be a workable system.

Can we make it better? One thing that won't work is to instead pass the environment as a parameter on the system stack (which is actually what I'm doing now). That would pop the context automatically on function exit, but that's the problem--def can't walk back on the stack because it can't alter the pointer in its caller. However, what might work is passing in a pointer to the list pointer, so that def can modify its caller's context but they still eventually get popped automatically. There are likely pitfalls there I have not considered, so the matter bears more thought. For example, if I were to actually implement garbage collection (right now Nil leaks just as fast as it can allocate cons cells), how would having context pointers on the C stack work?

As noted, keeping everything in one global environment seems to be the simplest, and I'm not sure why I'm reluctant to just settle on that.

Dustin

Not an Implementation of Lisp....

A random comment on a little thing that would probably matter to programmers a great deal. Nil follows the C family convention that symbols are fully case-sensitive (and I'm sure other families, but there is no doubt where I got it). This, I suppose, would bother lispers, but in fact I made it case-sensitive automatically a long time before it occurred to me that I was definitely not following lisp tradition. The fact that strings are ultimately uniqued with strcmp() helps, but I'd have done it anyway.

Which brings up the interesting issue of just how deeply attached programmers get to the visual conventions they are used to, and how annoying it is when they are not followed.

On another unremarked subject, I think the README, my blog post, and even one of the commit messages does not note the significant (from a practical standpoint) development that Nil's parser now recognizes single-quote as "quote the next expression" in the lisp manner. Non-lispers like me underestimate just how important this is for lisp, at least until we try testing our own code with the long form ( (quote x) vice 'x) only. It's so annoying that I went off and hacked in support just to keep my head from exploding. Which also sort of leads into a very small peeve of mine--that lisp parsing is more complex than lispers tend to admit when extolling the regularity of s-expressions. Not much, granted--adding support amounted to making the lexer and parser recognize ' as a magic parsing character akin to ( and ) and then having the parser drop single quotes, make a recursive call to get the next expression, and then cons'ing it into a list with 'quote'. But in fact the absolutely regular syntax is just intolerable.

In the interest of full disclosure I also admit it is a lot more intolerable in an interpreter like Nil's that doesn't give you full command-line editing. :-)

Dustin

Nil 0.0.8 released, the world breathes a sigh of relief

I just pushed v0.0.8 to github. From the README:


In Nil 0.0.8 we have a minimal list-oriented language and even a small luxury or two. In particular we have the seven primitive operators from Graham's (and therefore McCarthy's) paper: quote, atom, eq, car, cdr, cons, and cond, as well as the predefined values t and () needed by atom and cond. We also have commands to quit, turn on and off a mode where undefined symbols are self-quoting, display known symbols and strings, and even a minimal help facility. Of course there is a great deal of work under the hood to support that functionality.

The most important features still missing are user-defined symbols and
especially functions.

This is obviously a big milestone release--Nil can now justly be termed a language, and I debated bumping the version number to 0.1.0. The above hints at what will probably be the direction of development--ordinary variable definitions, which will require tweaking how the environment is stored and passed around, and then lambda (named functions should happen for free if we have both of those, as one should be able to simply define a symbol to have a value that happens to be a lambda expression). The larger conceptual hurdle is working out the details of lambda evaluation, probably the last major one of the project. A minor one is proper environment handling (what should happen if a symbol is defined deep in a nested expression?--this affects the implementation quite a bit). So that means that I am back to the point where I have as much thinking to do as I do hacking.

Having lambda expressions and named expressions is the earliest point at which, at least in my own mind, I've done what I set out to do and could declare victory, and the point at which the version number will move up to 0.1.0 (signifying, to me, that it is a usable, if minimal language and that it meets the project goals). However, it is likely that I should, at a minimum, provide macros as well. The primary goal of the project was to understand the lisp computational model, and one of the attractions of that model is the ability to define procedures that extend the syntax of the language. I think this is not a difficult extension beyond exposing the evaluator to user code (i.e. providing an eval primitive), but of course I may be wrong and the point of implementing it is to find out.

Another thing missing is letting LLVM do link-time optimization. I attempted to tweak my makefile to use llvm-ld, but it broke things for reasons I don't understand. It would be nice to figure out what I did wrong, since learning LLVM was also a goal.

Dustin

Small update

Since the blog has been quiet for more than a week I wanted to post a small update.

We had a houseguest from out of town and other things that made me lay Nil aside for a bit, which is always bad because I'm very much the sort that requires momentum. I don't have it back, but today I extended the binding lookup so that Nil can have nested contexts. This is actually important for finishing apply(), because the way I expect to implement user-defined functions is for apply() to use the function arguments to construct a new context, push it onto the environment stack, and then call eval on the operator expression.

From earlier I have a partial implementation of an integer type which I don't think I mentioned. I did that because my current simplistic exception implementation is basically an integer type and I wanted to make sure I preserved the ideas in case I make it more sophisticated. But finishing it requires some things that are a distraction now, so it is #ifdef'd out in my development branch and unless I miss something won't be in the release branch until it works.

One problem with the exception code is I can see a need for a more sophisticated implementation (actually it's probably grotesque to call something implemented so simply exceptions, but they function as exceptions at the language level even though in actual implementation they are just a special kind of return value), but am not sure just what would be needed to do a satisfying job. So for now it's just going to stay as it is, even with some output code crudding up my evaluator instead of just being an implicit catch in the interaction loop.

Dustin

Nil v0.0.7: Evaluation

I just pushed Nil v0.0.7 to github:

From the commit message:


This milestone release introduces several major additions:

* First-class primitive functions.
* First-class exception objects.
* Symbol bindings.
* Evaluation.

These supporting additions are also worth noting:

* The associative binding table implementation can be re-used at the
   language level.
* The binding code is ready for adding nested contexts.
* Improved evaluator internals.
* Some factoring of the messy interaction loop.

From the README:


In Nil 0.0.7 we have the beginning of actual evaluation. The evaluator
knows how to evaluate () to itself (it doesn't yet know that it is also
named nil, however) and 'quit' to a primitive function object. It also
knows how to evaluate primitives, which means (quit) quits the program.
It reports errors on symbols it can't evaluate, and also knows that ()
can't be applied to anything even though it can evaluate it.

This simple functionality hides a great deal of internal work, of course. The big additions are an exception type that allows reporting of and recovery from errors, and of course the primitive function type necessary to implement (quit). They are implemented as full types rather than as hard-coded hacks in the evaluator so that they are first-class objects from the beginning.

All of that means that this is a big-deal release that gets us on the home stretch, much more than is apparent in the trivial command-line interaction. I had to have 90% of the functionality there just to get the first primitive (quit) working. I don't think there is any obstacle to adding primitives until Nil is Turing-complete and declaring victory. However, to be well-done I should address some shortcomings that were revealed in getting this far.

The interpreter's reporting of exceptions is add-hoc, and should be moved up to the interaction loop's top level. To retain even moderately friendly error reporting, however, would require promoting the exception implementation from simple numeric codes to more involved objects that can remember, essentially, who threw the exception. On the other hand, the current exception code is essentially an implementation of an integer type and so we can get that most important type basically for free.

The exception objects should also be back-ported to the lexer so there is a uniform (and better) error-handling system.

It should also be noted that I'm fairly likely to find at least one more place where I didn't know enough to provide the needed capabilities and have to do some re-writing, just as I really should with the exceptions and their handling.

A final note about the semantics of Nil. The direction I'm going may make primitives and the interpreter code even less "special" than in regular lisp (granted, I'm not sure precisely what is special magic in standard lisp so I may be wrong). For example, the natural implementation would allow all primitives to be overridden by user definitions, even core stuff like cond, quote, and so on. With trivial extra work, this could even be true of nil. In a sense the parentheses are the only things that really need special token magic outside the symbol table (and even there I'm not sure I couldn't change that, if I really wanted to, by propagating the symbol table back into the lexer).

That would allow some pretty hair-raising capabilities, which is why I thought it was interesting enough to mention. It may or may not be normal in lisp, but it is in Forth and would more or less have the same semantics as in Forth and for the same reason (pre-existing definitions continue to use the old bindings, not the new, because they continue to maintain pointers to the old values). But whether that's an unusually odd and problematic language property or not, in general I am quite pleased that the primitives and exceptions just naturally became first-class objects that could be exposed and manipulated at the language level (albeit as magic atomic objects only--allowing a peek inside their definition would be a far deeper layer of magic, though I suppose I can dimly see how LLVM's JIT capabilities could be coerced into doing something insane like that).

Dustin

CFG style

As I was writing the associative lookup code for Nil's binding table it occurred to me I was writing a long chain of tests, yet the function was quite manageable. Why? Precisely because it was a chain of tests for exceptional cases leading to the normal case in the last block. What makes LLVM IR different than any other language I know of is having to work directly in terms of basic blocks, and what makes a function's basic block list hard to read is a complex control flow graph. So you can freely break the rules I gave, I think, when the graph is simple (the case above was linear with exception handling blocks sprinkled in) and the code is doing a single well-defined job (so it makes no sense to try to factor it). I think the true underlying principle is trying to keep the CFG simple and preferentially stereotypical in structure, and the rest of what I said is valid in so far as it serves that larger purpose.


Dustin

Some reasons to code in IR

Here is a snippet I posted on LLVM-dev which I think is worth putting here, since I'm commenting on what its like coding in IR. If it isn't convincing as motivation to code in IR, at least it qualifies as "lessons learned from coding in IR."


One advantage this backwards approach has is exposing more of the real
machine nature than even C. I'd like to think that makes one a better
compiler user in the end. It's nice to know what all those nice
high-level semantics are really costing you. I think part of my
motivation, besides just doing the unexpected, is that long ago someone
told me they took a class in "assembly and lisp"; basically, they taught
programming by teaching you how to implement a higher-level language.
Being young and stupid, I didn't see the point, but eventually I figured
it out. It's never too late to re-do your childhood right, is it?

I also think that the effort to write good code at such a low level is
very good discipline. At least, I find it so, because the consequences
of good and bad design become magnified. The absence of scope nesting
and the difficulty of doing many simple operations really makes
factoring out a vocabulary of small toolkit functions useful, for
example, and that's not a bad discipline to reinforce. I just created a
couple of functions whose body is a single shift simply because it
enforced some abstraction and the names are documentation. I can always
move the body into the header and let LLVM inline them if I want to
optimize away the function call overhead (not that there is any great
need to do that in a learning tool). (It's easy to tell which parts of
the code I care about. The expression representation is pretty cleanly
divided into a toolkit. The user interaction loop is a big fat function
I didn't take the time to decompose.)

Dustin

More IR style

I had a bit of a setback on the evaluator. I have to handle exceptions to write any kind of non-trivial evaluation, and I'd been planning to use invoke/unwind to handle exceptions in Eval(). It turns out those are not fully supported in the LLVM code generator! Which explains the seg faults when I tried it in any case but unwinding a single level in a call within the same translation unit....

What that means is that I have to bloat and uglify the evaluator with code to unwind the stack manually and pass exceptions back up to top level, and that will take a bit of boring, grungy coding. On the other hand, the system I'm writing has an explicit exception type for expressions, which means that I could clean up the parser and lexer by using the same system there. Faster too, not that parsing speed is likely to be a problem for anyone, but right now I'm passing around two-element structs and this system will make the tokens word-sized. In fact, carried to its logical conclusion, there would be a pleasing consistency because even characters would be represented in language-level structures. Essentially, it means Nil would get a native character type for free.

The other consequence is that it will add yet more things to test for in the main interaction loop, which is already too large and bloated (because it doesn't really interest me, for one thing). So I think it's going to break under the weight and I'll have to factor it to stay sane.

Which brings me to the point of the post. How far should handcoded IR functions be factored? I suspect a good rule is to count the number of tests, because every test complexifies the control flow graph. Ideally one would have a single test per function, I think! At least, every additional test should make one think harder about whether there is a better way to do something. I certainly don't achieve this, and I admit I don't always try, but I think for long-term maintainable code its a good idea.

At the other end of the scale I just created a couple of functions that do nothing but a single bitwise shift, simply because the abstraction and the self-documenting function names made it worthwhile. And I have a at least one other function that does nothing but call one of them--so I am willing to pay *two* unnecessary function calls to do one bitwise operation! In a sense it is no real penalty at all, though, since it could be optimized away completely simply by moving the definition to the header and letting the optimizer inline the functions.

Of course the actual CFG created matters too. A test that simply executes or jumps over a single block, the equivalent of "if-then" adds less complexity than one that branches two ways "for real." The more stereotyped the usage, the less mental overhead, which is the underlying idea of structured programming that *is* valid in IR (even if the specific prescriptions seem to add too much complexity to a language with only basic blocks). Also, a switch probably counts the same as a branch. Besides the fact that you can't have half a switch in a statement, I find that the uniformity of the multiple switch cases makes it much easier on the brain than even two chained two-way branches.

So there is more wisdom on a kind of programming I think no one else in the world actually wants to do. :-p

Dustin

Nil 0.0.6 released--skeleton evaluator

The evaluator is going more smoothly than the parser did, as evidence by me having some clean code to push even though I had less time to work on it.


From the v0.0.6 README:



In Nil 0.0.6 a skeleton evaluator recurses over the parsed expressions
and inserts 'apply' into every combination where apply would act. The
list of known strings is printed before each prompt (this is mostly for
debugging, but also shows that it is doing something besides printing
what you typed).


What that means is that Nil does just about everything it can do without actually evaluating anything. The evaluator structure is there and recurses over the trees correctly without doing anything except indicating where function application would happen (by inserting "apply" as the car of each combination). I have more goodies actually half-done, such as an implementation of primitive objects so that even the primitives implemented in LLVM can be first-class (as of course they should be). They just aren't ready to be plugged into the evaluator yet.


Part of the reason the evaluator is going faster even though I know the least about it (the last parser release brought us as far as my earlier C project had gone) is because the earlier hard work trying to write good code in something never meant for it is paying off. So overall I'm pretty happy with the code right now.


Dustin