Language design: homoiconicity

Thoughts on homoiconicity, an interesting language quality seen in Lisps.

written 2020-Oct-16

TLDR: Homoiconicity simplifies what was already trivial, while leading to poor design choices.

Disclaimer

While this post is highly critical, it comes from a fan. I used Clojure for years, dabbled in other Lisps, wrote a few parsers and compilers. Even if this concept is not good language design, it's still pretty cool.

Definition

Homoiconicity is when the entirety of a language's syntax matches the literal syntax of some of its data structures.

This does not just mean that we can convert this text:

(10 "20")

Into some library-defined type:

ast.LinkedList{ast.Number{"10"}, ast.String{"20"}}

This means (10 "20") is the literal syntax for that AST type. In other words, the expression (10 "20") gives your program a copy of the AST node that the parser generated for this expression when parsing that program. Sometimes with caveats:

; The quote tells the compiler: this list is not a function call.
'(10 "20")

; The quote tells the compiler: this symbol should not be evaluated.
'ident

This quality can simplify the language and macros (not by much). It requires the language to be dynamically typed, or have a dynamically typed subset.

Defects

Requires silly data types to be built in.
Loses useful information.

Symbols

Our language probably has identifiers: names for variables, functions, operators, and so on. To distinguish them from strings, we must introduce a new data type: "symbol".

; This unquoted symbol is evaluated as a variable.
blah
; This quoted symbol is evaluated as data.
'blah
; Strings are considered distinct from symbols.
"blah"

Setting macros aside, from the perspective of data modeling, having symbols is bad. They're just strings by another name, but everyone has to choose between symbols and strings. Library APIs will make different choices and conventions. Using external data formats gets more difficult, because they usually support only strings (see JSON).

It gets crazier. Common Lisp and Clojure have keywords, which are symbols with minute differences and their own syntax. Everyone using those languages must spend time and effort choosing between strings, symbols, and keywords, dealing with idiosyncratic APIs, and dealing with conversions. I know I have.

Side note: some languages with symbol-like data types support interning, which allows to compare them as integers. In dynamic languages, this can be a minor performance hack. Can also be a memory leak. Static languages don't need it. It's not worth it.

Impossible Literals

We can probably agree that code auto-formatting is great. We can also probably agree that generating documentation from comments is simpler and more universal than special-case support for doc strings. But in any given homoiconic language, comments and whitespace are missing from the AST.

We probably don't want to define a different AST and write a different parser. Which means our "main" AST generated by the parser must preserve comments and whitespace. Since all the other AST types are built-in, this requires built-in types for comments and whitespace. Internally, they would just be strings. Just like with symbols, we've added more string-like types that should be limited to the AST, yet are built-in, easily available, and will be used where they shouldn't be. Or would be, unless...

Homoiconicity seems to require that every data type in the AST is instantiated using the exact same syntax from which it was parsed. So, how do I assign literal whitespace to a variable? How do I assign a comment?

(define whitespace

(define comment   ; This doesn't get evaluated!

Information Loss

Comments and whitespace isn't the only information lost. Some data types might have N inputs for 1 output. One example is numbers:

0b110011
0x33
51

All of these would be parsed into just 51, losing the information about the original formatting. Even if we had preserved comments and whitespace in the AST, we can't print the original code back!

Triviality

One decent upshot is that it simplifies macros. In Lisps, you can just quote a bit of code, return it from a macro, and it counts as valid AST:

(defun sum (vals) (reduce '+ vals))

(defmacro trivial () '(sum '(10 20 30)))

All that's needed of macros is to return AST nodes. Programmatically manipulating an AST doesn't require special syntactic support. Calling map or head/rest on an AST doesn't care about its text representation. AST types could be defined somewhere in the standard library. Macros would import that module to use its types and functions. Non-trivial macros are already inscrutable, so we're not losing much readability.

(import std:ast)

(defmacro trivial ()
  (ast:list
    (ast:sym "sum")
    (ast:quote (ast:list (ast:num "10") (ast:num "20") (ast:num "30")))))

But instead, the language could convert quoted code into types from the AST module. So we're back to:

(defmacro trivial () '(sum '(10 20 30)))

What got simplified wasn't your code. It was the implementation of macro support in the language. Meanwhile, you got saddled with unnecessary data types and an inferior AST!