PL/LOLCODE documentation, such as it is
$Id: DOC,v 1.26 2008/05/02 03:32:29 eggyknap Exp $

JUSTIFICATION (or "WT*?")
=========================

My intent in writing PL/LOLCODE is as follows:

- Learn how to write a PL in an enjoyable way
- By so doing, familiarize myself somewhat with PostgreSQL at a code level
- Produce documentation or learning materials suitable for an introduction into
  how to write a PL
- Dramatize the modularity and extensibility of PostgreSQL ("If you can make it
  do something as silly as this, you can make it do anything")
- Through what is clearly at some level publicity stunt, achieve world
  domination

Conspicuously absent from the above list is any mention of a language of any
real utility at an application level. Although it is certainly possible (or
will be when it's finished) to use PL/LOLCODE to write useful, meaningful
procedures for real applications, and some might even argue that the language
and implementation are well suited to some tasks, I don't intent PL/LOLCODE as
a production-use application, for at least two reasons: 1) it offers no great
advantage over existing PLs, and 2) the "joke" nature of the LOLCODE language
itself will in my view prevent the widespread adoption of PL/LOLCODE for
production use. This doesn't particularly bother me, because given the
simplicity of the language and the complexity of implementing a PL (and despite
PostgreSQL's remarkable modularity, there is still plenty of complexity
involved), LOLCODE is a language well-suited to the purposes I've given above,
and PL/LOLCODE will have some use even if no one ever writes a serious procedure
in it.

DOCUMENTATION
=============

This section is intended as a sort of running journal, detailing the things
I've learned while working on this project as I learn them. Unfortunately I've
been working on PL/LOLCODE for quite some time before starting this
documentation, so this initial portion will have to come from memory, with the
code as a useful reference from time to time. 

THE BEGINNING

A PostgreSQL procedural language consists of a language handler procedure, and,
optionally, a validator procedure (which I'll worry about later). The handler is
responsible for taking a user-defined procedure and making some sense of it in
order to perform some function. The PostgreSQL documentation claims this
handler procedure needs to be written in a compiled language, but a very brief
test suggests this is incorrect. In any case, PL/LOLCODE's language handler
procedure is called pl_lolcode_call_handler() and is written in C. Whereas many
procedural languages use external interpreters (pl/perl, for instance, uses the
system's perl interpreter as a shared object rather than using its own
PostgreSQL-specific Perl interpreter), there are very few interpreters for
LOLCODE. Although there are bits of code that implement LOLCODE in part or
sometimes completely, the language is simple enough, the existing interpreters
scarce enough, and their licenses indeterminate enough that I decided to write
my own interpreter. This, after all, is a learning project, so why not learn to
write an interpreter as well. Note that having an interpreter included in
itself means PL/LOLCODE might be able to integrate closely with PostgreSQL in
ways something like Pl/Perl or other languages with external interpreters
can't. But I might just be making all that up. 

I expect when completed PL/LOLCODE will consist of the procedure handler and
the interpreter, which in turn will consist of the lexer, the parser, and the
executor, all of which should eventually be described below.

THE INTERPRETER

Few people would relish the task of writing an interpreter from scratch, and
there exist several useful tools to make the job easier. I've chosen to use lex
and yacc, which are pretty much a standard in the field. In fact, they're so
standard that although they've been largely replaced by flex and bison, it's
not uncommon to hear them referred to under their old names. Besides,
PostgreSQL uses the lex/yacc family of tools for its own parsers for SQL and
pl/pgsql.

The interpreter consists of three parts, mentioned above: the lexer, parser,
and executor. The lexer runs first, and splits the procedure's source code up
into tokens based on regular expressions supplied by the programmer. The parser
finds patters in the resultant sequence of tokens based on a grammar defined in
something like Bakus-Naur form, and executes code based on the result. In the
case of PL/LOLCODE, the code written in the parser should generate an in-memory
representation of the PL/LOLCODE procedure suitable for passing to the executor,
whose job is to actually run the thing. More on each of these parts later.

THE CALL HANDLER

The call handler's declaration looks like this:

Datum pl_lolcode_call_handler(PG_FUNCTION_ARGS);

PostgreSQL provides this language handler procedure with one argument to allow
it to get its job done, a pointer to a FunctionCallInfoData structure called
fcinfo (PG_FUNCTION_ARGS is a macro that expands to FunctionCallInfo *fcinfo),
and expects it to return a Datum. The Datum is PostgreSQL's internal
representation of any data object, and various functions exist to convert a
Datum to any of various types and back, find out what's inside one, create
Datums of your own, etc. More on Datums later. The FunctionCallInfoData
structure received as an argument provides accses to the OID of the procedure
being run, the arguments to that procedure and their types, the procedure's own
return type, whether this is a trigger or not, the source code of the procedure
itself, etc.  For PL/LOLCODE, I've lifted most of this code directly from
pl/perl, at least to get myself started.

Although I plan later to rearrange the code in the call handler itself to make
it flow more logically from one step to the next, right now everything is
pretty jumbled up, representing at least in part the jumbled nature of my
investigation into how to get something like this to work. The first step in
the process is to figure out a few important things about the procedure, like
the arguments, return type, and source code. This can all be obtained from the
FunctionCallInfoData structure received, with a little bit of work.

The fcinfo structure contains the OID of the procedure the user has called, at
fcinfo->flinfo->fn_oid. We use this OID to get to everything we need to know
about the procedure. First we translate the OID into a HeapTuple, which is a
structure that describes an arbitrary tuple. That HeapTuple can be converted
into a Form_pg_proc structure designed to describe procedures. The code is as
follows:

HeapTuple procTup;
Form_pg_proc procStruct;

procTup = SearchSysCache(PROCOID, ObjectIdGetDatum(fcinfo->flinfo->fn_oid),
			0, 0, 0);
if (!HeapTupleIsValid(procTup))
	elog(ERROR, "Cache lookup failed for procedure %u",
		fcinfo->flinfo->fn_oid);
procStruct = (Form_pg_proc) GETSTRUCT(procTup);

This Form_pg_proc structure is what we're really after to get the procedure
source and the return type. First (although this isn't necessary first in the
order of execution, is the function source code, which we can get as follows:

procsrcdatum = SysCacheGetAttr(PROCOID, procTup, Anum_pg_proc_prosrc,
	&isnull);
if (isnull)
	elog(ERROR, "Function source is null");
proc_source = DatumGetCString(DirectFunctionCall1(textout, procsrcdatum));

This code is pretty much copied from Pl/Perl, and I can't say I particularly
understand it in any great detail, except that DatumGetCString is one of a
family of functions that converts Datums to C types.

The next interesting procedure attribute we need to handle is the return type,
which is supplied as an OID in procStruct->prorettype.  The code to get the
specifics of the return type bears striking resemblance to that used to get the
procedure details:

Form_pg_type typeStruct; HeapTuple typeTup;

typeTup = SearchSysCache(TYPEOID, ObjectIdGetDatum(procStruct->prorettype),
	0, 0, 0);
if (!HeapTupleIsValid(typeTup))
	elog(ERROR, "Cache lookup failed for type %u",
		procStruct->prorettype);
typeStruct = (Form_pg_type) GETSTRUCT(typeTup);

Form_pg_type is a structure much like Form_pg_proc, but it describes a type.
Here's where we find out what data type we're supposed to return, and
importantly, what function we use to convert whatever we've got into that type.
PostgreSQL types have input and output functions, and converting data to and
from a Datum holding a particular type is as simple as passing the value you're
interested in to the corresponding input or output function. 

The call handler not only needs to deal with the procedure's return type; it
needs to treat the procedure's input arguments appropriately as well. At the time of this writing, PL/LOLCODE stores its variables in a fixed-length array
of structures, and maps each input argument into one of these structures.
LOLCODE supports four types (YARN, or string; NUMBR, or integer; NUMBAR, or
float; and TROOF, or boolean). The call handler needs to examine each input
argument to discover its type and convert it to the appropriate PL/LOLCODE
type. Since PL/LOLCODE doesn't currently support complex types or several other
classes of type PostgreSQL might try to feed it, it has to filter through the
possible types and complain if it hits something it doesn't know how to deal
with. In order to do this, the we need to examine the procedure's arguments in
turn.

The procStruct->pronargs will tell us how many arguments this procedure takes,
and we use it to loop through each one in turn. For each we need to get the
type, and make sure it's a type we can handle. Each PostgreSQL type has its own
value called typtype, which categorizes the variable into one of four classes,
described in detail in catalog/pg_type.h. Right now PL/LOLCODE only supports
"basic" types, designated by a 'b' in the typeStruct->typtype field. We get the
OID for the input argument type from the procStruct->prorgtypes.values[] array,
which contains an ordered list of type OIDs corresponding to each input
argument. Using that OID, we can get a Form_pg_type structure for the type in
question, and examine its typtype field. Provided the type is something
PL/LOLCODE supports, we then have to determine whether the type is a boolean,
because boolean (or TROOF) values need to be handled specially, described
later. We simply compare the OID we got from proargtypes.values[] with BOOLOID,
also defined in catalog/pg_type.h. If that comparison returns true, the value
is a boolean type.

Finally, the value might be null. Since there's no way to represent a NULL
non-pointer value in C, PostgreSQL uses another element in the
FunctionCallInfoData structure, the argnull[] boolean array. argnull[i] will be
true if input argument i is null. If after all that the value is of a type
PL/LOLCODE supports and the value isn't NULL, the call handler converts the
input argument Datum into a useful value by passing the actual Datum, stored in
the FunctionCallInfoData's arg[] array, into the output function for the input
argument's type. This returns a character string, and since PL/LOLCODE
automatically converts variables from one type to another as required, it
simply stores this value as a string, knowing that if it's not supposed to be a
string, it will get converted when the time comes. 

Calling type input and output functions

This deserves its only little section, since it's an important topic. Since I
don't yet know my way around it much, I'll leave it pretty empty for now. I've
shamelessly copied everything PL/LOLCODE does in this regard from pl/perl. To
call a type's input or output function, PL/LOLCODE first gets an OID for that
function from the typeStruct's typinput or typoutput field. With that OID, it
can then obtain a FmgrInfo structure, which is some magic device it needs to
identify the function it wants PostgreSQL to use on the data it's converting
from or to a given type. Given that FmgrInfo structure, it can then pass the
data in question to the type's input or output function usiing
InputFunctionCall() or OutputFunctionCall(), which themselves are mostly magic.
The result is either a Datum, in the case of InputFunctionCall(), or a
character string, in the case of OutputFunctionCall().

Note that the function info structures used in this process are stored in the
current transaction's memory context. That means that each time I run a
PL/LOLCODE function, I have to go through all the effort to find the function's
input and return types and their corresponding input and output functions.
Ideally, one day, PL/LOLCODE will "compile" functions into some other
long-lived memory context so that it doesn't have to do all this all the time.
That will probably happen around the same time the parse tree moves into a
different memory context.

THE PARSER

The PL/LOLCODE parser should be fairly simple to understand for someone
familiar with lex and yacc, and such people can probably skip this section
without missing much, down to where the description of the parse tree begins.

The parser's job is to take the function's source code and turn it into
something PL/LOLCODE can actually execute. It could, alternatively, simply
execute the code directly, but I've chosen to make the parser generate a parse
tree, which is then executed by the executor. There are several reasons for
this, primarily that it's just easier that way, but also because the parse tree
can be saved for later invocations of the function, speeding up execution of
the function.

The lexer 

The first part of the parser is the lexer, and it's the lexer's job to split
the function source up into tokens categorized meaningfully. The programmer
creates a lex file, and passes it to the lex executable (or flex, in our case,
which is an updated version of the same thing), and (f)lex generates C code
based on the input file. We then build that C code into our program and use it
to parse LOLCODE source. That input file often has a .l extension, and in
PL/LOLCODE's case, it's called scan.l.

As has been said above, lex's job is to split text into tokens. Typically a
token is a single word, but things like quoted strings, comments, and multiple
word commands (such as FOUND YR, the command to return a value) are lexed into
a single token. For most things, it's a simple as that -- just create a regular
expression for every token that will show up in the language you're parsing
(that is, every LOLCODE command, all identifiers, etc.) and come up with
something for the lexer to do each time it hits something that matches that
expression. In most cases this will just return some value, which the parser
will use to figure out what the source really means. Some things are more
difficult, though. For instance, within double quotes, we have a string value,
and we have to treat text differently if it's a string. If an LOLCODE command
shows up within double quotes, we don't want the parser to interpret that as a
command, but rather as a string constant. For this purpose, lex supports
different states, which are defined like this in scan.l:

%x LOL_STRING
%x LOL_LINECOMMENT
%x LOL_BLOCKCMT

This defines three states we might put lex into, aside from the default state.
Later on in scan.l, there is code to put lex into the LOL_STRING state when it
sees double quotes:

\"	{ BEGIN LOL_STRING; memset(stringbuf, 0, 255); stringptr = stringbuf; }

The key here is the BEGIN LOL_STRING, which tells lex to go into LOL_STRING
state. From then on, until we switch to some other state, the only regular
expressions lex will pay attention to and attempt to use to make tokens are the
ones that start with <LOL_STRING>. Those regex's are fairly simple, and
designed simply to handle escaped characters, until we get to the last one,
which puts us back in the default state (state 0) and does something useful
with the resulting string value lex has just found:


<LOL_STRING>\"		{ 
				*stringptr = 0;
				BEGIN 0;
				lol_lex_debug("yylex found string");
				lol_lex_debug(stringbuf);
				pllolcode_yylval.yarnVal = stringbuf;
				return YARN;
			}

Key for our purposes is the BEGIN 0, which tells lex to go back to its default
state. The other code takes the value of the string we've found, puts it into
pllolcode_yylval, which is a special variable lex uses when we need to keep
special data related to a token, and returns YARN, which is a constant that
says this token is a YARN token.

Also of interest in scan.l, and the reason we need flex v2.5.33 (which most
*NIX distributions don't ship yet) are the #define directives at the top having
to do with how lex allocates memory. By default, flex will use typical C
malloc() and free() calls to allocate and free memory in the code it generates.
However, PostgreSQL has its own memory management system where different
contexts are created and freed at different points in the life of a query or a
session, and tends to complain if backend code uses malloc() and free().
Instead, we need to use palloc() and pfree(), which are simply drop-in
replacements for malloc() and free(). These defines tell flex to use
PostgreSQL's memory management routines instead of its defaults, which prevents
the server from crashing every time you run a PL/LOLCODE function.

Finally, there are declarations for lots of different functions. Mostly these
are to prevent compiler warnings.

The parser

Once we have a stream of tokens, we need to translate it into something useful.
That's where yacc (or in our case, Bison, which is an updated version of yacc)
comes in. The method is very similar -- create a file that defines the grammar
of whatever you're parsing (typically that file has a .y extension, and in our
case is called gram.y), and pass it to yacc or Bison, which will create C code
out of it.

Yacc accepts grammar rules and converts them to C code, much like lex accepts
regular expressions and converts them to C code. Yacc is somewhat more complex,
however, because grammar rules are generally more complex than regular
expressions. The basic form should be fairly recognizable to those who have
studied formal languages; an example from PL/LOLCODE's gram.y follows:

lol_program:
	HAI lol_cmdblock KTHXBYE		{ executeProgram($2); }
	;

lol_cmdblock:
	lol_cmdblock lol_cmd	{ $$ = lolListAppend($1, $2); }
	| lol_cmd 		{ $$ = lolMakeList($1); }
	;

This example has two grammar rules. The labels lol_program and lol_cmdblock are
names of the rules, HAI and KTHXBYE are tokens returned by the lexer, and
anything in curly braces {} is template C code. The lol_program grammar rule
says that a set of tokens is considered an lol_program when it starts with a
HAI token, ends with a KTHXBYE token, and has stuff in the middle that can be
interpreted as a lol_cmdblock.

The next rule says a lol_cmdblock can be formed two different ways. First, it
might consist of one lol_cmdblock followed by a lol_cmd, or alternatively (the
pipe character | indicates alternative grammars that can make up a given rule)
a list of tokens that resolve to a single lol_cmd can be considered a
lol_cmdblock.  Later rules define lol_cmd.

Note the self-reference -- a grammar rule can define itself in terms of itself,
provided that at least one of the alternatives does not contain a
self-reference. That becomes useful when we have something like this:

HAI
	VISIBLE "IM IN YR DATABUKKIT"
	I HAS A YAK ITZ 3
	YAK R 15
	FOUND YR YAK
KTHXBYE

Suffice it to say that grammar rules not shown in the sample above will reduce
each line in the body of the program shown above to a lol_cmd. In this case the
parser will see the first lol_cmd (VISIBLE "IM IN YR DATABUKKIT") and decide
that it qualifies to be called a lol_cmdblock, thanks to the second alternative
in the lol_cmdblock rule. The next line is another lol_cmd, and the parser will
decide that the combination of the two lines is also a lol_cmdblock, because of
the first alternative definition of lol_cmdblock.

Perhaps a more detailed example is in order. Given the LOLCODE above, the lexer
would return a series of tokens something like this:

HAI VISIBLE YARN IHASA IDENTIFIER ITZ NUMBR IDENTIFIER R NUMBR FOUNDYR
IDENTIFIER KTHXBYE

The parser would go over the list, converting tokens to grammar rules something
like this, where the * indicates the token the parser is looking at. Note that
in yacc parlance, whenever the * moves to the right, that's called a "shift",
and when a token, group of tokens, or rule name is replaced with a more general
rule name (like when the parser turns "I HAS A YAK ITZ 3" into "lol_cmd",
that's called a reduction.

1) *HAI VISIBLE YARN IHASA IDENTIFIER ITZ NUMBR IDENTIFIER R NUMBR
FOUNDYR IDENTIFIER KTHXBYE

HAI can't be reduced by itself, so the parser shifts to the next token.

2) HAI *VISIBLE YARN IHASA IDENTIFIER ITZ NUMBR IDENTIFIER R NUMBR
FOUNDYR IDENTIFIER KTHXBYE

VISIBLE also can't be reduced by itself, nor can HAI VISIBLE, so the parser
shifts.

3) HAI VISIBLE *YARN IHASA IDENTIFIER ITZ NUMBR IDENTIFIER R NUMBR
FOUNDYR IDENTIFIER KTHXBYE

A lone YARN is an lol_identifier. The parser reduces.

4) HAI VISIBLE *lol_identifier IHASA IDENTIFIER ITZ NUMBR IDENTIFIER R NUMBR
FOUNDYR IDENTIFIER KTHXBYE

A single lol_identifier becomes a single lol_cmd. Another reduction.

5) HAI VISIBLE *lol_cmd IHASA IDENTIFIER ITZ NUMBR IDENTIFIER R NUMBR FOUNDYR
IDENTIFIER KTHXBYE

"VISIBLE lol_cmd" reduces to "lol_cmd". Here it's helpful to ignore the *
somewhat. Yacc keeps a stack and uses the whole stack to figure out when it can
do a reduction, so even though my indicator says it's looking at the lol_cmd
item, the parser is smart enough to look backward and see that VISIBLE
lol_cmd can be reduced.

6) HAI *lol_cmd IHASA IDENTIFIER ITZ NUMBR IDENTIFIER R NUMBR FOUNDYR
IDENTIFIER KTHXBYE

"lol_cmd" can be further reduced to "lol_cmdblock".

7) HAI *lol_cmdblock IHASA IDENTIFIER ITZ NUMBR IDENTIFIER R NUMBR FOUNDYR
IDENTIFIER KTHXBYE

"HAI lol_cmdblock" doesn't mean anything, nor does "lol_cmdblock IHASA" or any
other possible reduction, so another shift.  

7) HAI lol_cmdblock *IHASA IDENTIFIER ITZ NUMBR IDENTIFIER R NUMBR FOUNDYR
IDENTIFIER KTHXBYE

Here we'll skip a few steps for brevity. Suffice it to say that by the same
types of steps as above, "IHASA IDENTIFIER ITZ NUMBER" reduces to "lol_cmd".

8) HAI lol_cmdblock *lol_cmd IDENTIFIER R NUMBR FOUNDYR IDENTIFIER KTHXBYE

Here again, ignore the * a bit, because yacc is going to notice that it has
lol_cmdblock followed by lol_cmd, which reduces to lol_cmdblock.

9) HAI *lol_cmdblock IDENTIFIER R NUMBR FOUNDYR IDENTIFIER KTHXBYE

Again for brevity, suffice it to say that "IDENTIFIER R NUMBR" and "FOUNDYR
IDENTIFIER" both reduce to "lol_cmd", and the two lol_cmd elements become part
of the lol_cmdblock we've already created.

10) HAI lol_cmdblock *KTHXBYE

KTHXBYE doesn't reduce in itself, but the whole set of elements fits the
definition of lol_program, so another reduction.

11) lol_program

...and we've used up all the tokens, so we're done. 

At each reduction, the parser will run the code associated with the part of the
rule that triggered the reduction. So given the rule above for lol_cmdblock,
copied here...

lol_cmdblock:
	lol_cmdblock lol_cmd	{ $$ = lolListAppend($1, $2); }
	| lol_cmd 		{ $$ = lolMakeList($1); }
	;

...each time "lol_cmdblock lol_cmd" reduces to just "lol_cmdblock", the parser
calls lolListAppend().

So what of all the symbols in the grammar prefixed by $ characters? These make
reference to yacc's internal stack. When yacc reduces "lol_cmdblock lol_cmd",
some element in the stack represents a lol_cmdblock, and another represents a
lol_cmd. We can refer to those stack elements by $1 and $2 respectively.  Yacc
numbers each token involved in a reduction rule, 1 through N, moving left to
right through the rule, and each element can be referenced by that number. The
$$ symbol refers to the top of the stack, so the line in the sample above that
reduces "lol_cmdblock lol_cmd" takes the top two items off of yacc's stack and
replaces them with the return value of lolListAppend().

The parse tree

At this point it is probably a good idea to pause to describe what exactly
PL/LOLCODE is trying to accomplish here. It's trying to build a parse tree, an
in-memory representation of the program source it has been given. Eventually it
will pass this tree to the executor, which will execute it, and hopefully one
day it will even save the parse tree in some session-scoped memory context so
that it only has to parse functions once per PostgreSQL session rather than
once per function call.

The term "parse tree" is misleading, though. The parse tree is a representation
of the program, and a program is a series of sequential steps, so the "parse
tree" isn't really a tree, it's a linked list, because it's easier to represent
a sequence with a linked list than with a tree. However, there are some places
in this list that branch, like a tree. Each node in the list can point to a
sub-list of its own. This is useful for things like loops, where there will be
one node in the main list that represents the loop itself, and a sub-list
attached to that node to describe the commands within the loop. In fact, this
branching is used all over the place -- not just in loops. The VISIBLE command,
which takes an argument and spits it into the log system, is represented in the
parse tree by one node which points to a sublist, which in turn contains a
single node representing the value to be stuck in the log.  

When I started this project, I naturally assumed that because it was always
referred to as a parse "tree", the best data structure to use would be a binary
tree. This proved quite difficult, though whether that was because a binary
tree is an inelegant mechanism to represent a program's structure or because I
just didn't know my way around yacc very well is certainly debatable. Most
likely it was a little of the former and a bunch of the latter. In any case,
one day in frustration I looked at how PostgreSQL does it in PL/PgSQL, and saw
that it was using a linked list. So I started over on the yacc grammar once
again, and found that a linked list worked nicely.

Parser data types

It will be helpful at this point to dive into data types and how the parser
represents all this stuff internally. pllolcode.h defines lolNode as a
structure that (right now anyway) has elemets node_type, nodeVal, and list, to
store a type, a value, and a pointer to a lolNodeList, respectively. These
names will probably change throughout development, since I just noticed they're
inconsistent, and because I'll probably need to make this structure more
complex as I get the parser more complete.

The node_type is an enum that indicates what command this node represents (and
the executor is basically just something that loops through the nodes and
processes each one in a big switch/case command, with one case for each node
type). Some node types might require an argument of some sort, which is stored
in the nodeVal member, and if the node has a set of related subcommands, the
most intuitive example of which is a loop, where the body of the loop is just
such a list, the list member points to the list of subcommands, which are
stuffed into a lolNodeList.

Working with yacc means working with unions, and the first one we encounter is
here, in the form of nodeVal. The nodeVal is a union that can contain any of
LOLCODE's data types (YARN, NUMBR, NUMBAR, and TROOF -- and internally TROOF is
just a NUMBR). A more complex union is the one yacc uses to represent its
stack. In the discussion of the stack above, talking about representing items
on the stack as $1, $2, etc., no mention was made of what data type each of
these stack elements is. There's a whole set of yacc comamnds to deal with
that, reproduced in its current state below:

%union  {
		int numbrVal;
		float numbarVal;
		char *yarnVal;
		struct lolNodeList *nodeList;
		struct lolNode *node;
}

%type <nodeList> lol_program
%type <nodeList> lol_cmdblock
%type <node> lol_identifier
%type <node> lol_cmd
%type <yarnVal> YARN
%type <yarnVal> IDENTIFIER
%type <numbrVal> NUMBR
%type <numbarVal> NUMBAR
%type <numbarVal> VISDEBUG VISLOG VISNOTICE VISINFO \
	VISWARNING VISEXCEPTION lol_visibleLevel

(Note that the last line is split in this instance for formatting reasons, but
is one long line in the actual source)

The %union declaration defines all the different types that might show up in
the parser's internal stack, and the type declarations map C types from the
%union declaration to the types seen in grammar rules. So when the following
shows up in a grammar rule, because of the type declarations yacc knows that $1
means nodeList, which is a struct lolNodeList pointer, and $2 means a struct
lolNode pointer.

	lol_cmdblock lol_cmd    { $$ = lolListAppend($1, $2); }

Back to the parse tree

To support the parse tree, PL/LOLCODE contains a simple linked list
implementation built off of two functions declared in pllolcode.h and (for now
anyway) defined in gram.y. The functions will take a lolNode structure and
either make a lolNodeList out of it, or add it to an existing lolNodeList.  The
case of the lol_cmdblock grammar rule is fairly easy to understand, reproduced
(once again) below:

lol_cmdblock:
        lol_cmdblock lol_cmd    { $$ = lolListAppend($1, $2); }
	| lol_cmd               { $$ = lolMakeList($1); }
	;

What exactly this does should be fairly obvious now. If the parser encounters a
single lol_cmd (which because of other grammar rules will already be
represented as an lolNode structure), it passes that lolNode structure from the
top of the stack to the lolMakeList() function, which makes it into a
one-element lolNodeList, and returns a pointer to it, which replaces the
original lolNode on the top of the stack. If the parser encounters an lol_cmd
preceded by an lol_cmdblock (meaning that the stack contains an lolNode and an
lolNodeList in its top two elements) it pops them from the stack and passes
them to lolListAppend(), which appends the lolNode to the lolNodeList and
returns a pointer to the list, which goes back on the stack.

Notes on writing a yacc grammar

At this point, I'd better include this snippet, so I don't forget to include it
later. I ended up rewriting this grammar several times; as I learned more and
more about how yacc worked and got more and more comfortable with it, I judged
that it was easier to just start over than to rework my previous understanding
of the system to fit my new and improved understanding.  I also found it
easiest to work from the highest levels of detail down to the lowest, so the
first grammar rule I wrote was the one for lol_program, and as I expanded the
grammar I would similarly expand the test scripts I used to contain more and
more complex LOLCODE functions.  Finally, I found it easiest to start by making
the parser call debug functions that would simple output log messages, rather
than trying to get the parser and the parse tree right all at once. I replaced
the debug commands with parse tree generation commands after I got used to
making meaningful grammar rules.

More grammar rules

Here we need to investigate in closer detail how lolNodes are made to represent
an actual command, because the next set of grammar rules are designed to
translate LOLCODE commands to lolNode structures. Each command is different,
but can be well represented by VISIBLE, reproduced here:

lol_cmd:
        VISIBLE lol_cmd 
	       {
			tmpnval.numbrVal = NOTICE;
			$$ = lolMakeNode(node_VISIBLE,
				tmpnval, lolMakeList($2));
		}
	| VISIBLE lol_visibleLevel lol_cmd
		{ 
			tmpnval.numbrVal = $2;
			$$ = lolMakeNode(node_VISIBLE,
				tmpnval, lolMakeList($3));
		}

The VISIBLE command in PL/LOLCODE sends a message to the log. By default it's
at the log level NOTICE, but the programmer can optionally specify a log level
to use. VISIBLE commands end up in an lolNode in an interesting way. The type
of the node is node_VISIBLE, and the nodeVal is a NUMBR representing the log
level. The actual value to be sent to the log is contained in a one-element
sub-list, whose one node is an identifier. When the executor runs a VISIBLE
node, it first runs its sublist, which is expected to put a value into
LOLCODE's special IT variable, and then VISIBLE prints the contets of IT to the
log.

IT is like Perl's $_ variable -- it pretty much always references whatever the
last thing to happen was, and if no variable is specified, IT is assumed.
PL/LOLCODE supports IT by using it to pass values around for basically
everything, including VISIBLE.

So the code above accepts either version of the VISIBLE command (with or
without a log level specification), and creates an lolNode out of it with the
lolMakeNode() function. The value to be sent to the log is, at this point,
represented as an lol_cmd; earlier parser rules first identified it as a YARN,
NUMBR, NUMBAR, TROOF value, or IDENTIFIER, and labeled it an lol_identifier,
and another grammar rule maps all lol_identifiers to lol_cmds. The tmpnval
variable referenced is just a NodeVal union declared globally in gram.y to make
it easy to populate the second argument to lolMakeNode(). 

Executor

Once the parse tree is built, it gets sent to the executor. Specifically, that
happens in the most general grammar rule, when the parser decides it has seen
an lol_program, and sends it to the executeProgram() function, which is an
entry point into a fairly simple set of functions: executeProgram(),
executeList(), executeNode(), and one function for each type of node in the
parse tree (again, those are all part of the NodeType enum defined in
pllolcode.h).  executeProgram() simply checks to make sure there's something in
the parse tree, and sends it to executeList(). There, a simple loop calls
executeNode() on each node in the list. The executeNode() function is one big
case statement, which calls the appropriate execution function for each type of
node.

These execution functions themselves (or at least, the ones I've written so
far) are themselves very simple. The simplest are the ones that correspond to
constants (node_NUMBR, node_NUMBAR, node_YARN, and node_TROOF types). Their job
is simply to set IT to the constant value they contain. Something like
visibleNode() is a bit more complex: the node_VISIBLE node type should always
contain a sub-list consisting of one node, containing the value to be sent to
the logs. So visibleNode() first calls executeList() on its sub-list, which
initializes the IT variable, and then calls PostgreSQL's elog() function on the
value in IT, converting the value to a string on the way. Since each node
performs a relatively simple function, the executor function for each node type
can, at least in most cases, be correspondingly simple.

Variables

Variable handling is probably the sole remaining part of PL/LOLCODE not yet
handled in this document. Each PL/LOLCODE variable is represented in an
lolIdent structure, shown here with other related structures:

typedef union {
		int numbrVal;
		float numbarVal;
		char *yarnVal;
	} NodeVal;

typedef enum { ident_YARN, ident_NUMBR, ident_NUMBAR, ident_TROOF } \
	IdentType;

typedef struct lolIdent {
	char *name;
	IdentType type;
	NodeVal value;
} lolIdent;

Each variable has a name, data type, and value. Note that there are four data
types, but only three members in the union. TROOF values don't use a separate
union member; a TROOF variable representing WIN (TRUE) contains a non-zero
value in numbrVal.

These variables are stored, right now, in an array of LOL_MAX_VARS elements,
where the first element represents the IT variable. There is also a special
global lolIdent pointer called IT. Right now this could easily be replaced by a
reference to the first element in the variable array, but eventually I plan to
replace the array structure with something a bit more efficient, and when I do
it will be nice to have a pointer to IT immediately available, without having
to search through the variables to find the right one.

As it is, when a PL/LOLCODE function references a variable, lolGetVar() scans
through the array of variables one at a time until it finds one matching the
variable name used. Similarly, when a variable is set, lolSetVar() also scans
through the list of variables, setting an existing variable to the given value
when it finds a matching variable name, and creating a new value if it doesn't
find a match. lolSetVar() will also complain if the array of identifiers gets
too long.

Type casting

All type casting happens automatically. Since I haven't yet implemented
arithmentic, that has been pretty easy so far. Functions know what data type
they want, and when retrieving a variable they get it using a function designed
for the appropriate data type. For instance, lolVarGetString() will return the
value of the given variable name, converted to a string where necessary. Here I
could probably use more error checking, for instance when converting strings to
numbers. As it is, a string value that can't be converted to a number will just
return 0 (identical behavior to the underlying strtol() and strtof() calls).
lolVarGetTroof() returns lolWIN if the variable in question contains a non-zero
number or a non-empty string; otherwise it returns lolFAIL.

********************************************************************

Up until this point, I've been writing documentation for stuff I wrote weeks
ago. Now that the documentation has finally caught up with the code, I hope to
continue to document in journal format. This obviously means that this document
won't become the "How to Write A PL" document I hope one day to generate,
without having to go through some amount of work. Since that was already the
case anyway, I don't feel that turning this into a journal format will detract
greatly from the end product. The potential for increased detail, and if
nothing else, the added time dimension, may in fact greatly improve the value
of this document.

3 Dec 2007 -- I hope to start implementing math operations. To my surprise and
pleasure, I find tonight that lolcode.com does *not* yet have a 1.3 spec out,
which means I'm still coding for the 1.2 spec. I'm glad for that. The biggest
problem I anticipate for math operations is making sure the type casting works
out decently. We'll see how that goes.

First, I've decided to change the NodeType enum to an int, and to just use the
tokens defined by bison and flex to determine the node type. It prevents me
from having to create all kinds of node types in that enum that are already
defined elsewhere in slightly different language. And it appears, at first
blush, to actually work. I've had to define a few fake %tokens to keep
everything compiling happily, but that's no matter.

My goodness, it actually works. It assumes everything is a NUMBAR, true, but it
really works :) It was even pretty easy.

4 Dec 2007 -- Rather than have separate functions for math operators and
boolean ones (which I'm doing today), I've decided to have major groupings
based on the arity of the operator in question. So there are grammar rules for
binary operators, unary operators, and N-ary operators, since the handling is
very similar for each one. Each operator type has each of its operands in the
linked list attached to the operator's node. I've had some problems getting
N-ary operators to work, but it turned out it was principally because I forgot
to save the node on the stack when I first create the N-ary operator's node.
Also, I've named functions and grammar rules dealing with N-ary operators as
though the common term were "X-ary" operators, because "N-ary" is too easy to
confuse with unary or binary in function names. A bad idea? Time might bother
to tell.

5 Dec 2007 -- Implemented BOTH SAEM and DIFFRINT. This was really easy, given
that I have all the logic for a generic boolean operator set up already, and
that I've taken serious liberties in the binary operator handling function by
assuming that pretty much everything casts to a NUMBAR. Eventually I'll have to
pay for this by rewriting much of it to support more data types than just
NUMBAR, but for now it's nice to make it feel like I'm progressing.

6 Dec 2007 -- I'm trying the development of the last few days out on my OpenBSD
box, and finding that strtof() doesn't exist. I'm sure there's a good reason
for this (or at least, a zealously puritanical one, in keeping with OpenBSD
style) but I don't know what it is. I really should store NUMBARs as
doubles anyway, and not floats, for precision's sake (then again, why
would someone worried about precision use PL/LOLCODE?). So I've gone
halfway, for now. I've converted strtof() to strtod(), and typecast the
return value as a float in all cases. Wimpy, I know, but I need to help
make dinner, and don't have time to hack on this right now.

7 Dec 2007 -- I think I've adequately switched all NUMBARs to doubles. Time to
test.

8 Dec 2007 -- I realized I've not properly supported NULLs and NOOBs yet. This
should work now, with the exception of NULL return values... but I've not yet
tested it. I want to spend the next little while cleaning up TODO items,
because it seems like a good thing to pay attention to every once in a while.

Later -- the tests worked, and now it supports NULL return values.

14 Dec 2007 -- I've been trying to make the math functions work properly with
different types. LOLCODE has some casting rules that seem like they'll take
some work to implement properly. For instance, for mathematical and boolean
functions, values are automatically cast to meaningful types. For comparision,
there is no automatic casting beyond promoting NUMBRs to NUMBAR if there's one
of each in the comparison. But YARNs aren't typecast to numeric values, or vice
versa, implicitly. Anyway, I think the arithmetic functions work now, but I
have to add some stuff back in to make it possible to be really sure what data
type a variable is, before I can really test. I've had to rewrite most of the
binary operators' code, some of which is still disabled while I mull over how
it should work. Booleans work again, though.

29 Dec 2007 -- For a few weeks I've been working off and on (mostly off) to get
the explicit typecasting working so I could test comparison functions and such
things more fully. I think casting finally works. As a bonus, I found what's
been causing all the incompatible pointer type warnings I've been getting
during the build process, and fixed that, so the build is a lot cleaner. It's
pretty important to keep track of what types you're dealing with when doing
this.

31 Dec 2007 -- I think I'm insane. I'm considering finally making the necessary
changes so that line ends and commas are properly supported.
--(shortly later)--
Surprisingly enough, it compiles, and after I make the scanner accept initial
newlines, it doesn't yell about syntax errors. I've just added the handling for
the R command, and found there are some bugs somewhere. Specifically, declaring
a variable and assigning a value (I HAS A var ITZ value) followed by VISIBLE
var will cause a crash. However, I HAS A var ITZ value, var R newvalue, VISIBLE
var works. The changes I've done today are momentous enough, though, that I'm
going to check this in despite the bug.

1 Jan 2008 -- I've just added a debug function, GIMMEH YR VARZ, which is
supposed to print out the symbol table. I'll probably leave it in forever. For
now it should be nice to help debug the I HAS A <var> ITZ <value> problems I'm
having. I also think I've left a bug in the <var> R <expression> syntax, where
the expression is assumed to be constant. Finally, I've figured out how to get
rid of the shift/reduce conflict warning bison is giving me -- the %expect
option tells it how many shift-reduce conflicts to expect, and if it finds that
many (which it should handle correctly by default), it doesn't warn about them.
I now build without warnings :)

Incidentally, I'm starting to think I might want to restructure the
names of executor node structures and identifier structures. They're
very similar, and sometimes I confuse the two, leading to lots of
warnings.

4 Feb 2008 -- Not much going on lately; I've not had loads of time. Of late
I've been trying to make sure I've got the type handling done right in the code
that gets function arguments. Pl/python, for instance, seems to have a little
switch statement that converts various types of PostgreSQL internal variables
to different python types, and I figure I should do the same. Once I've got
that working, I'll probably mess with SPI...

(Later) types appear to work. I've moved the bulk of the type conversion logic
into pllolcode_symbol.c because it seems more consistent to put it there,
despite the result of having to mess with a bunch of #includes. SPI is up next,
but probably not today -- I'm updating to 8.3, and hope to see that everything
still works...

5 Feb 2008 -- W00T. I'm finally starting work on SPI. There will, in the
current rendition, be one command: GIMMEH <var> OUTTA DATABUKKIT <query>, e.g.
GIMMEH LOL2 OUTTA DATABUKKIT "SELECT a FROM table LIMIT 1". There's no way to
return multiple rows, or to loop through multiple rows, yet, nor is there a way
to deal with record types. There will be, though, a provision that if you don't
give it a variable name, it will stick the value in IT... kinda Perl-esque, no?
I'm thinking of providing a system to translate LOLSQL into real SQL, and
allowing queries in both.

For now, if you run a query that returns multiple columns or rows, I'll just
stick the first column of the first row in the variable you're interested in,
and drop the rest.

5 Feb 2008, later -- I committed what I'd done earlier today even though my
test script crashed the backend, because I wanted to have a backup somewhere.
It turns out that contrary to my expectations, it wasn't the new SPI stuff
that's crashing the backend; right now I think it's something wrong with FOUND
YR. If I declare a variable and return it with FOUND YR, the backend crashes,
but if I return a constant, it doesn't. Time to debug...

24 Apr 2008 -- It's been ages, but I'm working on this again. On this
particular computer, I can't make the string returning problem show up. So I'm
ignoring it for now and making conditionals work. It looks like they've got
some problems, but O RLY, YA RLY, MEBBE, and NO WAI work at least some of the
time :)

25 Apr 2008 -- With some helpful examples from yesterday's work, I've just
finished making WTF blocks (the equivalent of a switch statement) work. And I
don't think the O RLY? code I did yesterday has problems, after all.

29 Apr 2008 -- I just made it legal to return NULL when the function has a
return type. This is probably a useful feature :)

30 Apr 2008 -- Basic IM IN YR loops were pretty easy to add just now. Note that
the loop is infinite unless you include a GTFO somewhere in it

1 May 2008 -- It has been suggested that I include in this documentation a link
to this: http://epaperpress.com/lexandyacc/index.html. It's a description of
how to write a calculator with lex and yacc, and I found it really useful
getting started. Thanks to Tom Niemann for writing it.
