------------------------------------------------------------------------------ Inform 6.21 Technical Manual Graham Nelson 27th April 1996 revised on the following dates: 5th May, 10th May, 6th September, 23rd September, 16th December 1996, 27th January, 26th March, 5th April, 8th September 1997, 22nd March 1998, 10th December, 30th January 1999, 27th April ------------------------------------------------------------------------------ 1 The source code 1.1 Inventory 1.2 Map 1.3 Naming conventions 1.4 Typedef-named types 2 Porting Inform to a new environment 2.1 Dependence on the OS 2.2 Portability issues: the types int32 and uchar 2.3 The character set and the format of text files 2.4 The OS definitions block in "header.h" 2.5 Running Inform in a multi-tasking OS 3 Front end 3.1 The ICL interpreter 3.2 Fundamental method 4 Lexical analysis 4.1 Aim and structure of the lexer 4.2 Level 1: lexical blocks and buffered input 4.3 Level 2: breaking text into lexemes 4.4 Representation of tokens 4.5 Level 3: identifying identifiers 4.6 Hash-coding and comparing names 4.7 The symbols table 4.8 Token lookahead: the circle 4.9 Summary of the token output 5 Syntax analysis 1: the top-down structural parser 5.1 Aim and structure of the syntax analyser 5.2 Predictive parsing 5.3 The context-free grammar 5.4 Assigning values to symbols 5.5 Outputs other than assembly language 5.6 Assembly operands 5.7 Translation to assembly language 5.8 Summary of assembly language instructions output 6 Syntax analysis 2: the bottom-up expression parser 6.1 Map and structure 6.2 The operator precedence grammar 6.3 Lexical level 4: tokens to etokens 6.4 Resolution of ambiguous tokens 6.5 Constant folding 6.6 Checking lvalues and simplifying the parse tree 6.7 Summary of parse tree output 7 Code generation from parse trees 7.1 Aim and structure of the code generator 7.2 Annotation for conditions 7.3 Main traversal 7.4 Void context 7.5 Results of subexpressions 7.6 Conditional and logical operators 7.7 System functions 7.8 Strict mode assertions 8 Assembly of code, text and data structures 8.1 Assembly of initial code 8.2 Branch offset backpatching and optimisation 8.3 Text translation: ISO and Unicode resolution 8.4 Dictionary management 8.5 Format of tables unspecified by the Z-machine 8.6 Grammar version numbers GV1 and GV2 8.7 Value backpatching 8.8 Packed address decoding 9 Run-time veneer 9.1 Services provided by the veneer 9.2 Specification of the veneer routines 9.3 Properties 2 and 3, "ofclass" and "metaclass" 9.4 Class numbering and class objects 9.5 Individual property identifiers 9.6 Individual property tables 9.7 Availability of symbol names at run-time 10 Module compilation and linking 10.1 Model 10.2 Linking a module 10.3 Imports and exports 10.4 Backpatch backpatching 10.5 How modules differ from story files 10.6 Restrictions on what modules may contain 11 Service levels 11.1 Variables and arrays 11.2 Memory allocation and deallocation 11.3 Error messages 11.4 File input/output 12 Low-level language features 12.1 Using the "Trace" directive 12.2 System constants and other secret syntaxes 12.3 The "Zcharacter" directive 12.4 Sequence points 12.5 Format of debugging information files 12.6 How to syntax-colour Inform code 13 What little I remember 13.1 Publication history 13.2 Design history 13.3 Implementation history 13.4 Modification history ------------------------------------------------------------------------------ 1 The source code --------------- 1.1 Inventory --------- Inform is written in portable ANSI C and the source code is divided into 21 files of code, called "sections", plus 1 #include file of linkage, constant and type definitions. These files are: arrays.c asm.c bpatch.c chars.c directs.c errors.c expressc.c expressp.c files.c inform.c lexer.c linker.c memory.c objects.c states.c symbols.c syntax.c tables.c text.c veneer.c verbs.c header.h Note that all their names fit into the 8.3 filenaming convention. The subdivision into 21 sections is intended to ensure that each .c file can be compiled into one linkable object file: under some C compilers object code files cannot exceed 64K in length. On my machine, expressc.o (the object code derived from expressc.c) is the largest, at 40K (about a third of which is the static table of operator data). A concise tree giving the structure of the source code follows. A section name is given brackets when it has already been given on a previous line: so "(inform.c)" is intended to be read as "now back to inform.c again". A name written as a function, like "compile()", is indeed a function, whose arguments are omitted here. Text in square brackets indicates the presence of interesting tables of static data. Finally, note that this structure is not absolutely rigorous: the error-reporting routines in "errors.c", for example, are called from all over Inform, not just from the lexical analyser. inform.c ICL parser: switches, filename translation, path variables memory.c ICL memory command parser: sets memory settings. (inform.c) compile(): polling all sections to manage variables, allocate and free arrays syntax.c syntax analyser, top level lexer.c lexical analyser: converts source text to a stream of tokens [Tables of all Inform keywords] chars.c performs character set translations files.c reads source code files into buffers for the lexer; performs miscellaneous file I/O symbols.c keeps table of symbol names found in the source; recognises from and adds to this table errors.c issues error, fatal error and warning messages (syntax.c) parse_program(): top level routine in syntax analyser parse_directive() directs.c parse_given_directive(): parses and obeys the easier directives; manages conditional compilation; delegates directive parsing down to other sections for harder cases: text.c make_abbreviation(): Abbreviate arrays.c make_global(): Array, Global objects.c make_attribute(): Attribute make_property(): Property make_class(): Class make_object(): Object verbs.c make_fake_action(): Fake_Action make_verb(): Verb extend_verb(): Extend linker.c link_module(): Link (symbols.c) assign_symbol(): giving value and type to symbols (syntax.c) parse_routine() parse_code_block() states.c parse_statement(): assigns ".Label" labels, parses and generates code for statements parse_action(): handles statements parse_print(): handles print and print_ret asm.c parse_assembly(): handles assembly language source code (preceded by "@") expressp.c parse_expression(): parses all expressions (including constants), conditions and assignments. expressc.c [Table of operators] code_generate(): generates code from parse trees previously found by parse_expression() (asm.c) assemble_2_to() (and many other similarly named routines) [Database of all Z-machine opcodes] assemble_instruction(): assembles a single Z-code instruction assemble_label_no(): puts label N here assemble_routine_end(): finishes routine, backpatches and optimises branches (text.c) compile_string(): translates ASCII text to Z-encoded text the dictionary manager the abbreviations optimiser (chars.c) performs character set translations veneer.c compile_veneer(): compiles in any whole routines needed by the rest of the compiled code [Table of Inform source code for veneer routines] tables.c construct_storyfile(): glues together all the code, dictionary, object tree, etc. into a story file bpatch.c backpatch the code in the light of recent knowledge 1.2 Map --- Here follows a map of the Inform archipelago, marking the inhabited islands, their shipping lanes and chief imports and exports: command line and/or ICL files in | | ICL commands \|/ +----------+ FRONT | inform.c | END | memory.c | +----------+ | | filenames | \|/ +------------+ LEXICAL ------> files.c -----> | lexer.c | ANALYSER source chars | symbols.c | code in +------------+ /|\ | symbol | | values | | tokens | \|/ +------------+ SYNTAX | syntax.c | -----+------> asm.c --->---\ ANALYSER: | states.c | / assembly /|\ initial| STATEMENTS | . | / language | code| | ---------- | / | | @ ASSEMBLY | asm.c | -/ | | | ---------- | | | | . | parse trees | \|/ EXPRESSIONS | expressp.c | --+-------> expressc.c asm.c | (map 6.1) | \ | | . | \ | | . | \ TEXT | | ---------- | strings +-------+Z-text | DIRECTIVES | directs.c | \---->|text.c |-->---)|(--\ | . | |chars.c| | | | . | +-------+ | | | . | dictionary| | | | . | alphabets \|/ \|/ | | . | | | | | arrays.c | ------->------\ | | | | . | array area | | raw| | | . | | | Z-code| | | objects.c | ----->-----\ | | | | | verbs.c | objects | | | | | +------------+ | | | | | | \|/\|/\|/ \|/ | | grammar +----------+----------+ | \---------------> | tables.c bpatch.c | | +----------+----------+ | | OUTPUT | | | | | Z-machine \|/ Z-code| | up to | | | code area | \|/ \|/ \-----------> files.c | | \|/ story file out (For clarity, the linker and a few smaller tables are missed out; and the "service" sections of Inform, such as "errors.c" and the allocation code in "memory.c", are missed out since they are, so to speak, pipes and sewers which lie beneath the surface of the ocean.) 1.3 Naming conventions ------------------ The "header.h" makes over 700 constants using #define. These are mainly in capital letters and are followed by _ and then some short code indicating what kind of constant is being defined: for instance, NUMBER_TT means "the token type ". We write *_TT for the set of constants ending in _TT. Similarly, though to a lesser extent, groups of related variables and routines have grouped names. ------------------------------------------------------------------------- Set of constants Used for ------------------------------------------------------------------------- *_Extension File extensions, such as ".z5", used if the host OS supports them *_Directory Initial values for the ICL path variables (e.g., default pathname where story files are written to) *_TT Token types *_CODE Token values for statement and directive names *_COND Token values for textual condition names *_SEGMENT Token values for object definition segment names *_MK Token values for misc statement keywords *_TK Token values for "trace" directive keywords *_DK Token values for misc directive keywords *_SC Token values for system constant names (* is written in lower case) *_SYSF Token values for system function names [In all of the above eight cases, * is the name of the statement, keyword, etc. referred to, written in upper case except as specified above] *_SEP Token values for separators (the name sometimes reflects the text, e.g., DARROW_SEP for the double-length arrow "-->"; sometimes its use, e.g. NOTEQUAL_SEP for "~=") *_OP Token values for operators *_A Associativity values for operators *_U "Usage" (infix, prefix or postfix) values for operators *_T Symbol types *_SFLAG Symbol flags (bitmasks containing one bit set, so that (sflags[i] & *_SFLAG) is true if flag is set) *_CONTEXT Contexts in which the expression evaluator can run (e.g., "void", "condition") *_zc Internal numbers referring to Z-machine opcodes (* is the Standard 0.2 name for the opcode, written in lower case) *_OT Assembly operand types *_STYLE Two constants used to set whether the Z-machine has "time" or "score/moves" on its status line *_ZA Z-machine areas: e.g. PROP_ZA refers to the property values table area *_MV Marker values *_RTE Run-time error numbers *_VR Veneer routines (* is the name of the routine, usually in mixed case) *_DBR Record types in debugging information files ------------------------------------------------------------------------- Set of variables Used for ------------------------------------------------------------------------- *_switch Flag indicating whether a command-line switch such as -s is on or off *_setting Numerical value set by a command-line switch such as -t3 MAX_* A limit on something: note that a few of these are #define'd but most are memory setting variables no_* Number of things of this type made so far max_* Maximum number of things of this type made token_* Three variables used to hold the value, type and lexeme text for the last token read *_trace_level 0 if tracing information is not being printed out about *; otherwise, the larger this is, the more output is produced *_offset Byte offset in the Z-machine, either from the start of this Z-machine area or from the start of Z-machine memory *_top Pointer marking the current end in some uchar array (usually holding a Z-machine area being put together) ------------------------------------------------------------------------- Set of routines Used for ------------------------------------------------------------------------- init_*_vars() Routine in which section * of Inform initialises its variables to begin compilation *_begin_pass() ...and in which it initialises its variables at the start of the source code pass *_allocate_arrays() ...and in which is allocates any memory or arrays it needs to begin compilation *_free_arrays() ...and in which it deallocates any memory or arrays it has allocated, after compilation parse_*() Routine in the syntax analyser to parse the source code construct * assemble_*() Instructing the assembler to generate an instruction: assemble_#() (where # is a number from 0 to 4) an instruction with # operands assemble_#_to() an instruction with #operands which stores a result assemble_#_branch() an instruction with #operands which makes a conditional branch *_linenum() Keeping and writing line references to the debugging information file 1.4 Typedef-named types ------------------- ------------------------------------------------------------------------- Typedef name Where defined Used for ------------------------------------------------------------------------- int32 H signed 32-bit integer uint32 H unsigned 32-bit integer uchar H unsigned char assembly_operand H holding Z-machine numbers in the form used in Z-code, together with linkage information about how they were calculated assembly_instruction H a convenient representation of an instruction of Z-code to assemble opcode asm.c everything about a Z-machine opcode: how many operands it has, whether branch or store, etc. verbl H grammar line of 8 token values verbt H grammar table of grammar lines prop H list of values for a property propt H property values table fpropt H the same, but with attributes too objectt H object tree-position and attributes dict_word H Z-text of a dictionary word dbgl H source code reference used for the debugging file (to file, line, char) keyword_group H the plain text of a group of keywords (such as: all the statement names) token_data H lexical tokens expression_tree_node H node in a parse tree produced by the section "expressp.c" operator H everything about an operator: its name, how to recognise it, its usage and associativity, etc. memory_block H an extensible area of memory (allocated in 8K chunks as required) tlb_s text.c used in abbreviations optimiser optab_s text.c used in abbreviations optimiser FileId H filename and handle for a source file ErrorPosition H filename and line reference for error message printing purposes LexicalBlock lexer.c name, line count, etc. within a block of text being lexed Sourcefile lexer.c buffer, pipeline and lexical block for a source code file being lexed ImportExport linker.c holds import/export records Marker linker.c holds marker records VeneerRoutine veneer.c holds low-level Inform source code for a veneer routine ------------------------------------------------------------------------- "H" is an abbreviation here for "header.h" 2 Porting Inform to a new environment ----------------------------------- 2.1 Dependence on the OS -------------------- Strenuous efforts have been made over the last three years to make the source as operating-system independent as possible. As a general principle, mostly adhered to, all operating-system differences should be in the "header.h" file, and not in the 20 sections. As a general rule, for each target OS which Inform is being ported to, a new #define name is invented. For example, the name LINUX is used for the Linux port. When Inform is being compiled, only that one symbol will be defined, and none of the others: thus #ifdef LINUX ...code... #endif compiles the given code only for the Linux port. There are some very poor "ANSI C" compilers out there, and many more mediocre ones (which almost obey the standard, but don't quite): in any case the ANSI standard is very broadly defined. For example, the code int x; x = 45*1007; printf("%d\n", x); is entirely ANSI compliant, but results in different numbers being printed on different machines, due to the fact that ANSI does not specify the range of numbers which a variable of type int can safely hold. Since C is so highly unportable a language, and since some of the compilers used to produce Inform are poor, the whole Inform code has to be written with the worst possible compiler in mind. An illustration of this is that all preprocessor commands, such as #define, must begin on column 1 of the source code: even when they occur in code which is #ifdef'd out. VAX C (a particularly bad compiler) will reject #ifndef VAX #define FROG 2 #endif for example, even when VAX is defined. This makes the declarations in "header.h" annoyingly illegible for everybody. 2.2 Portability issues: the types int32 and uchar --------------------------------------------- The main issues when porting Inform have been found to be: (a) the size of "int", (b) whether "char" is unsigned or signed by default, (c) what conventions apply to filenames, (d) assumptions about sizeof() when casting pointers, (e) how parameters (switches, filenames, etc.) are passed to Inform. (a) ANSI requires that "int" be at least 16 bit, though advances in CPU technology mean that under most of today's environments it will in fact be 32 bit. ANSI further requires that "long int" be at least as big as "int", but not that it has to be any bigger. Inform needs at least one integer type to be able to hold 32 bit signed values, and "header.h" contains code which attempts to typedef the name "int32" to such a type. This should happen automatically. Under ANSI rules, as the above says, a compiler need not have any integer type larger than 16 bits: if so, that compiler will not be able to compile Inform. An annoying issue here is that compilers vary widely in the extent to which they give errors or warnings when they detect silent "promotions" from one integer type to another. This makes it very hard to sort out the types of every object in the program between int and int32 so that everybody is happy: in practice, every time a new release of the source code has been made, a few dozen types have had to be fiddled with until everybody can compile it. (b) Compilers seem to divide about fifty-fifty on this. Again, the original standard is vague: the issue is really about how the value (char) 253, say, should be interpreted when it is cast to (int). Should the answer be 253, or -3? ANSI did not specify this because compiler writers wanted to be able to choose whichever could be done instantly on the CPUs they were working with. (If you store a char in the bottom 8 bits of a 32 bit register, then casting the value (char) 253 to (int) -3 means setting all 24 of the upper bits, which requires code to be compiled.) Inform uses a typedef'd type called "uchar" when it needs an unsigned char type: it uses plain "char" when it doesn't mind. It never needs a signed char type. In theory ANSI C compilers must recognise the keywords "signed" and "unsigned", but some don't: typedef unsigned char uchar; actually produces an error on some compilers. So the typedef can only be made with your help. (On other compilers, "unsigned" is legal but "signed" is illegal.) (c) Many people think that the minimal 8.3 convention will work on any operating system, but this is not true (it won't work under Acorn's RISC OS). Much of each OS specification in "header.h" is therefore to do with filenaming. (d) For instance, sizeof(char *) sizeof(int *) sizeof(int32 *) sizeof(int) may all be different numbers on machines with segmented memory maps. This being so, casting between pointer types may lose information, and a few arrays in the source have surprising types to ensure safety. One thing Inform does need to be able to do is to subtract one pointer (of the same type) from another: it defines the macro subtract_pointers(X, Y) to do this. X and Y are normally of type uchar; there seems to have been no problem with this in practice. (e) The ANSI standard is quite good on the command line, and Inform expects to read parameters by the standard argc, argv mechanism. Unfortunately the Macintosh, for instance, has no orthodox command line. Such a port probably wants to have an "outer shell" which displays a window, allows options to be set and then calls the Inform 6 core as needed. The section "inform.c" normally compiles a "main" routine which makes a few machine-dependent changes and then passes its arguments straight on to "sub_main". For instance, here's the v6.10 source: int main(int argc, char **argv) { int rcode; #ifdef MAC_MPW InitCursorCtl((acurHandle)NULL); Show_Cursor(WATCH_CURSOR); #endif rcode = sub_main(argc, argv); #ifdef ARC_THROWBACK throwback_end(); #endif return rcode; } The Macintosh Programmer's Workshop port is making multi-tasking work before embarking on compilation; the Acorn Desktop Debugging Environment port is tidying up after any error throwbacks, at the end of compilation. The point is that here is the place for such minor machine quirks. However, if you want an entirely new front end (such as Robert Pelak's Macintosh port of Inform has), then you need to define #define EXTERNAL_SHELL in your machine definition block (see later). This will mean that no "main" routine is compiled at all from "inform.c" (so you can simply link the Inform source into your own code, which will contain its own "main.c"): Inform should be run by calling extern int sub_main(int argc, char **argv); having set up argc and argv suitably. For instance, the outer shell might take names typed into dialogue boxes, and various ticked options on a window, and make these into a series of ICL commands, which are then handed over textually to sub_main. I suggest that the most efficient way to do this is to write them as an ICL file somewhere and to pass sub_main a single parameter telling it to run this ICL file. 2.3 The character set and the format of text files ---------------------------------------------- The Inform source code assumes that the compiler is running on a machine whose character set agrees with ASCII in the range $20 to $7e. (This allows both plain ASCII and any of the ISO 8859 extensions to ASCII.) ASCII is now universal, but there is no common format for plain text files, and in particular how lines of text are ended. For example: MS-DOS, Windows, etc.: $0d $0a Mac OS: $0d RISC OS: $0a Inform 6 can read source code files in all these formats, and which further use any of the character sets above: plain ASCII or ISO 8859-1 to -9. (This is configurable using the -C switch.) 2.4 The OS definitions block in "header.h" -------------------------------------- Each Inform port makes a block of definitions in the header file. These blocks take a standard format. Firstly, the block is put in #ifdef's so that it will only be processed in this one port. The block is divided into 6 sections, as follows. /* 1 */ MACHINE_STRING should be set to the name of the machine or OS. /* 2 */ Section 2 contains some miscellanous options, all of which are on/off: they are by default off unless defined. The possibilities are: USE_TEMPORARY_FILES - use scratch files for workspace, not memory, by default EXTERNAL_SHELL - this port is providing an entire external front end, with its own "main" routine: see above PROMPT_INPUT - prompt input: ignore argc and argv, instead asking for parameters at the keyboard. (I hope people will write front-ends rather than resort to this, but it may be a useful staging post.) TIME_UNAVAILABLE - if the ANSI library routines for working out today's date are not available CHAR_IS_SIGNED - if on your compiler the type "char" is signed by default Note that defining USE_TEMPORARY_FILES does not make a mandatory choice (as it did under Inform 5): whether to use allocated memory or temporary files is selectable with -F0 (files off) or -F1 (files on) in ICL. All that this option does is to define the default setting for this -F switch. Running -F0 is faster (possibly, depending on whether your C library provides buffering or not, much faster) but consumes 100 to 300K more memory (it does so flexibly, allocating only what it needs, unlike the Inform 5 option). Most users will not want to understand the issues involved here, so please make a sensible default choice for them. Once again, note that CHAR_IS_SIGNED must be defined if "char" is signed: otherwise "uchar" will be typedef'd wrongly. /* 3 */ An estimate of the typical amount of memory likely to be free should be given in DEFAULT_MEMORY_SIZE. (This is only a default setting.) There are three settings: HUGE_SIZE, LARGE_SIZE and SMALL_SIZE. (I think it was Andrew Plotkin, though, who remarked that HUGE_SIZE might sensibly be renamed "not-bad-by-1980s-standards-size": these all allocate quite small amounts of memory compared to, say, the 8M of workspace that Windows appears to need just to keep breathing.) For most modern machines, LARGE_SIZE is the appropriate setting, but some older micros may benefit from SMALL_SIZE. /* 4 */ This section specifies the filenaming conventions used by the host OS. It's assumed that the host OS has the concept of subdirectories and has "pathnames", that is, filenames giving a chain of subdirectories divided by the FN_SEP (filename separator) character: e.g. for Unix FN_SEP is defined below as '/' and a typical name is users/graham/jigsaw.z5 Normally the comma ',' character is used to separate pathnames in a list of pathnames, but this can be overridden by defining FN_ALT as some other character. Obviously it should be a character which never occurs in normal pathnames. If FILE_EXTENSIONS is defined then the OS allows "file extensions" of 1 to 3 alphanumeric characters like ".txt" (for text files), ".z5" (for game files), etc., to indicate the file's type (and, crucially, regards the same filename but with different extensions -- e.g., "frog.amp" and "frog.lil" -- as being different names). If FILE_EXTENSIONS is defined, then Inform uses the following standard set of extensions unless they are overridden by other definitions at this point. (Please don't override these definitions without reason.) Source_Extension ".inf" Source code file Include_Extension ".h" Include file (e.g. library file) Code_Extension ".z3" Version 3 story file V4Code_Extension ".z4" 4 V5Code_Extension ".z5" 5 V6Code_Extension ".z6" 6 V7Code_Extension ".z7" 7 V8Code_Extension ".z8" 8 Module_Extension ".m5" Linkable module file (version 5, which is all that Inform 6 supports yet) ICL_Extension ".icl" ICL file The debugging information file and the transcript file also have defined default-names which can be over-ridden in this section if desired: Transcript_File "gametext.txt" or "gametext" Debugging_File "gameinfo.dbg" or "gamedebug" If you do not define FILE_EXTENSIONS, then it is essential to define STANDARD_DIRECTORIES instead. (You can also define both, if you so choose.) The STANDARD_DIRECTORIES option causes Inform to put all files of a particular kind into a standard directory for them: e.g., a "games" directory might hold the story files compiled, etc. All that happens when a standard directory is defined is that Inform sets the default value of the relevant pathname variable to that standard directory: otherwise, its pathname variable starts out as "". The standard directories are, once again, defined by default as follows: once again you can define these settings yourself, but please don't do so without a good reason. Source_Directory "source" Include_Directory "library" Code_Directory "games" Module_Directory "modules" Temporary_Directory "" ICL_Directory "" Note that the actual user of Inform can still override anything you choose by setting the pathname with an ICL command. A good way to test all this is to run inform -h1, which does some experimental filename translations and prints the outcome. /* 5 */ Section 5 contains information on how to choose the filenames for the three temporary files. (Note that this needs to be done even if USE_TEMPORARY_FILES is not defined.) On many machines, you only need to give a suitable name. (As usual, if you don't bother, something fairly sensible happens.) Temporary_Name is the body of a filename to use (if you don't set this, it becomes "Inftemp") and Temporary_Directory is the directory path for the files to go in (which can be altered with an ICL command). However, under some multi-tasking OSs it is desirable for multiple Inform tasks to work simultaneously without clashes, and this means giving the temporary files filenames which include some number uniquely identifying the task which is running. If you want to provide this, define INCLUDE_TASK_ID and provide some code... #define INCLUDE_TASK_ID #ifdef INFORM_FILE static int32 unique_task_id(void) { ...some code returning your task ID... } #endif /* 6 */ Finally, section 6 is "anything else". In particular this is where DEFAULT_ERROR_FORMAT should be set. This switches between different styles of error message. (This is not a matter of aesthetics: some error-throwback debugging tools are very fussy about what format error messages are printed out in.) For example, here is a typical OS definition block: #ifdef UNIX /* 1 */ #define MACHINE_STRING "Unix" /* 2 */ #define CHAR_IS_SIGNED /* 3 */ #define DEFAULT_MEMORY_SIZE LARGE_SIZE /* 4 */ #define FN_SEP '/' #define FILE_EXTENSIONS /* 5 */ #define Temporary_Directory "/tmp" #define INCLUDE_TASK_ID #ifdef INFORM_FILE static int32 unique_task_id(void) { return (int32)getpid(); } #endif #endif 2.5 Running Inform in a multi-tasking OS ------------------------------------ As mentioned above, if Inform is being used in a multi-tasking environment then temporary file-naming will need a little attention. Another issue is that under some systems the other tasks may all freeze up while Inform is working, because tasks only voluntarily hand control back to the OS (allowing it to poll the other tasks and share out the processor time). This means that some call to an OS primitive routine may have to be inserted into Inform somewhere: a good place to do this is in the routine reached_new_line() of the section "lexer.c". 3 The front end ------------- 3.1 The ICL interpreter ------------------- Inform's front end is quite simple and there is little to say about it. Its task is to translate the user's compilation command into five kinds of ICL variable: switches, which are on/off flags controlling compilation options; switch settings, which are numerical values (between about 1 and 8) controlling more finely-controllable compilation options; path variables, which hold the text of some directory pathname; memory settings, which hold positive numbers (the extent of certain memory allocations within Inform); the filenames of the source code to read, and the object code to write. See "inform.c" and the memory-setting-parser in "memory.c" for the details. 3.2 Fundamental method ------------------ It is not possible, in almost any compiled language, to make a direct translation from source to object code "a line at a time": the first time a line is reached, it often cannot be finally translated until information is known which cannot yet be known. (For example, Inform translates an object's name into a number indicating its position in the final game's tree of objects. If the name comes up before the definition of the object has been seen, Inform cannot know what number to translate the name into.) Inform makes only one pass through the entire source code. The translation it makes is (roughly speaking) left blank in places, and these blanks are filled in at the very end, once the necessary information is available. This process is called "backpatching": Inform is patching things up behind itself. 4 Lexical analysis ---------------- 4.1 Aim and structure of the lexer ------------------------------ The task of the lexical analyser, or "lexer" for short, is to translate the text it reads into a sequence of "tokens". The aim is that the rest of the compiler should work with the tokens and need never look at the original text again; Inform mainly achieves this aim. Each token represents a sequence of characters, called a "lexeme", which is indivisible from a syntactic point of view. For instance, the text $12+duck.&feathers contains five atomic pieces of syntax: Token Lexeme $12 + duck .& feathers The Inform lexer has three, or perhaps four levels: Level 1: reading in text, filtering out control characters, etc. Level 2: a context-free tokeniser Level 3: a context-dependent process of deciding whether any identifier names found are keywords or names made up by the programmer To make a fairly revolting analogy: when parsing dinner, lexical analysis is done with the teeth, and syntax analysis with the stomach. But with programming languages it is difficult to give a very precise point at which one ends and the other begins: and in Inform, the "perhaps level 4" of the lexer occurs when tokens are sorted out more finely in the expression parser (is this an operator? is it a constant value? is it a variable name?), which will be documented as part of the syntax analyser in this manual. At any rate, "lexer.c" and "symbols.c" contain levels 1 to 3 only. For details of some standard algorithms in compiler theory, the account below and in section 5 gives page references to "ASU": this is Aho, Sethi and Ullman (1986), "Compilers: Principles, Techniques and Tools", the so-called "Dragon book". It's called this because the front cover features a knight armoured in various algorithms confronting a dragon labelled "Complexity of compiler design": in the case of Inform, though, the dragon might reasonably be twice as big and labelled "Ignorance of designer who originally chose the syntax for Inform". Compiler-making tools like "lex", "yacc" and "bison" are difficult to apply to this language, since Inform doesn't have a convenient LALR(1) grammar (more like LALR(4), in fact). In any case the lexical and syntax analysers are hand-coded for speed, which is generally considered to produce code twice as fast as that generated by mechanical tools like "yacc". 4.2 Level 1: lexical blocks and buffered input ------------------------------------------ The lexer can take input from two sources: from the source code files which are being compiled, or from a null-terminated string somewhere in the compiler. (The latter is used when compiling the "veneer" of run-time routines.) A "lexical block" is a piece of text which is individually named and line-counted (so that error messages can indicate where an error has occurred with reference to the original source). Each single file of source code (the original file and each Include file) is a lexical block in its own right. Likewise, if the lexer is reading from an internal string, then that string is a lexical block. The LexicalBlock structure holds data about the position inside such a block. Note that several may be needed at once: if one source file includes another, which includes another, then three different LexicalBlocks will contain useful information. However, information about the files currently being worked from is stored in a different structure way, and not in LexicalBlock structures. For efficiency reasons, we can't read characters out of the files one at a time: instead we read them in 4096-byte chunks into a buffer. Note that this means that the read-position of a file will be well ahead of the read-position of the lexer within it. It may have been closed before the lexer is halfway through its buffer. A further complication is that it's natural to use a stack structure to hold the files being read from: include file 2 <-- reading from this include file 1 <-- still open but not being read from main source code file <-- still open but not being read from and it is also natural to use a stack structure to hold the LexicalBlocks being worked through: LB for include 2 LB for include 1 LB for main source code file Despite the apparent similarity here, we can't combine this into one stack, since "include file 2" will have been pushed off the files stack before its LB is pushed off the LB stack. Since further Include directives may occur at any point in the current LB, the stacks can be very different. Otherwise level 1 is simple and fast. Note that characters are fed up into level 2 through a "pipeline" (see the next section for what this means and why), and are also filtered. Inform can read source files in any plain ASCII or any of the ISO 8859 standard ASCII extensions (five variants with accented Latin characters, plus Cyrillic, Arabic, Hebrew, Greek). The filtering process removes any character codes undefined in the current ISO setting, and normalises new-line and spacing characters: 00 remains 0 (meaning "end of file") TAB becomes SPACE 0d becomes 0a, i.e., '\n' other control characters become '?' 7f becomes '?' 80 to 9f become '?' a0 (ISO "non-breaking space") becomes SPACE ad (ISO "soft hyphen") becomes '-' any character undefined in ISO between a0 and ff is mapped to '?' (Here "ISO" means "the current ISO setting", which can be 0, meaning plain ASCII only: in this mode any character value of 80 to ff will be filtered to '?'.) The filtering ensures that (i) no error message can contain an unprintable character and (ii) the user cannot type a character outside a standard ISO set and which might work on one platform but not on another. Note that the 0 character is only fed upwards into level 2 when the entire lexical source has run out: that is, all the source files are exhausted, or else the string being analysed has run out. 4.3 Level 2: breaking text into lexemes ----------------------------------- The input to level 2, then, is a stream of characters: and its output is a stream of tokens, each of which represents a sequence of characters called a "lexeme". Some definitions of terms used in the Inform source code. There is a fixed set of "separators", all of them sequences of 1 to 3 mainly non-alphabetic characters: -> --> -- - ++ + * / % || | && & ~~ ~= ~ == = >= > <= < ( ) , .& .# ..& ..# .. . :: : @ ; [ ] { } $ ?~ ? #a$ #n$ #r$ #w$ ## # An "identifier" is a sequence of one or more characters from the set: A B C D ... Z a b c ... z 0 1 2 3 4 5 6 7 8 9 _ which does not begin with 0 to 9. The lexer makes the longest lexeme it possibly can for any particular token, so the next character after any identifier will not be one of the set above. The lexemes representing numbers are: (a) one or more chars from the set 0 1 2 3 4 5 6 7 8 9 (b) $ followed by one or more chars from 0 1 2 3 4 5 6 7 8 9 A B C D E F (hexadecimal numbers) (b) $$ followed by one or more chars from 0 1 (binary numbers) Note that "-67" is not a lexeme for a single token: it is broken into two lexemes, "-" and "67". A ", followed by any number of characters and then another ", is a single lexeme. Similarly for '. Finally, as a special case, the six separators #a$ #n$ #r$ #w$ ## # are expected to be followed by an identifier. For example, "##Take" is a single lexeme, and is not divided into "##" and "Take". (Except that #n$ can be followed by any character, so that e.g. #n$1 is a single token, representing "the new dictionary word '1'" in the language.) To sum up, the lexical analyser expects legal source code to contain: "comments": sequences beginning with a ! character and ending with a new-line or the end of the file or string containing it any of the lexemes above "white space", that is, characters in the set space tab new-line When breaking down text into lexemes (and throwing away the white space and comments), the lexer needs 3 characters of look-ahead. That is, it can decide definitely what lexeme character N belongs to provided that it knows what characters N+1, N+2 and N+3 are. (The figure 3 occurs because that's the worst case arising: deciding whether character N is the start of a separator in the form "#a$" followed by an identifier character.) Because of this, level 1 maintains a "pipeline" of four variables to hold the current character and the next three on the way. By looking ahead at the pipeline as needed, level 2 decides what to do with the current character and then advances one character: the three waiting in the pipeline shuffle up and a new character is read into the last place in the pipeline. The advantage of maintaining a pipeline is that the lexer never needs to undo any decisions and go backward through the text. One token is output for each lexeme found, and when the lexer runs out of input altogether (when all the source files are finished, or when the string being analysed has run out) a special token, called "end of file", is output. Thus the lexer's output is a sequence of tokens from the list: numbers strings in " quotes strings in ' quotes separators identifiers end-of-file As will be seen in the next section, identifiers are analysed further before being passed out of the lexer. Except for the problem of deciding what an identifier means, the lexer is "context-free": it needs to know nothing about the syntax of the Inform language, what kind of code it is currently analysing, etc. There is a standard state-machine algorithm for writing such a lexer: see ASU, p. 98. The tricky part is to efficiently store a transition table for the states involved, which will be neither so small that slow code is required to read it, nor so large that it takes up an unacceptable amount of memory. Here the "tokeniser_grid" stores most of a transition table: this is an algorithm originally suggested to me by Dilip Sequeira, similar to that used by S. C. Johnson's "yacc" lexical analyser. 4.4 Representation of tokens ------------------------ Tokens are internally stored as quadruples: typedef struct token_data_s { char *text; int value; int type; int marker; } token_data; though the lexer is not responsible for writing "marker" values, which are the concern of the syntax analyser. The "type" identifies which kind of token it is (this value must be one of the *_TT set #define'd in "header.h"): for example, DQ_TT means "a double-quoted string". The "value" is a number whose meaning depends on the type. For example, in the case of type DQ_TT, the number has no meaning and is not used; in the case of type SEP_TT (for "separator"), the number gives the index in the above table of possible separators, thus identifying which separator it is. "text" is a pointer to a null-terminated string containing the lexeme. There are two reasons this is needed: firstly, so that comprehensible error messages can be printed out from higher up in the compiler, and secondly because in the case of DQ_TT and SQ_TT the text may need to be compiled into Z-text format and added to the static strings area, to the Z-code area or to the dictionary. The decision on whether the string should be compiled and if so, what to, cannot be taken until code generation time. Unfortunately code generation time may not be for a while, and this raises a problem: we clearly need to keep lexeme text in memory for a while, but it would be very costly on memory to keep the text of every token in memory permanently. When is it safe for the lexer to throw a lexeme away? One solution would be for a formal system by which the code generator explicitly deallocated the space, but this is too bureaucratic and too risky (suppose we miss 0.1% of these? Memory will slowly clog up and begin to cause odd errors on large compilation runs). Instead, the lexer simply writes its text cyclically into a large buffer. Code generation almost always takes place before the lexer has got any more than 100 or so bytes of text ahead; since the buffer is a good 10K long, there is a generous safety margin. 4.5 Level 3: identifying identifiers -------------------------------- A "keyword" is an identifier which has a meaning understood by the Inform compiler: for example the lexeme while is understood (in some contexts) to refer to the "while" statement, rather than being a name coined by the programmer for some variable or object. In most small languages such a lexeme would always be a keyword and never a name. (This is the concept of "reserved word": a word reserved for the use of the compiler, forbidden as a name.) For example, the fairly-large language C++ has 48 such keywords. Inform is unable to reserve keywords (that is, forbid their use as names) for two reasons: firstly, because there are 281 of the damn things (and 116 of them are only significant in lines of assembly language: why forbid the use of these to non-assembly-language programmers?), and secondly because it is important to the language that some keywords definitely should be the same as some object names. "Class" is both a keyword meaning "the class-making directive" and a name meaning "the object representing the class of all classes". So the decision "is this identifier intended to be a keyword or intended to be the name of something?" relies on the context. What the lexer does is to divide the 281 keywords into twelve groups. The syntax analyser switches these groups on or off as it finds its way through the program. For example, the group "opcode_names" is only enabled (i.e., switched on) after an @ character has been read at the beginning of a statement. A complication is that case is sensitive for some of these groups and not others. For example, "IFDEF and "Ifdef" both match the directive name "ifdef", but "WHILE" and "While" do not match the statement name "while". The rules in fact are: When matching... Example Sensitive? Language keywords: directive names Class No keywords used within directives with Yes statement names while Yes keywords used within statements else Yes condition names hasnt Yes built-in function names random Yes built-in constant names #code_offset Yes Assembly language keywords: opcode names put_prop Yes customised opcode names @"1OP:4S" Yes the stack pointer name sp Yes Names for variables, objects, etc. lantern No Names for actions Take No Moreover, in contexts where local variables are allowed, they take precedence over the rest. (For example, you could define a local variable called "random", and for the duration of that routine the word "random" would never be recognised as the keyword for the system function random().) Each keyword group has its own token type. The token value for a keyword token is the index within the group: see "header.h" for #define's for all of these indices. Finally, there are times when the syntax analyser just wants the raw text of an identifier, and doesn't want it identified (for efficiency reasons: this prevents it from being entered into the symbols table). If the dont_enter_into_symbols_table flag is set, the lexer simply returns a token of type DQ_TT as though the name had been found in double-quotes. 4.6 Hash-coding and comparing names ------------------------------- It would be computationally very expensive indeed to decide whether string X is a keyword or not by carrying out direct string comparisons with all 281 keywords. Instead, hash-coding is used. The idea is to perform a fast operation on X whose result is some number from, say, 1 to 100: this is the hash code of X. If X is the same string as K, then K must have the same hash code. So we organise the keywords into 100 different groups, the group with hash code 1, with hash code 2, etc.: then if h is the hash code of X, we only need to compare X with the keywords in group h. For this to be any use, the operation performed has to be chosen so that the keywords are sorted out into groups with about equal numbers in each. Many sensible "hash functions" are known (see ASU, p. 437 for experimental performance data). Inform's choice is simple but effective: it uses multiplication rather than bit-shifting, but then on modern CPUs multiplication is not very expensive. The algorithm is what ASU would call "X 30011": extern int hash_code_from_string(char *p) { unsigned int32 hashcode=0; for (; *p; p++) hashcode=hashcode*30011 + case_conversion_grid[*p]; return (int) (hashcode % HASH_TAB_SIZE); } where "hashcode" is meant to overflow repeatedly. Note that 30011 is a prime number. (case_conversion_grid[*p] is just a fast way to convert all lower case letters to upper case ones, since we want case to be insignificant when calculating hash codes.) It doesn't matter if this algorithm behaves differently on different machines (depending on how the CPU copes with overflows): all that matters is it delivers a reasonable spread of hash values. The hash code range is 0 to HASH_TAB_SIZE-1: this is a memory setting which by default is 512. Reducing this by any substantial amount causes a significant slowing down of Inform. 4.7 The symbols table ----------------- If an identifier, in context, isn't a keyword then it is called a "symbol": this basically means "a name for something created by the programmer" (except that Inform creates a few symbols automatically as well). The section "symbols.c" keeps a table of all the symbols which have turned up so far. (It often picks up a few extraneous names of keywords, too, when the lexer has been looking ahead in the wrong context: but this doesn't matter and it's faster to tolerate the slight waste of memory.) The symbols table is organised into HASH_TAB_SIZE-1 linked lists, each of which is kept in alphabetical order. A single operation is provided for searching it: symbol_index(), which works out the hash code of the string S supplied (unless it's already known, in which case it need not work it out again) and looks through the linked list for that hash code until the first letter matches. (Thus, the only full string comparisons made are between S and those strings in the table which have the same hash code and the same initial letter.) If S is not present, it is inserted into the table. In addition to its text, each symbol in the table has four pieces of data attached: its type, its value, the line on which it was assigned and its flags. The symbol with index i has: symbs[i] text stypes[i] type char svals[i] value int32 sflags[i] flags word int slines[i] line number int32 (It would be more elegant to store this as an array of structures, but the symbols table is a major consumer of memory, so we can't afford to let lazy compilers waste it by aligning the fields of such a structure at 4 byte intervals. Hence, five separate arrays.) Types and values are assigned by the syntax analyser: the type indicates what kind of thing the symbol is a name of (a routine, a label, an object, etc.), and must always be one of the #define names *_T (see "header.h"). The value holds some number indicating which particular thing the symbol is a name of (a routine address, an object number, etc.). The line number is set when the symbol is first used, and then reset when it is first assigned a value (if ever). The information is stored as file-number*0x10000 + line-number where file-number is an index into the InputFiles array maintained by "files.c". The flags word holds (presently) 15 different flags, for different properties that the symbol name can have. These flags have #define names in the form *_SFLAG. The token for a symbol has type SYMBOL_TT and value equal to the index i of the symbol within the symbols table. 4.8 Token lookahead: the circle --------------------------- The output of the lexer is sent, one token at a time, when the syntax analyser calls get_next_token(). The syntax analyser is a predictive parser which makes mistakes and has to backtrack (that is, it reads in token X, guesses that it might be looking at XY and reads the next token to see... but reads a Z instead, and has to go back to X and try a different theory). So the syntax analyser needs to be able to go backwards, as well as forwards, in the stream of tokens. This means that the lexer needs to remember its last N tokens, where N is the maximum number of backward steps the syntax analyser ever takes in one go. The number N is the amount of "lookahead", since one could also think of it as the maximum tokens one needs to look ahead of the current position to be sure of what is happening. These tokens are therefore stored in a "circle" of N+1 tokens: when get_next_token() is called, the lexer works out the next token, writes it into the circle and moves on one place clockwise; when put_token_back() is called, the lexer moves back one place anticlockwise, so that the next get_next_token() will send back the token in that position rather than work out a new one. However, the interpretation of identifiers as keyword or symbol depends on the current context, and this may have changed since the token was first calculated. Therefore a numerical representation of the context is stored in the circle too, so that if this has changed and if the token is an identifier then it is re-identified under the context which now prevails. 4.9 Summary of the lexer's output ----------------------------- To summarise, the lexer converts source text to a stream of tokens of types as follows: -------------------------------------------------------------------------- Type Lexeme Value -------------------------------------------------------------------------- NUMBER_TT literal number in the number decimal, hex or binary DQ_TT string extracted from (none) double-quotes (or identifier name: see above for when) SQ_TT string extracted from (none) single-quotes SEP_TT separator a #define'd *_SEP value EOF_TT (end of source) (none) SYMBOL_TT identifier assumed index in the symbols to be a name table LOCAL_VARIABLE_TT local variable name 1 to 15, its Z-machine variable number STATEMENT_TT statement name a #defined *_CODE value SEGMENT_MARKER_TT with/has/class etc. a #defined *_SEGMENT value DIRECTIVE_TT directive name a #defined *_CODE value CND_TT named conditional a #defined *_COND value SYSFUN_TT built-in function a #defined *_SYSFUN value OPCODE_NAME_TT opcode name a #defined *_zc value (i.e., an internal opcode number) MISC_KEYWORD_TT keyword like "char" used a #defined *_MK value in statement syntax DIR_KEYWORD_TT keyword like "meta" used a #defined *_DK value in directive syntax TRACE_KEYWORD_TT keyword used in syntax a #defined *_TK value of "trace" directive SYSTEM_CONSTANT_TT system #constant name a #defined *_SC value -------------------------------------------------------------------------- Note that other token types do exist, but are set in "level 4": i.e., within the expression parser. 5 Syntax analysis --------------- 5.1 Aim and structure of the syntax analyser ---------------------------------------- The syntax analyser is the heart of Inform, and contains within it the specification of the language. Inform code fundamentally consists of directives, which are instructions to the compiler to do something, and routines containing statements, which are lines of code to be executed at run-time. The syntax analyser takes as input the stream of tokens from the lexer. Some of these are interpreted as directives and acted on immediately. Others are realised to be statements and compiled; we can regard the output of the compiling part of the analyser as being a stream of Z-machine assembly language, passed down into "asm.c". In most modern compilers, the syntax analyser would convert the token stream into a "parse tree" expressing the structure of the input: something along the lines of routine / \ statement statement | | "if" ...etc., etc. / \ condition statement | | "==" "print" / \ | "x" "0" "Good heavens!" This is aesthetic, and makes possible many optimisations (such as elimination of common sub-expressions), since the compiler is able to see the whole structure before beginning to compile it. But it uses up a fair amount of memory (admittedly, most compilers only keep the tree for the current routine, not the tree for the entire program); and a lot of speed, as the tree is trawled through again and again. Characteristically, Inform instead uses an algorithm which is about 40 years old (see ASU, p.181 et seq). It doesn't generate a parse tree for whole routines, though it does do so for expressions, assignments and conditions: for example, it would have built the little tree: "==" / \ "x" "0" For higher level-parsing, it works "top-down", making recursive function calls which mimic a depth-first traversal of the tree drawn above. That is, routine() calls statement() which reads the token "if" and calls condition() which works out the expression "x == 0" and compiles some code to perform this test and then to skip the next instruction if it's false and calls statement() which reads the token "print" and then the token "Good heavens!" and compiles some code to print this and routine() next calls statement() which reads... etc., etc. Although we are only able to go through the tree once, it's a quick and efficient trip, effectively using the C run-time stack to hold the parse tree structure. The result is also a rather legible piece of source code (as compared, for example, with the output from "yacc"). (The main reason we can escape from having to make parse trees is that our target machine, the Z-machine, has an architecture making register allocation unnecessary: the variables are the registers, and there's no space or time penalty for use of the stack.) The syntax analyser begins with parse_program() in the section "syntax.c". Because its structure is embodied in the source code, there is relatively little to say about it except to refer the reader to that: and to the the table in section 1.1 above. 5.2 Predictive parsing ------------------ Note that tokens can be read from many different places, and code can be passed out from many different places: whereas the lexer and the assembler each have a single channel of input and output. As a typical example of the syntax analyser working as a "predictive parser" and having to "backtrack", consider how parse_action() begins to parse through action statements like ; tokenised as < symbol "Take" symbol "fishknife" > When it's called, the first token, "<" has already been read (this is how parse_statement() knew to call parse_action() in the first place). What it does is: "Predict" that it's going to be a << ... >> statement Ask the lexer for a token, expecting it to be another "<" Discover that, unfortunately, it's a symbol, so the prediction was wrong Backtrack to the point where it made the wrong prediction (this actually only means giving the token "Take" back into the lexer again) Now predict the next possibility, that it's a < ... > statement Ask the lexer for a token, expecting it to be a symbol name Discover that this time it is ("Take"), so the prediction was correct ... and so on. The code to do this is more like: get_next_token(); if (token == "<") return parse_double_angle_action(); put_token_back(); return parse_single_angle_action(); (The actual code doesn't make these inefficient function calls, and token comparison is a bit fiddlier, but this is the idea.) Clearly, the longer such a prediction goes on before it is found to be wrong, the more tokens which the lexer has to take back again. The syntax of the language determines this number, N: for many languages N = 1 (because they've been designed that way) but for Inform N = 5 (because it was defined by an ignorant oaf). (Though, as will appear in the next section, it would be possible to reduce this by implementing the grammar differently: and there are compensations, such as the relative conciseness of Inform code.) 5.3 The context-free grammar ------------------------ Here are the productions which the top-down syntax analyser implements. (Most of the major ones are handled by routines.) "void_expression", "constant", "condition" and "quantity" are left undefined: these are handled by the expression parser according to an operator-precedence grammar. "vcondition" is a condition whose first token is a VARIABLENAME. Symbols are represented in capitals; NEWNAME means one not so far assigned a value, while CLASSNAME, etc., mean an existing symbol of the given type. Obsolete features which are still supported are omitted. Details of conditional compilation are abbreviated heavily: the notation PROGRAM means "anything at all until the right directive keyword comes up at the same #if... level". ========================================================================= program -> directive ; program directive -> [ NEWNAME routine "Abbreviate" strings "Array" NEWNAME arraytype array "Attribute" NEWNAME "Attribute" NEWNAME "alias" ATTRIBUTENAME "Class" NEWNAME objbody "Class" NEWNAME ( constant ) objbody "Constant" NEWNAME "Constant" NEWNAME constant "Constant" NEWNAME = constant "Default" NEWNAME constant "End" "Extend" extension "Fake_action" NEWNAME "Global" NEWNAME "Global" NEWNAME = constant "Ifdef" NAME condcomp "Ifndef" NAME condcomp "Iftrue" constant condcomp "Iffalse" constant condcomp "Ifv3" constant condcomp "Ifv5" constant condcomp "Import" manifest "Include" STRING "Link" STRING "Lowstring" NEWNAME STRING "Message" diagnostic "Nearby" objhead objbody "Object" arrows objhead objbody "Property" NEWNAME propdef "Release" quantity "Replace" ROUTINENAME "Serial" STRING "Statusline" "score" "Statusline" "time" "Stub" NAME quantity "Switches" TEXT <---- An unquoted string "System_file" "Trace" trace "Verb" verb "Zcharacter" zchar CLASSNAME arrows objhead objbody condcomp -> ; PROGRAM ; "Endif" ; PROGRAM ; "Ifnot" ; PROGRAM ; "Endif" manifest -> import "," manifest import import -> "global" NEWNAME diagnostic -> STRING "error" STRING "fatalerror" STRING "warning" STRING propdef -> propdefault "additive" propdefault propdefault -> quantity ========================================================================= trace -> t_section t_level Tracing t_section -> "objects" "symbols" "verbs" "assembly" "expressions" "lines" t_level -> "on" "off" quantity ========================================================================= zchar -> char_spec Character set STRING STRING STRING "table" char_specs "table" "+" char_specs char_specs -> char_spec char_specs char_spec char_spec -> LITERAL_CHARACTER ========================================================================= arrows -> "->" arrows Object definition objhead -> NEWNAME STRING NEWNAME STRING NEWNAME OBJECTNAME STRING OBJECTNAME NEWNAME STRING OBJECTNAME objbody -> segment objbody segment "," objbody segment -> "class" class_s "with" with_s "private" with_s "has" has_s class_s -> CLASSNAME class_s with_s -> PROPERTYNAME property PROPERTYNAME property "," with_s has_s -> attributes property -> [ routine arrayvals ========================================================================= arraytype -> -> Arrays --> "string" "table" array -> constant STRING arrayvals [ manyvals ] manyvals -> constant manyvals constant ; manyvals arrayvals -> constant arrayvals ========================================================================= extension -> "only" strings priority grammar Grammar STRING priority grammar priority -> "replace" "first" "last" verb -> strings v_setting "meta" strings v_setting v_setting -> = STRING grammar grammar -> g_line grammar g_line -> * tokens -> g_action g_action -> ACTIONNAME ACTIONNAME "reverse" tokens -> g_token tokens g_token -> preposition "noun" "held" "multi" "multiheld" "multiexcept" "multiinside" "creature" "special" "number" "topic" "noun" = ROUTINENAME "scope" = ROUTINENAME ATTRIBUTENAME ROUTINENAME preposition -> DICTIONARYWORD DICTIONARYWORD / preposition strings -> STRING strings attributes -> attribute attributes attribute -> ATTRIBUTENAME ~ ATTRIBUTENAME ========================================================================= routine -> * locals ; body ] Everything below locals ; body ] here is code to locals -> NAME locals compile body -> action_case : body "default" : body statement body action_case -> ACTIONNAME ACTIONNAME , action_case block -> ; statement ; { states } { states . NEWNAME ; } states -> statement states; s_block -> { s_states } { s_states . NEWNAME ; } s_states -> case : s_states "default" : s_states statement s_states case -> range : range , case range -> constant constant "to" constant ========================================================================= statement -> . NEWNAME ; statement "@" assembly ; "<" action ">" ; "<<" action ">>" ; indiv_state STRING void_expression action -> ACTIONNAME ACTIONNAME quantity ACTIONNAME quantity quantity assembly -> opcode operands options opcode -> OPCODENAME STRING <--- customised opcode operands -> operand operands descriptions are operand -> constant not parsed by the VARIABLENAME syntax analyser sp options -> ? branch -> store -> store ? branch store -> VARIABLENAME sp branch -> LABELNAME ~ LABELNAME ========================================================================= indiv_state -> "box" strings ; "break" ; "continue" ; "do" eblock "until" ( condition ) ; "font" "on" ; "font" "off" ; "for" ( for1 : for2 : for3 ) block ; "give" quantity attributes ; "if" ( condition ) block "if" ( condition ) block "else" block "inversion" ; "jump" LABELNAME ; "move" quantity "to" quantity ; "new_line" ; "objectloop" ( ospec ) block ; "print" printlist ; "print_ret" printlist ; "quit" ; "read" quantity quantity ; "read" quantity quantity ROUTINENAME ; "remove" quantity ; "restore" LABELNAME ; "return" ; "return" quantity ; "rtrue" ; "rfalse" ; "save" LABELNAME ; "spaces" quantity ; "string" quantity STRING ; "style" textstyle ; "switch" ( quantity ) s_block "while" ( condition ) block for1 -> void_expression for2 -> condition for3 -> void_expression ospec -> VARIABLENAME VARIABLENAME "in" quantity VARIABLENAME "near" quantity VARIABLENAME "from" quantity vcondition printlist -> printitem , printlist printitem printitem -> STRING quantity ( form ) quantity form -> ROUTINENAME "number" "char" "address" "string" "The" "the" "a" "an" "name" "object" "identifier" textstyle -> "roman" "reverse" "bold" "underline" "fixed" ========================================================================= A few critical comments on the above. From this grammar it's possible to work out N, the maximum look-ahead needed to distinguish which production is being used in the source code. The worst cases are: (a) distinguishing productions (1) and (2) in ospec -> VARIABLENAME VARIABLENAME "in" quantity (1) VARIABLENAME "near" quantity VARIABLENAME "from" quantity vcondition (2) which requires N = 5; these two productions need to be distinguished between because objectloop ( a in b ) objectloop ( a in b ... ) (the second containing a compound condition) compile quite different code: one loops through the siblings in the object tree, the other through the tree in numerical order. (b) distinguishing productions (1) and (2) in printitem -> STRING quantity (1) ( form ) quantity (2) i.e., between print (routine-name) expression print (expression which happens to begin with a bracket) which requires N = 3. The grammar contains one serious ambiguity: the innocent-looking production arrayvals -> constant arrayvals means that array initialisations list constants in sequence without commas or other separating characters. This makes it impossible to distinguish between unary and binary minus in a line like: Array X 2 - 1 ; The answer is "binary", since the grammar makes the longest match possible; but a "this is ambiguous" warning is issued. A further inconvenience in the grammar, though not much of an ambiguity, occurs in the initialisation part of "for" statements: there is a danger of for (p=Class::) being mis-parsed due to "::" being recognised as a binary operator (without a right operand, which would cause an error) and not as two consecutive ":" delimiters. If I were free to redesign the Inform grammar in the light of the last three years' experience (which I am loath to do, since so much Inform source code now exists), here are the changes I think I would make: introduce commas as separators between array values and <> parameters; remove the statements quit, restore, save, style, font, spaces, box, inversion and read: the first three ought to be used via the assembler anyway, and the last six ought to be system-provided functions; use single quotes to refer to dictionary words used as values of "name", thus removing an anomalous rule going back to Inform 1, and to refer to dictionary words in grammar; find some notation for literal characters which does not look like the notation for dictionary words: e.g., ''z'' rather than 'z'; abolish the distinction between actions and fake actions; rename the Global directive "Variable"; require an = sign to be used in "Constant X 24" directives. 5.4 Assigning values to symbols --------------------------- Assigning values to symbols is the main way by which the syntax analyser changes the way it will behave further on in the program. From a strict theoretical point of view one could insist the symbols table contains the only information remembered by the syntax analyser about the program so far, which otherwise churns out code which it forgets as soon as it has written: however, Inform's analyser also keeps a number of other data structures, such as the game dictionary and the object tree. When the lexer creates a new symbol (that is, when the table is searched for a string which isn't there, so that it is added to the table as a new entry), it has: value 256 type CONSTANT_T flags only UNKNOWN_SFLAG line the current source line "Unknown" means that the syntax analyser doesn't recognise it as meaning anything yet. The flag will only be cleared when the syntax analyser assigns some value to the symbol (using the assign_symbol() routine), if ever. The line is reset to the source line then applying if the symbol is assigned a value, but is otherwise never changed. The type and value of a symbol are only altered via assign_symbol(). With one exception (see CHANGE_SFLAG below), the value once assigned is never subsequently altered by reassignment. Each symbol always has one of the following 11 types: Type Meaning Value ------------------------------------------------------------------------ CONSTANT_T Defined constant or Value of constant value not known yet LABEL_T Name of label in source Label number GLOBAL_VARIABLE_T Name of global variable Z-machine variable number ARRAY_T Name of array Z-machine byte offset in dynamic variables area ROUTINE_T Name of routine Scaled offset in Z-code ATTRIBUTE_T Name of attribute Z-machine attribute number PROPERTY_T Name of common property Z-machine property number between 1 and 63 INDIVIDUAL_PROPERTY_T Name of indiv property Property identifier >= 64 OBJECT_T Name of object Z-machine object number - 1 CLASS_T Name of class Z-machine object number - 1 of its class-object FAKE_ACTION_T Name of fake action Action number >= 256 ------------------------------------------------------------------------ The full list of symbol flags, and their meanings, is as follows: ------------------------------------------------------------------------ UNKNOWN_SFLAG no value has been assigned to this symbol USED_SFLAG the value of this has been used in an expression REPLACE_SFLAG the programmer has asked to Replace a routine with this name DEFCON_SFLAG this constant was defined by the "Default" directive STUB_SFLAG this routine was defined by the "Stub" directive (similarly) INSF_SFLAG this symbol was originally assigned a value in a system_file SYSTEM_SFLAG this symbol was assigned a value by Inform itself, as one of the standard stock (such as "true" or "recreate") provided to all programs UERROR_SFLAG a "No such constant as this" error has already been issued for this name ALIASED_SFLAG this is an attribute or property name whose value is shared with another attribute or property name CHANGE_SFLAG this is a defined Constant whose value needs later backpatching, or is a Label whose label number has been allocated before its declaration in the code ACTION_SFLAG this is a ## name (of an action or a fake action) REDEFINABLE_SFLAG the symbol can be defined more than once using the "Constant" directive, without errors being produced. (Used for the special constants DEBUG, USE_MODULES, MODULE_MODE and Grammar__Version.) ------------------------------------------------------------------------ IMPORT_SFLAG this name is "imported" (used in module compilation) EXPORT_SFLAG this name was "exported" from some module (used in story file compilation) ------------------------------------------------------------------------ USED_SFLAG is used only to decide whether or not to issue "declared but not used" warnings. A symbol has been "used" if one of the following is true: (i) its value occurred in an expression; (ii) it has been assigned, and tested with IFDEF or IFNDEF; (iii) it's an action routine name used in grammar or via ## (e.g. use of ##Take causes TakeSub to be "used"); (iv) it's a fake action name used via ##; (v) it's a property, attribute or class name used in an object or class definition; (vi) it's a label name branched to in assembly language or the destination of a "jump" statement; (vii) it's the routine name "Main"; (viii) it's a routine name referred to by the obsolete #r$ construct; (ix) it's a routine name of a veneer routine whose definition is being over-ridden in the source code, and to which a call has been compiled; (x) it's a routine name used in a grammar token; (xi) it's referred to in a module which has been linked into the story file being compiled. Note that such warnings are not issued for object names, since in much Inform 5 code objects are given irrelevant object symbol-names (the syntax required it); nor for symbols defined by Inform, or in the veneer, or in a system file. Warnings are never issued (except for labels and local variables, which have local scope) when compiling modules, since there is no way of knowing at compile time which exported symbols will be used and which will not. The warnings are issued at the end of the source code pass, but before the veneer is compiled. The slines[] array is used to work out which source lines to report from. CHANGE_SFLAG is used when definitions such as: Constant frog_word = 'frog'; are reached, where the value (the dictionary address for 'frog') cannot immediately be known. Such symbol values are backpatched later as needed. All symbols in the table have "global scope" (that is, their definitions are valid everywhere in the source program) except for label names, whose scope is restricted to the current routine. Thus, the same label name can be used in many different routines, referring to a different label in each. To achieve this, the routine-end routine in "asm.c" cancels the assignment of label names, returning them to type CONSTANT_T and flag UNKNOWN_SFLAG. Local variables also have local scope, but for efficiency reasons these are not stored in the symbols table but in a special hash table in level 3 of the lexer. Note that action names have a different "name-space" from other symbols: this is why the library's action name "Score" is never confused with its variable "score". Rather than use a different symbols table altogether, actions are stored as integer constants in the form Score__A (thus the value of this symbol is the value ##Score). Similarly, fake actions are stored this way (but with type FAKE_ACTION_T rather than INTEGER_CONSTANT_T); both forms of symbol are flagged ACTION_SFLAG. 5.5 Outputs other than assembly language ------------------------------------ Below the parse_routine() part of the syntax analyser, the only output is assembly language (together with dictionary words and encoded text as needed within it). However, parse_directive() has several other outputs, as shown on the source code map (section 1.2 above). Directives create objects to be remembered and written into the final program: arrays, verb definitions, actions, objects and so on. These will be called "program elements". Directives also "create" constants, attributes, properties and fake actions, but in these cases creation only consists of assigning a suitable value to a symbol. So these do not count as program elements. The data structures to hold "program elements" are all created and maintained within "directs.c" and its subsidiaries (such as "objects.c", the object-maker); they are then translated into a usable Z-machine form in "tables.c". (With only trivial exceptions, the data structures are not accessed anywhere else in Inform.) --------------------------------------------------------------- Program element Section Name of (main) data structure --------------------------------------------------------------- Global variable arrays.c global_initial_value[] Array arrays.c dynamic_array_area[] Release number directs.c release_number Serial code directs.c serial_code_buffer[] Statusline flag directs.c statusline_flag Object tree objects.c objects[] Common property objects.c prop_default_value[] default value Class-to-object- objects.c class_object_numbers[] numbers table Common property objects.c *properties_table values for objects Individual prop objects.c *individuals_table values for objects Table of static symbols.c individual_name_strings[] string values for property names And attribute names symbols.c attribute_name_strings[] And action names symbols.c action_name_strings[] Abbreviation text.c abbreviations_at[] table entry Grammar table verbs.c Inform_verbs[] Token-routine verbs.c grammar_token_routine[] addresses List of dict verbs.c adjectives[] addresses for "adjective" words Action routine verbs.c action_byte_offset[] addresses --------------------------------------------------------------- A rather unequal gathering: the storage required to hold the common property values table may be as much as 32K, whereas the statusline flag can be held in 1 bit. There are three other program elements: Z-code, kept by "asm.c"; the static strings of Z-encoded text, kept by "text.c"; and the dictionary, also kept by "text.c". 5.6 Assembly operands ----------------- The type "assembly_operand" is used for all numerical values (and local or global variable references) which are destined one day to be written into the Z-machine. Many of these will indeed be operands in Z-code instructions, hence the name. Others will be property values, array entries and so on. The definition is: { int type; int32 value; int marker; } which is a pretty generous memory allocation for holding a signed 16-bit number. (However, there are no large arrays of this type; type and marker could easily each have type char, but I suspect this would be slower because of alignment problems on some compilers; and speed does matter here.) The possible types are: SHORT_CONSTANT_OT number with value between 0 and 255 LONG_CONSTANT_OT number with any 16 bit value VARIABLE_OT reference to the stack pointer if value is 0 local variable if value 1 to 15 global variable if value 16 to 255 In addition, two special types are in use by the expression parser: OMITTED_OT (the operand holds no information) EXPRESSION_OT reference to a parse tree The "marker" value is used to record the origin of the data, which is essential to make backpatching work. For example, in the line v = "Jackdaws love my big sphinx of quartz"; the right operand is marked with STRING_MV, because the value which needs to be stored in the Z-machine is a packed string address and this cannot be known until after the compilation pass (when the size of the tables and the code area are known). Wherever the operand is written, in code or in a table, "bpatch.c" will later find and correct it. 5.7 Translation to assembly language -------------------------------- The Z-machine has a very easy architecture to generate code for. Most compilers' syntax analysers generate a "three-address code" as an intermediate stage towards final object code; this is a sequence of instructions in the form x = y z together with conditional branches and labelled points to branch to. (See ASU, p.466.) Translating this code into assembly language for some CPU is a further compilation phase: the tricky part is not translating the operators into instructions, but deciding where to locate the values x, y and z. On most CPUs, a limited number of registers can hold values, and arithmetic operations can only be performed on these registers; moreover, holding data in a register is much more efficient than holding it elsewhere. The problem is therefore to allocate registers to quantities, and the performance of the compiled code depends very heavily on how well this is done. (Register allocation is far from being a solved problem in the way that lexical analysis, say, is considered to be.) What makes the Z-machine particularly easy is that its instruction set is more or less three-address code already. Arithmetic can be performed with constants as operands as well as with "registers"; and not only is every local or global variable automatically allocated to a register of its own, but a stack is available to hold intermediate values (and with no performance penalty for its use, since it is accessible as though it were a register). The key point to remember when looking at Z-code is that writing a value to the "stack-pointer variable" pushes it onto the Z-machine stack; using the "stack-pointer variable" as the operand for an instruction pulls the value being read off the stack. Despite the availability of the stack, it's still convenient to have a few spare variables to use as "registers" holding temporary values. Inform reserves the variables 249, 250, 251, ..., 255 for its own use: though this slightly exaggerates the position, since two of these are used for the variables "self" and "sender". One more is used to hold the switch-value of the current switch statement (if there is one); the remaining four in order to implement various system functions (such as "children") with inline-code rather than routine calls to the veneer. (The one inconvenience with the Z-machine's instruction set is that there's no good way to read the top of the stack non-destructively, or to duplicate the top value, or to reorder the top few values.) The syntax analyser produces code by making function calls to the assembler. There are four types of function call: assemble_routine_header(...) assemble_routine_end(...) at the start and end of every routine; assemble_label_no(N) to indicate that the Nth label belongs here (i.e., at the point where the next instruction will be put); and then a fair variety of function calls to generate actual instructions. For example, assemble_jump(N) assembles an unconditional jump to label N. A typical "three-address code" is assembled by a routine like assemble_2_to(mul_zc, AO1, AO2, stack_pointer) meaning "the instruction mul_zc, which has 2 operands and has a result which it writes to the variable indicated by the third operand". AO1 and AO2 are assembly_operands (the abbreviation is often used in the source), and so is stack_pointer (it has type VARIABLE_OT and value 0). A typical example of how the top-down syntax analyser generates code is given by the code for the "while" statement: case WHILE_CODE: assemble_label_no(ln = next_label++); match_open_bracket(); code_generate(parse_expression(CONDITION_CONTEXT), CONDITION_CONTEXT, ln2 = next_label++); match_close_bracket(); parse_code_block(ln2, ln, 0); assemble_jump(ln); assemble_label_no(ln2); return; Note that this expects to match while ( condition ) statement or { statements } The expression parser is called to turn the condition into a parse tree, and its output is fed straight into the code generator for parse trees. The label numbers ln2 and ln are supplied to the routine for parsing code blocks because they indicate which labels the statements "break" and "continue" should generate jumps to. For example, while (i <= 10) print i++; generates the assembly language .L0; @jg i 10 ?L1; @inc i; @print_num i; @jump L0; .L1; In terms of function calls to "asm.c": assemble_label_no(0); assemble_2_branch(jg_zc, AO_for_i, AO_for_10, 1, TRUE); assemble_inc(AO_for_i); assemble_1(print_num_zc, AO_for_i); assemble_jump(0); assemble_label_no(1); (Note that incrementing is needed so often that assemble_inc() is provided as a shorthand: actually also because of a form of indirect addressing used by the Z-machine for store_zc and inc_zc, dec_zc. assemble_store() and assemble_dec() are also provided.) 5.8 Summary of assembly language instructions output ------------------------------------------------ The list of Z-machine instructions which Inform actually generates code for by itself is quite short (about half of the 115 possible instructions are ever used). The assembly-language "@ statement" parser can, of course, generate every Z-machine instruction. The instructions used for three-address-style code (that is, for program control, arithmetic, etc.) are as follows. For function calls and returns, unconditional jumps and for moving values between variables, memory locations and the stack: call_* rfalse rtrue ret_popped ret inc dec store push pull storeb storew loadb loadw jump set_attr (There are various forms of call instruction to do with the number of operands, whether a return value is wanted or not, etc.). Six conditional branch instructions are used: je jz jl jg jin test_attr check_no_args (the first four are numerical branches, the next two related to the object tree, and the last one is used only in V5 or later games, and then only for printing out tracing code): note that each can also be used in a "negate this condition" form. Finally, signed 16-bit arithmetic and unsigned 16-bit bitwise operations: add sub mul div mod and or not A further batch of instructions is used, so to speak, in lieu of calls to a standard library of input/output and utility routines (like C's "stdio"): print print_ret print_char print_num print_addr print_paddr print_obj new_line insert_obj remove_obj get_parent get_child get_sibling get_prop get_prop_addr get_prop_len put_prop random aread/sread quit save restore output_stream with five more exotic screen control instructions used only if a "box" statement is ever compiled: split_window set_window style set_cursor buffer_mode 6 Syntax analysis 2: the bottom-up expression parser -------------------------------------------------- 6.1 Map and structure ----------------- It would be possible to continue the top-down parser down into the level of expressions: for instance, to have a routine parse_plus() which would parse_expression(), then match a plus sign, then parse_expression() again, and so on. This would however be highly inefficient in terms of function calls, difficult to recover from when an error occurs, and require either much greater lookahead or rather complicated mechanisms for storing parse trees. Expressions (including assignments, conditions and constant values) are therefore parsed by a different syntax analyser, which is "bottom up" (it begins by reading in the tokens, and only works out what they mean afterward) rather than "top down" (starting from an assumption about the meaning and then trying to find the evidence to justify it). The algorithm used is an operator precedence parser, a simple form of shift-reduce parser. There is alas no space to describe this beautiful algorithm here: see ASU, pp.195-215. In this account we treat it as a "black box". The expression parser "expressp.c" has the structure: --------> "Level 4" of --------> Operator tokens lexical analysis etokens precedence in parser | | etokens re-ordered into RPN \|/ Emitter | | plain parse tree \|/ Lvalue and condition checking | | more detailed parse tree \|/ out (This is effectively a magnification of the part of the source map in section 1.2 above which reads just "expressp.c".) Note that there is a single input and a single output channel. RPN is "reverse Polish notation": for example (2 + 4)*45 + 34 becomes 2 4 + 45 * 34 + the idea being that one reads it left to right: each number is added to a stack of numbers, and each operation takes some numbers off the stack and puts an answer back. (E.g., + takes 2 and 4 off and replaces them with 6. * takes 6 and 45 off, replacing them with 270. + takes 34 and 270 off, replacing them with 304, which is the answer.) The advantage of this is that is unambiguous which operator acts on which operands, and the brackets have gone. The emitter builds a (plain) parse tree out of this sequence, as follows. It stacks 2 and then 4: Stack: 2 4 It reads +, which it knows has two operands, and takes the top two items off the stack, joining them into a tree segment like so: + / \ 2 4 It then makes a special value, say E1, meaning "the sub-expression in this tree", and stacks that: Stack: E1 It then stacks up 45, and reaches *: it takes off the top two values, which are E1 and 45, and makes * / \ E1 45 which it calls E2, and so on. When the process is finished, we have the tree: + <--- this is E3 / \ this is E2 ---> * 34 / \ this is E1 ---> + 45 / \ 2 4 and the stack contains just E3, which is the "answer". 6.2 The operator precedence grammar ------------------------------- A list of productions for this grammar would be tiresomely long. It can be roughly summarised as: expression -> ( expression ) a single operator acting on some operands in some way operand -> token representing a constant ( expression ) expression ( arguments ) arguments -> expression , arguments The tokens representing a constant are given in the next section. The operators have tokens as follows: -------------------------------------------------------------------------- Level Lexeme Usage Associativity Purpose -------------------------------------------------------------------------- 0 , binary left separating values to work out 1 = binary right set equal to 2 && binary left logical AND 2 || binary left logical OR 2 ~~ unary (prefix) logical NOT 3 == binary none equal to? 3 ~= binary none not equal to? 3 > binary none greater than? 3 >= binary none greater than or equal to? 3 < binary none less than? 3 <= binary none less than or equal to? 3 has binary none object has this attribute? 3 hasnt binary none object hasn't this attribute? 3 in binary none first obj a child of second? 3 notin binary none first obj not a child of second? 3 ofclass binary none obj inherits from class? 3 provides binary none obj provides this property? 4 or binary left separating alternative values 5 + binary left 16-bit signed addition 5 - binary left 16-bit signed subtraction 6 * binary left 16-bit signed multiplication 6 / binary left 16-bit signed integer division 6 % binary left 16-bit signed remainder 6 & binary left bitwise AND 6 | binary left bitwise OR 6 ~ unary (prefix) bitwise NOT 7 -> binary left byte array entry 7 --> binary left word array entry 8 - unary (prefix) 16-bit (signed!) negation 9 ++ unary (pre/postfix) read/increment or increment/read 9 -- unary (pre/postfix) read/decrement or decrement/read 10 .& binary left property address 10 .# binary left property length 10 ..& (**) binary left individual property address 10 ..# (**) binary left individual property length 11/14 ( ) binary left function call/message send 12 . binary left property value 12 .. (**) binary left individual property value 13 :: binary left "superclass" operator -------------------------------------------------------------------------- (**): Illegal except internally, in Inform's own source for the run-time veneer -------------------------------------------------------------------------- Note that, unlike C, Inform does not provide unary plus. A binary infix operator is used in the form "X op Y"; a unary prefix operator in the form "op X"; a unary postfix operator in the form "X op". The "level" is the precedence level: the higher this number, the more tightly an operator is glued to the operands adjacent to it. Cases where two binary infix operators have equal precedence are resolved according to whether that level is left or right associative. For instance, a / b / c means (a/b)/c since /, on level 6, is left associative. But a = b = c means a = (b = c) since =, on level 1, is right associative. Cases in the above table where the associativity is "none" mean that an attempt to silently associate the operator will cause an error. For example, a == b == c generates an error as being ambiguous, but (a == b) == c and a == (b == c) are both permitted. The function-call-brackets have an asymmetric precedence level according to whether they're being compared with something on the left or on the right: the point is that object.message(...) function_returning_an_object(...).property must be recognised as (object.message)(...) . 12 > (...) 11 (function_returning_an_object(...)).property (...) 14 > . 12 respectively. 6.3 Lexical level 4: tokens to etokens ---------------------------------- The shift-reduce parser expects tokens sorted out into operators, such as "+" or "ofclass", operands such as "34" or "score" (assuming that's a variable) and three tokens of special significance: ( ) end-of-expression "Level 4 of the lexer" does just this. It converts the stream output from "lexer.c" (i.e., from level 3) into a stream of tokens from the following table (and it uses a good deal of context information to do so). Token type Meaning ------------------------------------------------------------------------- OP_TT Operator: value is a *_OP constant LARGE_NUMBER_TT Operand (number): value is the number SMALL_NUMBER_TT Operand (number guaranteed in the range 0 to 255): value is the number VARIABLE_TT Operand: value is Z-machine variable number DQ_TT Operand (text used as static string): text in text DICTWORD_TT Operand (text used as dictionary word): text in text ACTION_TT Operand (##action number): name of action in text SYSFUN_TT Operand (system fn name): value is a *_SYSF constant SYSTEM_CONSTANT_TT Operand (#system_constant): value is a *_SC constant SUBOPEN_TT Open bracket used to open a sub-expression SUBCLOSE_TT Close bracket used to close a sub-expression ENDEXP_TT End of expression ------------------------------------------------------------------------- The tokens in this output stream are called "etokens", short for "expression tokens": they are the raw material from which parse trees are built up. (Although formally different from ordinary tokens, they have the same storage type, "token_data", in the Inform source code.) At first sight this seems an unnecessarily large stock: why not, for example, compile static strings and dictionary words and convert them to their addresses here and now? Why not evaluate system constants and action numbers now? Because: (i) these tokens may not be accepted by the parser, so we can't compile text on the strength of them (and indeed, the parse tree which is being parsed may not ever be code-generated from); and (ii) we can't allow ourselves to lose sight of where numbers are coming from so soon, or backpatching will be impossible. The presence of two different tokens for numbers - one for numbers guaranteed to be unsigned 8 bit, one for numbers with no such guarantee and whose value may be any signed 16 bit quantity - is due to the architecture of the target Z-machine. In Z-code considerable space savings are made by storing small numbers in 1, rather than 2, bytes. We are anxious to take advantage of this. Unfortunately, what seems to be a small number now may not be a small number later (after backpatching). Dictionary word values (such as 'word') are internally stored as the accession number of that word into the dictionary (say 23, if it is the 23rd different word to be entered) during the compilation pass, and then backpatched later to the byte address of this word's dictionary entry: but this will be a large number. Most operands with non-NULL backpatch markers are forced into long form, to be on the safe side. (This includes all operands which are forward references to constants not yet defined.) The exceptions are that variable and action numbers (though not fake action numbers) are guaranteed always to fit into short form. There are two tasks involved in translating tokens to etokens: evaluating tokens representing quantities into one of the "operand" etokens, and sorting other tokens into the right operator and special etokens. The second process is not as easy as it looks because of ambiguities, and is discussed in the next section. This section discusses the evaluation of quantity tokens. Tokens in expressions are read in a context where the keywords are local variables, textual operators, system function names and system constant names. Other identifiers are symbol names. So the token types which must represent quantities are: DQ_TT, SQ_TT, LOCAL_VARIABLE_TT, NUMBER_TT, SYMBOL_TT, SYSFUN_TT and SYSTEM_CONSTANT_TT. (Actually, this omits a detail: the largely obsolete constant forms #a$..., #r$..., #n$... are SEP_TT tokens converted into the above; and the very much still with us constant form ##... is a SEP_TT token converted straightforwardly into the ACTION_TT etoken.) These all translate directly into etokens except for SYMBOL_TT and SQ_TT. SQ_TT requires a little work because of the Inform syntax for single-quoted strings: 'x' means the character code (i.e. ASCII value) for "x" '@ss' means the character code for the German letter "sz" (i.e. the code specified in the accented set in the Z-Machine standard document) 'xyzzy' means the byte address of the dictionary word "xyzzy" The first two become SMALL_NUMBER_TT etokens. The third becomes a DICTWORD_TT etoken. This leaves SYMBOL_TT tokens: names for constants. There are two possibilities: the symbol is so far unknown, or else it has been assigned a value in earlier syntax analysis. In most languages it would be an error to use an unknown symbol name as a value: in C, for example, a function name cannot be used in an expression unless it has been declared beforehand. Inform is more tolerant: as far as expressions are concerned, any symbol can be used earlier than the point in the source code where its value is assigned, excepting only a local or global variable name. (This exception is mainly for Z-machine architecture reasons, but it also makes Inform code more legible.) When an unknown symbol is reached, it is converted to a LARGE_NUMBER_TT and marked as SYMBOL_MV, with the value being its index in the symbol table. (This allows the backpatcher to know where to find the true value later.) When a known symbol is reached which is flagged as CHANGE_SFLAG, this is treated in the same way. (CHANGE_SFLAG is given to symbols defined by Constant definitions whose values themselves require backpatching: for instance, symbols defined equal to dictionary words or routine addresses.) Otherwise, the symbol is not only known, but its current value is known to be correct, and so: if it's a variable, it's translated to etoken VARIABLE_TT; otherwise the LARGE_NUMBER_TT/SMALL_NUMBER_TT decision is now made: if it is marked with a marker other than VARIABLE_MV, it becomes "long"; otherwise the value is compared with the range 0 to 255. Any symbol name fed into the shift-reduce parser is flagged with USED_SFLAG to indicate that its value has at some point been used in an expression. (This information enables "This ... was declared but not used" warnings to be made at the end of compilation.) 6.4 Resolution of ambiguous tokens ------------------------------ It remains to translate to the etokens OP_TT, SUBOPEN_TT, SUBCLOSE_TT and ENDEXP_TT. The token types which represent operators are all from SEP_TT and COND_TT; open and close bracket are also SEP_TT tokens. There are two main difficulties here. Firstly, shift-reduce parsers are unsuitable for resolving ambiguities, and so it is not possible to map tokens directly into operators. The ambiguities in the expression grammar for Inform are: (a) ++ and -- each refer to either a prefix or a postfix operator (b) - refers to either a unary prefix operator or a binary infix one (c) ( and ) are used both to indicate function calls and subexpressions (d) , is used both to indicate a sequence of values, each to be evaluated and thrown away with only the last one kept, and to separate function arguments (e) -> does not represent an operator when parsing constants as operand values in a line of "@" assembly language (because it needs to mean "the store value goes in") These are resolved as follows: (a), (b) The "level 4 lexer" watches for context: for instance, MINUS_SEP translates to UNARY_MINUS_OP if there was no previous token, or it was an open bracket, or an infix or prefix operator, or a comma; otherwise it translates to MINUS_OP. (c) Similarly. If an open bracket token is read, and the previous token was an operand of some kind, then it must represent a function call (unless we are in constant context); an extra etoken is generated for an imaginary infix operator called FCALL_OP. The open bracket token is then translated to SUBOPEN_TT as usual; a close bracket is always SUBCLOSE_TT. Thus, the stream Eroica ( Beethoven , 3 ) is translated into the etokens Eroica ( Beethoven 3 ) so that a tree will come out of the emitter looking like / \ Eroica / \ Beethoven 3 although in fact the emitter recognises what is going on and automatically simplifies this to: / | \ / | \ Eroica Beethoven 3 (d) A comma at the top level is either disallowed (in constant context) or treated as the operator . (Just as in C.) Otherwise, comma is a separator of arguments. "Top level" means "top bracket level", which the "level 4 lexer" keeps track of by counting the bracket tokens going through it. (e) There is a special expression context called "assembly language", in which "->" is translated to ENDEXP_TT. The second main difficulty is entirely of the language designer's own making. How is the end of an expression to be detected? In C, it always possible when parsing an expression to say "the end will have been reached when such-and-such a token is reached": for example, ";" or "}". This is not possible in Inform. Even if one knows in advance what the terminating token ought to be, it might be impossible to recognise for context reasons. For example, the statement move to ; appears to be no problem: the first expression is terminated by the keyword "to", the second by ";". But suppose "to" has another meaning, as the name of a constant or variable? In this case, move 2 + to to to; is a legal statement. For all these reasons, Inform actually detects the end of an expression at lexical level 4 by watching the context carefully: for instance, in move 4 * banana to ... when "4 * banana" has been passed, the next token is expected to be an open bracket, an infix or postfix operator, or (perhaps) a comma. Since the next token is none of these, the expression has ended. Note that this decision can be taken without knowing whether "to" is a keyword or a symbol. From a grammatical point of view the worst design flaw in the Inform language is that array initialisations (either in Array directives or object property values) consist of sequences of expressions given without any separator in between. For example, Array X --> 2 4 6 * MAX_FROGS 45 ; This is arguably a convenient shorthand. Unfortunately, it falls prey to that predator of all computing grammars, unary minus. The directive Array X --> 10 -5; is ambiguous. I