------------------------------------------------------------------------------
                               Inform 6.21

                            Technical  Manual

                              Graham Nelson
                             27th April 1996
                      revised on the following dates:
     5th May, 10th May, 6th September, 23rd September, 16th December 1996,
     27th January, 26th March, 5th April, 8th September 1997,
     22nd March 1998, 10th December, 30th January 1999, 27th April
------------------------------------------------------------------------------
    1   The source code
            1.1   Inventory
            1.2   Map
            1.3   Naming conventions
            1.4   Typedef-named types
    2   Porting Inform to a new environment
            2.1   Dependence on the OS
            2.2   Portability issues: the types int32 and uchar
            2.3   The character set and the format of text files
            2.4   The OS definitions block in "header.h"
            2.5   Running Inform in a multi-tasking OS
    3   Front end
            3.1   The ICL interpreter
            3.2   Fundamental method
    4   Lexical analysis
            4.1   Aim and structure of the lexer
            4.2   Level 1: lexical blocks and buffered input
            4.3   Level 2: breaking text into lexemes
            4.4   Representation of tokens
            4.5   Level 3: identifying identifiers
            4.6   Hash-coding and comparing names
            4.7   The symbols table
            4.8   Token lookahead: the circle
            4.9   Summary of the token output
    5   Syntax analysis 1: the top-down structural parser
            5.1   Aim and structure of the syntax analyser
            5.2   Predictive parsing
            5.3   The context-free grammar
            5.4   Assigning values to symbols
            5.5   Outputs other than assembly language
            5.6   Assembly operands
            5.7   Translation to assembly language
            5.8   Summary of assembly language instructions output
    6   Syntax analysis 2: the bottom-up expression parser
            6.1   Map and structure
            6.2   The operator precedence grammar
            6.3   Lexical level 4: tokens to etokens
            6.4   Resolution of ambiguous tokens
            6.5   Constant folding
            6.6   Checking lvalues and simplifying the parse tree
            6.7   Summary of parse tree output
    7   Code generation from parse trees
            7.1   Aim and structure of the code generator
            7.2   Annotation for conditions
            7.3   Main traversal
            7.4   Void context
            7.5   Results of subexpressions
            7.6   Conditional and logical operators
            7.7   System functions
            7.8   Strict mode assertions
    8   Assembly of code, text and data structures
            8.1   Assembly of initial code
            8.2   Branch offset backpatching and optimisation
            8.3   Text translation: ISO and Unicode resolution
            8.4   Dictionary management
            8.5   Format of tables unspecified by the Z-machine
            8.6   Grammar version numbers GV1 and GV2
            8.7   Value backpatching
            8.8   Packed address decoding
    9   Run-time veneer
            9.1   Services provided by the veneer
            9.2   Specification of the veneer routines
            9.3   Properties 2 and 3, "ofclass" and "metaclass"
            9.4   Class numbering and class objects
            9.5   Individual property identifiers
            9.6   Individual property tables
            9.7   Availability of symbol names at run-time
    10  Module compilation and linking
            10.1  Model
            10.2  Linking a module
            10.3  Imports and exports
            10.4  Backpatch backpatching
            10.5  How modules differ from story files
            10.6  Restrictions on what modules may contain
    11  Service levels
            11.1  Variables and arrays
            11.2  Memory allocation and deallocation
            11.3  Error messages
            11.4  File input/output
    12  Low-level language features
            12.1  Using the "Trace" directive
            12.2  System constants and other secret syntaxes
            12.3  The "Zcharacter" directive
            12.4  Sequence points
            12.5  Format of debugging information files
            12.6  How to syntax-colour Inform code
    13  What little I remember
            13.1  Publication history
            13.2  Design history
            13.3  Implementation history
            13.4  Modification history
------------------------------------------------------------------------------

    1   The source code
        ---------------

    1.1   Inventory
          ---------

Inform is written in portable ANSI C and the source code is divided into 21
files of code, called "sections", plus 1 #include file of linkage, constant
and type definitions.  These files are:

  arrays.c     asm.c        bpatch.c     chars.c      directs.c    errors.c
  expressc.c   expressp.c   files.c      inform.c     lexer.c      linker.c
  memory.c     objects.c    states.c     symbols.c    syntax.c     tables.c
  text.c       veneer.c     verbs.c      header.h

Note that all their names fit into the 8.3 filenaming convention.  The
subdivision into 21 sections is intended to ensure that each .c file can be
compiled into one linkable object file: under some C compilers object
code files cannot exceed 64K in length.  On my machine, expressc.o (the
object code derived from expressc.c) is the largest, at 40K (about a third of
which is the static table of operator data).

A concise tree giving the structure of the source code follows.  A section
name is given brackets when it has already been given on a previous line: so
"(inform.c)" is intended to be read as "now back to inform.c again". A name
written as a function, like "compile()", is indeed a function, whose
arguments are omitted here.  Text in square brackets indicates the presence
of interesting tables of static data.  Finally, note that this structure
is not absolutely rigorous: the error-reporting routines in "errors.c", for
example, are called from all over Inform, not just from the lexical analyser.


    inform.c  ICL parser: switches, filename translation, path variables

    memory.c  ICL memory command parser: sets memory settings.

    (inform.c)   compile(): polling all sections to manage variables,
                 allocate and free arrays

    syntax.c       syntax analyser, top level
    
    lexer.c          lexical analyser: converts source text to a stream of
                     tokens
                     [Tables of all Inform keywords]

    chars.c            performs character set translations
    
    files.c            reads source code files into buffers for the lexer;
                       performs miscellaneous file I/O

    symbols.c          keeps table of symbol names found in the source;
                       recognises from and adds to this table
    
    errors.c           issues error, fatal error and warning messages
    
    (syntax.c)     parse_program(): top level routine in syntax analyser

                     parse_directive()

    directs.c          parse_given_directive(): parses and obeys the easier
                       directives; manages conditional compilation; delegates
                       directive parsing down to other sections for harder
                       cases:

    text.c               make_abbreviation(): Abbreviate

    arrays.c             make_global(): Array, Global

    objects.c            make_attribute(): Attribute
                         make_property(): Property
                         make_class(): Class
                         make_object(): Object

    verbs.c              make_fake_action(): Fake_Action
                         make_verb(): Verb
                         extend_verb(): Extend

    linker.c             link_module(): Link

    (symbols.c)            assign_symbol(): giving value and type to symbols

    (syntax.c)         parse_routine()

                         parse_code_block()

    states.c               parse_statement(): assigns ".Label" labels, parses
                           and generates code for statements

                             parse_action(): handles <Action ...> statements

                             parse_print(): handles print and print_ret

    asm.c                    parse_assembly(): handles assembly language
                             source code (preceded by "@")

    expressp.c                 parse_expression(): parses all expressions
                               (including constants), conditions and
                               assignments.

    expressc.c                 [Table of operators]
                               code_generate(): generates code from parse
                               trees previously found by parse_expression()

    (asm.c)                      assemble_2_to() (and many other similarly
                                 named routines)

                                   [Database of all Z-machine opcodes]
                                   assemble_instruction(): assembles a single
                                   Z-code instruction

                                   assemble_label_no(): puts label N here

                                   assemble_routine_end(): finishes routine,
                                   backpatches and optimises branches

    (text.c)                     compile_string(): translates ASCII text
                                 to Z-encoded text

                                 the dictionary manager

                                 the abbreviations optimiser

    (chars.c)                      performs character set translations

    veneer.c       compile_veneer(): compiles in any whole routines
                   needed by the rest of the compiled code
                   [Table of Inform source code for veneer routines]

    tables.c       construct_storyfile(): glues together all the
                   code, dictionary, object tree, etc. into a story file

    bpatch.c       backpatch the code in the light of recent knowledge


    1.2   Map
          ---

Here follows a map of the Inform archipelago, marking the inhabited islands,
their shipping lanes and chief imports and exports:


         command line
       and/or ICL files in
              |
              | ICL commands
             \|/
         +----------+ FRONT
         | inform.c | END
         | memory.c |
         +----------+
              |
              | filenames
              |
             \|/          +------------+ LEXICAL
  ------>  files.c -----> | lexer.c    | ANALYSER
  source            chars | symbols.c  |
  code in                 +------------+
                              /|\  |
                        symbol |   |
                        values |   | tokens
                               |  \|/
                          +------------+
               SYNTAX     | syntax.c   | -----+------>  asm.c  --->---\
               ANALYSER:  | states.c   |     /  assembly /|\   initial|
               STATEMENTS |     .      |    /    language |       code|
                          | ---------- |   /              |           |
               @ ASSEMBLY | asm.c      | -/               |           |
                          | ---------- |                  |           |
                          |     .      | parse trees      |          \|/
              EXPRESSIONS | expressp.c | --+-------> expressc.c     asm.c
                          |  (map 6.1) |    \                         |
                          |     .      |     \                        |
                          |     .      |      \         TEXT          |
                          | ---------- |    strings   +-------+Z-text |
               DIRECTIVES | directs.c  |        \---->|text.c |-->---)|(--\
                          |     .      |              |chars.c|       |   |
                          |     .      |              +-------+       |   |
                          |     .      |        dictionary|           |   |
                          |     .      |       alphabets \|/         \|/  |
                          |     .      |                  |           |   |
                          | arrays.c   | ------->------\  |           |   |
                          |     .      |  array area   |  |        raw|   |
                          |     .      |               |  |     Z-code|   |
                          | objects.c  | ----->-----\  |  |           |   |
                          | verbs.c    |  objects   |  |  |           |   |
                          +------------+            |  |  |           |   |
                                |                  \|/\|/\|/         \|/  |
                                |    grammar      +----------+----------+ |
                                \---------------> | tables.c   bpatch.c | |
                                                  +----------+----------+ |
                                                       |    OUTPUT    |   |
                                                       |              |   |
                                            Z-machine \|/       Z-code|   |
                                                 up to |              |   |
                                             code area |             \|/ \|/
                                                       \-----------> files.c
                                                                        |
                                                                        |
                                                                       \|/
                                                                      story
                                                                    file out

(For clarity, the linker and a few smaller tables are missed out; and the
"service" sections of Inform, such as "errors.c" and the allocation code in
"memory.c", are missed out since they are, so to speak, pipes and sewers
which lie beneath the surface of the ocean.)


    1.3   Naming conventions
          ------------------

The "header.h" makes over 700 constants using #define.

These are mainly in capital letters and are followed by _ and then some short
code indicating what kind of constant is being defined: for instance,
NUMBER_TT means "the token type <number>".  We write *_TT for the set of
constants ending in _TT.  Similarly, though to a lesser extent, groups of
related variables and routines have grouped names.

    -------------------------------------------------------------------------
    Set of constants          Used for
    -------------------------------------------------------------------------
    *_Extension               File extensions, such as ".z5", used if the
                              host OS supports them
    *_Directory               Initial values for the ICL path variables
                              (e.g., default pathname where story files are
                              written to)

    *_TT                      Token types

    *_CODE                    Token values for statement and directive names
    *_COND                    Token values for textual condition names
    *_SEGMENT                 Token values for object definition segment names
    *_MK                      Token values for misc statement keywords
    *_TK                      Token values for "trace" directive keywords
    *_DK                      Token values for misc directive keywords
    *_SC                      Token values for system constant names
                              (* is written in lower case)
    *_SYSF                    Token values for system function names
                              [In all of the above eight cases, * is the name
                              of the statement, keyword, etc. referred to,
                              written in upper case except as specified above]

    *_SEP                     Token values for separators
                              (the name sometimes reflects the text, e.g.,
                              DARROW_SEP for the double-length arrow "-->";
                              sometimes its use, e.g. NOTEQUAL_SEP for "~=")

    *_OP                      Token values for operators
    *_A                       Associativity values for operators
    *_U                       "Usage" (infix, prefix or postfix) values for
                              operators

    *_T                       Symbol types
    *_SFLAG                   Symbol flags
                              (bitmasks containing one bit set, so that
                              (sflags[i] & *_SFLAG) is true if flag is set)

    *_CONTEXT                 Contexts in which the expression evaluator
                              can run (e.g., "void", "condition")

    *_zc                      Internal numbers referring to Z-machine opcodes
                              (* is the Standard 0.2 name for the opcode,
                              written in lower case)
    *_OT                      Assembly operand types
    *_STYLE                   Two constants used to set whether the Z-machine
                              has "time" or "score/moves" on its status line
    *_ZA                      Z-machine areas: e.g. PROP_ZA refers to the
                              property values table area
    *_MV                      Marker values
    *_RTE                     Run-time error numbers
    *_VR                      Veneer routines (* is the name of the routine,
                              usually in mixed case)

    *_DBR                     Record types in debugging information files

    -------------------------------------------------------------------------
    Set of variables          Used for
    -------------------------------------------------------------------------
    *_switch                  Flag indicating whether a command-line switch
                              such as -s is on or off
    *_setting                 Numerical value set by a command-line switch
                              such as -t3
    MAX_*                     A limit on something: note that a few of these
                              are #define'd but most are memory setting
                              variables
    no_*                      Number of things of this type made so far
    max_*                     Maximum number of things of this type made
    token_*                   Three variables used to hold the value, type
                              and lexeme text for the last token read
    *_trace_level             0 if tracing information is not being printed
                              out about *; otherwise, the larger this is,
                              the more output is produced
    *_offset                  Byte offset in the Z-machine, either from the
                              start of this Z-machine area or from the start
                              of Z-machine memory
    *_top                     Pointer marking the current end in some uchar
                              array (usually holding a Z-machine area being
                              put together)

    -------------------------------------------------------------------------
    Set of routines           Used for
    -------------------------------------------------------------------------
    init_*_vars()             Routine in which section * of Inform initialises
                              its variables to begin compilation
    *_begin_pass()            ...and in which it initialises its variables
                                 at the start of the source code pass
    *_allocate_arrays()       ...and in which is allocates any memory or
                                 arrays it needs to begin compilation
    *_free_arrays()           ...and in which it deallocates any memory or
                                 arrays it has allocated, after compilation
    parse_*()                 Routine in the syntax analyser to parse the
                              source code construct *
    assemble_*()              Instructing the assembler to generate an
                              instruction:
    assemble_#()              (where # is a number from 0 to 4) an instruction
                              with # operands
    assemble_#_to()           an instruction with #operands which stores a
                              result
    assemble_#_branch()       an instruction with #operands which makes a
                              conditional branch
    *_linenum()               Keeping and writing line references to the
                              debugging information file


    1.4   Typedef-named types
          -------------------

    -------------------------------------------------------------------------
    Typedef name      Where defined      Used for
    -------------------------------------------------------------------------
    int32                  H             signed 32-bit integer
    uint32                 H             unsigned 32-bit integer
    uchar                  H             unsigned char

    assembly_operand       H             holding Z-machine numbers in the form
                                         used in Z-code, together with linkage
                                         information about how they were
                                         calculated
    assembly_instruction   H             a convenient representation of an
                                         instruction of Z-code to assemble
    opcode                 asm.c         everything about a Z-machine opcode:
                                         how many operands it has, whether
                                         branch or store, etc.
    verbl                  H             grammar line of 8 token values
    verbt                  H             grammar table of grammar lines
    prop                   H             list of values for a property
    propt                  H             property values table
    fpropt                 H             the same, but with attributes too
    objectt                H             object tree-position and attributes
    dict_word              H             Z-text of a dictionary word
    dbgl                   H             source code reference used for the
                                         debugging file (to file, line, char)
    keyword_group          H             the plain text of a group of keywords
                                         (such as: all the statement names)
    token_data             H             lexical tokens
    expression_tree_node   H             node in a parse tree produced by
                                         the section "expressp.c"
    operator               H             everything about an operator: its
                                         name, how to recognise it, its usage
                                         and associativity, etc.

    memory_block           H             an extensible area of memory
                                         (allocated in 8K chunks as required)

    tlb_s                  text.c        used in abbreviations optimiser
    optab_s                text.c        used in abbreviations optimiser

    FileId                 H             filename and handle for a source file
    ErrorPosition          H             filename and line reference for
                                         error message printing purposes
    LexicalBlock           lexer.c       name, line count, etc.
                                         within a block of text being lexed
    Sourcefile             lexer.c       buffer, pipeline and lexical block
                                         for a source code file being lexed
    ImportExport           linker.c      holds import/export records
    Marker                 linker.c      holds marker records
    VeneerRoutine          veneer.c      holds low-level Inform source code
                                         for a veneer routine
    -------------------------------------------------------------------------
                                   "H" is an abbreviation here for "header.h"


    2   Porting Inform to a new environment
        -----------------------------------

    2.1   Dependence on the OS
          --------------------

Strenuous efforts have been made over the last three years to make the source
as operating-system independent as possible.  As a general principle, mostly
adhered to, all operating-system differences should be in the "header.h"
file, and not in the 20 sections.

As a general rule, for each target OS which Inform is being ported to, a new
#define name is invented.  For example, the name LINUX is used for the Linux
port.  When Inform is being compiled, only that one symbol will be defined,
and none of the others: thus

#ifdef LINUX
   ...code...
#endif

compiles the given code only for the Linux port.

There are some very poor "ANSI C" compilers out there, and many more mediocre
ones (which almost obey the standard, but don't quite): in any case the ANSI
standard is very broadly defined.  For example, the code

int x;
x = 45*1007;
printf("%d\n", x);

is entirely ANSI compliant, but results in different numbers being printed
on different machines, due to the fact that ANSI does not specify the range
of numbers which a variable of type int can safely hold.  Since C is so highly
unportable a language, and since some of the compilers used to produce Inform
are poor, the whole Inform code has to be written with the worst possible
compiler in mind.

An illustration of this is that all preprocessor commands, such as #define,
must begin on column 1 of the source code: even when they occur in code which
is #ifdef'd out.  VAX C (a particularly bad compiler) will reject

#ifndef VAX
  #define FROG 2
#endif

for example, even when VAX is defined.  This makes the declarations in
"header.h" annoyingly illegible for everybody.


    2.2   Portability issues: the types int32 and uchar
          ---------------------------------------------

The main issues when porting Inform have been found to be:

    (a) the size of "int",
    (b) whether "char" is unsigned or signed by default,
    (c) what conventions apply to filenames,
    (d) assumptions about sizeof() when casting pointers,
    (e) how parameters (switches, filenames, etc.) are passed to Inform.

(a) ANSI requires that "int" be at least 16 bit, though advances in CPU
    technology mean that under most of today's environments it will in fact be
    32 bit.  ANSI further requires that "long int" be at least as big as "int",
    but not that it has to be any bigger.

    Inform needs at least one integer type to be able to hold 32 bit signed
    values, and "header.h" contains code which attempts to typedef the name
    "int32" to such a type.  This should happen automatically.
    
    Under ANSI rules, as the above says, a compiler need not have any integer
    type larger than 16 bits: if so, that compiler will not be able to
    compile Inform.

    An annoying issue here is that compilers vary widely in the extent to
    which they give errors or warnings when they detect silent "promotions"
    from one integer type to another.  This makes it very hard to sort out
    the types of every object in the program between int and int32 so that
    everybody is happy: in practice, every time a new release of the source
    code has been made, a few dozen types have had to be fiddled with until
    everybody can compile it.

(b) Compilers seem to divide about fifty-fifty on this.  Again, the original
    standard is vague: the issue is really about how the value (char) 253,
    say, should be interpreted when it is cast to (int).  Should the answer
    be 253, or -3?  ANSI did not specify this because compiler writers
    wanted to be able to choose whichever could be done instantly on the
    CPUs they were working with.  (If you store a char in the bottom 8 bits
    of a 32 bit register, then casting the value (char) 253 to (int) -3
    means setting all 24 of the upper bits, which requires code to be
    compiled.)

    Inform uses a typedef'd type called "uchar" when it needs an unsigned
    char type: it uses plain "char" when it doesn't mind.  It never needs
    a signed char type.
    
    In theory ANSI C compilers must recognise the keywords "signed" and
    "unsigned", but some don't:
    
        typedef unsigned char uchar;

    actually produces an error on some compilers.  So the typedef can only
    be made with your help.  (On other compilers, "unsigned" is legal but
    "signed" is illegal.)
    
(c) Many people think that the minimal 8.3 convention will work on any
    operating system, but this is not true (it won't work under Acorn's
    RISC OS).  Much of each OS specification in "header.h" is therefore
    to do with filenaming.

(d) For instance,

        sizeof(char *)    sizeof(int *)    sizeof(int32 *)    sizeof(int)

    may all be different numbers on machines with segmented memory maps.
    This being so, casting between pointer types may lose information, and
    a few arrays in the source have surprising types to ensure safety.

    One thing Inform does need to be able to do is to subtract one pointer
    (of the same type) from another: it defines the macro
    
        subtract_pointers(X, Y)

    to do this.  X and Y are normally of type uchar; there seems to have been
    no problem with this in practice.

(e) The ANSI standard is quite good on the command line, and Inform expects
    to read parameters by the standard argc, argv mechanism.  Unfortunately
    the Macintosh, for instance, has no orthodox command line.  Such a port
    probably wants to have an "outer shell" which displays a window, allows
    options to be set and then calls the Inform 6 core as needed.

    The section "inform.c" normally compiles a "main" routine which makes
    a few machine-dependent changes and then passes its arguments straight
    on to "sub_main".  For instance, here's the v6.10 source:

        int main(int argc, char **argv)
        {   int rcode;
            #ifdef MAC_MPW
            InitCursorCtl((acurHandle)NULL); Show_Cursor(WATCH_CURSOR);
            #endif
            rcode = sub_main(argc, argv);
            #ifdef ARC_THROWBACK
            throwback_end();
            #endif
            return rcode;
        }

    The Macintosh Programmer's Workshop port is making multi-tasking
    work before embarking on compilation; the Acorn Desktop Debugging
    Environment port is tidying up after any error throwbacks, at the
    end of compilation.  The point is that here is the place for such
    minor machine quirks.

    However, if you want an entirely new front end (such as Robert Pelak's
    Macintosh port of Inform has), then you need to define
    
        #define EXTERNAL_SHELL

    in your machine definition block (see later).  This will mean that
    no "main" routine is compiled at all from "inform.c" (so you can
    simply link the Inform source into your own code, which will
    contain its own "main.c"): Inform should be run by calling
    
        extern int sub_main(int argc, char **argv);

    having set up argc and argv suitably.  For instance, the outer shell
    might take names typed into dialogue boxes, and various ticked options
    on a window, and make these into a series of ICL commands, which
    are then handed over textually to sub_main.  I suggest that the most
    efficient way to do this is to write them as an ICL file somewhere
    and to pass sub_main a single parameter telling it to run this
    ICL file.


    2.3   The character set and the format of text files
          ----------------------------------------------

The Inform source code assumes that the compiler is running on a machine
whose character set agrees with ASCII in the range $20 to $7e.  (This
allows both plain ASCII and any of the ISO 8859 extensions to ASCII.)

ASCII is now universal, but there is no common format for plain text files,
and in particular how lines of text are ended.  For example:

    MS-DOS, Windows, etc.:  $0d $0a
    Mac OS:                 $0d
    RISC OS:                $0a

Inform 6 can read source code files in all these formats, and which further
use any of the character sets above: plain ASCII or ISO 8859-1 to -9.
(This is configurable using the -C switch.)


    2.4   The OS definitions block in "header.h"
          --------------------------------------

Each Inform port makes a block of definitions in the header file.  These
blocks take a standard format.

Firstly, the block is put in #ifdef's so that it will only be processed
in this one port.  The block is divided into 6 sections, as follows.

    /* 1 */

    MACHINE_STRING should be set to the name of the machine or OS.

    /* 2 */

    Section 2 contains some miscellanous options, all of which are on/off:
    they are by default off unless defined.  The possibilities are:

      USE_TEMPORARY_FILES - use scratch files for workspace, not memory,
                            by default
      EXTERNAL_SHELL      - this port is providing an entire external
                            front end, with its own "main" routine:
                            see above
      PROMPT_INPUT        - prompt input: ignore argc and argv, instead
                            asking for parameters at the keyboard.
                            (I hope people will write front-ends
                            rather than resort to this, but it may be
                            a useful staging post.)
      TIME_UNAVAILABLE    - if the ANSI library routines for working out
                            today's date are not available
      CHAR_IS_SIGNED      - if on your compiler the type "char" is signed
                            by default

    Note that defining USE_TEMPORARY_FILES does not make a mandatory choice
    (as it did under Inform 5): whether to use allocated memory or temporary
    files is selectable with -F0 (files off) or -F1 (files on) in ICL.
    All that this option does is to define the default setting for this -F
    switch.

    Running -F0 is faster (possibly, depending on whether your C library
    provides buffering or not, much faster) but consumes 100 to 300K more
    memory (it does so flexibly, allocating only what it needs, unlike the
    Inform 5 option).  Most users will not want to understand the issues
    involved here, so please make a sensible default choice for them.

    Once again, note that CHAR_IS_SIGNED must be defined if "char" is signed:
    otherwise "uchar" will be typedef'd wrongly.

    /* 3 */

    An estimate of the typical amount of memory likely to be free should be
    given in DEFAULT_MEMORY_SIZE.  (This is only a default setting.)
    
    There are three settings: HUGE_SIZE, LARGE_SIZE and SMALL_SIZE.
    (I think it was Andrew Plotkin, though, who remarked that HUGE_SIZE
    might sensibly be renamed "not-bad-by-1980s-standards-size":
    these all allocate quite small amounts of memory compared to, say,
    the 8M of workspace that Windows appears to need just to keep breathing.)

    For most modern machines, LARGE_SIZE is the appropriate setting, but
    some older micros may benefit from SMALL_SIZE.

    /* 4 */

    This section specifies the filenaming conventions used by the host OS.

    It's assumed that the host OS has the concept of subdirectories and
    has "pathnames", that is, filenames giving a chain of subdirectories
    divided by the FN_SEP (filename separator) character: e.g. for Unix
    FN_SEP is defined below as '/' and a typical name is

                          users/graham/jigsaw.z5

    Normally the comma ',' character is used to separate pathnames in a
    list of pathnames, but this can be overridden by defining FN_ALT
    as some other character.  Obviously it should be a character which never
    occurs in normal pathnames.
                                                                           
    If FILE_EXTENSIONS is defined then the OS allows "file extensions" of
    1 to 3 alphanumeric characters like ".txt" (for text files), ".z5"
    (for game files), etc., to indicate the file's type (and, crucially,
    regards the same filename but with different extensions -- e.g.,
    "frog.amp" and "frog.lil" -- as being different names).

    If FILE_EXTENSIONS is defined, then Inform uses the following standard
    set of extensions unless they are overridden by other definitions
    at this point.  (Please don't override these definitions without reason.)

        Source_Extension  ".inf"    Source code file
        Include_Extension ".h"      Include file (e.g. library file)

        Code_Extension    ".z3"     Version 3 story file
        V4Code_Extension  ".z4"             4
        V5Code_Extension  ".z5"             5
        V6Code_Extension  ".z6"             6
        V7Code_Extension  ".z7"             7
        V8Code_Extension  ".z8"             8

        Module_Extension  ".m5"     Linkable module file (version 5,
                                    which is all that Inform 6 supports yet)

        ICL_Extension     ".icl"    ICL file

    The debugging information file and the transcript file also have defined
    default-names which can be over-ridden in this section if desired:
    
        Transcript_File   "gametext.txt" or "gametext"
        Debugging_File    "gameinfo.dbg" or "gamedebug"

    If you do not define FILE_EXTENSIONS, then it is essential to define
    STANDARD_DIRECTORIES instead.  (You can also define both, if you so
    choose.)

    The STANDARD_DIRECTORIES option causes Inform to put all files of
    a particular kind into a standard directory for them: e.g., a "games"
    directory might hold the story files compiled, etc.  All that happens
    when a standard directory is defined is that Inform sets the default
    value of the relevant pathname variable to that standard directory:
    otherwise, its pathname variable starts out as "".
                                                                           
    The standard directories are, once again, defined by default as follows:
    once again you can define these settings yourself, but please don't
    do so without a good reason.

        Source_Directory      "source"
        Include_Directory     "library"
        Code_Directory        "games"
        Module_Directory      "modules"
        Temporary_Directory   ""
        ICL_Directory         ""

    Note that the actual user of Inform can still override anything you
    choose by setting the pathname with an ICL command.

    A good way to test all this is to run inform -h1, which does some
    experimental filename translations and prints the outcome.

    /* 5 */

    Section 5 contains information on how to choose the filenames for the
    three temporary files.  (Note that this needs to be done even if
    USE_TEMPORARY_FILES is not defined.)

    On many machines, you only need to give a suitable name.  (As usual,
    if you don't bother, something fairly sensible happens.)

    Temporary_Name is the body of a filename to use (if you don't set this,
    it becomes "Inftemp") and Temporary_Directory is the directory path for
    the files to go in (which can be altered with an ICL command).

    However, under some multi-tasking OSs it is desirable for multiple Inform
    tasks to work simultaneously without clashes, and this means giving
    the temporary files filenames which include some number uniquely
    identifying the task which is running.  If you want to provide this,
    define INCLUDE_TASK_ID and provide some code...
                                                                           
       #define INCLUDE_TASK_ID                                             
       #ifdef INFORM_FILE                                                  
       static int32 unique_task_id(void)                                   
       {   ...some code returning your task ID...                          
       }                                                                   
       #endif                                                              

    /* 6 */

    Finally, section 6 is "anything else".  In particular this is where
    DEFAULT_ERROR_FORMAT should be set.  This switches between different
    styles of error message.  (This is not a matter of aesthetics: some
    error-throwback debugging tools are very fussy about what format
    error messages are printed out in.)

For example, here is a typical OS definition block:

#ifdef UNIX
/* 1 */
#define MACHINE_STRING   "Unix"
/* 2 */
#define CHAR_IS_SIGNED
/* 3 */
#define DEFAULT_MEMORY_SIZE LARGE_SIZE
/* 4 */
#define FN_SEP '/'
#define FILE_EXTENSIONS
/* 5 */
#define Temporary_Directory "/tmp"
#define INCLUDE_TASK_ID
#ifdef INFORM_FILE
static int32 unique_task_id(void)
{   return (int32)getpid();
}
#endif
#endif
                                                                           

    2.5   Running Inform in a multi-tasking OS
          ------------------------------------

As mentioned above, if Inform is being used in a multi-tasking environment
then temporary file-naming will need a little attention.

Another issue is that under some systems the other tasks may all freeze
up while Inform is working, because tasks only voluntarily hand control
back to the OS (allowing it to poll the other tasks and share out the
processor time).  This means that some call to an OS primitive routine
may have to be inserted into Inform somewhere: a good place to do this
is in the routine reached_new_line() of the section "lexer.c".


    3   The front end
        -------------

    3.1   The ICL interpreter
          -------------------

Inform's front end is quite simple and there is little to say about it.
Its task is to translate the user's compilation command into five kinds of
ICL variable:

    switches, which are on/off flags controlling compilation options;
    switch settings, which are numerical values (between about 1 and 8)
        controlling more finely-controllable compilation options;
    path variables, which hold the text of some directory pathname;
    memory settings, which hold positive numbers (the extent of certain
        memory allocations within Inform);
    the filenames of the source code to read, and the object code to write.

See "inform.c" and the memory-setting-parser in "memory.c" for the details.


    3.2   Fundamental method
          ------------------

It is not possible, in almost any compiled language, to make a direct
translation from source to object code "a line at a time": the first time
a line is reached, it often cannot be finally translated until information
is known which cannot yet be known.  (For example, Inform translates an
object's name into a number indicating its position in the final game's
tree of objects.  If the name comes up before the definition of the object
has been seen, Inform cannot know what number to translate the name into.)

Inform makes only one pass through the entire source code.  The translation
it makes is (roughly speaking) left blank in places, and these blanks are
filled in at the very end, once the necessary information is available.
This process is called "backpatching": Inform is patching things up behind
itself.


    4   Lexical analysis
        ----------------

    4.1   Aim and structure of the lexer
          ------------------------------

The task of the lexical analyser, or "lexer" for short, is to translate the
text it reads into a sequence of "tokens".  The aim is that the rest of the
compiler should work with the tokens and need never look at the original text
again; Inform mainly achieves this aim.

Each token represents a sequence of characters, called a "lexeme", which is
indivisible from a syntactic point of view.  For instance, the text

    $12+duck.&feathers

contains five atomic pieces of syntax:

    Token                             Lexeme

    <the number 18>                   $12
    <the separator +>                 +
    <symbol number 1>                 duck
    <the separator .&>                .&
    <symbol number 2>                 feathers


The Inform lexer has three, or perhaps four levels:

    Level 1: reading in text, filtering out control characters, etc.

    Level 2: a context-free tokeniser
    
    Level 3: a context-dependent process of deciding whether any
             identifier names found are keywords or names made up
             by the programmer


To make a fairly revolting analogy: when parsing dinner, lexical analysis
is done with the teeth, and syntax analysis with the stomach.  But with
programming languages it is difficult to give a very precise point at
which one ends and the other begins: and in Inform, the "perhaps level 4"
of the lexer occurs when tokens are sorted out more finely in the expression
parser (is this an operator? is it a constant value? is it a variable name?),
which will be documented as part of the syntax analyser in this manual.
At any rate, "lexer.c" and "symbols.c" contain levels 1 to 3 only.

For details of some standard algorithms in compiler theory, the account below
and in section 5 gives page references to "ASU": this is Aho, Sethi and
Ullman (1986), "Compilers: Principles, Techniques and Tools", the so-called
"Dragon book".

It's called this because the front cover features a knight armoured in
various algorithms confronting a dragon labelled "Complexity of compiler
design": in the case of Inform, though, the dragon might reasonably be
twice as big and labelled "Ignorance of designer who originally chose the
syntax for Inform".  Compiler-making tools like "lex", "yacc" and "bison"
are difficult to apply to this language, since Inform doesn't have a
convenient LALR(1) grammar (more like LALR(4), in fact).  In any case the
lexical and syntax analysers are hand-coded for speed, which is generally
considered to produce code twice as fast as that generated by mechanical
tools like "yacc".


    4.2   Level 1: lexical blocks and buffered input
          ------------------------------------------

The lexer can take input from two sources: from the source code files which
are being compiled, or from a null-terminated string somewhere in the
compiler.  (The latter is used when compiling the "veneer" of run-time
routines.)

A "lexical block" is a piece of text which is individually named and
line-counted (so that error messages can indicate where an error has occurred
with reference to the original source).  Each single file of source code
(the original file and each Include file) is a lexical block in its own
right.  Likewise, if the lexer is reading from an internal string, then
that string is a lexical block.

The LexicalBlock structure holds data about the position inside such a block.
Note that several may be needed at once: if one source file includes another,
which includes another, then three different LexicalBlocks will contain
useful information.

However, information about the files currently being worked from is stored
in a different structure way, and not in LexicalBlock structures.  For
efficiency reasons, we can't read characters out of the files one at a time:
instead we read them in 4096-byte chunks into a buffer.  Note that this
means that the read-position of a file will be well ahead of the
read-position of the lexer within it.  It may have been closed before the
lexer is halfway through its buffer.

A further complication is that it's natural to use a stack structure to hold
the files being read from:

        include file 2          <-- reading from this
        include file 1          <-- still open but not being read from
        main source code file   <-- still open but not being read from

and it is also natural to use a stack structure to hold the LexicalBlocks
being worked through:

        LB for include 2
        LB for include 1
        LB for main source code file

Despite the apparent similarity here, we can't combine this into one stack,
since "include file 2" will have been pushed off the files stack before its
LB is pushed off the LB stack.  Since further Include directives may occur
at any point in the current LB, the stacks can be very different.


Otherwise level 1 is simple and fast.  Note that characters are fed up into
level 2 through a "pipeline" (see the next section for what this means and
why), and are also filtered.  Inform can read source files in any plain
ASCII or any of the ISO 8859 standard ASCII extensions (five variants with
accented Latin characters, plus Cyrillic, Arabic, Hebrew, Greek).  The
filtering process removes any character codes undefined in the current
ISO setting, and normalises new-line and spacing characters:

      00                         remains 0 (meaning "end of file")
      TAB                        becomes SPACE
      0d                         becomes 0a, i.e., '\n'
      other control characters become '?'
      7f                         becomes '?'
      80 to 9f                   become '?'
      a0                         (ISO "non-breaking space") becomes SPACE
      ad                         (ISO "soft hyphen") becomes '-'
      any character undefined in ISO between a0 and ff is mapped to '?'

(Here "ISO" means "the current ISO setting", which can be 0, meaning
plain ASCII only: in this mode any character value of 80 to ff will
be filtered to '?'.)

The filtering ensures that (i) no error message can contain an unprintable
character and (ii) the user cannot type a character outside a standard
ISO set and which might work on one platform but not on another.

Note that the 0 character is only fed upwards into level 2 when the entire
lexical source has run out: that is, all the source files are exhausted,
or else the string being analysed has run out.


    4.3   Level 2: breaking text into lexemes
          -----------------------------------

The input to level 2, then, is a stream of characters: and its output is a
stream of tokens, each of which represents a sequence of characters called a
"lexeme".

Some definitions of terms used in the Inform source code.  There is a fixed
set of "separators", all of them sequences of 1 to 3 mainly non-alphabetic
characters:

    ->     -->    --     -     ++    +    *    /    %
    ||     |      &&     &     ~~
    ~=     ~      ==     =     >=    >
    <=     <      (      )     ,
    .&     .#     ..&    ..#   ..    .
    ::     :      @      ;     [     ]    {    }
    $      ?~     ?
    #a$    #n$    #r$    #w$   ##    #

An "identifier" is a sequence of one or more characters from the set:

    A B C D ... Z a b c ... z 0 1 2 3 4 5 6 7 8 9 _

which does not begin with 0 to 9.  The lexer makes the longest lexeme it
possibly can for any particular token, so the next character after any
identifier will not be one of the set above.

The lexemes representing numbers are:

    (a) one or more chars from the set 0 1 2 3 4 5 6 7 8 9
    (b) $ followed by one or more chars from 0 1 2 3 4 5 6 7 8 9 A B C D E F
        (hexadecimal numbers)
    (b) $$ followed by one or more chars from 0 1
        (binary numbers)

Note that "-67" is not a lexeme for a single token: it is broken into two
lexemes, "-" and "67".

A ", followed by any number of characters and then another ", is a single
lexeme.  Similarly for '.

Finally, as a special case, the six separators

    #a$    #n$    #r$    #w$   ##    #

are expected to be followed by an identifier.  For example, "##Take" is
a single lexeme, and is not divided into "##" and "Take".  (Except that
#n$ can be followed by any character, so that e.g. #n$1 is a single token,
representing "the new dictionary word '1'" in the language.)

To sum up, the lexical analyser expects legal source code to contain:

    "comments": sequences beginning with a ! character and ending with a
    new-line or the end of the file or string containing it

    any of the lexemes above

    "white space", that is, characters in the set
        space  tab  new-line


When breaking down text into lexemes (and throwing away the white space
and comments), the lexer needs 3 characters of look-ahead.  That is, it can
decide definitely what lexeme character N belongs to provided that it knows
what characters N+1, N+2 and N+3 are.  (The figure 3 occurs because that's
the worst case arising: deciding whether character N is the start of a
separator in the form "#a$" followed by an identifier character.)

Because of this, level 1 maintains a "pipeline" of four variables to hold the
current character and the next three on the way.  By looking ahead at the
pipeline as needed, level 2 decides what to do with the current character and
then advances one character: the three waiting in the pipeline shuffle up and
a new character is read into the last place in the pipeline.  The advantage
of maintaining a pipeline is that the lexer never needs to undo any decisions
and go backward through the text.


One token is output for each lexeme found, and when the lexer runs out of
input altogether (when all the source files are finished, or when the string
being analysed has run out) a special token, called "end of file", is output. 
Thus the lexer's output is a sequence of tokens from the list:

    numbers
    strings in " quotes
    strings in ' quotes
    separators
    identifiers
    end-of-file

As will be seen in the next section, identifiers are analysed further before
being passed out of the lexer.

Except for the problem of deciding what an identifier means, the lexer is
"context-free": it needs to know nothing about the syntax of the Inform
language, what kind of code it is currently analysing, etc.  There is a
standard state-machine algorithm for writing such a lexer: see ASU, p. 98.
The tricky part is to efficiently store a transition table for the states
involved, which will be neither so small that slow code is required to
read it, nor so large that it takes up an unacceptable amount of memory.
Here the "tokeniser_grid" stores most of a transition table: this is an
algorithm originally suggested to me by Dilip Sequeira, similar to that used
by S. C. Johnson's "yacc" lexical analyser.


    4.4   Representation of tokens
          ------------------------

Tokens are internally stored as quadruples:

    typedef struct token_data_s
    {   char *text;
        int value;
        int type;
        int marker;
    } token_data;

though the lexer is not responsible for writing "marker" values, which are
the concern of the syntax analyser.  The "type" identifies which kind of
token it is (this value must be one of the *_TT set #define'd in "header.h"):
for example, DQ_TT means "a double-quoted string".  The "value" is a
number whose meaning depends on the type.  For example, in the case of
type DQ_TT, the number has no meaning and is not used; in the case of type
SEP_TT (for "separator"), the number gives the index in the above table
of possible separators, thus identifying which separator it is.

"text" is a pointer to a null-terminated string containing the lexeme. There
are two reasons this is needed: firstly, so that comprehensible error
messages can be printed out from higher up in the compiler, and secondly
because in the case of DQ_TT and SQ_TT the text may need to be compiled
into Z-text format and added to the static strings area, to the Z-code area
or to the dictionary.  The decision on whether the string should be compiled
and if so, what to, cannot be taken until code generation time.

Unfortunately code generation time may not be for a while, and this raises a
problem: we clearly need to keep lexeme text in memory for a while, but it
would be very costly on memory to keep the text of every token in memory
permanently.  When is it safe for the lexer to throw a lexeme away?

One solution would be for a formal system by which the code generator
explicitly deallocated the space, but this is too bureaucratic and too risky
(suppose we miss 0.1% of these? Memory will slowly clog up and begin to cause
odd errors on large compilation runs).  Instead, the lexer simply writes its
text cyclically into a large buffer.  Code generation almost always takes
place before the lexer has got any more than 100 or so bytes of text ahead;
since the buffer is a good 10K long, there is a generous safety margin.


    4.5   Level 3: identifying identifiers
          --------------------------------

A "keyword" is an identifier which has a meaning understood by the Inform
compiler: for example the lexeme

    while

is understood (in some contexts) to refer to the "while" statement, rather
than being a name coined by the programmer for some variable or object.

In most small languages such a lexeme would always be a keyword and never a
name.  (This is the concept of "reserved word": a word reserved for the use
of the compiler, forbidden as a name.) For example, the fairly-large language
C++ has 48 such keywords.

Inform is unable to reserve keywords (that is, forbid their use as names) for
two reasons: firstly, because there are 281 of the damn things (and 116 of
them are only significant in lines of assembly language: why forbid the use
of these to non-assembly-language programmers?), and secondly because it is
important to the language that some keywords definitely should be the same as
some object names.  "Class" is both a keyword meaning "the class-making
directive" and a name meaning "the object representing the class of all
classes".

So the decision "is this identifier intended to be a keyword or intended
to be the name of something?" relies on the context.  What the lexer does
is to divide the 281 keywords into twelve groups.  The syntax analyser
switches these groups on or off as it finds its way through the program.
For example, the group "opcode_names" is only enabled (i.e., switched on)
after an @ character has been read at the beginning of a statement.

A complication is that case is sensitive for some of these groups and not
others.  For example, "IFDEF and "Ifdef" both match the directive name
"ifdef", but "WHILE" and "While" do not match the statement name "while".
The rules in fact are:

         When matching...                        Example          Sensitive?

         Language keywords:
             directive names                     Class            No
             keywords used within directives     with             Yes
             statement names                     while            Yes
             keywords used within statements     else             Yes
             condition names                     hasnt            Yes
             built-in function names             random           Yes
             built-in constant names             #code_offset     Yes

         Assembly language keywords:
             opcode names                        put_prop         Yes
             customised opcode names             @"1OP:4S"        Yes
             the stack pointer name              sp               Yes

         Names for variables, objects, etc.      lantern          No
         Names for actions                       Take             No

Moreover, in contexts where local variables are allowed, they take
precedence over the rest.  (For example, you could define a local variable
called "random", and for the duration of that routine the word "random"
would never be recognised as the keyword for the system function random().)

Each keyword group has its own token type.  The token value for a keyword
token is the index within the group: see "header.h" for #define's for
all of these indices.

Finally, there are times when the syntax analyser just wants the raw text
of an identifier, and doesn't want it identified (for efficiency reasons:
this prevents it from being entered into the symbols table).  If the
dont_enter_into_symbols_table flag is set, the lexer simply returns a
token of type DQ_TT as though the name had been found in double-quotes.


    4.6   Hash-coding and comparing names
          -------------------------------

It would be computationally very expensive indeed to decide whether string X
is a keyword or not by carrying out direct string comparisons with all 281
keywords.  Instead, hash-coding is used.  The idea is to perform a fast
operation on X whose result is some number from, say, 1 to 100: this is the
hash code of X.  If X is the same string as K, then K must have the same hash
code.  So we organise the keywords into 100 different groups, the group with
hash code 1, with hash code 2, etc.: then if h is the hash code of X, we only
need to compare X with the keywords in group h.

For this to be any use, the operation performed has to be chosen so that
the keywords are sorted out into groups with about equal numbers in each.
Many sensible "hash functions" are known (see ASU, p. 437 for experimental
performance data).  Inform's choice is simple but effective: it uses
multiplication rather than bit-shifting, but then on modern CPUs
multiplication is not very expensive.  The algorithm is what ASU would
call "X 30011":

  extern int hash_code_from_string(char *p)
  {   unsigned int32 hashcode=0;
      for (; *p; p++) hashcode=hashcode*30011 + case_conversion_grid[*p];
      return (int) (hashcode % HASH_TAB_SIZE);
  }
 
where "hashcode" is meant to overflow repeatedly.  Note that 30011 is a prime
number.  (case_conversion_grid[*p] is just a fast way to convert all lower
case letters to upper case ones, since we want case to be insignificant when
calculating hash codes.)  It doesn't matter if this algorithm behaves
differently on different machines (depending on how the CPU copes with
overflows): all that matters is it delivers a reasonable spread of hash
values.

The hash code range is 0 to HASH_TAB_SIZE-1: this is a memory setting
which by default is 512.  Reducing this by any substantial amount causes a
significant slowing down of Inform.


    4.7   The symbols table
          -----------------

If an identifier, in context, isn't a keyword then it is called a "symbol":
this basically means "a name for something created by the programmer"
(except that Inform creates a few symbols automatically as well).

The section "symbols.c" keeps a table of all the symbols which have turned
up so far.  (It often picks up a few extraneous names of keywords, too,
when the lexer has been looking ahead in the wrong context: but this doesn't
matter and it's faster to tolerate the slight waste of memory.)

The symbols table is organised into HASH_TAB_SIZE-1 linked lists, each of
which is kept in alphabetical order.  A single operation is provided for
searching it: symbol_index(), which works out the hash code of the string S
supplied (unless it's already known, in which case it need not work it out
again) and looks through the linked list for that hash code until the
first letter matches.  (Thus, the only full string comparisons made are
between S and those strings in the table which have the same hash code
and the same initial letter.)  If S is not present, it is inserted into
the table.

In addition to its text, each symbol in the table has four pieces of data
attached: its type, its value, the line on which it was assigned and its
flags.  The symbol with index i has:

        symbs[i]        text             
        stypes[i]       type             char
        svals[i]        value            int32
        sflags[i]       flags word       int
        slines[i]       line number      int32

(It would be more elegant to store this as an array of structures, but
the symbols table is a major consumer of memory, so we can't afford to let
lazy compilers waste it by aligning the fields of such a structure
at 4 byte intervals.  Hence, five separate arrays.)

Types and values are assigned by the syntax analyser: the type indicates
what kind of thing the symbol is a name of (a routine, a label, an object,
etc.), and must always be one of the #define names *_T (see "header.h").  The
value holds some number indicating which particular thing the symbol is a
name of (a routine address, an object number, etc.).

The line number is set when the symbol is first used, and then reset when it
is first assigned a value (if ever).  The information is stored as

    file-number*0x10000 + line-number

where file-number is an index into the InputFiles array maintained by
"files.c".

The flags word holds (presently) 15 different flags, for different properties
that the symbol name can have.  These flags have #define names in the form
*_SFLAG.

The token for a symbol has type SYMBOL_TT and value equal to the index i of
the symbol within the symbols table.


    4.8   Token lookahead: the circle
          ---------------------------

The output of the lexer is sent, one token at a time, when the syntax
analyser calls get_next_token().

The syntax analyser is a predictive parser which makes mistakes and has to
backtrack (that is, it reads in token X, guesses that it might be looking at
XY and reads the next token to see... but reads a Z instead, and has to go
back to X and try a different theory).  So the syntax analyser needs to be
able to go backwards, as well as forwards, in the stream of tokens.

This means that the lexer needs to remember its last N tokens, where N is
the maximum number of backward steps the syntax analyser ever takes in one
go.  The number N is the amount of "lookahead", since one could also think
of it as the maximum tokens one needs to look ahead of the current position
to be sure of what is happening.

These tokens are therefore stored in a "circle" of N+1 tokens: when
get_next_token() is called, the lexer works out the next token, writes
it into the circle and moves on one place clockwise; when put_token_back() is
called, the lexer moves back one place anticlockwise, so that the next
get_next_token() will send back the token in that position rather than work
out a new one.

However, the interpretation of identifiers as keyword or symbol depends on
the current context, and this may have changed since the token was first
calculated.  Therefore a numerical representation of the context is stored
in the circle too, so that if this has changed and if the token is an
identifier then it is re-identified under the context which now prevails.


    4.9   Summary of the lexer's output
          -----------------------------

To summarise, the lexer converts source text to a stream of tokens of types
as follows:

    --------------------------------------------------------------------------
    Type                  Lexeme                    Value
    --------------------------------------------------------------------------
    NUMBER_TT             literal number in         the number
                          decimal, hex or binary
    DQ_TT                 string extracted from     (none)
                          double-quotes
                          (or identifier name:
                          see above for when)
    SQ_TT                 string extracted from     (none)
                          single-quotes
    SEP_TT                separator                 a #define'd *_SEP value
    EOF_TT                (end of source)           (none)

    SYMBOL_TT             identifier assumed        index in the symbols
                            to be a name              table

    LOCAL_VARIABLE_TT     local variable name       1 to 15, its Z-machine
                                                      variable number
    STATEMENT_TT          statement name            a #defined *_CODE value
    SEGMENT_MARKER_TT     with/has/class etc.       a #defined *_SEGMENT value
    DIRECTIVE_TT          directive name            a #defined *_CODE value
    CND_TT                named conditional         a #defined *_COND value
    SYSFUN_TT             built-in function         a #defined *_SYSFUN value
    OPCODE_NAME_TT        opcode name               a #defined *_zc value
                                                      (i.e., an internal
                                                      opcode number)
    MISC_KEYWORD_TT       keyword like "char" used  a #defined *_MK value
                          in statement syntax       
    DIR_KEYWORD_TT        keyword like "meta" used  a #defined *_DK value
                          in directive syntax       
    TRACE_KEYWORD_TT      keyword used in syntax    a #defined *_TK value
                          of "trace" directive
    SYSTEM_CONSTANT_TT    system #constant name     a #defined *_SC value
    --------------------------------------------------------------------------

Note that other token types do exist, but are set in "level 4": i.e., within
the expression parser.


    5   Syntax analysis
        ---------------

    5.1   Aim and structure of the syntax analyser
          ----------------------------------------

The syntax analyser is the heart of Inform, and contains within it the
specification of the language.

Inform code fundamentally consists of directives, which are instructions to
the compiler to do something, and routines containing statements, which are
lines of code to be executed at run-time.

The syntax analyser takes as input the stream of tokens from the lexer.  Some
of these are interpreted as directives and acted on immediately.  Others
are realised to be statements and compiled; we can regard the output of the
compiling part of the analyser as being a stream of Z-machine assembly
language, passed down into "asm.c".

In most modern compilers, the syntax analyser would convert the token
stream into a "parse tree" expressing the structure of the input: something
along the lines of

                   routine
                  /      \
              statement   statement
                 |           |
                "if"         ...etc., etc.
               /    \
          condition  statement
              |         |
            "=="     "print"
            /   \       |
          "x"   "0"  "Good heavens!"

This is aesthetic, and makes possible many optimisations (such as elimination
of common sub-expressions), since the compiler is able to see the whole
structure before beginning to compile it.  But it uses up a fair amount of
memory (admittedly, most compilers only keep the tree for the current
routine, not the tree for the entire program); and a lot of speed, as the
tree is trawled through again and again.

Characteristically, Inform instead uses an algorithm which is about 40 years
old (see ASU, p.181 et seq).  It doesn't generate a parse tree for whole
routines, though it does do so for expressions, assignments and conditions:
for example, it would have built the little tree:

            "=="
            /   \
          "x"   "0"

For higher level-parsing, it works "top-down", making recursive function
calls which mimic a depth-first traversal of the tree drawn above.  That is,

    routine()
        calls statement()
            which reads the token "if"
            and calls condition()
                which works out the expression "x == 0"
            and compiles some code to perform this test and
            then to skip the next instruction if it's false
            and calls statement()
                which reads the token "print"
                and then the token "Good heavens!"
                and compiles some code to print this
        and routine() next calls statement()
            which reads... etc., etc.

Although we are only able to go through the tree once, it's a quick and
efficient trip, effectively using the C run-time stack to hold the parse
tree structure.  The result is also a rather legible piece of source
code (as compared, for example, with the output from "yacc").  (The main
reason we can escape from having to make parse trees is that our target
machine, the Z-machine, has an architecture making register allocation
unnecessary: the variables are the registers, and there's no space or
time penalty for use of the stack.)

The syntax analyser begins with parse_program() in the section "syntax.c".
Because its structure is embodied in the source code, there is relatively
little to say about it except to refer the reader to that: and to the
the table in section 1.1 above.


    5.2   Predictive parsing
          ------------------

Note that tokens can be read from many different places, and code can be
passed out from many different places: whereas the lexer and the assembler
each have a single channel of input and output.

As a typical example of the syntax analyser working as a "predictive parser"
and having to "backtrack", consider how parse_action() begins to parse through
action statements like

    <Take fishknife>;          tokenised as    <
                                               symbol "Take"
                                               symbol "fishknife"
                                               >

When it's called, the first token, "<" has already been read (this is how
parse_statement() knew to call parse_action() in the first place).

What it does is:

    "Predict" that it's going to be a << ... >> statement
    Ask the lexer for a token, expecting it to be another "<"
    Discover that, unfortunately, it's a symbol, so the prediction was wrong
    Backtrack to the point where it made the wrong prediction
        (this actually only means giving the token "Take" back into
        the lexer again)
    Now predict the next possibility, that it's a < ... > statement
    Ask the lexer for a token, expecting it to be a symbol name
    Discover that this time it is ("Take"), so the prediction was correct
    ... and so on.

The code to do this is more like:

    get_next_token();
    if (token == "<") return parse_double_angle_action();
    put_token_back();
    return parse_single_angle_action();

(The actual code doesn't make these inefficient function calls, and token
comparison is a bit fiddlier, but this is the idea.)


Clearly, the longer such a prediction goes on before it is found to be
wrong, the more tokens which the lexer has to take back again.  The syntax
of the language determines this number, N: for many languages N = 1 (because
they've been designed that way) but for Inform N = 5 (because it was defined
by an ignorant oaf).  (Though, as will appear in the next section, it would
be possible to reduce this by implementing the grammar differently: and
there are compensations, such as the relative conciseness of Inform code.)


    5.3   The context-free grammar
          ------------------------

Here are the productions which the top-down syntax analyser implements.
(Most of the major ones are handled by routines.)

"void_expression", "constant", "condition" and "quantity" are left undefined:
these are handled by the expression parser according to an operator-precedence
grammar.  "vcondition" is a condition whose first token is a VARIABLENAME.
Symbols are represented in capitals; NEWNAME means one not so far assigned
a value, while CLASSNAME, etc., mean an existing symbol of the given type.
Obsolete features which are still supported are omitted.

Details of conditional compilation are abbreviated heavily: the notation
PROGRAM means "anything at all until the right directive keyword comes up
at the same #if... level".

  =========================================================================
  program      ->    directive ; program

  directive    ->    [ NEWNAME routine
                     "Abbreviate" strings
                     "Array" NEWNAME arraytype array
                     "Attribute" NEWNAME
                     "Attribute" NEWNAME "alias" ATTRIBUTENAME
                     "Class" NEWNAME objbody
                     "Class" NEWNAME ( constant ) objbody
                     "Constant" NEWNAME
                     "Constant" NEWNAME constant
                     "Constant" NEWNAME = constant
                     "Default" NEWNAME constant
                     "End"
                     "Extend" extension
                     "Fake_action" NEWNAME
                     "Global" NEWNAME
                     "Global" NEWNAME = constant
                     "Ifdef" NAME condcomp
                     "Ifndef" NAME condcomp
                     "Iftrue" constant condcomp
                     "Iffalse" constant condcomp
                     "Ifv3" constant condcomp
                     "Ifv5" constant condcomp
                     "Import" manifest
                     "Include" STRING
                     "Link" STRING
                     "Lowstring" NEWNAME STRING
                     "Message" diagnostic
                     "Nearby" objhead objbody
                     "Object" arrows objhead objbody
                     "Property" NEWNAME propdef
                     "Release" quantity
                     "Replace" ROUTINENAME
                     "Serial" STRING
                     "Statusline" "score"
                     "Statusline" "time"
                     "Stub" NAME quantity
                     "Switches" TEXT                <---- An unquoted string
                     "System_file"
                     "Trace" trace
                     "Verb" verb
                     "Zcharacter" zchar
                     CLASSNAME arrows objhead objbody

  condcomp     ->    ; PROGRAM ; "Endif"
                     ; PROGRAM ; "Ifnot" ; PROGRAM ; "Endif"

  manifest     ->    import "," manifest
                     import
  import       ->    "global" NEWNAME

  diagnostic   ->    STRING
                     "error" STRING
                     "fatalerror" STRING
                     "warning" STRING

  propdef      ->    propdefault
                     "additive" propdefault
  propdefault  ->    <empty>
                     quantity
  =========================================================================
  trace        ->    t_section t_level                    Tracing
  t_section    ->    <empty>
                     "objects"
                     "symbols"
                     "verbs"
                     "assembly"
                     "expressions"
                     "lines"
  t_level      ->    <empty>
                     "on"
                     "off"
                     quantity
  =========================================================================
  zchar        ->    char_spec                            Character set
                     STRING STRING STRING
                     "table" char_specs
                     "table" "+" char_specs
  char_specs   ->    char_spec char_specs
                     char_spec
  char_spec    ->    LITERAL_CHARACTER
  =========================================================================
  arrows       ->    "->" arrows                          Object definition
                     <empty>
  objhead      ->    <empty>
                     NEWNAME
                     STRING
                     NEWNAME STRING
                     NEWNAME OBJECTNAME
                     STRING OBJECTNAME
                     NEWNAME STRING OBJECTNAME
  objbody      ->    segment objbody
                     segment "," objbody
  segment      ->    "class" class_s
                     "with" with_s
                     "private" with_s
                     "has" has_s
  class_s      ->    CLASSNAME class_s
  with_s       ->    PROPERTYNAME property
                     PROPERTYNAME property "," with_s
  has_s        ->    attributes
  property     ->    <empty>
                     [ routine
                     arrayvals
  =========================================================================
  arraytype    ->    ->                                   Arrays
                     -->
                     "string"
                     "table"
  array        ->    constant
                     STRING
                     arrayvals
                     [ manyvals ]
  manyvals     ->    constant manyvals
                     constant ; manyvals
  arrayvals    ->    constant arrayvals
  =========================================================================
  extension    ->    "only" strings priority grammar      Grammar
                     STRING priority grammar
  priority     ->    <empty>
                     "replace"
                     "first"
                     "last"

  verb         ->    strings v_setting
                     "meta" strings v_setting
  v_setting    ->    = STRING
                     grammar

  grammar      ->    g_line grammar
  g_line       ->    * tokens -> g_action
  g_action     ->    ACTIONNAME
                     ACTIONNAME "reverse"

  tokens       ->    g_token tokens
  g_token      ->    preposition
                     "noun"
                     "held"
                     "multi"
                     "multiheld"
                     "multiexcept"
                     "multiinside"
                     "creature"
                     "special"
                     "number"
                     "topic"
                     "noun" = ROUTINENAME
                     "scope" = ROUTINENAME
                     ATTRIBUTENAME
                     ROUTINENAME

  preposition  ->    DICTIONARYWORD
                     DICTIONARYWORD / preposition

  strings      ->    STRING strings

  attributes   ->    attribute attributes
  attribute    ->    ATTRIBUTENAME
                     ~ ATTRIBUTENAME
  =========================================================================
  routine      ->    * locals ; body ]                    Everything below
                     locals ; body ]                      here is code to
  locals       ->    NAME locals                          compile
                     <empty>

  body         ->    action_case : body
                     "default" : body
                     statement body
  action_case  ->    ACTIONNAME
                     ACTIONNAME , action_case

  block        ->    ;
                     statement ;
                     { states }
                     { states . NEWNAME ; }
  states       ->    statement states;

  s_block      ->    { s_states }
                     { s_states . NEWNAME ; }
  s_states     ->    case : s_states
                     "default" : s_states
                     statement s_states
  case         ->    range :
                     range , case
  range        ->    constant
                     constant "to" constant
  =========================================================================
  statement    ->    . NEWNAME ; statement
                     "@" assembly ;
                     "<" action ">" ;
                     "<<" action ">>" ;
                     indiv_state
                     STRING
                     void_expression

  action       ->    ACTIONNAME
                     ACTIONNAME quantity
                     ACTIONNAME quantity quantity

  assembly     ->    opcode operands options
  opcode       ->    OPCODENAME
                     STRING                          <--- customised opcode
  operands     ->    operand operands                     descriptions are
  operand      ->    constant                             not parsed by the
                     VARIABLENAME                         syntax analyser
                     sp
  options      ->    ? branch
                     -> store
                     -> store ? branch
  store        ->    VARIABLENAME
                     sp
  branch       ->    LABELNAME
                     ~ LABELNAME
  =========================================================================
  indiv_state  ->    "box" strings ;
                     "break" ;
                     "continue" ;
                     "do" eblock "until" ( condition ) ;
                     "font" "on" ;
                     "font" "off" ;
                     "for" ( for1 : for2 : for3 ) block ;
                     "give" quantity attributes ;
                     "if" ( condition ) block
                     "if" ( condition ) block "else" block
                     "inversion" ;
                     "jump" LABELNAME ;
                     "move" quantity "to" quantity ;
                     "new_line" ;
                     "objectloop" ( ospec ) block ;
                     "print" printlist ;
                     "print_ret" printlist ;
                     "quit" ;
                     "read" quantity quantity ;
                     "read" quantity quantity ROUTINENAME ;
                     "remove" quantity ;
                     "restore" LABELNAME ;
                     "return" ;
                     "return" quantity ;
                     "rtrue" ;
                     "rfalse" ;
                     "save" LABELNAME ;
                     "spaces" quantity ;
                     "string" quantity STRING ;
                     "style" textstyle ;
                     "switch" ( quantity ) s_block
                     "while" ( condition ) block
  for1         ->    void_expression
                     <empty>
  for2         ->    condition
                     <empty>
  for3         ->    void_expression
                     <empty>
  ospec        ->    VARIABLENAME
                     VARIABLENAME "in" quantity
                     VARIABLENAME "near" quantity
                     VARIABLENAME "from" quantity
                     vcondition
  printlist    ->    printitem , printlist
                     printitem
  printitem    ->    STRING
                     quantity
                     ( form ) quantity
  form         ->    ROUTINENAME
                     "number"
                     "char"
                     "address"
                     "string"
                     "The"
                     "the"
                     "a"
                     "an"
                     "name"
                     "object"
                     "identifier"               
  textstyle    ->    "roman"
                     "reverse"
                     "bold"
                     "underline"
                     "fixed"
  =========================================================================

A few critical comments on the above.

From this grammar it's possible to work out N, the maximum look-ahead needed
to distinguish which production is being used in the source code.  The worst
cases are:

   (a) distinguishing productions (1) and (2) in

       ospec     -> VARIABLENAME
                    VARIABLENAME "in" quantity     (1)
                    VARIABLENAME "near" quantity
                    VARIABLENAME "from" quantity
                    vcondition                     (2)
     
       which requires N = 5; these two productions need to be distinguished
       between because

       objectloop ( a in b )
       objectloop ( a in b ... )

       (the second containing a compound condition) compile quite different
       code: one loops through the siblings in the object tree, the other
       through the tree in numerical order.

   (b) distinguishing productions (1) and (2) in

       printitem    ->    STRING
                          quantity                 (1)
                          ( form ) quantity        (2)

       i.e., between 

       print (routine-name) expression
       print (expression which happens to begin with a bracket)

       which requires N = 3.

The grammar contains one serious ambiguity: the innocent-looking production

       arrayvals    ->    constant arrayvals

means that array initialisations list constants in sequence without commas
or other separating characters.  This makes it impossible to distinguish
between unary and binary minus in a line like:

       Array X 2 - 1 ;     

The answer is "binary", since the grammar makes the longest match possible;
but a "this is ambiguous" warning is issued.

A further inconvenience in the grammar, though not much of an ambiguity,
occurs in the initialisation part of "for" statements: there is a danger of

       for (p=Class::)

being mis-parsed due to "::" being recognised as a binary operator (without
a right operand, which would cause an error) and not as two consecutive
":" delimiters.

If I were free to redesign the Inform grammar in the light of the last
three years' experience (which I am loath to do, since so much Inform source
code now exists), here are the changes I think I would make:

       introduce commas as separators between array values and <<Action ...>>
           parameters;
       remove the statements quit, restore, save, style, font, spaces, box,
           inversion and read: the first three ought to be used via the
           assembler anyway, and the last six ought to be system-provided
           functions;
       use single quotes to refer to dictionary words used as values of "name",
           thus removing an anomalous rule going back to Inform 1, and to
           refer to dictionary words in grammar;
       find some notation for literal characters which does not look like
           the notation for dictionary words: e.g., ''z'' rather than 'z';
       abolish the distinction between actions and fake actions;
       rename the Global directive "Variable";
       require an = sign to be used in "Constant X 24" directives.


    5.4   Assigning values to symbols
          ---------------------------

Assigning values to symbols is the main way by which the syntax analyser
changes the way it will behave further on in the program.  From a strict
theoretical point of view one could insist the symbols table contains the
only information remembered by the syntax analyser about the program so far,
which otherwise churns out code which it forgets as soon as it has written:
however, Inform's analyser also keeps a number of other data structures,
such as the game dictionary and the object tree.

When the lexer creates a new symbol (that is, when the table is searched
for a string which isn't there, so that it is added to the table as a new
entry), it has:

    value   256
    type    CONSTANT_T
    flags   only UNKNOWN_SFLAG
    line    the current source line

"Unknown" means that the syntax analyser doesn't recognise it as meaning
anything yet.  The flag will only be cleared when the syntax analyser
assigns some value to the symbol (using the assign_symbol() routine), if ever.

The line is reset to the source line then applying if the symbol is assigned
a value, but is otherwise never changed.

The type and value of a symbol are only altered via assign_symbol().  With
one exception (see CHANGE_SFLAG below), the value once assigned is never
subsequently altered by reassignment.  Each symbol always has one of the
following 11 types:

    Type           Meaning                       Value
    ------------------------------------------------------------------------
    CONSTANT_T     Defined constant or           Value of constant
                       value not known yet
    LABEL_T        Name of label in source       Label number
    GLOBAL_VARIABLE_T
                   Name of global variable       Z-machine variable number
    ARRAY_T        Name of array                 Z-machine byte offset
                                                   in dynamic variables area
    ROUTINE_T      Name of routine               Scaled offset in Z-code
    ATTRIBUTE_T    Name of attribute             Z-machine attribute number
    PROPERTY_T     Name of common property       Z-machine property number
                                                   between 1 and 63
    INDIVIDUAL_PROPERTY_T
                   Name of indiv property        Property identifier >= 64
    OBJECT_T       Name of object                Z-machine object number - 1
    CLASS_T        Name of class                 Z-machine object number - 1
                                                   of its class-object
    FAKE_ACTION_T  Name of fake action           Action number >= 256
    ------------------------------------------------------------------------

The full list of symbol flags, and their meanings, is as follows:

    ------------------------------------------------------------------------
    UNKNOWN_SFLAG      no value has been assigned to this symbol
    USED_SFLAG         the value of this has been used in an expression
    REPLACE_SFLAG      the programmer has asked to Replace a routine
                           with this name
    DEFCON_SFLAG       this constant was defined by the "Default"
                           directive
    STUB_SFLAG         this routine was defined by the "Stub" directive
                           (similarly)
    INSF_SFLAG         this symbol was originally assigned a value in a
                           system_file
    SYSTEM_SFLAG       this symbol was assigned a value by Inform itself,
                           as one of the standard stock (such as "true"
                           or "recreate") provided to all programs
    UERROR_SFLAG       a "No such constant as this" error has already been
                           issued for this name
    ALIASED_SFLAG      this is an attribute or property name whose value
                           is shared with another attribute or property name
    CHANGE_SFLAG       this is a defined Constant whose value needs later
                           backpatching, or is a Label whose label number
                           has been allocated before its declaration in the
                           code
    ACTION_SFLAG       this is a ## name (of an action or a fake action)
    REDEFINABLE_SFLAG  the symbol can be defined more than once using the
                           "Constant" directive, without errors being
                           produced.  (Used for the special constants DEBUG,
                           USE_MODULES, MODULE_MODE and Grammar__Version.)
    ------------------------------------------------------------------------
    IMPORT_SFLAG       this name is "imported" (used in module compilation)
    EXPORT_SFLAG       this name was "exported" from some module
                           (used in story file compilation)
    ------------------------------------------------------------------------

USED_SFLAG is used only to decide whether or not to issue "declared but not
used" warnings.  A symbol has been "used" if one of the following is true:

    (i)    its value occurred in an expression;
    (ii)   it has been assigned, and tested with IFDEF or IFNDEF;
    (iii)  it's an action routine name used in grammar or via ##
           (e.g. use of ##Take causes TakeSub to be "used");
    (iv)   it's a fake action name used via ##;
    (v)    it's a property, attribute or class name used in an
           object or class definition;
    (vi)   it's a label name branched to in assembly language or
           the destination of a "jump" statement;
    (vii)  it's the routine name "Main";
    (viii) it's a routine name referred to by the obsolete #r$
           construct;
    (ix)   it's a routine name of a veneer routine whose definition
           is being over-ridden in the source code, and to which a
           call has been compiled;
    (x)    it's a routine name used in a grammar token;
    (xi)   it's referred to in a module which has been linked into
           the story file being compiled.

Note that such warnings are not issued for object names, since in much
Inform 5 code objects are given irrelevant object symbol-names (the syntax
required it); nor for symbols defined by Inform, or in the veneer, or in
a system file.  Warnings are never issued (except for labels and local
variables, which have local scope) when compiling modules, since there is
no way of knowing at compile time which exported symbols will be used and
which will not.

The warnings are issued at the end of the source code pass, but before the
veneer is compiled.  The slines[] array is used to work out which source
lines to report from.

CHANGE_SFLAG is used when definitions such as:

    Constant frog_word = 'frog';

are reached, where the value (the dictionary address for 'frog') cannot
immediately be known.  Such symbol values are backpatched later as needed.


All symbols in the table have "global scope" (that is, their definitions are
valid everywhere in the source program) except for label names, whose scope
is restricted to the current routine.  Thus, the same label name can be used
in many different routines, referring to a different label in each.

To achieve this, the routine-end routine in "asm.c" cancels the assignment
of label names, returning them to type CONSTANT_T and flag UNKNOWN_SFLAG.

Local variables also have local scope, but for efficiency reasons these are
not stored in the symbols table but in a special hash table in level 3 of
the lexer.


Note that action names have a different "name-space" from other symbols:
this is why the library's action name "Score" is never confused with its
variable "score".  Rather than use a different symbols table altogether,
actions are stored as integer constants in the form

    Score__A

(thus the value of this symbol is the value ##Score).  Similarly, fake
actions are stored this way (but with type FAKE_ACTION_T rather than
INTEGER_CONSTANT_T); both forms of symbol are flagged ACTION_SFLAG.


    5.5   Outputs other than assembly language
          ------------------------------------

Below the parse_routine() part of the syntax analyser, the only output is
assembly language (together with dictionary words and encoded text as needed
within it).

However, parse_directive() has several other outputs, as shown on the source
code map (section 1.2 above).  Directives create objects to be remembered and
written into the final program: arrays, verb definitions, actions, objects
and so on.  These will be called "program elements".

Directives also "create" constants, attributes, properties and fake actions,
but in these cases creation only consists of assigning a suitable value to a
symbol.  So these do not count as program elements.

The data structures to hold "program elements" are all created and maintained
within "directs.c" and its subsidiaries (such as "objects.c", the
object-maker); they are then translated into a usable Z-machine form in
"tables.c".  (With only trivial exceptions, the data structures are not
accessed anywhere else in Inform.)

    ---------------------------------------------------------------
    Program element      Section      Name of (main) data structure
    ---------------------------------------------------------------
    Global variable      arrays.c     global_initial_value[]
    Array                arrays.c     dynamic_array_area[]
    Release number       directs.c    release_number
    Serial code          directs.c    serial_code_buffer[]
    Statusline flag      directs.c    statusline_flag
    Object tree          objects.c    objects[]
    Common property      objects.c    prop_default_value[]
      default value
    Class-to-object-     objects.c    class_object_numbers[]
      numbers table              
    Common property      objects.c    *properties_table
      values for
      objects
    Individual prop      objects.c    *individuals_table
      values for
      objects
    Table of static      symbols.c    individual_name_strings[]
      string values
      for property
      names
    And attribute names  symbols.c    attribute_name_strings[]
    And action names     symbols.c    action_name_strings[]
    Abbreviation         text.c       abbreviations_at[]
      table entry
    Grammar table        verbs.c      Inform_verbs[]
    Token-routine        verbs.c      grammar_token_routine[]
      addresses
    List of dict         verbs.c      adjectives[]
      addresses for
      "adjective" words
    Action routine       verbs.c      action_byte_offset[]
      addresses
    ---------------------------------------------------------------

A rather unequal gathering: the storage required to hold the common property
values table may be as much as 32K, whereas the statusline flag can be held
in 1 bit.

There are three other program elements: Z-code, kept by "asm.c"; the static
strings of Z-encoded text, kept by "text.c"; and the dictionary, also kept by
"text.c".


    5.6   Assembly operands
          -----------------

The type "assembly_operand" is used for all numerical values (and local or
global variable references) which are destined one day to be written into the
Z-machine.  Many of these will indeed be operands in Z-code instructions,
hence the name.  Others will be property values, array entries and so on.

The definition is:

    {   int   type;
        int32 value;
        int   marker;
    }

which is a pretty generous memory allocation for holding a signed 16-bit
number.  (However, there are no large arrays of this type; type and marker
could easily each have type char, but I suspect this would be slower because
of alignment problems on some compilers; and speed does matter here.)

The possible types are:

    SHORT_CONSTANT_OT       number with value between 0 and 255
    LONG_CONSTANT_OT        number with any 16 bit value
    VARIABLE_OT             reference to the stack pointer if value is 0
                            local variable if value 1 to 15
                            global variable if value 16 to 255

In addition, two special types are in use by the expression parser:

    OMITTED_OT              (the operand holds no information)
    EXPRESSION_OT           reference to a parse tree

The "marker" value is used to record the origin of the data, which is
essential to make backpatching work.  For example, in the line

    v = "Jackdaws love my big sphinx of quartz";

the right operand is marked with STRING_MV, because the value which needs
to be stored in the Z-machine is a packed string address and this cannot be
known until after the compilation pass (when the size of the tables and
the code area are known).  Wherever the operand is written, in code or in
a table, "bpatch.c" will later find and correct it.


    5.7   Translation to assembly language
          --------------------------------

The Z-machine has a very easy architecture to generate code for.  Most
compilers' syntax analysers generate a "three-address code" as an
intermediate stage towards final object code; this is a sequence of
instructions in the form

    x = y <operator> z

together with conditional branches and labelled points to branch to.  (See
ASU, p.466.)  Translating this code into assembly language for some CPU
is a further compilation phase: the tricky part is not translating the
operators into instructions, but deciding where to locate the values
x, y and z.  On most CPUs, a limited number of registers can hold values, and
arithmetic operations can only be performed on these registers; moreover,
holding data in a register is much more efficient than holding it elsewhere.
The problem is therefore to allocate registers to quantities, and the
performance of the compiled code depends very heavily on how well this is
done.  (Register allocation is far from being a solved problem in the way
that lexical analysis, say, is considered to be.)

What makes the Z-machine particularly easy is that its instruction set is
more or less three-address code already.  Arithmetic can be performed with
constants as operands as well as with "registers"; and not only is every
local or global variable automatically allocated to a register of its own,
but a stack is available to hold intermediate values (and with no performance
penalty for its use, since it is accessible as though it were a register).

The key point to remember when looking at Z-code is that writing a value to
the "stack-pointer variable" pushes it onto the Z-machine stack; using the
"stack-pointer variable" as the operand for an instruction pulls the value
being read off the stack.


Despite the availability of the stack, it's still convenient to have a few
spare variables to use as "registers" holding temporary values.  Inform
reserves the variables 249, 250, 251, ..., 255 for its own use: though this
slightly exaggerates the position, since two of these are used for the
variables "self" and "sender".  One more is used to hold the switch-value
of the current switch statement (if there is one); the remaining four in
order to implement various system functions (such as "children") with
inline-code rather than routine calls to the veneer.  (The one inconvenience
with the Z-machine's instruction set is that there's no good way to read the
top of the stack non-destructively, or to duplicate the top value, or to
reorder the top few values.)


The syntax analyser produces code by making function calls to the assembler.
There are four types of function call:

    assemble_routine_header(...)
    assemble_routine_end(...)

at the start and end of every routine;

    assemble_label_no(N)

to indicate that the Nth label belongs here (i.e., at the point where the
next instruction will be put); and then a fair variety of function calls
to generate actual instructions.  For example,

    assemble_jump(N)

assembles an unconditional jump to label N.  A typical "three-address code"
is assembled by a routine like

    assemble_2_to(mul_zc, AO1, AO2, stack_pointer)
    
meaning "the instruction mul_zc, which has 2 operands and has a result which
it writes to the variable indicated by the third operand".  AO1 and AO2
are assembly_operands (the abbreviation is often used in the source), and
so is stack_pointer (it has type VARIABLE_OT and value 0).


A typical example of how the top-down syntax analyser generates code is
given by the code for the "while" statement:

        case WHILE_CODE:
                 assemble_label_no(ln = next_label++);
                 match_open_bracket();

                 code_generate(parse_expression(CONDITION_CONTEXT),
                     CONDITION_CONTEXT, ln2 = next_label++);
                 match_close_bracket();

                 parse_code_block(ln2, ln, 0);
                 assemble_jump(ln);
                 assemble_label_no(ln2);
                 return;

Note that this expects to match

        while ( condition ) statement
                            or { statements }

The expression parser is called to turn the condition into a parse tree,
and its output is fed straight into the code generator for parse trees.

The label numbers ln2 and ln are supplied to the routine for parsing code
blocks because they indicate which labels the statements "break" and
"continue" should generate jumps to.

For example,

        while (i <= 10) print i++;

generates the assembly language

        .L0;    @jg i 10 ?L1;
                @inc i;
                @print_num i;
                @jump L0;
        .L1;    

In terms of function calls to "asm.c":

        assemble_label_no(0);
        assemble_2_branch(jg_zc, AO_for_i, AO_for_10, 1, TRUE);
        assemble_inc(AO_for_i);
        assemble_1(print_num_zc, AO_for_i);
        assemble_jump(0);
        assemble_label_no(1);

(Note that incrementing is needed so often that assemble_inc() is provided
as a shorthand: actually also because of a form of indirect addressing used
by the Z-machine for store_zc and inc_zc, dec_zc.  assemble_store() and
assemble_dec() are also provided.)


    5.8   Summary of assembly language instructions output
          ------------------------------------------------

The list of Z-machine instructions which Inform actually generates code for
by itself is quite short (about half of the 115 possible instructions are
ever used).  The assembly-language "@ statement" parser can, of course,
generate every Z-machine instruction.

The instructions used for three-address-style code (that is, for program
control, arithmetic, etc.) are as follows.  For function calls and returns,
unconditional jumps and for moving values between variables, memory locations
and the stack:

    call_*          rfalse            rtrue             ret_popped
    ret             inc               dec               store
    push            pull              storeb            storew
    loadb           loadw             jump              set_attr

(There are various forms of call instruction to do with the number of
operands, whether a return value is wanted or not, etc.).  Six conditional
branch instructions are used:

    je              jz                jl                jg
    jin             test_attr         check_no_args

(the first four are numerical branches, the next two related to the object
tree, and the last one is used only in V5 or later games, and then only
for printing out tracing code): note that each can also be used in a
"negate this condition" form.  Finally, signed 16-bit arithmetic and
unsigned 16-bit bitwise operations:

    add             sub               mul               div
    mod             and               or                not


A further batch of instructions is used, so to speak, in lieu of calls to a
standard library of input/output and utility routines (like C's "stdio"):

    print           print_ret         print_char        print_num      
    print_addr      print_paddr       print_obj         new_line
    insert_obj      remove_obj        get_parent        get_child
    get_sibling     get_prop          get_prop_addr     get_prop_len
    put_prop        random            aread/sread       quit
    save            restore           output_stream

with five more exotic screen control instructions used only if a "box"
statement is ever compiled:

    split_window    set_window        style             set_cursor
    buffer_mode

    
    6   Syntax analysis 2: the bottom-up expression parser
        --------------------------------------------------

    6.1   Map and structure
          -----------------

It would be possible to continue the top-down parser down into the level of
expressions: for instance, to have a routine parse_plus() which would
parse_expression(), then match a plus sign, then parse_expression() again,
and so on.  This would however be highly inefficient in terms of function
calls, difficult to recover from when an error occurs, and require either
much greater lookahead or rather complicated mechanisms for storing parse
trees.

Expressions (including assignments, conditions and constant values) are
therefore parsed by a different syntax analyser, which is "bottom up" (it
begins by reading in the tokens, and only works out what they mean afterward)
rather than "top down" (starting from an assumption about the meaning and
then trying to find the evidence to justify it).

The algorithm used is an operator precedence parser, a simple form of
shift-reduce parser.  There is alas no space to describe this beautiful
algorithm here: see ASU, pp.195-215.  In this account we treat it as a
"black box".

The expression parser "expressp.c" has the structure:

    -------->  "Level 4" of      -------->  Operator
     tokens    lexical analysis   etokens   precedence
       in                                   parser
                                              |
                                              | etokens re-ordered into RPN
                                             \|/ 
                                            Emitter
                                              |
                                              | plain parse tree
                                             \|/ 
                                            Lvalue and
                                            condition checking
                                              |
                                              | more detailed parse tree
                                             \|/         out

(This is effectively a magnification of the part of the source map in
section 1.2 above which reads just "expressp.c".)  Note that there is a
single input and a single output channel.

RPN is "reverse Polish notation": for example

    (2 + 4)*45 + 34         becomes  2 4 + 45 * 34 +

the idea being that one reads it left to right: each number is added to
a stack of numbers, and each operation takes some numbers off the stack
and puts an answer back.  (E.g., + takes 2 and 4 off and replaces them
with 6.  * takes 6 and 45 off, replacing them with 270.  + takes 34 and 270
off, replacing them with 304, which is the answer.)  The advantage of
this is that is unambiguous which operator acts on which operands, and the
brackets have gone.

The emitter builds a (plain) parse tree out of this sequence, as follows.
It stacks 2 and then 4:

    Stack: 2 4

It reads +, which it knows has two operands, and takes the top two items
off the stack, joining them into a tree segment like so:

                      +
                     / \
                    2   4

It then makes a special value, say E1, meaning "the sub-expression in this
tree", and stacks that:

    Stack: E1

It then stacks up 45, and reaches *: it takes off the top two values, which
are E1 and 45, and makes

                      *
                     / \
                   E1   45

which it calls E2, and so on.  When the process is finished, we have the
tree:
                            +   <--- this is E3
                           / \
        this is E2 --->   *   34
                         / \
      this is E1 --->   +   45
                       / \
                      2   4

and the stack contains just E3, which is the "answer".


    6.2   The operator precedence grammar
          -------------------------------

A list of productions for this grammar would be tiresomely long.  It can
be roughly summarised as:

    expression   ->   ( expression )
                      a single operator acting on some operands in some way
    operand      ->   token representing a constant
                      ( expression )
                      expression ( arguments )
    arguments    ->   <empty>
                      expression , arguments

The tokens representing a constant are given in the next section.  The
operators have tokens as follows:

    --------------------------------------------------------------------------
    Level    Lexeme    Usage   Associativity  Purpose
    --------------------------------------------------------------------------
      0      ,         binary  left           separating values to work out

      1      =         binary  right          set equal to

      2      &&        binary  left           logical AND
      2      ||        binary  left           logical OR
      2      ~~        unary   (prefix)       logical NOT

      3      ==        binary  none           equal to?
      3      ~=        binary  none           not equal to?
      3      >         binary  none           greater than?
      3      >=        binary  none           greater than or equal to?
      3      <         binary  none           less than?
      3      <=        binary  none           less than or equal to?
      3      has       binary  none           object has this attribute?
      3      hasnt     binary  none           object hasn't this attribute?
      3      in        binary  none           first obj a child of second?
      3      notin     binary  none           first obj not a child of second?
      3      ofclass   binary  none           obj inherits from class?
      3      provides  binary  none           obj provides this property?

      4      or        binary  left           separating alternative values

      5      +         binary  left           16-bit signed addition
      5      -         binary  left           16-bit signed subtraction

      6      *         binary  left           16-bit signed multiplication
      6      /         binary  left           16-bit signed integer division
      6      %         binary  left           16-bit signed remainder
      6      &         binary  left           bitwise AND
      6      |         binary  left           bitwise OR
      6      ~         unary   (prefix)       bitwise NOT

      7      ->        binary  left           byte array entry
      7      -->       binary  left           word array entry

      8      -         unary   (prefix)       16-bit (signed!) negation

      9      ++        unary   (pre/postfix)  read/increment or increment/read
      9      --        unary   (pre/postfix)  read/decrement or decrement/read

     10      .&        binary  left           property address
     10      .#        binary  left           property length
     10      ..&  (**) binary  left           individual property address
     10      ..#  (**) binary  left           individual property length

    11/14    ( )       binary  left           function call/message send

     12      .         binary  left           property value
     12      ..   (**) binary  left           individual property value

     13      ::        binary  left           "superclass" operator
    --------------------------------------------------------------------------
    (**): Illegal except internally, in Inform's own source for the run-time
          veneer
    --------------------------------------------------------------------------

Note that, unlike C, Inform does not provide unary plus.

A binary infix operator is used in the form "X op Y"; a unary prefix operator
in the form "op X"; a unary postfix operator in the form "X op".

The "level" is the precedence level: the higher this number, the more tightly
an operator is glued to the operands adjacent to it.

Cases where two binary infix operators have equal precedence are resolved
according to whether that level is left or right associative.  For instance,

   a / b / c      means    (a/b)/c

since /, on level 6, is left associative.  But

   a = b = c      means    a = (b = c)

since =, on level 1, is right associative.

Cases in the above table where the associativity is "none" mean that an
attempt to silently associate the operator will cause an error.  For example,

   a == b == c

generates an error as being ambiguous, but

   (a == b) == c   and   a == (b == c)

are both permitted.

The function-call-brackets have an asymmetric precedence level according
to whether they're being compared with something on the left or on the right:
the point is that

    object.message(...)
    function_returning_an_object(...).property

must be recognised as

    (object.message)(...)                               . 12      >  (...) 11
    (function_returning_an_object(...)).property        (...) 14  >  . 12

respectively.


    6.3   Lexical level 4: tokens to etokens
          ----------------------------------

The shift-reduce parser expects tokens sorted out into operators, such as
"+" or "ofclass", operands such as "34" or "score" (assuming that's a
variable) and three tokens of special significance:

    (     )     end-of-expression

"Level 4 of the lexer" does just this.  It converts the stream output from
"lexer.c" (i.e., from level 3) into a stream of tokens from the following
table (and it uses a good deal of context information to do so).

    Token type           Meaning
    -------------------------------------------------------------------------
    OP_TT                Operator: value is a *_OP constant

    LARGE_NUMBER_TT      Operand (number): value is the number
    SMALL_NUMBER_TT      Operand (number guaranteed in the range 0 to 255):
                             value is the number
    VARIABLE_TT          Operand: value is Z-machine variable number
    DQ_TT                Operand (text used as static string): text in text
    DICTWORD_TT          Operand (text used as dictionary word): text in text
    ACTION_TT            Operand (##action number): name of action in text
    SYSFUN_TT            Operand (system fn name): value is a *_SYSF constant
    SYSTEM_CONSTANT_TT   Operand (#system_constant): value is a *_SC constant

    SUBOPEN_TT           Open bracket used to open a sub-expression
    SUBCLOSE_TT          Close bracket used to close a sub-expression

    ENDEXP_TT            End of expression
    -------------------------------------------------------------------------

The tokens in this output stream are called "etokens", short for "expression
tokens": they are the raw material from which parse trees are built up.
(Although formally different from ordinary tokens, they have the same storage
type, "token_data", in the Inform source code.)

At first sight this seems an unnecessarily large stock: why not, for example,
compile static strings and dictionary words and convert them to their
addresses here and now?  Why not evaluate system constants and action numbers
now?  Because: (i) these tokens may not be accepted by the parser, so we
can't compile text on the strength of them (and indeed, the parse tree which
is being parsed may not ever be code-generated from); and (ii) we can't
allow ourselves to lose sight of where numbers are coming from so soon,
or backpatching will be impossible.


The presence of two different tokens for numbers - one for numbers guaranteed
to be unsigned 8 bit, one for numbers with no such guarantee and whose value
may be any signed 16 bit quantity - is due to the architecture of the target
Z-machine.  In Z-code considerable space savings are made by storing small
numbers in 1, rather than 2, bytes.  We are anxious to take advantage of
this.

Unfortunately, what seems to be a small number now may not be a small number
later (after backpatching).  Dictionary word values (such as 'word') are
internally stored as the accession number of that word into the dictionary
(say 23, if it is the 23rd different word to be entered) during the
compilation pass, and then backpatched later to the byte address of this
word's dictionary entry: but this will be a large number.

Most operands with non-NULL backpatch markers are forced into long form, to be
on the safe side.  (This includes all operands which are forward references
to constants not yet defined.)  The exceptions are that variable and action
numbers (though not fake action numbers) are guaranteed always to fit into
short form.


There are two tasks involved in translating tokens to etokens: evaluating
tokens representing quantities into one of the "operand" etokens, and sorting
other tokens into the right operator and special etokens.  The second process
is not as easy as it looks because of ambiguities, and is discussed in the
next section.  This section discusses the evaluation of quantity tokens.

Tokens in expressions are read in a context where the keywords are local
variables, textual operators, system function names and system constant
names.  Other identifiers are symbol names.  So the token types which must
represent quantities are:

    DQ_TT, SQ_TT, LOCAL_VARIABLE_TT, NUMBER_TT,
    SYMBOL_TT, SYSFUN_TT and SYSTEM_CONSTANT_TT.

(Actually, this omits a detail: the largely obsolete constant forms #a$...,
#r$..., #n$... are SEP_TT tokens converted into the above; and the very much
still with us constant form ##... is a SEP_TT token converted
straightforwardly into the ACTION_TT etoken.)

These all translate directly into etokens except for SYMBOL_TT and SQ_TT.
SQ_TT requires a little work because of the Inform syntax for single-quoted
strings:

    'x'        means the character code (i.e. ASCII value) for "x"
    '@ss'      means the character code for the German letter "sz"
               (i.e. the code specified in the accented set in the
               Z-Machine standard document)
    'xyzzy'    means the byte address of the dictionary word "xyzzy"

The first two become SMALL_NUMBER_TT etokens.  The third becomes a
DICTWORD_TT etoken.


This leaves SYMBOL_TT tokens: names for constants.  There are two
possibilities: the symbol is so far unknown, or else it has been assigned
a value in earlier syntax analysis.

In most languages it would be an error to use an unknown symbol name as
a value: in C, for example, a function name cannot be used in an expression
unless it has been declared beforehand.

Inform is more tolerant: as far as expressions are concerned, any symbol can
be used earlier than the point in the source code where its value is
assigned, excepting only a local or global variable name.  (This exception
is mainly for Z-machine architecture reasons, but it also makes Inform code
more legible.)

When an unknown symbol is reached, it is converted to a LARGE_NUMBER_TT and
marked as SYMBOL_MV, with the value being its index in the symbol table.
(This allows the backpatcher to know where to find the true value later.)

When a known symbol is reached which is flagged as CHANGE_SFLAG, this is
treated in the same way.  (CHANGE_SFLAG is given to symbols defined by
Constant definitions whose values themselves require backpatching: for
instance, symbols defined equal to dictionary words or routine addresses.)

Otherwise, the symbol is not only known, but its current value is known to
be correct, and so:

    if it's a variable, it's translated to etoken VARIABLE_TT;
    otherwise the LARGE_NUMBER_TT/SMALL_NUMBER_TT decision is now made:
        if it is marked with a marker other than VARIABLE_MV, it becomes
            "long";
        otherwise the value is compared with the range 0 to 255.

Any symbol name fed into the shift-reduce parser is flagged with USED_SFLAG
to indicate that its value has at some point been used in an expression.
(This information enables "This ... was declared but not used" warnings to be
made at the end of compilation.)


    6.4   Resolution of ambiguous tokens
          ------------------------------

It remains to translate to the etokens OP_TT, SUBOPEN_TT, SUBCLOSE_TT and
ENDEXP_TT.  The token types which represent operators are all from SEP_TT and
COND_TT; open and close bracket are also SEP_TT tokens.

There are two main difficulties here.  Firstly, shift-reduce parsers are
unsuitable for resolving ambiguities, and so it is not possible to map tokens
directly into operators.  The ambiguities in the expression grammar for
Inform are:

(a)  ++ and -- each refer to either a prefix or a postfix operator
(b)  - refers to either a unary prefix operator or a binary infix one
(c)  ( and ) are used both to indicate function calls and subexpressions
(d)  , is used both to indicate a sequence of values, each to be evaluated
        and thrown away with only the last one kept, and to separate
        function arguments
(e)  -> does not represent an operator when parsing constants as operand
        values in a line of "@" assembly language (because it needs to
        mean "the store value goes in")

These are resolved as follows:

(a), (b)  The "level 4 lexer" watches for context: for instance, MINUS_SEP
          translates to UNARY_MINUS_OP if there was no previous token, or
          it was an open bracket, or an infix or prefix operator, or a
          comma; otherwise it translates to MINUS_OP.

     (c)  Similarly. If an open bracket token is read, and the previous
          token was an operand of some kind, then it must represent a
          function call (unless we are in constant context); an extra
          etoken is generated for an imaginary infix operator called
          FCALL_OP.  The open bracket token is then translated to
          SUBOPEN_TT as usual; a close bracket is always SUBCLOSE_TT. Thus,
          the stream

              Eroica ( Beethoven , 3 )

          is translated into the etokens

              Eroica <fcall> ( Beethoven <comma> 3 )

          so that a tree will come out of the emitter looking like
          
                     <fcall>
                     /     \
                 Eroica   <comma>
                          /     \
                      Beethoven  3

          although in fact the emitter recognises what is going on and
          automatically simplifies this to:

                       <fcall>
                      /   |   \
                     /    |    \
               Eroica Beethoven 3

     (d)  A comma at the top level is either disallowed (in constant context)
          or treated as the operator <evaluate from left to right but throw
          away the result on the left and keep the one on the right>.  (Just
          as in C.)  Otherwise, comma is a separator of arguments.  "Top
          level" means "top bracket level", which the "level 4 lexer" keeps
          track of by counting the bracket tokens going through it.

     (e)  There is a special expression context called "assembly language",
          in which "->" is translated to ENDEXP_TT.


The second main difficulty is entirely of the language designer's own making. 
How is the end of an expression to be detected?  In C, it always possible
when parsing an expression to say "the end will have been reached when
such-and-such a token is reached": for example, ";" or "}".  This is not
possible in Inform.

Even if one knows in advance what the terminating token ought to be, it might
be impossible to recognise for context reasons.  For example, the statement

    move <first expression> to <second expression>;

appears to be no problem: the first expression is terminated by the keyword
"to", the second by ";".  But suppose "to" has another meaning, as the name
of a constant or variable?  In this case,

    move 2 + to to to;

is a legal statement.

For all these reasons, Inform actually detects the end of an expression at
lexical level 4 by watching the context carefully: for instance, in

    move 4 * banana to ...

when "4 * banana" has been passed, the next token is expected to be an open
bracket, an infix or postfix operator, or (perhaps) a comma.  Since the next
token is none of these, the expression has ended.  Note that this decision
can be taken without knowing whether "to" is a keyword or a symbol.


From a grammatical point of view the worst design flaw in the Inform language
is that array initialisations (either in Array directives or object property
values) consist of sequences of expressions given without any separator in
between.  For example,

    Array X --> 2 4 6 * MAX_FROGS 45 ;

This is arguably a convenient shorthand.  Unfortunately, it falls prey to
that predator of all computing grammars, unary minus.  The directive

    Array X --> 10 -5;

is ambiguous.  Is it an array of 10-5 = 5 entries, or an array of two entries,
10 and -5?

In Inform 5 there was no problem since constants were not allowed to contain
operators anyway (and negative numbers were represented by single tokens,
which is not true in Inform 6): the directive can only mean that there are
two entries, 10 and -5.

In Inform 6 the other answer is true, because the parser always makes the
longest match it can.  Since this may mean that some Inform 5 code needs
changing in a few places, a warning that "this is ambiguous" is issued
for unbracketed minus signs used in array initialisations.

(Illustrating the problem further, if the user tries to clarify things with

    Array A --> 10 (-5);

then this is parsed under the normal rules as the function call 10(-5);
therefore another constant-context rule is used to interpret brackets as
always referring to subexpressions, not function calls.)


Similar difficulties at first seem to occur with < and >, and with << and >>. 
Does

    <Action 3 -4>

refer to the pair (Action, -1) or the triple (Action, 3, -4)?  (It is now
obvious that the correct syntax design would be to insist on a comma in
between 3 and -4.)   Fortunately, though, both Informs 5 and 6 apply a "make
the longest expression possible" principle to this case, so they agree that
the answer is (Action, -1).  On the same grounds, the statement

    <Action x ++ y>

is understood as applying ++ to x, not to y, because the "longest expression
possible" when parsing the first expression is "x ++".  A similar slight
exception is made to the rules parsing open-brackets in action statements,
so that

    <Action firstobj (4*counter)>

is not parsed as having one argument, the result of the function call

    firstobj(4*counter).


    6.5   Constant folding
          ----------------

Recall that the emitter takes as input a stream of etokens from the
shift-reduce parser, consisting of the expression re-ordered into RPN.  When
the whole stream has been received (i.e., when it has received ENDEXP_TT),
the emitter has to produce an operand as "answer".

For instance, if the emitter receives just the operand 56, then the answer
is 56.  More often, the answer may be a special value (written above as E1,
E2, ...) meaning "the numerical value has to be worked out by working through
this tree".  (E1 is actually a new type of etoken, of type TREE_NODE_TT.)

If this were done strictly, then the original text 4 + 7 would result in the
emitter producing the answer E1, where E1 refers to the expression tree

                 +
                / \
               4   7

and this is not a helpful answer, for two reasons: it means the value
couldn't be used as a constant (since it seems to need Z-code to compiled to
work it out), and even if it's used in code, there's not much point in
compiling the Z-code instruction

    @add 4 7 -> sp;

just to put the value 11 on the stack.

Therefore, if the emitter finds an operator that it knows how to calculate
(such as +), whose operands are all constants (and which are not unknown
for linking reasons), it eliminates the sub-expression in favour of the
result.  This is called "constant folding": the constant value 4+7 is
folded up into 11.

Two issues arise: firstly, arithmetic must be done exactly as it would be
done in the Z-machine, that is, as signed 16-bit arithmetic.  (Otherwise
constants may fold differently in one port of Inform or another.)

Secondly, constant folding cannot safely be performed on a marked value.
(For otherwise: consider what might happen to the expression SODS - LAW,
at a time when neither SODS nor LAW have been defined.  Both are evaluated
as constants with value their symbol numbers and marker SYMBOL_MV: the result
is folded to the difference between their symbol numbers, and the marker
is lost: i.e., the result is nonsense.)

The emitter is able to fold the following operators:

     +  - (unary or binary)   *  /  %  &  |  ~
     ==  ~=  >  <  >=  <=  &&  ||  ~~

The inclusion of the conditionals here is quite important: it enables
directives like

     Iftrue #version_number == 3;

to work, since the expression "#version_number == 3" can be parsed in a
constant context.

The inclusion of unary minus is also important, as it is only by folding
this that the text "-45" can be parsed in constant context.


    6.6   Checking lvalues and simplifying the parse tree
          -----------------------------------------------

If the emitter produces a parse tree, then it's now modified according to
several rules.  The aim is to remove any ambiguities in the tree (for
instance, "=" does several different things according to what its left
operand is) and make a final correctness check (for instance, this is where
"45 = 12" results in an error message).


Firstly, the tree is traversed to find instances where a value is used where
a condition was expected.  (If the expression is in a condition context, the
top node is expected to be a condition; the children of a logical operator
&&, || or ~~ are expected to be conditions.)  The configuration

          <value or tree>

is replaced by

      <~= 0 condition operator>
                 |
          <value of tree>

If the <value or tree> has root node the "=" operator, a warning is printed
out, since it seems likely that

    if (x = 4) ...

(which would invariably be found true, since the value of "x=4" is 4 which is
non-zero) is a mistype for

    if (x == 4) ...


A second, depth-first, traversal is then made to remove usages of ~~, using
de Morgan's laws:

            ~~(x && y)   =   ~~x || ~~y
            ~~(x || y)   =   ~~x && ~~y
            ~~(x == y)   =   x ~= y
            ~~(x >= y)   =   x < y

and so on.  (Note that, internally, every operator etoken has a corresponding
negated operator etoken.)


One ambiguity remaining in the tree is that the operators ".", ".#" and ".&"
act on both common and individual properties (that is, properties handled by
the Z-machine's architecture directly, and properties handled by the run-time
veneer routines).  The decision about which is intended is made here, by
looking at the right operands of the operators.  (If it's a common property
number, or an individual property number, the decision is obvious; if it's a
variable or some compound expression, then we decide in favour of individual
properties (the run-time veneer routines can patch things up if this is a
wrong decision).)

So a traversal is made to carry out the following transformation:

               .                             <common .>          <indiv .>
              / \             becomes          /    \      or      /   \
             a   b                            a      b            a     b

(Since the veneer routines are themselves written in Inform code, a slightly
different translation rule applies to them: "." always means <common .>
and so on, while three "secret" operators are provided, "..", "..&" and
"..#", to mean <indiv .> and so on.  Outside of the veneer, use of these
operators is illegal.)

In the same traversal, function calls are distinguished from sent messages:

             <fcall>                             <send message>
              /   \ \ ...                           / | | \ ...
       <common .>  c d        becomes              a  b c  d
      or <indiv .>
          / \
         a   b


The final traversal performs "lvalue checking".  Five operators act directly
on storage objects (such as variables) rather than values: these are

       <lvalue> = <ordinary value>
       <lvalue> ++
       <lvalue> --
       ++ <lvalue>
       -- <lvalue>

where <lvalue> has to be a reference to a storage object.  There are up to
five kinds of storage object in the tree:

       local/global variable                       VARIABLE_TT
       common property value of an object          <common .> subexpression
       individual property value of an object      <indiv .> subexpression
       entry in byte -> array                      -> subexpression
       entry in word --> array                     --> subexpression

This makes 25 combinations.  As well as checking that what ought to be an
lvalue is indeed one of these five possibilities, the traversal also
rewrites the tree explicitly with one of 25 operators.  For instance:

             =                                <set byte array entry =>
            / \                                    /    |    \
          ->   v        becomes                   a     b     v
         /  \
        a    b

and so on.

Thus, the end of this process is a tree representing a syntactically correct
expression (if necessary having mended any errors in the source code's
description of the expression), which has a minimal number of logical
operators and no negation operators, and in which the action to be taken by
any node does not depend on what its children are.


    6.7   Summary of parse tree output
          ----------------------------

It is time to formally specify the input and output of the main routine in
"expressp.c":

    assembly_operand parse_expression(int context)

takes, as argument, one of the seven *_CONTEXT constants:

          QUANTITY_CONTEXT    the default: the result may be any quantity

          VOID_CONTEXT        the expression is used as a statement, so that
                              its value will be thrown away and it only
                              needs to exist for any resulting side-effects
                              (used in function calls and assignments)

          CONDITION_CONTEXT   the result must be a condition

          CONSTANT_CONTEXT    the result must be known now (at compile time)

          ASSEMBLY_CONTEXT    like QUANTITY_CONTEXT, but with the '->'
                              operator forbidden
                              (this is needed for assembly language to
                              indicate store destinations)

          FORINIT_CONTEXT     like VOID_CONTEXT, but with the '::' operator
                              forbidden at the top level
                              (needed to resolve ambiguity in grammar)

          ARRAY_CONTEXT       like CONSTANT_CONTEXT, but with different rules
                              to resolve whether a minus sign represents
                              unary or binary minus
                              (needed to resolve ambiguity in grammar)

The return value is the result of the expression.  This is an assembly
operand, the type of which indicates what has happened:

          LONG_CONSTANT_OT, SHORT_CONSTANT_OT
                              the result is this number

          VARIABLE_OT         the result is in this global/local variable

          OMITTED_OT          there is no resulting value (for instance,
                              this happens if the expression was successfully
                              parsed in void context)

          EXPRESSION_OT       the result can only be determined by running
                              the code generated from this parse tree

If an error has occurred in the expression, from which recovery was not
possible, then the return is (short constant) 0.  This should minimise the
chance of a cascade of further error messages.

The type EXPRESSION_OT is not a valid Z-machine assembly operand type, and
it's only used here and in the code generator "expressc.c".  The value of
such an operand is an index N, indicating that the root node of the tree is
in Nth position in the expression-tree-array ET.

The expression-tree-array ET[0], ET[1], ... is an array of "nodes", each of
which is a structure as follows:

    {   /*  Data used in tree construction  */

        int up, down, right;
        int operator_number;         /* Only meaningful for non-leaves */
        assembly_operand value;      /* Only meaningful for leaves */

        /*  Attributes synthesised during code generation  */

        ...

    }

The numbers ET[n].up, ET[n].down and ET[n].right are node numbers for the node
above, the leftmost node below, and the next child to the right of the same
parent: they hold -1 if there is no node in that position.

A "leaf" is a node for which ET[n].down == -1.  Leaves hold operands, and
other nodes hold operators.  (Note that leaf operands must be of operand type
SHORT_CONSTANT_OT, LONG_CONSTANT_OT or VARIABLE_OT: they must be honest
Z-machine operands.)

Nothing distinguishes the root node except that its number happens to be held
in an EXPRESSION_OT operand value somewhere.


A brief word on deallocation: clearly these parse trees need to survive
intact until code generation occurs for them.  No elaborate system is provided
here, since the syntax analyser always generates code for any expressions it
reads while it's still parsing the same syntax line as the expression was
parsed on.  Each time a new syntax line is reached, "expressp.c" is notified
and it then wipes the ET[n] array empty again.


    7   Code generation from parse trees
        --------------------------------

    7.1   Aim and structure of the code generator
          ---------------------------------------

The section "expressc.c" takes as input a parse tree and generates suitable
Z-code to calculate the result of the expression which the parse tree
embodies.  As a result, its main output is a stream of function calls to
"asm.c" (see chapter 5 for details).

The syntax analyser calls only one routine:

   assembly_operand code_generate(assembly_operand AO, int context, int label)

where AO is expected to be the output from a previous run through "expressp.c".

The contexts are the same as those for expression parsing, except that only
VOID_CONTEXT, CONDITION_CONTEXT and QUANTITY_CONTEXT are allowed.

The "label" is only meaningful in CONDITION_CONTEXT:

    if label >= 0, branch to this label if the expression is false
        as a condition;

    if label == -3, rfalse if the expression is true;

    if label == -4, rtrue if the expression is true.

(Note that the presence of rfalse/rtrue here allows better code generation
for statements in the form

    if ( ... ) return;

owing to a feature of the Z-machine's architecture for branch instructions.)

The return AO is only meaningful in QUANTITY_CONTEXT; it holds the result
of the expression, often the stack pointer variable (meaning, the answer
is on the top of the stack) but not always: e.g., from

    j++, 2

the result would be SHORT_CONSTANT_OT 2.  Similarly, from the expression

    2

the code generator returns 2 without actually generating code.


    7.2   Annotation for conditions
          -------------------------

The first thing the code generator does is to "annotate" the tree for
conditions.  Annotation is the process of assigning useful values to
every node on the tree, usually in such a way that either the value at
one node determines the values for its children, or vice versa.  (Such
values are called "attributes" and this inductive definition of them
is called "synthesis": see ASU, p.280.)

The explanation of this algorithm is about five times longer than its
source code!

We wish to assign a pair of values written (on paper) in the form

    a | b

to every conditional node (that is, every node which is a conditional
operator such as "~=" or "ofclass", or is a logical operator such as "&&").
This is to be read as "branch to label a if true, or label b if false".
Inform has four special label numbers:

    -1 means "carry on rather than branch"
    -2 means "branch to the immediately following instruction" 
    -3 means "return false rather than branch"
    -4 means "return true rather than branch"

(-2 is used in the assembly of instructions like get_child_zc, which is
a conditional branch in Z-machine terms but which Inform wants to ignore
the branch result from.  We can ignore -2 here: it behaves exactly like
-1.)

One of the two numbers a and b must always be other than -1 (otherwise we
would have rendered a branch meaningless, which must be a mistake).  Ideally
exactly one of them is -1, so that the node can generate a single branch
instruction.

There are two points where these a | b attributes start to appear in the
tree.  Firstly, in condition context they'll appear at the root node: for
example, in CONDITION_CONTEXT with label 40 the top node would be labelled

                       -1 | 40

("jump to L40 if false").  Secondly, at condition subexpressions used as
values, e.g. in the tree:

                          =
                         / \
                        v   ==
                           /  \
                          t    1

This is trickier: the result of == is not a branch somewhere, but a numerical
value (which must be 0 or 1).  To cope with such conversions, two more
attributes are provided:

    to_expression,  label_after

A node flagged as "to_expression" is one where a condition is being converted
to a numerical value; the "==" node above would have this flag set.

"label_after" is either -1 or the number of a label which the code generator
is to assemble after code generation for the node is complete.

So here are the rules for generating these attributes for node N.  Each node
is sent a pair of values a | b from its parent (in the case of the root
node, from the code_generate() routine).

    annotate N with
       a | b

    if N is &&, || or a conditional operator, and  a | b is -1 | -1,
    then this must be a condition used as a value:
    so re-annotate N with
       to_expression = TRUE
       label_after = a new label L
       L | -1

    if N is an && operator:
    a label for "go here if false" is going to be needed to provide a
    destination for the branch generated by the lefthand operand.  So,
    if b = -1, re-annotate N with:
       label_after = a new label L
       a | L

    if N is an || operator:
    a label for "go here if true" is going to be needed to provide a
    destination for the branch generated by the lefthand operand.  So,
    if a = -1, re-annotate N with:
       label_after = a new label L
       L | b

    Finally, we need to pass a pair of values down to each child of the
    node.  In the case of all nodes except && and ||, the values passed
    down are

         -1 | -1

    (because the operator expects numerical operands).
    
    In the case of && and ||, let x | y be the annotation which was finally
    made for this node.  We pass values down as follows:
    
                &&                     ||
               /  \                   /  \
         -1 | y    x | y        x | -1    x | y

(Here "condition node" means a conditional or logical operator.  Recall that
the ~~ operator was eliminated earlier.)

For example, consider the condition in

    if (x == 1 || y >= 2) ...

The syntax analyser decides to generate code which will branch to L23 if the
condition is false.  The parse tree

                ||                                       ||  -1 | 23, l_a = 50
               /  \                                       /    \
             ==    >=       is annotated to    == 50 | -1        >=  50 | 23
            /  \  /  \                            /      \        /      \
           x   1  y   2                          x        1      y        2

where the annotation routine created label 50 as a point to branch to if the
first condition turned out to be true.

Note that it's actually a little wasteful to annotate this way: the node >=
is annotated with 50 | 23, but label L50 will appear immediately after the
code generated for this node anyway, so it might as well be annotated -1 | 23.
The annotation routine does indeed make this simplification.

  Lemma: One of the values a, b sent down to any node is always -1.

  Proof: By induction down through the tree.  The values sent down to the
         root node are either -1 | L, -3 | -1 or -4 | -1, so the property
         holds for the root node.
         
         Now suppose such a pair a | b is sent down to node N.  Unless N is
         && or ||, it sends -1 | -1 to its children, so the property holds
         for the children of N.

         If N is &&, then it has two children, left and right.  The left
         child is always sent -1 | y, so it has the property.

         The right child is sent x | y, unless y is a new label appearing
         after N, in which case it is sent x | -1.  Note, though, that
         N itself was sent a pair in which one number was -1 (by inductive
         hypothesis) and so x | y can only have both numbers other than -1
         if a new label was created by N.  Since N is an && operator
         this label must be y; and therefore the child is sent x | -1,
         which means the right child also has the property.

         The argument for || is symmetrical to this.

  Corollary: In the annotation "a | b" of every node other than a && or ||
         operator, either a is -1 and b is not, or vice versa.

The annotation values a | b are only used to generate code for condition
operators, so we now have the desired result.


    7.3   Main traversal
          --------------

The main algorithm is simple.  We make a depth-first (i.e., bottom-up)
traversal of the parse tree, pruning off sub-expressions as code is
generated for them.  In the tree

                    *
                   / \
                  a   +
                     / \
                    b   c

we first descend to the +, and convert that into

    @add b c -> sp

and prune the tree to

                    *
                   / \
                  a   sp

so that the next instruction generated is

    @mul a sp -> sp

and the tree is now just

                    sp

and the result is therefore on the stack.  A complication, however, is what
order in which to remove the subexpressions.  Faced with the tree

                    -
                   / \
                  *   +
                 / \ / \
                a  b c  d

do we go left first from the top -, or right first?  This decision is taken
for us by the Z-machine's architecture.  When the Z-machine executes the
instruction

    @sub sp sp -> x;

it takes the two operands off the stack in the order left to right, i.e.,
it pulls the first operand first and then the second.  (Most stack machines
work the other way, to make RPN expressions easy to evaluate.  The Z-machine
was designed for forward Polish notation, though, because Infocom's own
language for it was a Lisp-variant with syntax like (PLUS 2 3).)

This means that we need to push the second operand first, then the first
operand.  In other words, we have to code generate the + operation first,
and then the * operation.

Therefore we need to traverse the tree depth-first and right-to-left.
But there is a complication: the && and || nodes need to have their children
code-generated left-to-right.  (Although "and" is logically commutative,
so that it shouldn't matter, "&&" is not computationally commutative, because
the language specification requires that the left condition is never evaluated
if the right condition was false.)

There is one other left-to-right node: the comma operator.  For example,
if the original value of x is 10, then

               ,
              / \
            ++   x
            /
           x

must evaluate to 11, not 10, because "x++, x" must evaluate "x++" first,
throw the result away and then evaluate "x".


    7.4   Void context
          ------------

As the above example demonstrates, in some cases sub-expressions need to
produce a result and in other cases they must not do so.  If the "++" operator
had left a value on the stack, then it would still be there long after the
expression had been evaluated.

Therefore, when code generating each node it is essential to know where the
result of the expression is supposed to go.  A flag called void_flag is
passed down to indicate this: if set, then the operator is being evaluated
in "void context", meaning that the value should be thrown away.  The rules
are:

    if the expression is in VOID_CONTEXT, then the root node is;
    the left operand of a comma operator is in void context;
    if a comma operator is in void context, its right operand also is.

Not every subexpression is permitted to be in void context.  For instance,

    3*x, y++;

generates the error

    Result of expression is thrown away

because the * calculation was pointless.  Numerical values or variables are
not allowed in void context, and nor are any operators which have no "side
effect": i.e., whose evaluation causes nothing in the Z-machine to change.
The operators with side effects are:

    =  in its various forms (5 of them)
    ++ in its various forms (10 of them)
    -- in its various forms (10 of them)
    function call (including message sending)

It is only now, at code generation time, that the error message above is
generated.  (In C this would only cause a warning, but I didn't want to
waste effort making the code generator assemble code to explicitly throw
results away.)


    7.5   Results of subexpressions
          -------------------------

By the rules above, then, either an operator is in void context and its
result should be thrown away, or it isn't and the result should go onto the
stack.  However, this would be wasteful, as an expression like "x = 4 + y"
(where x is a local variable, say) would then generate

    @add 4 y -> sp
    @pull x

and give result "x".  In fact we want to generate

    @add 4 y -> x

and give the result "x".  Therefore, when code-generating an operator with
a result (such as @add), we look at the operator above (which is guaranteed
still to exist): if it is <set-variable-equal> then we get its left operand,
write the result of the operator (such as @add) to that, and replace the
<set-variable-equal> operator with the variable.  For instance,

        set-var-equal                               x
           /    \
          x      +           becomes just
                / \
               4   y


One method used to throw away the results of expressions in void context
is to write the result to temporary variable 255 rather than the stack.
This is how unwanted function call return values are disposed of (well,
unless the Z-machine version number is 5 or more, in which case instructions
for call-but-throw-away-result are used directly).

A few other optimisations are used in void context: for example,

         ++ used as postfix on variable
                       |
                      var

generates the code

    @push var
    @inc var

with result "sp" in a quantity context.  In void context, it generates
simply

    @inc var

This is quite an important optimisation since postfix-++ is used very often
in void context (in "for" statements and just as an assignment).


    7.6   Conditional and logical operators
          ---------------------------------

The above account covers code generation for arithmetic and assignment
operators.

In generating && and ||, nothing needs to be done except possibly to
assemble a label (if the attribute label_after isn't -1).

Generating for conditions like == is in principle easy but in practice
difficult.  Recall that we have three attributes to go on:

    a | b              labels to branch to on true | false,
                           one which is -1 meaning "carry on"
    to_expression      flag indicating "convert to numerical value"

One problem is that, because of the "or" operator, a condition like ==
may have an arbitrary number of children:

    if (x == a or b or c or d or e) ...

for instance, must be assembled into the instructions:

    @je x a b c ?L2
    @je x d e ?~L1
    .L2
         ...
    .L1

(and if x is a use of the stack pointer, it must be pulled into some temporary
variable so that it can be non-destructively read more than once).  The code
to achieve this is baroque but uninteresting.

The second question is how to convert a condition to a numerical value:
either 0 (false) or 1 (true).  At present, this is not very optimised, and
the code looks like:

    <if condition is true jump to L1>
    @push 0
    @jump L2
    .L1
    @push 1
    @jump L2

with the result being "sp".


    7.7   System functions
          ----------------

When generating the <function call> operator, Inform looks at the leftmost
operand (the function to call) to see if it's the name of a system function.
(Note that if the system function has been Replaced, then it will have been
filtered out long since and the name will have been treated as the symbol
name for a function in the usual way.)

These are all implemented as short inline functions (sometimes only one
instruction long) except for "metaclass", implemented as a veneer routine.

Note that

             random
             / | |
            a  b c ....

nodes are generated by creating a word array whose contents are a, b, c, ...
and then assembling code to read a random entry from this array.  (An
error is issued if a, b, c, etc. are found not to be constant values.)


    7.8   Strict mode assertions
          ----------------------

When the compiler runs in "Strict" mode, with the -S switch set, it
compiles otherwise wasteful code to protect the programmer from crashing
the Z-machine through programming errors.  For instance, the statement

         x = child(nothing);

would otherwise compile to Z-code which is technically illegal:

         @get_child nothing -> x ?L0;
        .L0;

because the "get_child" opcode should have a legal Z-machine object
number as first opcode.  Most interpreters would overlook this technical
breach of the Z-machine specification.  A more serious example might be

         my_array --> 34000 = 1;

which would compile code trying to use @storew to write a value way
outside the Z-machine's dynamic memory area: this would typically crash
an interpreter.  The generic term for mistakes like these, coined
by Mr David Glasser, is "vile zero errors from hell".

In strict mode, Inform aims to compile code such that no program
(which does not contain @ assembly language) can possibly violate
the Z-machine specification in any respect.  In effect it compiles
what other compilers have called "assertions" before any use of a
dangerous opcode whose operands have been supplied by the user's
program.  For example "storew" will not be executed until its operand
value has been checked to lie in range.  That is, the statement

       z = x-->y;

compiles in non-strict mode to

       @storew        x y -> z

but in strict mode to

       @call_vs       RT__ChSTW x y -> z

where RT__ChSTW is a veneer routine ("run-time check store word")
which verifies only that x+2*y lies between 0 and the top of dynamic
memory: if so, the routine performs the required opcode; if not, it
causes the error message

   [** Programming error: tried to read outside memory using --> **]

to be printed up.

Not all the protected opcodes are replaced simply by function calls,
as this could be very slow: some are replaced by in-line code, which
is quicker.  For example, if the source code contained

       z = my_array-->y;

where "my_array" had been declared as an array, then Inform would
instead compile this:

       jl           y short_0 to L1 if TRUE
       jl           y <n> to L0 if TRUE
      .L1
       call_vn2     RT__Err short_28 y short_6 short_5 short_1 
       jump         L2 
      .L0
       loadb        long_680 y -> z 
      .L2

This has wasted some 23 bytes, but is faster than calling a function,
and directly checks that

       0  <=  y  <  <n>

where <n> is the 1 more than the highest legal entry of my_array, which
Inform knows thanks to having already compiled my_array.  The long
run of numbers fed to the veneer's run-time error system specify the
error number (28), the index tried (y) and so forth, enabling it to
print an error like so:

   [** Programming error: tried to write to -->14 in the array my_array,
   which has entries 0 up to 9 **]

The current list of assertions is as follows:

   Inform syntax           Assertion
   =============           =========
   child(x)                metaclass(x) == Object,
   children(x)             i.e., is the number of a Z-machine object
   elder(x)                but isn't a class-object
   eldest(x)
   parent(x)
   sibling(x)
   younger(x)
   youngest(x)

   x in y                  metaclass(x) == Object or Class
   x notin y               i.e. x is the number of a Z-machine object
                           (y, however, is unrestricted: the second operand
                           of @jin does not have to be a valid object number)

   x has y                 metaclass(x) == Object or Class,
   x hasnt y               y is a valid attribute number

   give x y                metaclass(x) == Object
   remove x

   move x to y             metaclass(x) == metaclass(y) == Object,
                           and the object tree remains a tree if x is
                           moved to y

   x/y                     y ~= 0
   x%y

   x.y = z                 metaclass(x) == Object, y is a valid property,
   ++x.y                   and x provides y
   x.y++
   --x.y
   x.y--

   x.y (used as value)     metaclass(x) == Object, y is a valid property,
                           and x provides y or else y is a common property

   x.&y                    metaclass(x) == Object, y is a valid property
   x.#y                    (not necessarily provided by x)

   x-->y = z               x+2*y,x+2*y+1 lie within "alterable memory"
   x-->y++                 (see below)
   x-->y--                 Moreover, if x is the name of a declared array,
   ++x-->y                 x+2*y,x+2*y+1 lie within the bounds of the array
   --x-->y                 (and not the length entry 0 if a table or string)

   x-->y (used as value)   x+2*y,x+2*y+1 lie within byte-accessible memory
                           Similarly for declared arrays, though entry 0
                           can be read

   x->y = z                x+y lies within "alterable memory" (see below)
   x->y++
   x->y--                  Moreover, if x is the name of a declared array,
   ++x->y                  x+y lies within the bounds of the array
   --x->y                  (and not the length entry 0 if a table or string)

   x->y (used as value)    x+y lies within byte-accessible memory
                           Similarly for declared arrays, though entry 0
                           can be read

   print (char) x          x is a valid ZSCII character for output, as
                           defined in section 3.8 of the Z-Machine Standards
                           Document
   print (address) x       x lies within Z-machine byte-accessible memory
   print (string) x        metaclass(x) == String
   print (object) x        metaclass(x) == Object or Class

The definition of "alterable memory" is:
either (a) within the array space area;
    or (b) within the object common property values table;
    or (c) within the object individual property values table;
    or (d) being the word "Flags 2" in the Z-machine header, at
           address $0010, $0011.
All other areas in byte-accessible memory can be read but are
"write-protected": in particular the global variable values table,
which needs to be defended as corrupting the contents during a loop
can cause the Z-machine to hang.

A further assertion is compiled into loops of the form

   objectloop (x in something)
   {   ...
   }

where "something" is any simple term (but not a calculated expression):
at the end of the loop, the assertion checks that x is still in "something"
and has not been removed or moved elsewhere in the object tree, causing
the iteration to go awry.  (This is a popular Inform coding mistake
and often arises from code like "objectloop(x in player) remove x;".)

Note that the veneer code is always compiled in non-strict-mode, which
is necessary to prevent some circular problems (e.g. can't do X without
calling the veneer to ask permission, but then the veneer has to call
itself to ask permission, but then...); and also enables the veneer to
do horrid things to class objects in ways which the outside program isn't
allowed to.


    8   Assembly of code, text and data structures
        ------------------------------------------

    8.1   Assembly of initial code
          ------------------------

The front end of "asm.c" consists of three routines to assemble routine
start and finish, and to place labels; plus many routines for assembling
individual instructions.  These all fill out data in the
assembly_instruction AI and then call a single assemble-instruction routine.
Thus, the back end of "asm.c" contains four operations to place data into
the Z-machine's code area:

    Routine start           assemble_routine_header()
    Routine end             assemble_routine_end()
    Label                   assemble_label_no()
    Instruction             assemble_instruction()

For the format of Z-code instructions, see the Z-Machine Standard Document,
which goes into the Z-machine's format with exhaustive thoroughness.  Note
that Inform 6 is much more compliant to the standard (so far as I know,
perfectly compliant) than earlier releases: for example, the Flags 2 bits
for "UNDO is needed" and so on are set correctly for the first time.

One of the few pieces of information fed back to Inform's higher levels by
"asm.c" is a flag indicating whether execution can possibly reach the next
instruction or not.  The assembly of certain instructions (returning from
a routine, terminating the Z-machine, unconditionally jumping) sets the
flag

    execution_never_reaches_here

while the placing of a label removes it again (because by branching to that
label it may be possible to reach this point after all).  This is used to
issue dead code warnings (and also to remove some unnecessary branches
when code-generating "if"..."else"...).

Assembling a routine header is straightforward: note that an option exists
here to assemble some code along with it to print out tracing information.
For this reason, and also to print out assembly trace at compile time,
"asm.c" accesses the variable names in the symbols table.

All labels are referred to in Inform's higher layers by number, rather than
by address; only "asm.c" has any access to what their addresses are.  Labels
are numbered from 0 upwards, but the assembler also recognises

    -2    branch only to the next instruction (this converts a branch
          instruction such as @get_child to a non-branch one, since it
          causes the instruction to behave identically whether the branch
          is made or not)
    -3    return false rather than branch
    -4    return true rather than branch

The code -1, used higher up for "no branch intended", should never descend
this far.

Note that the Z-machine has no formal concept of "routine end".  Therefore,
assemble_routine_end() only generates code to cause a return (and only then
if the present instruction position can be reached).  By long-standing
(if unfortunate) Inform convention, the value returned in this case will
be 0 for a routine in an object definition, or 1 for any other routine.

The routine-start/end routines also keep track of the usage of local
variables and labels, and issue warnings about those which were declared
but not used.  Since the scope of these symbols is restricted to the routine
in which they are defined, label names are unassigned (i.e. reset to
UNKNOWN_SFLAG CONSTANT_T's).  Local variable names are not held in the
symbols table and are automatically cleared when the next routine starts.
Finally, we can now detect the use of a label name not defined in the routine,
so we issue an error message if this has happened.


What happens in this first phase of code assembly is that a form of Z-code
called "initial code" is stored in a temporary holding area of memory until
a routine has been completed.  It is then treated by the branch optimiser
(see the next section).

"Initial code" is the same as Z-code, except that:

    (a)  all branches to labels have long form, and instead of an offset
         the branch value is the label number within this routine
         (0, 1, 2, ...);
    (b)  LABEL operands (to "jump" instructions) are, similarly, label
         numbers of the label to branch to;
    (c)  parallel to the holding area is a uchar[] buffer of the same size,
         holding a "marker value" for each byte.

(Note that branches to -2, -3 and -4 are assembled as short-form branches
and not marked as BRANCH_MV.)  The marker values are as follows:

    BRANCH_MV        This byte is the first of two in a branch operand
    LABEL_MV         Ditto but in a LABEL operand
    DELETED_MV       (Inserted by the optimiser, rather than at initial
                     code assembly.)  This byte has been deleted and should
                     be skipped over when the code is transferred to
                     long-term storage.
    a "module value" marker value
                     This byte is the first of two in a LONG_CONSTANT_OT
                     operand which should be marked with this value in
                     module compilation
    ditto + 128      Ditto but a one-byte SHORT_CONSTANT_OT or VARIABLE_OT
                     operand


    8.2   Branch offset backpatching and optimisation
          -------------------------------------------

There are three phases of this process:

    (i)   Deciding whether each branch should or should not be optimised
          to the 1-byte short form, using the label positions in the
          initial code;
    (ii)  Recalculating the label positions to take account of this
          (which will tend to move the labels closer together as second
          bytes of branch operands are deleted);
    (iii) Transferring the initial code to long-term storage, replacing
          branch and LABEL operand values with address offsets and
          issuing module markers as indicated (if in module mode).

A branch can be "short form" if it is a forward branch of between 0 and 29
bytes from the PC position of the next instruction.          

To record a branch as "short form" during (i), the second byte of the branch
operand is marked as DELETED_MV.

The long-term storage alluded to is either temporary file 2 or else the
zcode_area memory block.  Note that the routines moved there are still not
finally correct, but are "raw Z-code": Z-code which has incorrect operands
because value backpatching has not yet occurred.


    8.3   Text translation: ISO and Unicode resolution
          --------------------------------------------

The algorithm encoding ZSCII text to a compressed sequence of 5-bit
"Z-characters" has been documented many times: see the Z-machine standards
document.  

The section of Inform responsible, "text.c", keeps text for three Z-machine
areas: the static strings area, the dictionary and the "low strings pool".
(In addition, as requested by "asm.c", it encodes text into the Z-code area
as part of the assembly instructions @print and @print_ret.)

Static strings are written to a temporary holding area of memory, manipulated
until correct and then transferred to long-term storage: either temporary
file 1 or the static_strings_area memory block.

The low strings pool holds a table of strings in low Z-machine memory used
for abbreviations.  Such strings need to be accessible in an unusual way
(they are addressed by halved byte addresses, not packed addresses) owing
to an anomaly in the Z-machine architecture, and therefore must reside in
the bottom 64K of memory: they cannot be written to the static strings area.


Source text arrives at the text translation routines still encoded in an
ISO extension to the ASCII character set (depending on the current value
of the -C switch).  A character value of $ca might mean Latin E-circumflex,
E-ogonek, Cyrillic capital hard-sign, Arabic Teh or Greek capital Kappa
according to this context.  It would be illegal in plain ASCII or ISO
Hebrew, but since the Inform lexer has already filtered out any illegal
ISO characters, this possibility no longer arises.

Furthermore, characters can be specified by string escapes beginning with
the @ character.  The sequence @^E would specify Latin E-circumflex, and
the sequence @{39A} would specify Greek capital Kappa (by its Unicode
value, $039A) regardless of the current -C setting.

The work of sorting out character codes is handled by "chars.c".  It may
be useful to summarise the different meanings which character codes can
hold within Inform:

  (1)  "Source".  Raw source files can contain any character values
       between $00 and $ff.  The lexer filters these characters to reduce
       them to:

  (2)  "ISO".  An unsigned 8-bit character code within which the values
       $20 to $7e always have their usual ASCII meanings, and $7f to $9f
       are always undefined, and some values in the range $a0 to $ff
       may have meanings as specified in ISO 8859-1 to -9 if the current
       Inform -C setting is 1 to 9.  If -C is set to 0, "ISO" means the
       same as plain ASCII.

  (3)  "Unicode".  Unicode, or ISO 10646-1, is an unsigned 16-bit
       character code which includes essentially every human script.
       Inform uses it as an intermediate state between the diverse
       ISO possibilities and the final ZSCII result.

  (4)  "Text".  A string of ISO characters to specify a sequence of one
       or more Unicode characters.  Usually, one ISO character is
       translated to one Unicode character: the exception is for string
       escapes.  For instance, the text "a@{39a}c" specifies a sequence
       of 3 Unicode characters.

  (5)  "ZSCII".  An unsigned 10-bit character code used within the
       Z-machine.  See the Standards document for its definition: note
       that between $20 and $7e, it agrees with ASCII.

  
There are several translation routines in "chars.c" to move between these
forms: for instance, "zscii_to_text" is used when printing out the
contents of the dictionary into a transcript file.  (Transcript files
use the same ISO setting as the original source code.)  However, the
main sequence of translation is:

       Source -------> Text -------> Unicode -------> ZSCII
              lexer         ISO-to-          Unicode-to-
              filtration    Unicode;         ZSCII
                            string escape
                            parsing

Note that the first two stages (lexer filtration and ISO to Unicode)
depend only on the current -C setting, and are always possible.  The
third stage (Unicode to ZSCII) is more complex because it is programmable
and may fail, depending on what the text is needed for.

Unicode already specifies over 30000 different characters and
compromise is inevitable in trying to squash this into the 1024 possible
ZSCII values.  The Z-machine's stock of "extra characters"
(see Standard Document 1.0) is configured by the story file to
correspond to a block of up to 97 Unicode characters, and these
are the only ones usable in Inform strings, dictionary words or
as quoted values.

Inform sets up this block to a sensible default set given the
current ISO setting.  For example, if Inform reads in source code under
-C6 (ISO Arabic) then it will choose all the Arabic letters from
ISO Arabic.  (This default setting can be overridden using the
"Zcharacter" directive.)  Note that if the designer wants, say,
Unicode $274b ("heavy eight teardrop-spoked propeller asterisk")
as well as plain old $0621 (Arabic letter Hamza), this can simply
be added to the stock accumulated so far.

The stock of extra characters is defined by a list of Unicode values
written out as the "Unicode translation table" in "tables.c".


    8.4   Dictionary management
          ---------------------

As dictionary words are found in the source code, for instance in the value
of "name" properties, or in grammar, or in constants like 'this', they are
added to the dictionary by a routine called dictionary_add(), which numbers
each different word in "accession order": word 0 is the first which was
entered, and so on.

References to dictionary words are actually stored in the Z-machine as
byte addresses within the dictionary table, which cannot be known until after
the compilation pass (when the full alphabetical ordering is known).  Such
references are therefore always marked (with DICT_MV) for backpatching.

During compilation a dictionary table similar to the Z-machine format is
kept: there is a 7-byte header (left blank here to be filled in at the
construct_storyfile() stage in "tables.c") and then a sequence of records,
one per word, in the form

        <Z-coded text>    <dict_par1>  <dict_par2>  <dict_par3>
        4 or 6 bytes       byte         byte         byte

The difference between this and the final Z-machine dictionary table is
that the records occur in accession order, not alphabetical order.  (The
construct_storyfile() routine in "tables.c" rearranges them.)

The Z-machine does not specify what use is made of the three bytes of
"dictionary parameters" (it doesn't even specify that there have to be only
three): for this, see the next section.


The type "dict_word" is used for a structure containing the (4 or) 6 bytes of
Z-coded text of a word.  Usefully, because the PAD character 5 is < all
alphabetic characters, alphabetic order corresponds to numeric order of
Z-encoding (regarding the Z-encoded bytes as a 32 or 48 bit big-end-first
number; sign is unimportant as the initial bit is never set).  For this
reason, the dict_word is called the "sort code" of the original text word.

A special text-translation routine is used to "prepare" such a sort code:
as dictionary words do not use every feature possible in ordinary text
translation (abbreviations, for instance, or two of the string escape
forms) it's worth having a separate routine, optimised for speed.  Note
that this routine makes use of "text_to_unicode" and "unicode_to_zscii"
in "chars.c" when, but only when, it hits a string escape character @.
(For ordinary ISO characters, when no string escape is hit, these routines
would be too slow.)


Since dictionaries typically contain 500 to 2000 words, speed is extremely
important when searching and sorting: a simple O(n^2) algorithm to search the
dictionary, for example, greatly increases Inform's total compilation time on
medium-size games.  (Here "n" is the number of dictionary words.)

Inform 1 to 3 used a crude O(n^2) mass comparison algorithm, which was
unacceptably slow.  Inform 4 to 6.05 used a form of hash table, with the
first character of a word as hash function (there were 27 hash values, for
A to Z plus "other"): this behaved very acceptably most of the time, but was
ultimately still O(n^2) and the uneven distribution of initial letters in
English meant that the hashing function was not ideal.

Inform 6.1 instead stores the dictionary as a "red-black tree".  (See, e.g.,
Robert Sedgewick, "Algorithms" (1983, second ed. 1988).)  The tree is binary
and always has the property that at node X, anything hanging from the left
branch of X is alphabetically before what's at X, and anything hanging from
the right is alphabetically after.  For instance, a legal configuration is:

                             fieldmouse
                               /    \
                          abacus   patrician
                          /   \
                         a    constantinople

This is simple enough to build and to search through.  The problem is that
one usually ends up with a haphazardly arranged or "unbalanced" tree in which
some branches are very much longer than others, so that searching times can be
unpredictable, and in worst case the construction process is still O(n^2).

Thus one wants a method of building the tree which attempts to restore the
balance as it does so.  The algorithm here involves "colouring" each branch
either red or black (that is, storing one bit of information for each branch).
Roughly speaking, suppose X is added as a branch of Y.  Then the branch
leading down from Y to X -- the new branch -- is coloured black.  But the
branch leading down to Y -- which was already there -- is recoloured red.

We do not permit the tree to contain two consecutive red branches: whenever
we notice this happening, we perform a "rotation".  A typical rotation is:

                  a                        a
                   \                        \
                    bag                      cat
                   / \(red)    --->     (red)/ \(red)
                 baa cat                  bag   dog
                     / \(red)            /  \
                   bun   dog           baa  bun

It's like lifting the "bag" subtree and re-hanging it from the word "cat".
(There is another case, when the red branches slope in opposite directions.)

We also try to percolate redness upwards (redness being the possibility for
change in the tree structure).  So, when searching, if we ever find

                 \(black)
                   node
             (red)/    \(red)

then we replace it with

                 \(red)
                   node
           (black)/    \(black)

and check the "rotate if you get two consecutive red branches" rule again.

It is known that:
(i)  if there are N words in the tree, then any search takes less than
     2 log-to-base-2(N) + 2 comparisons (i.e., this is the maximum depth
     of the tree);
(ii) at worst you need to perform only one rotation for every four
     comparisons.
In practice, the observed values are even better: the average seems to
be log-to-base-2(N) comparisons and 1 rotation (though this is unproven).
Trees are almost optimally balanced most of the time.

For an Inform game with a dictionary of 2000 words, then, we expect to perform
about 12 comparisons per search.  This compares with a typical figure of 75
using the Inform 6.05 hashing algorithm.  Comparisons are not terribly
expensive, but searches are quite frequent.


    8.5   Format of tables unspecified by the Z-machine
          ---------------------------------------------

The Z-machine standards document requires many tables to have particular
formats and locations: but Inform makes several other tables, and this
section is to document their formats.  (See the construct_storyfile()
routine in "tables.c" for exactly how the byte-addressable Z-machine memory
is put together.)

Details of differences in modules as compared with story files are given
later.

    (i)  Header

Following standard conventions, Inform: leaves bits 2 and 3 of "Flags 1"
clear; places the release number in the word at bytes 2 and 3; the grammar
table address at bytes 14 and 15; and the serial number in bytes 18 to 23. 
The default release number is 1 and the default serial number is either
970000 (if no access to the current date is available) or the compilation
date in the form YYMMDD.

In addition, Inform writes its version number in ASCII form into bytes
60 to 63 (the last four bytes of the header) in the form "6.01".  This
makes story files produced by Inform 6 distinguishable from all other
story files, something interpreter writers have long wanted.

    (ii)  Low strings pool and abbreviations table
    
By default, the 96 entries are all "   " (this Z-encodes to the $80 $00,
which makes it a convenient default value).  The first two banks of 32
are used for strings declared by Abbreviate; the third is set at run time
by the "string" statement, to values which have previously been declared
by Low_string.

The pool of low strings is always placed at $0040, immediately after the
header and before the abbreviations table.

    (iii)  Header extension table

The Z-machine standard allows games not to provide a header extension
table (formally called the "mouse data table" after the only information
which Infocom ever stored in it) and Inform 6.01 to 6.11 never do.
Inform 6.12 and later always generate this extension, which always
contains at least 3 word entries.

    (iv)  Character set table

This is a table of 78 = 3*26 bytes, giving the Z-character numbers of
the characters to appear in the three translation alphabets.  The table
is only written if the "Zcharacter" directive has been used to alter
the standard settings.  (See section 12.2.)

    (v)  Unicode translation table

This is generated (by Inform 6.12 and later) only when needed, i.e., only
when a game has asked to use a stock of "extended characters" which
differs from the usual ISO Latin1 set.  Games compiled under the ISO
setting -C0, -C1 (the default setting) and -C9 will usually never
have a Unicode translation table, though they can have.

    (vi)  Property default values table

Note that Inform defines properties 1, 2 and 3 automatically: 1 is "name",
while 2 and 3 are used to support the veneer routines' implementation of
object orientation.

    (vii)  Class-number-to-object-number table
    (viii)  Individual property values table

See the notes on object orientation; these tables are used by the veneer
and are unspecified by the Z-machine.

    (ix)  Grammar table
    (x)  Actions table
    (xi)  "Preactions" table
    (xii)   "Adjectives" table

See the next section.  The Z-machine specifies none of these tables.

    (xiii)   Dictionary table

The format of this table is largely specified by the Z-machine.  Although
the Z-machine allows some flexibility in setting up the dictionary, Inform
sticks to the usual Infocom configuration:

    the separating characters are full stop, comma and space;
    the word-entry length is 7 (in version 3) or 9 (otherwise).

Thus the dictionary begins with seven header bytes:

    3  '.'  ','  ' '  7-or-9  <word containing the number of entries>

Each word is encoded with an entry giving 4 (in version 3) or 6 (otherwise)
bytes of textual representation, fully specified by the Z-machine, and
then three bytes of data, which Inform can do with as it pleases.

These are the so-called "dictionary parameters" dict_par1, 2 and 3.  Inform's
pleasure is to write into them as follows:

    dict_par1: flags
    dict_par2: verb number (counting downwards from 255)
    dict_par3: preposition number (counting downwards from 255) in
               grammar version 1; not used in grammar version 2

The flags are given as follows:

    bit:    7      6   5   4   3     2        1      0
            <noun>             <adj> <plural> <meta> <verb>

The bits <verb>, <noun> and <adj> are set if the word can be used in the
context of a verb, noun and/or preposition (all three can be simultaneously
set).  The <meta> bit indicates that the English verb is "meta", that is,
is a command to the program and not a request in the game.

The <plural> bit is set using the '...//p' notation, like so:

    'egg'  'eggs//p'

Library 6/3's parser uses this to automatically detect commands referring
to multiple game objects.

    (xiv)  Module map

This is an extension to the header, only used in module files (see later).

    (xv)  Linking data table

A table placed after the end of static strings, but only in module files
(see later).

To summarise, then, an Inform game has the following memory map:

      -----------------------------------------------------------
      Dynamic      Header
       memory      Low strings pool
                   Abbreviations table
                   Header extension                   
                   Character set table (if differs from normal)
                   Unicode table (if differs from normal)
                   Common property default values                  (*)
                   Object tree
                   Common property value tables                    (*)
                   Class number to object number conversion table
                   Property names table                            (+)
                   Attribute names table                           (+)
                   Array names table                               (+)
                   Individual property value tables                (*)
                   Global variable values, part of...              (*)
                   ...Dynamic array space                          (*)
      -----------------------------------------------------------
      Static       Table of grammar table addresses
      byte         The grammar tables (one for each Inform verb)   (-)
      addressed    Table of action routine addresses               (+)
      memory       Table of parsing routine addresses              (+)
                   Table of "adjectives", i.e., prepositions       (+)
                   Dictionary
                   (Module map)
                   Synchronising gap of up to 7 null bytes
      -----------------------------------------------------------
      Static       Z-code area                                     (*)
      "virtual"    Static strings area
      memory       (Link data table)
      -----------------------------------------------------------
      (*)  This area requires full value backpatching
      (+)  This area requires a simple automatic correction
           (e.g. adding on the static strings or code offsets),
           which is done without the full backpatching machinery
      (-)  In grammar version 1, this area requires no backpatching,
           but in GV2 each grammar token's data is checked to see
           if a dictionary word or routine address, and backpatched
           if so: so in GV2 it comes under category (+)


    8.6   Grammar version numbers GV1 and GV2
          -----------------------------------

The "grammar tables", used by the parser in an adventure game, are not
specified by the Z-machine at all (contrary to popular opinion) and must
therefore be fully specified here.  There are four such tables:

    Grammar table
    Actions table
    "Preactions" table
    "Adjectives" table

Inform can generate two different formats for these tables, known as
grammar version 1 (henceforth GV1) and grammar version 2 (GV2).

Inform makes the GV1 or GV2 decision according to the value of the symbol

    Grammar__Version

which Inform usually defines as 1, but allows programs to redefine.  Library
6/3 and later, in particular, contain the line

    Constant Grammar__Version = 2;

Note that it is essential for the library's parser to understand the
grammar tables being generated by Inform, so attempting to use GV2 with
library 6/2 or earlier would fail.


The designs of GV1 and GV2 have some points in common.  The Grammar table
is a --> array containing addresses of grammars, one for each Inform verb.
The Actions table is a --> array containing addresses of action routines
(such as "TakeSub"), one for each action.  (Fake actions have no action
routines and aren't in the table.)

(The term "Inform verb" means essentially a common parsing grammar, such as
that shared by "take" and "hold", and is used to distinguish this from an
"English verb", such as the word "take".)


GV1 is at heart an imitation of middle-period Infocom table formats, and
was used in order that common debugging tools would still work on Inform
games (an important consideration in the early days of debugging Inform 1).
In GV1, an individual grammar table has the format:

    <number of grammar lines>          1 byte

followed by that many 8-byte lines in the form:

    <parameter count>  <token1> ... <token6>  <action number>
    -- byte ---------  --byte--     --byte--  -- byte -------

The parameter count is the number of tokens which are not prepositions.
This is needed because the run of tokens is terminated by null bytes (00),
which is ambiguous: "noun" tokens are also encoded as 00.

The numerical token values are as follows:

    "noun"            0  
    "held"            1  
    "multi"           2  
    "multiheld"       3  
    "multiexcept"     4  
    "multiinside"     5  
    "creature"        6  
    "special"         7  
    "number"          8
    noun=Routine      16 + parsing-routine-number
    Routine           48 + parsing-routine-number
    scope=Routine     80 + parsing-routine-number
    attribute         128 + attribute number
    'preposition'     adjective number

    illegal values:   9-15, 112-127

This one-byte value has to identify particular prepositions and routines,
which is only possible using a numbering system for each.  GV1 numbers
parsing-routines upwards from 0 to 31, in order of first use.  A separate
table translates these into routine packed addresses: the "preactions"
table.  (As usual, the term is traditional: Inform has no concept of
preaction, but the Infocom games from which it inherits GV1 do have such a
concept.)  The preactions table is a simple --> array.

(Note that in Infocom's games the preactions table always has the same
length as the actions table: this is not true in either GV1 or GV2 Inform
games.)

Prepositions are also identified by their "adjective number".  (An early
misunderstanding in Z-machine decipherment led to the traditional use of
the word "adjective" for dictionary words explicitly written into grammar
lines, which are mainly prepositions like 'into' or 'against'.)  Adjective
numbers count downwards from 255 in order of first use.  They are translated
back into dictionary words using the "adjectives table".

The adjectives table contains 4-byte entries:

    <dictionary address of word>  00  <adjective number>
    ----2 bytes-----------------  ----2 bytes----------- 

To make life more interesting, these entries are stored in reverse order
(i.e., lowest adjective number first).  The address of this table is
rather difficult to deduce from the file header information, so the constant
#adjectives_table is set up by Inform to refer to it.


GV2 is a much more suitable data structure, easier to read and write,
less limiting and marginally faster and more economical of Z-machine and
Inform memory.  In GV2 an individual grammar table has the format:

    <number of grammar lines>          1 byte

followed by that many grammar lines.  Individual lines are no longer always
8 bytes long, as in GV1.  Instead they have the form:

    <action number>  <token 1> ... <token N>  <ENDIT>
    ----2 bytes----  -3 bytes-     -3 bytes-  -byte--

The action number is actually contained in the bottom 10 bits of the word
given first: the top five are reserved for future use, which leaves

    action_number & $400

as a bit meaning "reverse the order of the first and second parameters
if this action is to be chosen".

The ENDIT marker is the number 15.  There can be anything from 0 to 30
tokens, and each occupies three bytes, arranged as:

    <token type>   <token data>
    -- byte ----   --2 bytes---

Token type bytes are divided into the top two bits, the next two and the
bottom four.

The "next two bits" are used to indicate alternatives.  In a sequence of
tokens

    T1 / T2 / T3 / ... / Tn

then T1 will have $$10 in its "next two bits", and each of T2 to Tn will
have $$01.  Tokens not inside lists of alternatives always have $00.  (Note
that at present only prepositions are allowed as alternatives, but the
format is designed to open the possibility of extending this to all tokens.)

The bottom four are the "type" of the token.  The top two indicate what kind
of data is contained in the token data.  Strictly speaking this could be
deduced from the bottom six bits, but it's convenient for making backpatching
GV2 tables a simple matter within the compiler.

    Type  Means                       Data contains              Top bits
    0     illegal (never compiled)
    1     elementary token            0   "noun"                 00
                                      1   "held"
                                      2   "multi"
                                      3   "multiheld"
                                      4   "multiexcept"
                                      5   "multiinside"
                                      6   "creature"
                                      7   "special"
                                      8   "number"
                                      9   "topic"
    2     'preposition'               dictionary address         01
    3     noun = Routine              routine packed address     10
    4     attribute                   attribute number           00
    5     scope = Routine             routine packed address     10
    6     Routine                     routine packed address     10

GV2 makes no use of "adjective numbers" (so that dict_par3 is always zero
in GV2's dictionary words) and leaves both the adjectives table and the
preactions table empty.

There is one further difference between GV1 and GV2: in GV1, fake actions
are numbered from 256 upwards; in GV2, from 4096 upwards.


Note that although GV2 takes three bytes for a token as compared with GV1's
one byte, omission of the redundant null tokens and adjective table means
that when compiling a small "shell" library game, GV2 actually produces
more economical tables: 1920 bytes as opposed to 2337.

The first two entries in the following table are the real reason for GV2:

                                        Limit in GV1        Limit in GV2

    Prepositions per game               76                  unlimited
    Parsing routines (general ones,
       noun= filters, scope= routines
       all put together) per game       32                  unlimited
    Tokens per grammar line             6                   unlimited
    Actions per game                    256                 4096
    Inform verbs per game               256                 256

In practice the Inform compiler restrains the number of verbs (but that's
an adjustable memory setting) and lines per verb: in Inform 6.05 and earlier,
the maximum number of lines per verb is 20.  Inform 6.10 internally stores
grammar roughly as GV2 even when it's going to compile GV1 at the end, and
this allows a more economical use of Inform's memory: as a bonus, then, the
maximum number of lines per verb is raised to 32 in Inform 6.10.


    8.7   Value backpatching
          ------------------

Value backpatching is the final translation phase in Inform.  It is the
process of correcting temporary values which were written into the Z-machine
at a time when final values could not be known.

In addition to the Z-machine regions marked in the memory map above, the
symbols table also undergoes value backpatching: defined Constant symbols
flagged as CHANGE_SFLAG are backpatched as needed, before first use of their
values in backpatching something else.

The positions of such values have been "marked" with a "marker value"
indicating the type of value needed.  The backpatchable marker values are:

    Marker value       Significance of temporary value

    STRING_MV          Scaled address within static strings area
    ARRAY_MV           Byte address within dynamic array area
    IROUTINE_MV        Scaled address within Z-code area
    VROUTINE_MV        Veneer routine number (a *_VR value)
    NO_OBJS_MV         None
    INCON_MV           "Inform constant": the index in the #constants
                       keyword group (a *_SC value)
    DWORD_MV           Accession number of dictionary word
    INHERIT_MV         Byte address within common property values table
                       of the value which is being inherited here
    INHERIT_INDIV_MV   Ditto, but for individual property values table
    INDIVPT_MV         Offset in individual property values table
    MAIN_MV            None
    SYMBOL_MV          Index within symbols table

and these are backpatched as follows:

    Marker value       Final value

    STRING_MV          Packed address of static string
    ARRAY_MV           Byte address
    IROUTINE_MV        Packed address of routine
    VROUTINE_MV        Packed address of veneer routine
    NO_OBJS_MV         Number of objects in object tree
    INCON_MV           Value of constant (typically an address of some
                       Z-machine table)
    DWORD_MV           Byte address of word's entry in dictionary table
    INHERIT_MV         The value to inherit (after it has been backpatched)
    INHERIT_INDIV_MV   The value to inherit (after it has been backpatched)
    INDIVPT_MV         The byte address of this point in the individual
                       property valyes address
    MAIN_MV            Packed address of "Main" routine
    SYMBOL_MV          Value of symbol (after it has been backpatched)

The code NULL_MV is sometimes used to indicate "no marker", i.e., that a
value is correct as written.  Additional marker values also exist for
operands held within modules:

    IDENT_MV           Property identifier number
    ACTION_MV          Action number
    OBJECT_MV          Object number
    VARIABLE_MV        Variable number

(see chapter 10).  Note that modules are not value-backpatched at compilation
time, but at Link time.

Two different memory blocks are used to store backpatch markers: one for the
"Z-machine image", the byte-addressable memory at the bottom of the memory
map; and another for the Z-code area.  In the interests of economy, these use
different formats to hold marker data: see "bpatch.c".


    8.8   Packed address decoding
          -----------------------

Recall that the formula relating a packed address P to its byte address B is:

    B = 2P            version 3
        4P            versions 4 and 5
        4P + 8R       versions 6 and 7: routines
        4P + 8S       versions 6 and 7: static strings
        8P            version 8

In versions 6 and 7, R and S can be chosen fairly freely in the range 0 to
M/8, where M is the minimum address of a routine or static string as
appropriate.  The point is to expand the address space by making higher B
values available (whereas P has to lie within 0 and 65535): so one wants to
make R, S larger rather than smaller.  On the other hand, it's necessary to
the Inform language that the following are all different:

    (a) the number 0;
    (b) valid object numbers;
    (c) packed addresses of routines;
    (d) packed addresses of strings.

To be sure that (c) and (d) differ, Inform sets S = R.  To keep (c) from
(a) and (b), we must ensure that the packed address of every routine is at
least X, where X is slightly more than the largest object number.  Inform
also knows the byte address M of the lowest routine in memory ("Main__").
Note that

    M  =  4X + 8R

Inform therefore nudges up M and X to ensure that M and 4X are both divisible
by 8, and sets

    R  =  (M - 4X)/8
    S  =  R.

To compare this with the code: Write_Code_At is M; extend_offset is X;
scale_factor and length_scale_factor hold the magic constants 4 and 8.
code_offset and strings_offset are the scaled addresses of the first
routine and the first static string respectively, worked out by

    code_offset = 4X
    strings_offset = 4X + size of code area

The reason for having these is that when compilation was actually happening,
Inform did not know enough to calculate packed addresses, and instead
wrote scaled addresses starting from 0 for each area.  In backpatching,
then, Inform adds the above offsets to each packed address, and then they
come right.


    9   The run-time veneer
        -------------------

         "I think you now have a fully-stocked veneer shop, with
          the rest of the compiler being a pencil sharpener bolted
          to the wall in the back room."
          -- Andrew Plotkin, letter to the author, 10 January 1999


    9.1   Services provided by the veneer
          -------------------------------

The "veneer" is a thin section of code generated by Inform as an intermediate
layer between the code compiled from the source, and the Z-machine itself:
like the veneer on a table, it gilds the surface of the Z-machine, or so
the term is supposed to mean.

It consists of 38 routines, 4 class-objects, 2 properties and 2 global
variables, together with several tables in the Z-machine's dynamic memory
and a number of strings, notably the source-code names of arrays and
properties, which help in printing out good error messages.  The veneer
adds a major data structure not present in the Z-machine architecture:
the concept of "individual property", a mechanism to allow games to have
more or less unlimited numbers of properties which don't need to be
declared before use.

The "veneer.c" section of Inform organises the compilation of the routines;
each one is compiled only if actually needed, and if no over-riding definition
has already been given in the source code.  (For instance, "PrintShortName"
is a veneer routine but the Inform library provides a much fuller version
than the one in "veneer.c".)

Note that the full Inform source code for these routines is present in
static strings in "veneer.c" (which are compiled using the lexer's "read
from a string" mode).

The routines come in five groups.  Note that names with double underscores
in are so chosen not to clash with identifiers used by the programmer.

   Main__        This is not really a routine at all, but is the piece of code
                 which the Z-machine's PC is initially set to.  It simply
                 makes a function call to Main(), and then quits.

   Box__Routine  Code to display an array of static strings in a black
                 "quotations" box, as required by the "box" statement.

   Printing routines   PrintShortName, DefArt, CDefArt, etc.;
                 all normally over-ridden within the Inform library.
                 Also RT__Err, "print an error message at run-time".

   Object orientation routines
                 Routines to implement message sending, access to individual
                 properties, "metaclass", "ofclass" and "provides".

   Assertion routines
                 Routines needed by "strict mode" to check that proposed
                 object moves, memory read/writes, etc. will not crash
                 the Z-machine if executed.

Since veneer routines are compiled at the end of the pass (when it's known
which will be needed), the code area looks like this:

     start of code area -->      00             (routine header for Main__)
       initial PC value -->      @call_1n Main
                                 @quit

                                 ... routines as defined in source code...
                                 
                                 veneer routines as required, in order
                                 of appearance in the "veneer.c" table

(Note that Main__ is assembled as a routine.  In the Version 6 Z-machine,
this is called to begin execution.  In other versions, the initial PC value
is set to the call_1n instruction -- which is why Main__ must be first: the
initial PC value has to lie in the bottom 64K of Z-machine memory.  Note
also: for the Versions 3 and 4 Z-machine more primitive call opcodes are
used to get into Main.)


The four objects in the veneer are the class-objects for the four
metaclasses defined in all Inform programs: Class, Object, Routine and
String.  The object tree therefore looks like this:

    1     Class
    2     Object
    3     Routine
    4     String
          ... objects and class-objects as defined in source code ...

The two global variables are "self" and "sender", used in message-passing,
and which are two of the 7 highest-numbered global variables which Inform
reserves to its own use as "registers".


    9.2   Specification of the veneer routines
          ------------------------------------

Please note that some of the specifications given here may change in future
compiler releases without warning or apology -- so it is unwise to write
code which accesses the veneer directly.

Box__Routine(maxw, table)

    Display a "Trinity"-style quotation box, whose text is given as a
    table array of packed addresses of strings giving individual lines,
    and whose maximum text width is "maxw".

R_Process(action, noun, second)

    Implement <action noun second>.  (The Inform library does this properly
    by defining its own "R_Process" routine; the ordinary veneer would just
    print text like "<23 2 3>".)

DefArt(object)
IndefArt(object)
CDefArt(object)

    Print the object's name, prefixed by definite/indefinite/capitalised
    definite article.  (Again, the veneer's plain version is very plain.)

PrintShortName(object)

    Print just the object's short name: but give a sensible response,
    and in particular don't crash the Z-machine, even if "object" is any
    illegal value.

EnglishNumber(n)

    Print out "n" in words.  (The Inform library does this properly:
    the plain veneer just prints n as a number.)

Print__PName(property)

    Print a textual description of the property, e.g. "before" or
    "Coin::before".

WV__Pr(object, property, value)

    Implement object.property = value, printing suitable run-time errors
    if either object or property are invalid or if the object doesn't provide
    property.

RV__Pr(object, property)

    Likewise but simply read object.property.

CA__Pr(object, property, a, b, c, d, e, f)

    Implement object.property(a, b, c, d, e, f) with a to f being optional.
    Note that object might be a class, routine or string and this is where
    the messages provided by these pseudo-objects are implemented.

IB__Pr(object, property)

    Implement ++object.property.

IA__Pr(object, property)

    Implement object.property++.

DB__Pr(object, property)

    Implement --object.property.

DA__Pr(object, property)

    Implement object.property++.

RA__Pr(object, property)

    Implement object.&property (much more difficult than it sounds: this is
    where individual property tables are effectively implemented).

RL__Pr(object, property)

    Implement object.#property.

RA__SC(class, property)

    Implement class::property, returning this property number.

OP__Pr(object, property)

    Implement the condition "object provides property".

OC__Cl(object, class)

    Implement the condition "object ofclass class".

Copy__Primitive(o1, o2)

    Make o1 a copy of o2, in the sense that: the attributes of o1 become
    exactly those of o2; and for every property provided by o2, that
    property of o1 has the o2 value copied over.

RT__Err(error_number, ...parameters...)

    Print the appropriate run-time error message, which always takes the
    form "[** Programming error: ... **]".

    <string>, <class>, <property>
        "class <class> (object number ...) has no property <property> to
        <string> [and nor does any other object"

    <string>, <object>, <property>
        "<object> (object number ...) has no property <property> to
        <string> [and nor does any other object"

    <string>, <class>, -<class2>
        "class <class> (object number ...) is not of class <class>"

    <string>, <object>, -<class>
        "<object> (object number ...) is not of class <class>"

    1, <class>
        "class <class>: 'create' can have 0 to 3 parameters only"

    2, <illegal-object>
        "tried to test "in" or "notin" of object <illegal-object>"

    3, <illegal-object>
        "tried to test "has" or "hasnt" of object <illegal-object>"

    4, <illegal-object>
        "tried to find the "parent" of object <illegal-object>"

    5, <illegal-object>
        "tried to find the "eldest" of object <illegal-object>"

    6, <illegal-object>
        "tried to find the "child" of object <illegal-object>"

    7, <illegal-object>
        "tried to find the "younger" of object <illegal-object>"

    8, <illegal-object>
        "tried to find the "sibling" of object <illegal-object>"

    9, <illegal-object>
        "tried to find the "children" of object <illegal-object>"

   10, <illegal-object>
        "tried to find the "youngest" of object <illegal-object>"

   11, <illegal-object>
        "tried to find the "elder" of object <illegal-object>"

   12, <illegal-object>
        "tried to use "objectloop" of object <illegal-object>"

   13, <illegal-object>
        "tried to use "}" at end of "objectloop" of object <illegal-object>"

   14, <illegal-object>
        "tried to "give" an attribute to object <illegal-object>"

   15, <illegal-object>
        "tried to "remove" object <illegal-object>"

   16, <object1> <object2>
        "tried to "move" <object1> to <object2>"
        where <object1> is illegal

   17, <object1> <object2>
        "tried to "move" <object1> to <object2>"
        where <object2> is illegal

   18, <object1> <object2>
        "tried to "move" <object1> to <object2>, which would make a loop:
        <object2> in ... in <object 1> in <object2>"

   19, <object>
        "tried to "give" or test "has" or "hasnt" for a non-attribute
         of <object>"

   20,
        "tried to divide by zero"

   21, <illegal-object>
        "tried to find the ".&" of object <illegal-object>"

   22, <illegal-object>
        "tried to find the ".#" of object <illegal-object>"

   23, <illegal-object>
        "tried to find the "." of object <illegal-object>"

   24,
        "tried to read outside memory using ->"

   25,
        "tried to read outside memory using -->"

   26,
        "tried to write outside memory using ->"

   27,
        "tried to write outside memory using -->"

   28, <index>, <max-index>, <array-type>, <array-number>
        "tried to read from -><index> in the <array-type> <array-name>,
         which has entries <min-index> to <max-index>"
        The array types are: 0 = byte -> array, 1 = word --> array,
        2 = string array, 3 = table array and the array name is the
        packed string address at #array_names_table--><array-number>.

   29, <index>, <max-index>, <array-type>, <array-number>
        "tried to read from --><index> in the <array-type> <array-name>,
         which has entries <min-index> to <max-index>"

   30, <index>, <max-index>, <array-type>, <array-number>
        "tried to write to -><index> in the <array-type> <array-name>,
         which has entries <min-index> to <max-index>"

   31, <index>, <max-index>, <array-type>, <array-number>
        "tried to write to --><index> in the <array-type> <array-name>,
         which has entries <min-index> to <max-index>"

   32, <object>
        "objectloop was broken because the object <object> was moved
         while the loop passed through it"

   33, <code>
        "tried to print (char) <code> which is not a valid ZSCII character
         code for output"

   34,
        "tried to print (address) on something not the byte address of
         a string"

   35,
        "tried to print (string) on something not a string"

   36,
        "tried to print (object) on something not an object or class"


Z__Region(value)

   Returns 1 if value is a Z-machine object number, 2 if a packed address
   within the code area, 3 if a packed address within the strings area
   and 0 otherwise.

Unsigned__Compare(x, y)

   Returns 1 if x>y, 0 if x=y, -1 if x<y, regarding x and y as positive
   numbers in the range 0 to 65535.

Meta__class(value)

   Implements metaclass(value).

CP__Tab(property_table, property)

   If property is -1, return the address of the first byte after the
   property table.  Otherwise:
   If the property table contains the property, return the address of the
   first byte of the property value.
   If the property table doesn't contain the property, return 0.

Cl__Ms(class_object, property, arity, a, b, c, d)

   Here property must be one of: create, recreate, destroy, remaining
   and copy.  Arity must be the number of parameters supplied, 0 to 4,
   in the optional variables a, b, c and d.
   This function implements class_object.property(a, b, c, d).  It's
   really a part of CA__Pr but the code has been moved here to prevent
   CA__Pr from becoming inconveniently large for the veneer-compilation
   code.

RT__ChT(object1, object2)

   If it is legal to move object1 to object2, do so; otherwise print an
   error message.  If the library is present and the appropriate debugging
   flag is set, trace this by printing something like "[Moving torch
   to Conservatory]".

RT__ChR(object)

   If it is legal to remove object, do so; otherwise print an error message.
   If the library is present and the appropriate debugging flag is set,
   trace this by printing something like "[Removing coin]".

RT__ChG(object, attribute)

   If it is legal to give attribute to object, do so; otherwise print an
   error message.  If the library is present and the appropriate debugging
   flag is set, trace this by printing something like "[Giving torch
   moved]".

RT__ChPS(object, property, value)

   If it is legal to set object.property = value, where property is known
   to a common (i.e. declared with Property) property, then do so.
   Otherwise print an error message.

RT__TrPS(object, property, value)

   Traces an object property setting, printing text such as
   "[Setting Conservatory.time_left to 7]".  This routine is called only
   when the library is present and the appropriate debugging flag is set.

RT__ChLDB(base, offset)

   If it is legal to read base->offset, do so; otherwise print an error
   message.

RT__ChLDW(base, offset)

   If it is legal to read base-->offset, do so; otherwise print an error
   message.

RT__ChLDB(base, offset, value)

   If it is legal to set base->offset = value, do so; otherwise print an
   error message.

RT__ChLDW(base, offset)

   If it is legal to set base-->offset = value, do so; otherwise print an
   error message.

Symb__Tab(code, num)

   Only present in games compiled with -X for Infix.  In effect this
   holds some static data in the code area, rather than using up the
   limited readable memory area with further tables.  

   If code is 1, returns data on the array numbered "num" and stores
   further in temp_var2 and temp_var3; if code is 2, ditto for
   routines; if 3, for constants.  This routine is subject to change
   without notice, to put it mildly.


    9.3   Properties 2 and 3, "ofclass" and "metaclass"
          ---------------------------------------------

Inform reserves common properties 2 and 3 to itself, and uses them as follows:

    property 2 (additive): a word array containing the class-object numbers
        of all classes of which this object is a member; the property is
        absent if the object belongs to no classes.  (Note that metaclasses
        do not appear here: tree objects are all members of either Object
        or Class but neither is listed here.)

    property 3: a byte address pointing to the individual properties table
        for this object; property 3 is absent if it has no individual
        properties.

The veneer tests "X ofclass Y" according to the rules:

    if X is a valid Z-machine object number:
        if X is a class-object (tested by: if X is a child of Class)
            then only if Y is Class
        otherwise only if Y appears in the property 2 list for X

    if X is the packed address of a routine: only if Y is Routine

    if X is the packed address of a string:  only if Y is String

giving a run-time error if Y isn't a class-object.  Note that determining
which of these ranges contains X needs code essentially the same as that
in the Inform 5/12 library routine "ZRegion": indeed, this is how the
veneer routine implementing the "metaclass" function works.


    9.4   Class numbering and class objects
          ---------------------------------

In Inform 6 each class definition results in the creation of a related
object, the class-object for that class.

Class objects have:

    their internal symbol-name as short name;
    no attributes (*);
    no (common) properties except possibly for property 3.

All except for Class are placed in the object tree as children of Class.

Immediately following the property values table for a class (which is bound
to be short, since it can only contain the short name and perhaps property 3)
is a six-byte block of attributes which would be inherited from the class (*),
and then a second property values table, specifying the properties which
members of the class would inherit from it.  (These values are accessed at
run time to implement the superclass access operator :: when applied to a
common property.)

Property 3 is used to hold the byte address of the individual properties 
table of properties which members would inherit.  It is absent if there
are no such properties.  (These are accessed at run-time to implement the
superclass access operator :: when applied to an individual property.)

Note that there are two numbering systems used for classes: the "class
number" counts 0, 1, 2, 3, ... in order of declaration, with Class numbered
0.  (This is used in superclass message number coding.)  Classes are also
referred to by the object numbers of their class objects.  It's necessary
to be able to convert between the two at run time; the "class numbers table"
is a word array whose Nth entry is the class-object number corresponding to
class number N.

[(*) This is a change made in Inform 6.12.  Previously, class objects
had the attributes which would be inherited from the class.  6.12 moves
this information to a harmless block of six bytes elsewhere, essentially
so that loops like "objectloop (x has animate)" do not pick up classes
as well as objects.]


    9.5   Individual property identifiers
          -------------------------------

Just as the common properties are identified by property numbers 1 to 63,
so the individual properties are identified by numbers 64 upward.  The
group 64 to 71 is used to identify the properties provided by inheritance
from the metaclasses:

    Id    Name                Provided by

    64    create              class-objects other than metaclass-objects
    65    recreate            ditto
    66    destroy             ditto
    67    remaining           ditto
    68    copy                ditto
    69    call                objects of metaclass Routine
    70    print               objects of metaclass String
    71    print_to_array      ditto

However, in addition to this a property ID number with its top bit set has
a special meaning:

    Bit    16  15  14  13  12  11  10   9   8   7   6   5   4   3   2   1
           set --------------------------   -----------------------------
                     entry number                    class number
               within class's i.p. table
                   (counting from 0)

which means "the given class's inheritable value of the property with this
ID number", and is used to implement the "superclass" operator as something
which acts on ID numbers:

          Class :: identifier

is implemented by writing the class number into the bottom bits and searching
as above.  Further, a property ID number in the form:

    Bit    16  15  14  13  12  11  10   9   8   7   6   5   4   3   2   1
           00  set  ---------------------   -----------------------------
                    common prop number              class number

refers to "the given class's inheritable value of this common property".

As a special rule, in the case of class Object this accesses the Z-machine
default property value.  Thus, since the library defines

    Property cant_go "You can't go that way.";

then X.Object::cant_go will always evaluate to "You can't go that way."
for any Object X.


Note that the need to be able to pack such descriptions into 16 bits is one
reason classes need to have a 0, 1, 2, ... numbering system in addition
to the class-object system of numbering (which has many lacunae).  Even as
it is, the present veneer implementation constrains us to a maximum of 256
class definitions, each able to provide at most 128 individual properties
(in addition to the potentially 63 declared as common using Property);
more work may be needed to alleviate this at a later date.


    9.6   Individual property tables
          --------------------------

For each object that provides any individual properties at all, property
3 holds the byte address of its individual property table.  The entries
in this table are pairs of identifier number and current value: thus,

   raw identifier number (2 bytes)
       but with top bit set if "private" to the object in question
   property length (1 byte)
   current value (as many bytes as specified in the property length,
       though this will always be an even number >= 2)

terminated by a zero word, 00 00.  (Note that there is no property 0,
common or individual.)

The properties can occur in any order and need not be in numerical order
of their IDs (which is helpful to assist linking, though it may have speed
implications).  The following Inform code prints out an object's table:

    [ IndivTable obj x n;
      x = obj.3;
      if (x == 0) rfalse;
      print "Table for ", (object) obj, " is at ", (hex) x, "^";
      while (x-->0 ~= 0)
      {   print (hex) x-->0, " (", (property) x-->0, ")  length ", x->2, ": ";
          for (n = 0: n< (x->2)/2: n++)
              print (hex) (x+3)-->n, " ";
          new_line;
          x = x + x->2 + 3;
      }
    ];

    [ hex x y;
      y = (x & $ff00) / $100;
      x = x & $ff;
      print (hdigit) y/$10, (hdigit) y, (hdigit) x/$10, (hdigit) x;
    ];

    [ hdigit x;
      x = x % $10;
      if (x<10) print x; else print (char) 'a'+x-10;
    ];


From an inheritance point of view, individual properties are always
non-additive: a specified value overrides and replaces an inherited value.
(The idea is that access to superclass values is a more elegant alternative
to using additive properties to make class values remain available to
inheritors which have altered them.)


The veneer contains routines to implement "provides", ".", ".&" and ".#",
together with routines to apply ++ and -- as pre- or postfix operators,
and a routine to implement = in the context of individual properties.
These all lapse back into ordinary Z-machine instructions for common
properties if they find a property number in the range 1 to 63.

Note that the veneer's implementation of "provides" contains a special
rule meaning that class-objects do not provide any properties except the
five inherited from the metaclass Object, even though they have a table
in the above format (which contains the inheritable values).


The veneer's highest-level routine sends a message to an individual
property.  (This same routine contains implementations of the eight messages
numbered 64 to 71.)  Note that the trickiest part is keeping "self" and
"sender" right:

    push the current "self" value
    push the current "sender" value
    set "sender" to "self"
    set "self" to the destination object
    now make the function call
    pull to "sender"
    pull to "self"


Note that the object creation/deletion system works in a quite simple way:
the pool of uncreated duplicate members of a class are exactly the children
of its class-object.  Objects are re-initialised when they are returned to
this pool, leaving them "fresh" and ready for recreation.


    9.7   Availability of symbol names at run-time
          ----------------------------------------

It is not possible to determine at compile-time if a message is wrongly
addressed, i.e., if a message is sent to property X of object O but object O
does not provide X.  For one thing, availability depends on who is calling
(if the property is private to O); for another, message identifiers do not
need to be constants.

Therefore a good error-reporting mechanism is needed at run-time; errors like
"Object 32 has no property 57" would make debugging hopelessly difficult.
The only way to achieve this is for the property names to be available at
run time.  Similarly, it's helpful for debugging purposes if attribute
names and action names can also be available.


To this end, Inform compiles a table of identifier names into every
story file (though not into modules) and provides a system constant called
#identifiers_table pointing to the start of the table.  The table contains:

    --> 0      = Number of property names plus 1 = P
    --> 1      = Packed address of string containing name of property 1
    to
    --> (P-1)  = Packed address of string containing name of property P-1
                 (properties 1 to 63 will be common, and subsequent ones
                 individual: this table includes both kinds; note that
                 there is no property numbered 0)

    --> P      = Packed address of string containing name of attribute 0
    to
    --> (P+47) = Packed address of string containing name of attribute 47

    --> (P+48) = Packed address of string containing name of action 0
    and so on, followed by

    #array_names_table-->0 = Packed address of string containing name of
                 array 0
    and so on. (Array numbers are allocated from 0 upwards as arrays are
    declared in the source code: they're used by the veneer only for
    printing array-bound-violation error messages, and only in strict mode.)

In the case of (common) property and attribute names, the ability of
programmers to "alias" names together means that these names will sometimes
be given as a list of possibilities with slashes in between.

Note that printing property names so as to properly handle usage of ::
is harder than simply looking up in this table (for instance, there are
times when the right output should look like "Coin::value").  Inform
provides the construct

    print (property) X;

to handle this.  But the corresponding cases for attributes and actions
are easier.  Simply define:

    [ DebugAction a anames;
      if (a>=256) print "<fake action ", a-256, ">";
      anames = #identifiers_table;
      anames = anames + 2*(anames-->0) + 2*48;
      print (string) anames-->a;
    ];
    [ DebugAttribute a anames;
      if (a<0 || a>=48) print "<invalid attribute ", a, ">";
      else
      {   anames = #identifiers_table; anames = anames + 2*(anames-->0);
          print (string) anames-->a;
      }
    ];

(These are in fact both part of Library 6/3 anyway: a different DebugAction
routine is present in Library 6/1 and 6/2 if DEBUG is set.)  You can then:

    print (DebugAction) X;
    print (DebugAttribute) X;


It is also for error reporting that the names of the classes are needed at
run-time, which is why the short name of a class-object is the name of the
class.


   10   Compiling and linking modules
        -----------------------------

   10.1   Model
          -----

The model for linking is that a game being compiled (called the "external"
program) may Link one or more pre-compiled sections of code called "modules". 
Suppose the game Jekyll has a subsection called Hyde.  Then these two
methods of making Jekyll are (almost) equivalent:

(i)  Putting Include "Hyde"; in the source code for "Jekyll", and
     compiling "Jekyll".

(ii) Compiling "Hyde" with the -M ("module") switch set, putting
     Link "Hyde"; into the same point in the source code for "Jekyll",
     and compiling "Jekyll".

Option (ii) is of course much faster if "Hyde" does not change very often,
since its ready-compiled module can be left lying around while "Jekyll"
is being developed.

Option (ii) also slightly increases story file size, as some code
optimisations are impossible this way: e.g., linking rather than including
the library costs about 160 bytes on final story file size.


In the linking process, two incarnations of Inform need to communicate with
each other:

    A.  one in the middle of compiling a story file
        and wanting to link a module into it;

    B.  one compiling a module for future linking into
        some unknowable story file.

To be more precise, B needs to send information to A.  It does so by writing
a modified form of story file with additional tables attached, and this is
what a module is.

We discuss how Inform copes with situation A first, as this indicates the
information which B needs to send it; we then discuss how B gathers and
sends this information.

Note that A and B can never apply in the same run-through of Inform, since
a module cannot link another module into it.

Finally, note that A and B may be different ports of Inform running on
different machines at different times, since the module format is
machine-independent.


   10.2   Linking a module
          ----------------

There are two tasks: (a) merging the module's data structures into those of
the external story file, and (b) mending all the constants used in these data
structures which couldn't be known at module compilation time.

(a) Merging data structures

This is formidably hard, since it means extracting tables in Z-machine format
and putting them back into Inform's own (usually higher-level) data
structures: in effect, an inverse is needed for each translation operation.
The following table briefly lists the data structures which are merged, and
how this is done:

   Structure                        Method

   Initial global variable values   Directly copy in
   Initial array entry states       Append
   Code area                        Append.  Z-code is relocatable except
                                      that function calls are to absolute
                                      addresses: these are handled as though
                                      they were references to constants
                                      unknown at module compilation time
   Static strings area              Append
   Dictionary                       Extract words from the module, one at
                                      a time, and insert into the story
                                      file's dictionary.  (The dictionary
                                      is just too complex to merge the
                                      internal structures.)
   Object tree                      Append, and also fix all the object
                                      numbers used in parent, sibling and
                                      child fields: a particular concern
                                      is with classes defined in the
                                      module, which have to be added as
                                      children of the main program's
                                      Class object
   Property values area             Append
   Individual property values       Append
   "Flags 2" header bits            Bitwise OR
   Class to object numbers table    Effectively append, moving the object
                                      numbers up appropriately.  The
                                      module's class numbers table contains
                                      offsets of the class inheritance
                                      property blocks as well, and these
                                      are appended too

(b) Mending references

The problem here is to absorb numerous references within the module into
the backpatch table for the story file.  It is not possible just to append
the module's backpatch tables onto the story file's backpatch tables:
all the offsets need adjusting, and certain kinds of backpatching need to
take place immediately (to fix the four marker values not allowable in
story file backpatch tables, ACTION_MV, IDENT_MV, VARIABLE_MV and OBJECT_MV).


   10.3   Imports and exports
          -------------------

The main "extra" ingredients in a module (compared with an ordinary story
file) are: two backpatching tables (see the next section), and an
import/export table giving the names of symbols to be shared with the main
story file which will link in with it.

The language of "imports and exports" considers the module to be the home
country, and the external program to be foreign.  (An exported symbol's value
is assigned in the module and used in the external program; an imported
symbol's value is assigned in the external program and used in the module.)

An export is a symbol defined in the module which may be referred to
in the external program.  All variables, routines and general constants
defined in the module except for (a) system-defined symbols always present
(such as "Class", "true", etc.) and (b) attributes, common properties and
labels are exported.

An import is a symbol name used in compiling the module which isn't defined
within it.  During module compilation, just as in story file compilation, if
an unknown name is hit then an operand is constructed which is marked
SYMBOL_MV and has the symbol index as its value.  In the case of a story
file, at value backpatching time the correct value is found out and written
in.  In the case of a module, there is no value backpatching phase.

Instead, all symbols are flagged with IMPORT_SFLAG if ever used when undefined
(i.e., if any SYMBOL_MV marker is ever issued).  Any which remain undefined
at the end of module compilation (in fact, at the same point where exports
are made) are "imported": that is, their names are written down so that their
existence can be checked on at Link time.  It is an error for any imported
symbol not to exist in the external program performing the Link.

Note, therefore, that backpatch markers in the module with marker value
SYMBOL_MV may refer either to symbols which were assigned values later on
during module compilation (and are thus exported) or to symbols which were
never assigned values (and are thus imported).


   10.4   Backpatch backpatching
          ----------------------

Although modules undergo branch backpatching and optimisation just as story
files do, they do not undergo value backpatching.

Instead, the two value-backpatching tables (zcode_backpatch_table and
zmachine_backpatch_table) are appended to the module file.  The basic idea
is that when the external file merges the module's data structures into its
own, it also adds the module's backpatch markers into its own collection. 
Thus, the module's code and data are finally value-backpatched when the
external program is value backpatched.

Unfortunately, these backpatch markers have addresses which are correct for
the module, but wrong for the external program: they require correction
before they can be transferred, and this is called "backpatch backpatching".

In addition to this, a few special types of backpatch markers (only ever
generated inside modules, never in story files) are dealt with immediately.

A further complication is that the module may have exported a symbol which
itself has a value needing backpatching.  For example, if the module contains

    Constant Swine 'hog';

then, in its symbol table, the symbol Swine will have the marker DWORD_MV
attached to its value.  When this is exported to the external program, the
symbol value will need backpatch-backpatching.


The sequence of events is roughly:

     1.  Load the module into an allocated block of memory.
     2.  Merge the dictionary, finding out which dictionary
         accession numbers in the external program correspond to
         which in the module.
     3.  Go through the import/export table, until we know
         which symbol numbers in the module correspond to which
         symbol numbers in the external program; and which
         variable numbers to which; and which action numbers
         to which; and which property identifiers to which.
         (Arrays called "maps" are created to encode all these.)
     4.  Backpatch, or deal with the backpatch markers attached to,
             any exported symbols from the module.
     5.  Go through the Z-code backpatch table:
             deal with IDENT_MV, ACTION_MV, VARIABLE_MV and OBJECT_MV markers
                 immediately, by backpatching the copy of the module
                 held in memory;
             backpatch the other markers and then transfer them
                 into the external program's Z-code backpatch table.
     6.  Do the same for the Z-machine backpatch table.
     7.  Now merge the memory copy of the module into the external
         program.

(There are actually 17 stages, but most of the latter ones are mergers of
different tables which can happen in any order.)

As an example of "backpatch backpatching", the backpatch marker for the
quantity 'hog' will be DWORD_MV, with value the accession number of the word
"hog" in the module's dictionary.  This accession number needs to be changed
to the accession number of "hog" in the story file's dictionary.

The four "module only" backpatch marker values are:

    VARIABLE_MV             global variable number in module's set
    OBJECT_MV               object number in the module's tree
    IDENT_MV                identifier number in the module's set
    ACTION_MV               action number (always between 0 and 255, though
                              always in a "long" constant so that a fake
                              action number can be backpatched over it)

Note that within the module, there is no way to tell an externally defined
action from a fake action.  That is, the reference ##Duckling might be
to an external action or fake action called "Duckling".  Within the module,
it is created as an action name in the usual way; at Link time, the action
map corresponds this module action number to a fake action number or a real
one as required.

Variable numbers in the module are not marked if they are local (or the
stack pointer), or if they are from the system set ("self", "sender" and so
on), whose numbers are guaranteed to be correct already.  In a module, the
available variable numbers are 16 to N-1, where N is the number of the lowest
system-used variable (at present, 249).  Imported variables are numbered
from 16 upwards, and variables defined newly in the module from N-1
downwards.  If these allocations collide in the middle, then the Z-machine
has run out of variables.


   10.5   How modules differ from story files
          -----------------------------------

Module files are identical to story files, except that they have not
undergone value backpatching, and do not contain any part of the veneer; and
except as detailed below.

(i)  Size
---------
A module is at most 64K in size.

(ii) Header entries (where different from story file header entries)
--------------------------------------------------------------------
    Byte   Contents

    0      64 + V  (where V is the version number)
    1      module version number
    6,7    byte address of the module map table
           (note that this is the "initial PC value" slot in the
           story file header format, but no such value is meaningful
           for modules)

V is used to make sure that we aren't trying to link, e.g., a Version 5
module into a Version 8 story file: this would cause heaps of backpatch
errors, as the packed addresses would all be wrong.  Although we could
in principle automatically fix them all up, it isn't worth the trouble:
the user only needs to compile off suitable versions of the modules.

The module version number is a version number for the module format
being used: this document describes module version 1.
    
(iii) Class numbers table
-------------------------
In a module, the class numbers table contains additional information,
and has the form:

    <word containing object number for Class>
    <word containing class 1 inheritance-properties block offset>
    <word containing object number for class 1>
    <word containing class 1 inheritance-properties block offset>
    <word containing object number for class 2>
    <word containing class 2 inheritance-properties block offset>
    ...
    <word containing object number for class N>
    <word containing class N inheritance-properties block offset>
    00 00
  
(In a story file, the inheritance-properties block offsets are absent.)

(iv) Identifier names table
---------------------------
This is missing altogether in a module.

(v) Dictionary
--------------
A module dictionary is in accession order, not alphabetical order.  (This is
necessary so that backpatch markers in the module, which refer to dictionary
words by accession number, can be connected with their original words at
link time.)

(vi) Module map
---------------
The module map is (currently) 15 words long and contains the following:

    Word 0   byte address of object tree
         1   byte address of common property values table
         2   scaled address of static strings table
         3   byte address of class numbers table
         4   byte address of individual property values table
         5   number of bytes in indiv prop values table
         6   number of symbols in module's symbols table
         7   number of property identifier numbers used in module
         8   number of objects present in module
         9   byte address of import/export table
        10   its size in bytes
        11   byte address of Z-code backpatch table
        12   its size in bytes
        13   byte address of Z-machine image backpatch table
        14   its size in bytes

The map serves as an extension to the header, which has more or less run out
of available slots for new pieces of information.

(vii) Parents of class objects
------------------------------
Objects representing classes are given in the object tree as having
parent $7fff (and having next-object 0, as if each were an only child).
These are to be understood as being children of the Class object (which
is not present in a module).

(viii) Import/export table
--------------------------
This is a sequence of symbols (mixing up imports and exports in no particular
order), occupying records as follows:

    Record type (1 byte)
    Symbol number in module symbols table (2 bytes)
    Symbol type (1 byte)
    Symbol backpatch marker (1 byte)
    Symbol value (2 bytes)
    Symbol name (null terminated)

where the possible record types are:

    IMPORT_MV                   import a symbol
                                (in which case the symbol backpatch marker
                                is omitted)

    EXPORT_MV                   export a symbol defined in a non-system file
    EXPORTSF_MV                 export a symbol defined in a system file
    EXPORTAC_MV                 export an action name (in the form
                                    "Name__A")

Note: we need to distinguish between EXPORT_MV and EXPORTSF_MV in order to
make Replacement of routines work properly.  Suppose the user asks to Replace
the routine DrawStatusLine, normally found in the parser, and then links the
parser module and also a module of his own containing his new definition of
DrawStatusLine.  The linker knows which of these to accept because the first
was compiled from a system module, and thus exported with EXPORTSF_MV, while
the second was not, and was exported with EXPORT_MV. Other than for this
purpose, the two values are treated equivalently.

(ix) Z-code backpatch table
---------------------------
This is exactly the module's Z-code backpatch table, no different in format
from that used for backpatching story files (except that four additional
marker values are permitted).

(x) Z-machine image backpatch table
-----------------------------------
This is exactly the module's Z-machine backpatch table, no different in
format from that used for backpatching story files (except that four
additional marker values are permitted).


   10.6   Restrictions on what modules may contain
          ----------------------------------------

[1] It is impractical to allow the module and the external program use of
    each others' attribute and property names, because there is no rapid way
    of repairing the module's object tables.  Instead, both program and
    module must use the same Include file to make the same attribute and
    property definitions as each other.

[2] An object defined in a module cannot:
    (a) have an externally-defined parent object,
    or (b) inherit from an externally-defined class.

[3] A module can only use externally-defined global variables if they have
    been explicitly "imported" using the Import directive.

[4] A module cannot use Verb or Extend: that is, it cannot define grammar.

[5] A module cannot use Stub or Default (for obvious reasons).

[6] A module cannot use "unknown at compile time" constants in every
    context.  (Unknown in that, e.g., 'frog' and MAX_FROGS might be
    unknown: the former because the dictionary address of 'frog' in the
    final program can't be known yet, the latter (say) because MAX_FROGS
    is a constant defined only in the external program.)  In the
    following list of contexts in which constants occur in Inform source
    code, those marked with a (*) are not permitted to be "unknown at
    compile time":

        1. In high-level source code (including as a switch value and
           a static string in a "box" or "string" statement).
        2. In assembly language source code.
        3. As the initial entries in an array.
        4. As the initial value of a variable.
        5. In a CONSTANT definition.
        6. As an object's (common or individual) property value.
    *   7. As a release number.  (Defining the module release number only.)
    *   8. As a version number.  (Modules are always version 5.)
    *   9. In grammar.  (You can't define grammar in a module.)
    *  10. As the size of an array.
    *  11. In a DEFAULT constant definition.  (You can't use Default.)
    *  12. In an IFTRUE or IFFALSE condition.
    *  13. As a property default value (though this is unnecessary since
           the definitions in the outside program take precedence).
    *  14. As the number of duplicate objects provided for a class.

    Only (10) and (14) are restrictive.  Combining a module with a short
    Include file to make such definitions will get around them.


   11   Service levels
        --------------

   11.1   Variables and arrays
          --------------------

Each of the sections is responsible for the initialisation of its own
variables, and for the allocation and deallocation of its own arrays.  To
this end, every section except "inform.c" (which is considered to be in
charge of the others) provides a set of four routines to manage its
data structures:

    init_*_vars()           called when the compiler is beginning a run
    *_begin_pass()          called at the beginning of the compilation
                            pass through the source code: this usually sets
                            some of the variables
    *_allocate_arrays()     called when the compiler is beginning a run
    *_free_arrays()         called when the compiler is ending its run

Note that *_free_arrays() must exactly undo what *_allocate_arrays() has
done.

The 19 routines of each kind are polled in no particular order (actually,
in alphabetical order).

Note that there are also

    lexer_begin_prepass()
    files_begin_prepass()

which need to happen even earlier than the general *_begin_pass() poll,
and also

    lexer_endpass()
    linker_endpass()

but since the other sections generally don't need to do anything at end
of pass, it wasn't worth making a formal poll of it.

Static initialisation of variables is only allowed when they are certain
never to change (for example, the string containing the accent escape
characters never changes).


   11.2   Memory allocation and deallocation
          ----------------------------------

All allocation and deallocation is routed through routines given in
"memory.c": specifically,

    my_malloc()   and   my_free()

Note that these are called with a string as an extra parameter (as compared
with malloc() and free()), which is used to print helpful error messages in
case of collapse.  This is where the -m ("print the memory allocation") switch
is implemented, and it's also a good place to add any OS-specific code which
may be needed.

This may be needed because, under ANSI rules, malloc() is entitled to refuse
to allocate blocks of memory any larger than 32K minus a few bytes.  On a few
implementations malloc() actually does this, even on very large machines,
because it's convenient to avoid memory segmentation problems.  (For instance,
on a few versions of C for the Macintosh, it's possible to have 16M free
and still be refused a request for 33K of it.)  Consequently it may be
necessary to use some compiler-specific routine instead.

Inform allocates many arrays (50 or so) and in a normal compilation run,
the largest of these occupies about 26K.  However, the implementation of
Link requires the whole of a module file to be held in one contiguous block
of memory, potentially 64K long; and the same applies to the array holding
the unpaged memory area of the Z-machine during construct_storyfile().

Failure of my_malloc() causes a fatal error.


Inform also provides a structure (memory_block) for allocating extensible
regions of memory (at a slight access-speed penalty): these allocate
8K chunks as needed.  Three of these are used as alternatives to the
temporary files when -F0 is operating, and two more are used to hold
backpatch markers.


   11.3   Error messages
          --------------

All diagnostic and error messages are routed through the "errors.c" section
(except for errors in parsing ICL, as this takes place before the compiler
itself is active).

There are three kinds of diagnostic: warning, error and fatal error.
Fatal errors cause Inform to halt with an exit(1), which is about as fatal
as anything can be to a C program.  (Fatal errors are mainly issued when the
memory or file-handling environment seems to have gone wrong.)  Errors
allow compilation to carry on, but suppress the production of any output
at the end: in some ways this is a pity, but there are not many errors which
one can be confident of safely recovering from.

Note that a sequence of MAX_ERRORS (normally 100) errors causes a fatal error.

Error messages which are generated during the compilation pass try to quote
the source code line apparently responsible (unless the "concise" switch
is set): but Inform may be unable to do this if the text was read and lost
too long ago, for example in the case of a "declared but not used" warning.

Note that informational messages (such as the banner line, the statistics
printout, etc.) are not routed through "errors.c", or indeed anywhere else.

Error recovery in Inform is mainly a matter of going into "panic mode"
(amusingly, this is a term of art in compiler theory): tearing through the
tokens until a semicolon is reached, then starting afresh with a new
statement or directive.  This pretty often goes wrong (especially if a
routine start/end is not noticed), and more work might be needed here.


   11.4   File input/output
          -----------------

The "files.c" section handles almost all of the file input/output in Inform:
the two exceptions are in cli_interpret() in "inform.c", which loads in ICL
files; and in link_module() in "linker.c", which loads in modules (a simple
job which there was no point abstracting).

The biggest shake-up to this section since Inform 5 was the realisation that,
far from being a grubby low-level activity, working out good filenames was
in fact a task which had to be done much higher up in Inform: so all of
the code which used to do this is now located in the ICL interpreter in
"inform.c".

What remains in "files.c" is:

    opening source code and reading it into buffers;
    opening and closing the temporary files;
    calculating the checksum and length of, and then outputting, the
        story file or module being compiled;
    opening, writing to and closing the text transcript file;
    opening, writing to and closing the debugging information file.

For each different source file that has ever been open, "files.c" maintains
a structure called FileId, containing the full filename and a file handle.
The files are never read from except via the routine file_load_chars(),
which fills up the lexical analyser's buffers.

The routine output_file() writes the story/module file: see its comments
for details.  It takes up where construct_storyfile() leaves off, and
contributes the last two pieces of header data to be worked out, the
checksum and the length fields.  Note that the checksum calculation is
only known after the whole file has been written (mainly because Z-code
backpatching alters the checksum, but can't be done until output time without
consuming a good deal of extra memory); so that an "fseek" is made to skip
the write position back into the header and overwrite the correct checksum
just before closing the file.  This call is made in a strictly ANSI way
but, like everything on the fringes of the C library-to-operating-system
interface, may cause trouble on some compilers.

The other routines are largely self-explanatory.

There are three temporary files (although they won't be used if -F0 applies),
as follows:

    Temporary file         Contents                   

         1                 The static strings area    
         2                 The Z-code area            
         3                 The link data area         (only opened and used
                                                      in -M, module mode)

For the format of the debugging information file, see the utility "infact.c",
which prints from it.  This format is likely to change in the near future
in any case, as it is already inadequate for some of the new features in
Inform 6.


   12   Low-level language features
        ---------------------------

   12.1   Using the "Trace" directive
          ---------------------------

The trace directive is primarily intended for compiler maintenance purposes.
It can cause Inform to print tracing information on nine different aspects
of what it does, in most cases at several levels of detail.

    -------------------------------------------------------------------------
    Tracing option        Level        Output
    -------------------------------------------------------------------------
    "assembly"            1            all assembly language produced, in
                                           an imitation of "@..." syntax 
                          2            and also the actual bytes produced,
                                           in hexadecimal
                                       (Note that because trace printing
                                       occurs before label optimisation,
                                       addresses of instructions cannot be
                                       relied on: nor can the values of
                                       operands which are marked for later
                                       backpatching.)

    "tokens"              1            lexemes of all tokens output from the
                                           lexer to the syntax analyser, and
                                           indication when a token is put
                                           back
                          2            full token descriptions
                          3            and also the lexical context in which
                                           they were identified

    "expressions"         1            annotated parse trees for all
                                           expressions being code-generated
                          2            and the behaviour of the shift-reduce
                                           parser when parsing it, and
                                           the parse tree received by the
                                           code generator (not yet annotated)
                          3            and the token-to-etoken translations
                                           made, and the parse tree produced
                                           by the emitter before lvalue
                                           checking

    "linker" used in      1            list all import/export information sent
    compiling module      2            and all marker information sent

    "linker" used in      1            list all modules linked in
    compiling story file  2            and all import/export information
                                           received
                          3            and all marker information received
                          4            and how each marker was dealt with

    "lines"               ---          (currently inoperable)
    -------------------------------------------------------------------------
    "dictionary"          1            print current dictionary contents
                                           (including verb/preposition nos.)
                                           in alphabetical order

    "objects"             1            print current object tree

    "verbs"               1            print current grammar

    "symbols"             1            print out those entries in the symbols
                                           table which are not: unknown,
                                           generated by the veneer or in a
                                           system file
                          2            print entire symbols table
    -------------------------------------------------------------------------


   12.2   System constants and other secret syntax
          ----------------------------------------

The addresses of many important tables in the Z-machine are not recorded
in the header, or anywhere else: but they are known to the compiler, and
needed by the run-time code.  The system constants are provided mainly
as a way of passing this information into run-time code, usually code
within the veneer.

   Constant            Evaluates to
   --------------------------------------------------------------------------
   #version_number     Version number of Z-machine format being compiled to
   #dict_par1          Byte offset in a dictionary table entry of the
                           the first ("flags") parameter byte
   #dict_par2          And the second ("verb") byte
   #dict_par3          And the third ("adjective") byte
                           (note that these three depend only on the version
                            number)

   #largest_object     The largest object number constructed, plus 256
                           (the "+256" is due to a quirk in the implementation
                           used by Inform 3 or so; since this constant is not
                           a very public feature, I don't mind leaving it in)
   #actual_largest_object    Ditto, but without the "+256"
   #adjectives_table   Byte addresses of adjectives table,
   #actions_table                        actions table,
   #preactions_table                     "preactions table",
   #classes_table                        class number to object number table,
   #identifiers_table                    property ID names table
   #array_names_offset                   array names table

   #readable_memory_offset   The byte address of the first byte which isn't
                       accessible using readb and readw: i.e., the first
                       byte of the first Z-code routine

   #code_offset        Packed address of Z-code area
   #strings_offset     Packed address of static strings area

   #array__start       Start of array space (byte address)
   #array__end         End of array space + 1
   #cpv__start         Start of common property values space (byte address)
   #cpv__end           End + 1
   #ipv__start         Start of individual property values space (byte addr.)
   #ipv__end           End + 1

   --------------------------------------------------------------------------

Two more secret syntaxes were introduced in Inform 6.10.  The first changes
the behaviour of the "Include" directive:

   Include "Language__";

(and no other string) includes the current language definition file, whose
name is an ICL variable.

The second controls which grammar table format is generated: normally GV1,
but this can be set to GV2 by

   Constant Grammar__Version = 2;

The "Grammar__Version" symbol is redefinable; if no such Constant directive
is made, then it will have the value 1.  It needs to be changed to its
final value before the first Verb, Extend or Fake_Action directive is
reached.


   12.3   The "Zcharacter" directive
          --------------------------

Finally, the "Zcharacter" directive is provided mostly for the benefit
of language definition files, for configuring Inform games to use a
non-English alphabet or character set.  (See the Inform Translator's
Manual.)  Different forms of "Zcharacter" allow both the Z-machine
alphabet table and the Unicode translation table to be specified.


(i) The Z-machine alphabet

The Z-machine's text encryption system is optimised to make it especially
cheap on memory to use letters in alphabet 1, then cheapish to use letters
in alphabets 2 and 3 but rather expensive to use letters which aren't in
any of the three.  We aren't much concerned about lack of memory in the
game as a whole, but the economy is very useful in dictionary words,
because dictionary words are only stored to a "resolution" of nine
Z-characters.  Thus, in a dictionary word:

   something from alphabet 1         costs 1 Z-character
                           2 or 3          2 Z-characters
             outside the alphabets   costs 4 Z-characters

The standard arrangement of these alphabets (A1 lower case a to z, A2 upper
case A to Z, A3 numerals and punctuation marks) includes no accented
characters.  In a language with frequent accented or non-English letters,
such as Finnish or French, this means that 4 of the 9 Z-characters in
a dictionary word may be wasted on just one letter.  For instance,

   't@'el@'ecarte'    is stored as   't@'el'
   't@'el@'ephone'    is stored as   't@'el'

(there are not even enough of the 9 Z-characters left to encode the second
e-acute, let alone the "c" or the "p" which would distinguish the two words).
On the other hand if e-acute could be moved into Alphabet 3, say in place of
one of the punctuation marks which is never needed in dictionary words,
the two e-acutes would take just 2 Z-characters each and then

   't@'el@'ecarte'    would be stored as   't@'el@'ecar'
   't@'el@'ephone'    would be stored as   't@'el@'epho'

which is far more acceptable.

The Z-machine has a mechanism (at least in Version 5 or better) for changing
the standard alphabet tables, invented to make the German translation of
Infocom's "Zork I" work.  The "Trace dictionary" will print the current
contents of the alphabet table (as well as the dictionary contents).


(i).1  Moving a single character in

There are two ways to change the standard English alphabet table.
One way, which is probably good enough for a language definition file
for a mostly Latin language (where only up to around 10 accented or
non-English letters are commonly needed) is to move characters into
the least-used positions in Alphabet 2.  For this, use the directive:

   Zcharacter <char>;

It will only be possible if there's a letter in A2 which hasn't yet
been used (otherwise, changing that entry in A2 will make some of the
text already compiled wrong).  The directive is thus only practicable
early in compilation, such as at the start of the library definition file.

For instance the code

   Trace dictionary;
   Zcharacter '@'e';
   Zcharacter '@`a';
   Zcharacter '@^a';
   Trace dictionary;

might produce output including...

Z-machine alphabet entries:
 a  b  c (d) e (f) g (h) i  j  k  l  m  n  o (p)(q) r  s  t  u  v (w)(x)(y)(z)
 A (B) C (D) E (F)(G) H  I (J)(K) L (M)(N) O (P)(Q) R  S (T)(U)(V)(W)(X)(Y)(Z)
( ) ^  0  1 (2) 3 (4)(5) 6 (7)(8) 9 (.) , (!)(?)(_)(#)(')(~) / (\) - (:)(()())

Z-machine alphabet entries:
 a  b  c (d) e (f) g (h) i  j  k  l  m  n  o (p)(q) r  s  t  u  v (w)(x)(y)(z)
 A (B) C (D) E (F)(G) H  I (J)(K) L (M)(N) O (P)(Q) R  S (T)(U)(V)(W)(X)(Y)(Z)
( ) ^  0  1 @'e 3 @`a@^a 6 (7)(8) 9 (.) , (!)(?)(_)(#)(')(~) / (\) - (:)(()())

...in which note that bracketed letters are ones which have not been encoded
yet.  The three Zcharacter directives have inserted e-acute, a-grave and
a-circumflex into the positions previously occupied by the numerals 2, 4 and
5.  It is reasonable to make up to about 10 such insertions, after which any
further attempts will only be successful if the game being compiled doesn't
(let us say) have a title like "123456789: An Interactive Lesson In Counting",
which would have used 9 of the numerals and forced them to stay in the final
alphabet table.


(i).2  Changing the entire alphabet

This has to be done very early in compilation, before any strings are
translated, so that it can't be done by a language definition file.
One might put such directives into a file called "Alphabet.inf" and
then begin the main game with

   Include "Alphabet";

to achieve this.

The form required is to give three strings after "Zcharacter", containing
26, 26 and 23 characters respectively.  For instance:

   Zcharacter "abcdefghijklmnopqrstuvwxyz"
              "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
              "0123456789!$&*():;.,<>@{386}";

(Note that "@{386}" specifies only one character: Unicode $0386.)  Space,
new-line and quotation marks " are automatically included, while
~, @ and ^ have special meanings in Inform and should not be used.
Otherwise, any arrangement of characters is fine, except that every
character used has to be either a normal ASCII character or part of
the "extra characters" (already declared) in the ZSCII set.


(ii)  Defining the "extra characters"

Inform normally makes up a block of "extra characters" based on the
source code it reads: if it reads plain ASCII or ISO Latin1 (-C0 or
-C1) then the block contains the usual European accents, such as
e-acute or i-circumflex, as defined in Z-machine Standard 0.2.  (And
if this table is never changed, Inform doesn't then compile the
table at all, as this is the default Z-machine arrangement.)

More generally if Inform reads ISO 8859-n (-Cn) then the block is set
up to contain all the non-ASCII letter characters in ISO 8859-n.

There's room to spare for others to be added, and

   Zcharacter table + '@{386}';

would add Unicode character $0386 (Greek capital Alpha with tonos
accent, as it happens) to the current stock of "extra characters".

Alternatively, you can simply give a fresh stock altogether:

   Zcharacter table '@{9a}' '@{386}' '@^a';

would specify a stock of just three, for instance.

These directives must be made before the characters in question are
first used in game text.


(iii)  Defining terminating characters

It's also possible to specify which ZSCII character codes are
"terminating characters", meaning that they terminate a line of input.
Normally, the return key is the only terminating character, but this
can be added to.  For instance, the following directive makes
ZSCII 132 and 136 terminating:

   Zcharacter terminating 132 136;

The legal values to include are those for the cursor, function
and keypad keys, plus mouse and menu clicks. The special value 255
makes all of these characters terminating.


   12.4   Sequence points
          ---------------

Inform marks certain positions in the code it compiles as being
"sequence points".  The idea is that the code can be regarded as a
sequence of chunks, and the sequence points mark where these chunks
begin.  Roughly speaking, each different statement in Inform source code
compiles to a different chunk, so that statements correspond closely to
sequence points.  Sequence points are marked in assembly trace output
using the notation "<*>".  For instance, the source code

    [ WorkOutSquares counter;
      counter = 0;
      while (counter < 100)
      {   squares-->counter = counter*counter;
          counter = counter + 1;
      }
    ];

produces the traced output:

    6  +00008  [ WorkOutSquares counter 

    7  +00009 <*> store        counter short_0 
    8  +0000c    .L0
    8  +0000c <*> jl           counter short_100 to L1 if FALSE
    9  +00011 <*> mul          counter counter -> sp 
    9  +00015     storew       long_480 counter sp 
   10  +0001b <*> add          counter short_1 -> counter 
   11  +0001f     jump         L0 
   11  +00022    .L1
   12  +00022 <*> rtrue        

The "<*>" in front of an instruction means "the position where this
instruction begins is a sequence point".  We could mark the five
positions in the original source code as:

    [ WorkOutSquares counter;
      <*> counter = 0;
      <*> while (counter < 100)
      {   <*> squares-->counter = counter*counter;
          <*> counter = counter + 1;
      }
    <*> ];

Note that the open and close braces and square brackets don't normally
cause sequence points.  The exact rule is that every statement,
action < > command, assignment or expression in void context is at
a sequence point, except as shown in the examples below:

    for (<*> i=0: <*> i<45: <*> i++) ...

"for" loops contain 0 to 3 sequence points, depending on whether there's
any code compiled in the three parts of the specification.  For instance

    for (::) <*> print "Madness!";

contains no sequence point corresponding to the "for" specification.

    <*> objectloop (<*> O ofclass Coin) <*> print (name) O;

"objectloop" normally generates two sequence points: at the start,
where the variable is initialised, and then where it's tested.  However,
loops over the contents of particular objects work differently:

    <*> objectloop (O in Mailbox) <*> print (name) O;

(Because the test "O in Mailbox" is not actually being performed at
run-time: instead, O is looping through the tree.)

    do <*> print counter++, " "; <*> until (counter < 17);

Here the sequence point generated by the loop itself is attached to
the "until" clause, not the "do" clause, because that's where the
test is performed.


"switch", "while" and "if" statements are not exceptions to the
usual rule (1 statement = 1 sequence point at the beginning), but it
might be useful to give some examples anyway:

    <*> switch(counter)
    {   1: <*> print "One^";
        2, 3: <*> print "Two or three^";
        default: <*> print "Neither^";
    }

    <*> if (i == 17) <*> print "i is 17"; else <*> print "i isn't 17";

    <*> while (i<100) <*> print i++;


The following is true:
       Except possibly in code explicitly assembled using the "@"
           notation, at each sequence point the Z-machine stack
           is empty and no important information is held in the
           global variables reserved by Inform as "registers":
           thus, it's safe for a debugger to switch execution from
           any sequence point in a routine to any other.
       No two sequence points can be at the same position in either
           the source code or the compiled code.
       Every sequence point corresponds to a definite position in the
           source code (because the veneer, i.e. the code compiled
           from within Inform itself, contains no sequence points).

But the following is _not_ true:
       Sequence points occur in the same order in the source code
           as they do in compiled code
       Every routine contains at least one sequence point
           (a very few "stub" routines are excluded)

Inform uses sequence points only to generate debugging information
files, and to annotate assembly tracing output.  They do not affect the
code compiled.


   12.5   Format of debugging information files
          -------------------------------------

This is a provisional specification of a format which will probably
change slightly in future releases.  Support for the old -k option has
been re-introduced in Inform 6.12 to assist development of Infix, the
projected source-level debugger for Inform.  (See the minor utility
program "infact", updated to 6.12 format, which prints out the contents
of a debugging information file in legible form.)

A debugging information file begins with a six-byte header:

  0,1  the bytes $DE and then $BF (DEBF = "Debugging File")
  2,3  a word giving the version number of the format used (currently 0)
  4,5  a word giving the current Inform version number, in its
           traditional decimal form: e.g. 1612 means "6.12"

The remainder of the file consists of a sequence of records, terminated
by an end-of-file record.  These records may be in _any_ order unless
otherwise noted.  Each record begins with an identifying byte, for which
constants looking like *_DBR are defined in Inform's source code.

A "string" is a null-terminated string of ASCII chars.
A "word" is a 16-bit unsigned number, high byte first.
A "line" is a sequence of four bytes: the first is the file number,
   the next two are a line number (a word), and the last is a character
   number within that line.
   In all three cases -- file numbers, line numbers, character numbers --
   counting begins at 1.  The line reference 0:0:0 is however used to
   mean "no such line": for instance, the metaclass "Routine" is defined
   at line 0:0:0, because it's defined by the compiler, not in any
   source code.
   Character positions greater than 255 in any line are recorded simply
   as 255.
An "address" is a 24-bit unsigned number, a sequence of three bytes
   (high byte, middle byte, low byte).  All addresses are counted in
   bytes (rather than being Z-machine packed addresses).

   EOF_DBR   (byte: 0)
   End of the debugging file.

   FILE_DBR   (byte: 1)
   <file number>         1 byte, counting from 1
   <include name>        string
   <actual file name>    string
   One of these records always appears before any reference to the
   source code file in question.

   CLASS_DBR   (byte: 2)
   <name>                string
   <defn start>          line
   <defn end>            line

   OBJECT_DBR   (byte: 3)
   <number>              word
   <name>                string
   <defn start>          line
   <defn end>            line

   GLOBAL_DBR   (byte: 4)
   <number>              byte
   <name>                string

   ARRAY_DBR   (byte: 12)
   <byte address>        word
   <name>                string
   The byte address is an offset within the "array space" area, which
   always begins with the 480 bytes storing the values of the global
   variables.

   ATTR_DBR   (byte: 5)
   <number>              word
   <name>                string

   PROP_DBR   (byte: 6)
   <number>              word
   <name>                string

   FAKE_ACTION_DBR   (byte: 7)
   <number>              word
   <name>                string
   Note that the numbering of fake actions differs in Grammar Versions
   1 and 2.

   ACTION_DBR   (byte: 8)
   <number>              word
   <name>                string

   HEADER_DBR   (byte: 9)
   <the header>          64 bytes
   This is provided in order to check that a debugging information file
   (probably) does match a given story file.

   ROUTINE_DBR   (byte: 11)
   <routine number>      word
   <defn start>          line
   <PC start>            address
   <name>                string
   then for each local variable:
   <local name>          string
   terminated by a zero byte.
   Note that the PC start address is in bytes, relative to the start of
   the story file's code area.  Routines are numbered upward from 0,
   and in each case the ROUTINE_DBR, LINEREF_DBR and ROUTINE_END_DBR
   records occur in order.

   LINEREF_DBR   (byte: 10)
   <routine number>              word
   <number of sequence points>   word
   and then, for each sequence point:
   <source code position>        line
   <PC offset>                   word
   The PC offset for each sequence point is in bytes, from the start of the
   routine.  (Note that the initial byte of the routine, giving the number
   of local variables for that routine, is at PC offset 0: thus the actual
   code begins at PC offset 1.)  It is possible for a routine to have no
   sequence points (as in the veneer, or in the case of code reading simply
   "[; ];").

   ROUTINE_END_DBR   (byte: 14)
   <routine number>      word
   <defn end>            line
   <next PC value>       address

   MAP_DBR   (byte: 13)
   A sequence of records consisting of:
   <name of structure>   string
   <location>            address
   terminated by a zero byte.
   The current names of structures consist of:
       "abbreviations table"
       "header extension"
       "alphabets table"
       "Unicode table"
       "property defaults"
       "object tree"
       "common properties"
       "class numbers"
       "individual properties"
       "global variables"
       "array space"
       "grammar table"
       "actions table"
       "parsing routines"
       "adjectives table"
       "dictionary"
       "code area"
       "strings area"
   Other names made be added later, and some of the above won't be
   present in all files ("Unicode table", for instance).  Locations are
   byte addresses inside the story file.

LINEREF_DBR records will probably be compressed in future releases.


   12.6   Notes on how to syntax-colour Inform source code
          ------------------------------------------------

"Syntax colouring" is an automatic process which some text editors apply
to the text being edited: the characters are displayed just as they are,
but with artificial colours added according to what the text editor thinks
they mean.  The editor is in the position of someone going through a book
colouring all the verbs in red and all the nouns in green: it can only do
so if it understands how to tell a verb or a noun from other words.
Many good text editors have been programmed to syntax colour for languages
such as C, and a few will allow users to reprogram them to other languages.

One such is the popular Acorn RISC OS text editor "Zap", for which the
author has written an extension mode called "ZapInform".  ZapInform
contributes colouring rules for the Inform language and this section
documents its algorithm, which has since been successfully adapted by
Paul Gilbert's "PIDE" environment and John Wood's C++ code for Inform
syntax styling, both running under Windows 95/NT.  (My thanks to John
for making two corrections to the previously-published algorithm.)

(a)  State values

ZapInform associates a 32-bit number called the "state" with every
character position.

The "state" is as follows.  11 of the upper 16 bits hold flags, the
rest being unused:

   32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17
                                                comment
                                             single-quoted text
                                          double-quoted text
                                       statement
                                    after marker
                                 highlight flag
                              highlight all flag
                           colour backtrack
                        after-restart-flag
                     wait-direct (waiting for a directive)
                  dont-know-flag

These flags make up the "outer state" while the lower 16 bits holds
a number pompously called the "inner state":

             0    after WS (WS = white space or start of line or comma)
             1    after WS then "-"
             2    after WS then "-" and ">" [terminal]
             3    after WS then "*" [terminal]

          0xFF    after junk
   0x100*N + S    after WS then an Inform identifier N+1 characters long
                  itself in state S:
                  101 w    202 wi   303 wit   404 with
                  111 h    212 ha   313 has
                  121 c    222 cl   323 cla   424 clas   525 class
 same + 0x8000    when complete [terminal]
   
In practice it would be madness to try to actually store the state
of every character position in memory (it would occupy four times as
much space as the file itself).  Instead, ZapInform caches just one
state value, the one most recently calculated, and uses a process
called "scanning" to determine new states.  That is, given that we
know the state at character X and want to know the state at character
Y, we can find out by scanning each character between X and Y,
altering the state according to each one.

It might possible save some time to cache more state values than
this (say, the state values at the start of every screen-visible
line of text, or some such) but the complexity of doing this doesn't
seem worthwhile on my implementation.  Scanning is a quick process
because the Zap text editor stores the entire file in almost contiguous
memory, easy to run through, and the state value can be kept in a
single CPU register while this is done.


(b)  Scanning text

Let us number the characters in a file 1, 2, 3, ...

The state before character 1 is always 0x02000000: that is, inner
state zero and outer state with only the waiting-for-directive flag set.
(One can think of this as the state of an imaginary "character 0".)
The state at character N+1 is then a function of the state at
character N and what character is actually there.  Thus,

       State(0) = 0x02000000

and for all N >= 0,

       State(N+1) = Scanning_function(State(N), Character(N+1))

And here is what the scanning function does:

    1.  Is the comment bit set?
           Is the character a new-line?
              If so, clear the comment bit.
              Stop.

    2.  Is the double-quote bit set?
           Is the character a double-quote?
              If so, clear the double-quote bit.
              Stop.

    3.  Is the single-quote bit set?
           Is the character a single-quote?
              If so, clear the single-quote bit.
              Stop.

    4.  Is the character a single quote?
           If so, set the single-quote bit and stop.

    5.  Is the character a double quote?
           If so, set the double-quote bit and stop.

    6.  Is the character an exclamation mark?
           If so, set the comment bit and stop.

    7.  Is the statement bit set?
           If so:
              Is the character "]"?
                 If so:
                    Clear the statement bit.
                    Stop.

              If the after-restart bit is clear, stop.

              Run the inner finite state machine.

              If it results in a keyword terminal (that is, a terminal
              which has inner state 0x100 or above):
                 Set colour-backtrack (and record the backtrack colour
                 as "function" colour).
                 Clear after-restart.

              Stop.

           If not:
              Is the character "["?
                 If so:
                    Set the statement bit.
                    If the after-marker bit is clear, set after-restart.
                    Stop.

              Run the inner finite state machine.

              If it results in a terminal:
                 Is the inner state 2 [after "->"] or 3 [after "*"]?
                    If so:
                       Set after-marker.
                       Set colour-backtrack (and record the backtrack
                       colour as "directive" colour).
                       Zero the inner state.
                 [If not, the terminal must be from a keyword.]
                 Is the inner state 0x404 [after "with"]?
                    If so:
                       Set colour-backtrack (and record the backtrack
                       colour as "directive" colour).
                       Set after-marker.
                       Set highlight.
                       Clear highlight-all.
                 Is the inner state 0x313 ["has"] or 0x525 ["class"]?
                    If so:
                       Set colour-backtrack (and record the backtrack
                       colour as "directive" colour).
                       Set after-marker.
                       Clear highlight.
                       Set highlight-all.
                 If the inner state isn't one of these: [so that recent
                 text has formed some alphanumeric token which might or
                 might not be a reserved word of some kind]
                    If waiting-for-directive is set:
                          Set colour-backtrack (and record the backtrack
                          colour as "directive" colour)
                          Clear waiting-for-directive.
                    If not, but highlight-all is set:
                          Set colour-backtrack (and record the backtrack
                          colour as "property" colour)
                    If not, but highlight is set:
                          Clear highlight.
                          Set colour-backtrack (and record the backtrack
                          colour as "property" colour).

                 Is the character ";"?
                    If so:
                       Set wait-direct.
                       Clear after-marker.
                       Clear after-restart.
                       Clear highlight.
                       Clear highlight-all.
                 Is the character ","?
                    If so:
                       Set after-marker.
                       Set highlight.

              Stop.

The "inner finite state machine" adjusts only the inner state, and
always preserves the outer state.  It not only changes an old inner
state to a new inner state, but sometimes returns a "terminal" flag
to signal that something interesting has been found.

          State      Condition      Go to state     Return terminal-flag?
          0          if "-"         1
                     if "*"         3               yes
                     if space, "#",
                        newline     0
                     if "_"         0x100
                     if "w"         0x101
                     if "h"         0x111
                     if "c"         0x121
                     other letters  0x100
                     otherwise      0xFF
          1          if ">"         2               yes
                     otherwise      0xFF
          2          always         0
          3          always         0
          0xFF       if space,
                        newline     0
                     otherwise      0xFF         

          all 0x100+ states:
                     if not alphanumeric, add
                        0x8000 to the state         yes
          then for the following states:
          0x101      if "i"         0x202
                     otherwise      0x200
          0x202      if "t"         0x303
                     otherwise      0x300
          0x303      if "h"         0x404
                     otherwise      0x400
          0x111      if "a"         0x212
                     otherwise      0x200
          0x212      if "s"         0x313
                     otherwise      0x300
          0x121      if "l"         0x222
                     otherwise      0x200
          0x222      if "a"         0x323
                     otherwise      0x300
          0x323      if "s"         0x424
                     otherwise      0x400
          0x424      if "s"         0x525
                     otherwise      0x500
          but for all other 0x100+ states:
                     if alphanumeric, add
                        0x100 to the state

          0x8000+    always         0

(Note that if your text editor stores tabs as characters in their own
right (usually 0x09) rather than rows of spaces, tab should be included
with space and newline in the above.)

Briefly, the finite state machine can be left running until it returns
a terminal, which means it has found "->", "*" or a completed Inform
identifier: and it detects "with", "has" and "class" as special keywords
amongst these identifiers.


(c)  Initial colouring

ZapInform colours one line of visible text at a time.  For instance, it
might be faced with this:

     Object -> bottle "~Heinz~ bottle"

And it outputs an array of colours for each character position in the
line, which the text editor can then use in actually displaying the text.

It works out the state before the first character of the line (the "O"),
then scans through the line.  For each character, it determines the
initial colour as a function of the state at that character:

  If single-quote or double-quote is set, then quoted text colour.
  If comment is set, then comment colour.
  If statement is set:
     Use code colour
        unless the character is "[" or "]", in which case use
           function colour,
        or is a single or double quote, in which case use quoted text
           colour.
  If not:
     Use foreground colour
        unless the character is "," or ";" or "*" or ">", in which
           case use directive colour,
        or the character is "[" or "]", in which case use
           function colour,
        or is a single or double quote, in which case use quoted text
           colour.
  
However, the scanning algorithm sometimes signals that a block of
text must be "backtracked" through and recoloured.  For instance,
this happens if the white space after the sequence "c", "l", "a",
"s" and "s" is detected when in a context where the keyword "class"
is legal.  The scanning algorithm does this by setting the "colour
backtrack" bit in the outer state.  Note that the number of characters
we need to recolour backwards from the current position has been
recorded in bits 9 to 16 of the inner state (which has been counting
up lengths of identifiers), while the scanning algorithm has also
recorded the colour to be used.  For instance, in

     Object -> bottle "~Heinz~ bottle"
           ^  ^      ^

backtracks of size 6, 2 and 6 are called for at the three marked
spaces.  Note that a backtrack never crosses a new-line.

ZapInform uses the following chart of colours:

    name                   default actual colour

    foreground             navy blue
    quoted text            grey
    comment                light green
    directive              black
    property               red
    function               red
    code                   navy blue
    codealpha              dark green
    assembly               gold
    escape character       red

but note that at this stage, we've only used the following:

    function colour        [ and ] as function brackets, plus function names
    comment colour         comments
    directive colour       initial directive keywords, plus "*",
                           "->", "with", "has" and "class" when used
                           in a directive context
    quoted text colour     singly- or doubly-quoted text
    foreground colour      code in directives
    code colour            code in statements
    property colour        property, attribute and class names when
                           used within "with", "has" and "class"

For instance,

     Object -> bottle "~Heinz~ bottle"

would give us the array

     DDDDDDDDDDFFFFFFFQQQQQQQQQQQQQQQQ

(F being foreground colour; it doesn't really matter what colour
values the spaces have).


(d)  Colour refinement


The next operation is "colour refinement", which includes a number
of things.

Firstly, any characters with colour Q (quoted-text) which have special
meanings are given "escape-character colour" instead.  This applies
to "~", "^", "\" and "@" followed by (possibly) another "@" and a
number of digits.

Next we look for identifiers.  An identifier for these purposes includes
a number, for it is just a sequence of:

     "_" or "$" or "#" or "0" to "9" or "a" to "z" or "A" to "Z".

The initial colouring of an identifier tells us its context.  We're
only interested in those in foreground colour (these must be used
in the body of a directive) or code colour (used in statements).

If an identifier is in code colour, then:

    If it follows an "@", recolour the "@" and the identifier in
       assembly-language colour.
    Otherwise, unless it is one of the following:

      "box"  "break"  "child"  "children"  "continue"  "default"
      "do"  "elder"  "eldest"  "else"  "false"  "font"  "for"  "give"
      "has"  "hasnt"  "if"  "in"  "indirect"  "inversion"  "jump"
      "metaclass"  "move"  "new_line"  "nothing"  "notin"  "objectloop"
      "ofclass"  "or"  "parent"  "print"  "print_ret"  "provides"  "quit"
      "random"  "read"  "remove"  "restore"  "return"  "rfalse"  "rtrue"
      "save"  "sibling"  "spaces"  "string"  "style"  "switch"  "to"
      "true"  "until"  "while"  "younger"  "youngest"

    we recolour the identifier to "codealpha colour".

On the other hand, if an identifier is in foreground colour, then we
check it to see if it's one of the following interesting keywords:

      "first"  "last"  "meta"  "only"  "private"  "replace"  "reverse"
      "string"  "table"

If it is, we recolour it in directive colour.

Thus, after colour refinement we arrive at the final colour scheme:


    function colour        [ and ] as function brackets, plus function names
    comment colour         comments
    quoted text colour     singly- or doubly-quoted text
    directive colour       initial directive keywords, plus "*",
                              "->", "with", "has" and "class" when used
                              in a directive context, plus any of the
                              reserved directive keywords listed above
    property colour        property, attribute and class names when
                              used within "with", "has" and "class"
    foreground colour      everything else in directives
    code colour            operators, numerals, brackets and statement
                              keywords such as "if" or "else" occurring
                              inside routines
    codealpha colour       variable and constant names occurring inside
                              routines
    assembly colour        @ plus assembly language opcodes
    escape char colour     special or escape characters in quoted text


(e)  An example

Consider the following example stretch of code (which is not meant to
be functional or interesting, just colourful):

   ! Here's the bottle:
   
   Object -> bottle "bottle marked ~DRINK ME~"
     with name "bottle" "jar" "flask",
          initial "There is an empty bottle here.",
          before
          [; LetGo:                      ! For dealing with water
                if (noun in bottle)
                    "You're holding that already (in the bottle).";
          ],
     has  container;
   
   [ ReadableSpell i j k;
     if (scope_stage==1)
     {   if (action_to_be==##Examine) rfalse;
         rtrue;
     }
     @set_cursor 1 1;
   ];
   
   Extend "examine" first
                   * scope=ReadableSpell            -> Examine;

Here are the initial colourings:

   ! Here's the bottle:
   CCCCCCCCCCCCCCCCCCCC
   
   Object -> bottle "bottle marked ~DRINK ME~"
   DDDDDDDDDDFFFFFFFQQQQQQQQQQQQQQQQQQQQQQQQQQ
     with name "bottle" "jar" "flask",
   FFDDDDDPPPPPQQQQQQQQFQQQQQFQQQQQQQD
          initial "There is an empty bottle here.",
   FFFFFFFPPPPPPPPQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQD
          before
   FFFFFFFPPPPPP
          [; LetGo:                      ! For dealing with water
   FFFFFFFfSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSCCCCCCCCCCCCCCCCCCCCCCCC
                if (noun in bottle)
   SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
                    "You're holding that already (in the bottle).";
   SSSSSSSSSSSSSSSSSQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQS
          ],
   SSSSSSSfD
     has  container;
   FFDDDDDPPPPPPPPPD
   
   [ ReadableSpell i j k;
   fffffffffffffffSSSSSSS
     if (scope_stage==1)
   SSSSSSSSSSSSSSSSSSSSS
     {   if (action_to_be==##Examine) rfalse;
   SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
         rtrue;
   SSSSSSSSSSSS
     }
   SSS
     @set_cursor 1 1;
   SSSSSSSSSSSSSSSSSS
   ];
   fD
   
   Extend "examine" first
   DDDDDDDQQQQQQQQQFFFFFF
                   * scope=ReadableSpell            -> Examine;
   FFFFFFFFFFFFFFFFDDFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFDDDFFFFFFFD

(Here F=foreground, D=directive, f=function, S=code (S for
"statement"), C=comment, P=property, Q=quoted text.)  And here is
the refinement:

   ! Here's the bottle:
   CCCCCCCCCCCCCCCCCCCC
   
   Object -> bottle "bottle marked ~DRINK ME~"
   DDDDDDDDDDFFFFFFFQQQQQQQQQQQQQQQEQQQQQQQQEQ
     with name "bottle" "jar" "flask",
   FFDDDDDPPPPPQQQQQQQQFQQQQQFQQQQQQQD
          initial "There is an empty bottle here.",
   FFFFFFFPPPPPPPPQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQD
          before
   FFFFFFFPPPPPP
          [; LetGo:                      ! For dealing with water
   FFFFFFFfSSIIIIISSSSSSSSSSSSSSSSSSSSSSSCCCCCCCCCCCCCCCCCCCCCCCC
                if (noun in bottle)
   SSSSSSSSSSSSSSSSSIIIISSSSIIIIIIS
                    "You're holding that already (in the bottle).";
   SSSSSSSSSSSSSSSSSQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQS
          ],
   SSSSSSSfD
     has  container;
   FFDDDDDPPPPPPPPPD
   
   [ ReadableSpell i j k;
   fffffffffffffffSSSSSSS
     if (scope_stage==1)
   SSSSSSIIIIIIIIIIISSIS
     {   if (action_to_be==##Examine) rfalse;
   SSSSSSSSSSIIIIIIIIIIIISSIIIIIIIIISSSSSSSSS
         rtrue;
   SSSSSSSSSSSS
     }
   SSS
     @set_cursor 1 1;
   SSAAAAAAAAAAASISIS
   ];
   fD
   
   Extend "examine" first
   DDDDDDDQQQQQQQQQFDDDDD
                   * scope=ReadableSpell            -> Examine;
   FFFFFFFFFFFFFFFFDDFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFDDDFFFFFFFD

(where E = escape characters, A = assembly and I = "codealpha", that
is, identifiers cited in statement code).


   13   What little I remember
        ----------------------

   13.1   Publication history
          -------------------

The language has existed in the following different forms on the Internet:

      Inform 1      v0.5     April 30th 1993
      Inform 2      v0.6     (I can't find any record of this date)
      Inform 3      v1.0     November 17th 1993
      Inform 3a              December 7th 1993
      Inform 4               January 20th 1994
      Inform 5               June 1st 1994
      Inform 5.1             June 12th 1994
      Inform 5.2             June 19th 1994
      Inform 5.3             (released on an "Acorn User" cover disc)
      Inform 5.4             October 2nd 1994
                   update    later in October 1994
      Inform 5.5    v1501    late May/early June (private circulation)
                    v1502    June 30th 1995
      Inform 6      v6.01    April 30th 1996: declared a "beta" release
                    v6.02    May 5th 1996
                    v6.03    May 10th 1996
                    v6.04    September 12th 1996: declared no longer "beta"
                    v6.05    September 24th 1996
      Inform 6.1    v6.10    December 16th 1996
                    v6.11    January 27th 1997
                    v6.12    March 26th 1997
                    v6.13    April 5th 1997
                    v6.14    September 8th 1997
                    v6.15    March 22nd 1998
      Inform 6.2    v6.20    December 10th 1998: declared a "beta"
                    v6.21    April 30th 1999

There have been a few other very slightly modified states, consisting of
bug fixes and source code portability improvements on the above (e.g.
Inform 4 went through a fairly long process of tidying-up once it reached
the porters), and a number of CD ROM releases.

This base of users has continuously grown, from size 1 (myself) to what at
time of writing is a very large group of at least casual users and a group of
perhaps 50 "serious" ones whom I actually know.  Inform's early versions were
very hard to use, as will become clear, and only attracted attention because:

   (a) it did something undeniably "cool", to aficionados of 1980s adventure
       games - it revived the Infocom run-time format, and the early Inform
       manuals were for a while the fullest documents on the Web describing
       the Infocom format; and
   (b) since "Curses" had been written in it, the compiler must at least work.
       (Even though the source code for "Curses" has never been made public,
       the fact that it was known to exist underwrote the project and made
       it seem worthwhile to the many porters who invested considerable time
       and effort on initially complex and poorly written code.)

I first posted Inform to the Internet in the spirit of "glasnost" rather than
with any pretension that it would be a widely used program.  (It worked for
me, which was all that was then required.)  By mid-1994 my ambitions for it
had grown: in looking back, that first year is the interesting one.


   13.2   Design history
          --------------

Inform was designed specifically as a language to express programs which
would run on the Z-machine, a run-time format which already existed.  In
turn, the Z-machine was designed specifically as a run-time format for
an existing language (ZIL, or Zork Implementation Language, a form of MDL).
Inform is therefore the result of a game of Chinese whispers, since I had no
idea what ZIL looked like until much later.

In fact Inform and ZIL are entirely dissimilar.  One reason for this is that
ZIL is syntactically a very high-level language, not unlike LISP, and its
compiler (ZILCH) made a substantial translation down to Z-machine level:
whereas Inform was designed very close to the Z-machine level.  It is often
remarked that Inform resembles C, and although this is partly because some
syntax conventions were borrowed from C by a designer who felt comfortable
with it, another reason is that both languages were designed for easy
translation to efficient low-level code.

Inform began as an assembler called "zass", whose syntax continues to exert
a (perhaps malign) influence on the language of today.  The zass assembly
mnemonics were different from the modern set (this was at a time when no
standardisation of opcode names had taken place), and were copied from the
set of names output by Mark Howell's disassembler "txd".  (The opcode
names recognised by Inform were tinkered with until Inform 5 when the standard
set was agreed.)  But the format of routines and labels remains.  A typical
"zass" routine looked like this:

    [ Routine v1 v2 ... vn
      an assembly line
      .alabel
      another assembly line
    ]

and semicolons were introduced as line separators when "zass" became Inform
1.  Individual assembly lines then took the form

    opcode operand1 ... operandn store;

(if they were store operands), with no syntactic indication that the last
operand was not really an operand but a store destination.  This persisted
until Inform 5, and is still recognised in Inform 6, though the preferred
syntax is now

    opcode operand1 ... operandn -> store;

Right up until Inform 5, similarly, "sp" was a reserved word meaning "the
assembly-language stack pointer", preventing its use as a local variable
name (for example).  Only in Inform 6 has it disappeared from the set of
reserved words (except in the context of assembly lines).

One of the most unusual features of Inform is its handling of local variables
to functions, in that

    (a) functions can be called with any number of arguments;
    (b) there's no specification by the programmer of any expected number
        of arguments;
    (c) local variables and argument-receiving variables are treated as
        the same;
    (d) there are at most 15 local variables per routine.

All four arise from the fact that this is how Z-machine programs behave:
these were the rules for zass, and they're still the rules for Inform 6.
It's efficient to identify Inform's idea of local variables with the
Z-machine implementation of locals; nevertheless, the decision to do so
was made by default.

This illustrates a general point about restrictions in the Inform syntax.
The Z-machine is rich, well-equipped with features making it a good target
for compilation.  It implements, let us say, feature F subject to a modest
restriction (for example, it implements local variables subject to a limit
of 15 per routine).  Since this is good enough for most purposes, and
efficient, and easy to compile to, an early release of Inform decided to
make use of the Z-machine's implementation directly; and consequently the
design of the Inform language was shaped around the restriction needed
to make it possible to use F.

In other words, if a sparer "RISC" target machine had been chosen, Inform
would have been obliged to make a new implementation of feature F, which
would not have been subject to arbitrary Z-machine restrictions.

Perversely, it is therefore Inform's greatest weakness (as well as its
greatest selling point) that it compiles to the Z-machine.  At least, though,
there is a not inconsiderable reward from using the Z-machine's features
so directly: compact and fast code results.

In any case, the history of Inform is a continuous process of syntax moving
away from direct expression of the Z-machine and towards a higher-level
description of a program.


Something that partially frustrated this was that Inform acquired a serious
user base "too soon", before the language design was mature enough: by Inform
4, a good many people were using Inform, and this began to make me nervous of
changing the established syntax. But if a syntax feature is clearly "wrong"
(that is, anomalous and inconvenient) then clearly one must bite the bullet
and make a change.  In retrospect, I feel I have always been too hesitant
over this.

For example, I realised when working on Inform 5.1 that the notation

    print "You swallow ", (the) food, " with glee.";

to mean "print the name of the object whose number follows, preceding it
by a suitable definite article", would be desirable.  But I had wanted
Inform 5 to be a point of stability, not rapid change, and so I shied away
from adding this feature (besides, I was worried about complicating the
interaction between the library and the language itself).  I did not in fact
add the feature until Inform 5.5, but it would have been easier for all
concerned had I done so earlier.

The moral to draw would be that any really big changes still needed ought
to be made now.  Unfortunately, they are just too big (the syntax for
character and dictionary literals, for example, or for array and property
value initialisation) and so I have failed to learn my lesson, after all.


Another problem with an evolutionary design is that vestigial features tend
to survive, especially if one is trying to serve a user base who may still
be using old syntaxes.  (For instance, using old library files of code.)
A problem P, say, is given a crude solution S1 in 1993, a slightly better
solution S2 in 1994 and a proper solution S3 in 1995.  By this time, does
Inform still have to recognise the syntaxes applying to S2 and S1?  Usually
I have decided to make it do so; the formal concept of an "obsolete usage"
(which would be recognised but cause "please change this" warnings to be
issued) was not added to Inform until 5.5.

For example, how do we set a property value?  In Inform 1, by using a line
of assembly code:

    put_prop lamp oil_left 15;

By Inform 3 this had become painful, and a shift away from assembly language
was taking place, but not a drastic one.  The statement "write" was devised
for writing a sequence of property values (in this example, just one):

    write lamp oil_left 15;

(Even this existed partly because of a horrid form of "object recycling"
present in Autumn 1993 forms of "Curses" and very early drafts of "Jigsaw",
causing objects to be heavily over-written: it became necessary to change
many properties in one go.  The "write" statement was a great help.)

Only in Inform 4 did the C-like form appear:

    lamp.oil_left = 15;

And not until Inform 6 was it possible to apply ++ and -- to a property
specified in this way.  Likewise, only in Inform 6 was the "write" statement
finally withdrawn from recognition, though it was flagged as obsolete in
Inform 5.5.

Such obsolete syntaxes do not merely increase the "clutteredness" of the
language; they also block more desirable syntaxes by causing clashes.  And
the tendency to provide partial solutions to problems has been unfortunate
in putting off the point where significant change is needed.  I did not
realise in "zass" that a notation was needed to refer to dictionary words
as operands of opcodes, automatically creating them if need be; then I did
realise this, in Inform 1, but made a directive called "Dictionary" to set
constant symbol names equal to their dictionary addresses; in Inform 3,
I found this was cumbersome and instead used the syntax #n$... to indicate
"new dictionary word" (it seemed logical since #w$... was being used for
"existing dictionary word"); in Inform 5, the syntax '...' was used, and
this was perhaps a mistake, since it clashes slightly with the syntax for
specifying literal characters.

If I had realised at Inform 1 that a syntax was needed, this would have been
avoided.  As a footnote, #n$... is still present in the language; #w$...
and "Dictionary" were finally deleted in Inform 6.


A happier point of view on this evolution is that, at every stage, the subset
of Z-machine features which Inform could use expanded.  Inform 1 could produce
only version 3 games; Inform 4 was a major breakthrough in making version 5
games; and so on.  (V5 did not become the default version compiled to
until Inform 5, but this reflects a lack of confidence in the interpreters
for high-version-number Z-machines then available.)  And an evolved syntax
is at least adapted to what its users want of it.


    (i)  Inform 1 and 2
         --------------

Inform 1, then, was the assembler "zass" rewritten and with certain shorthands
added.  For example,

    var = number operator number;

was permitted, though full expressions were not.  Most of the "creating"
directives were present -- Object, Global, Property and Attribute -- though
directives were not syntactically distinguished from statements, and could
be mixed freely.  Nor were statements distinguished from assembly lines:
indeed, most of them were assembly lines.

In addition, control structures were provided:

     if ... else ...
     while ... ...
     do ... until ...

These took their modern form, except that conditions did not need to be
bracketed, and braces were compulsory around all code blocks, even those
which contained only one statement.  A form of "for" loop was provided,
but took the BASIC-like form "for var 1 to 10".  (This syntax existed
alongside the modern "for" syntax until Inform 6.)

Grammar and actions existed in their modern form from the first (except
for the <...>, <<...>> notation for causing them).  Object definitions took
their modern form (except that specifying a parent was compulsory, there was
no "class" segment and routines could not be given as property values).

And that was about all.  The language appears from this description to be
very primitive, and it was, but it was not so different in appearance from a
"real" language (except for the lack of compound expressions) and "Curses"
was written in Inform 1 code for about two years before being translated
into modern syntax.

Inform 2 was only a tidying-up release of Inform 1.


    (ii)  Inform 3 and 3a
          ---------------

Inform 3 was a more substantial upgrade, with various worthy features
(command line switches; control over header flag bits) added.  But the
driving force was that "Curses" was being expanded, and running severely into
the limits of the Z-machine Version 3 architecture: in particular, the limit
of 255 objects.  "Curses" apparently needed a minimum of about 270: Richard
Tucker suggested to me that there was no need for the 255 physical Z-machine
objects to correspond directly to the 270 game objects.  So the idea of
"recycling" was born.  Object O might be a carpet in one room and a door in
another: as long as it was impossible for these ever to be simultaneously
present.  When either room was entered, everything about O was over-written
to make it the appropriate definition.

This made it necessary to be able to drastically alter objects.  The "write"
statement (see above) was introduced to fix properties, a syntax #n$...
was introduced to refer to dictionary words (because so many were needed as
property values to over-write with) and the @.. string escape was added
to make use of the Z-machine's abbreviations table, with the "string"
statement added for changing entries in it.  (This last made it possible for
the apparently unalterable object short-name to consist of a string reading
"print abbreviation 1", so that by changing abbreviation 1 it was possible to
effectively alter short names.)  Finally, object declaration syntax was
altered to make it possible to give the same object more than one symbol
name.

In retrospect this substantial effort to break the game-object Z-object
correspondence was misguided; the proper solution to the problem was to make 
Inform Version-5 capable (as happened in Inform 4).  It ought to be
remembered, though, that at this time many people were using Version 3
interpreters not capable of running Version 5 games (indeed, the Version 5
release of "Curses" was one of the nudges towards a wider awareness of
Version 5 and the interpreters capable of running it).  Inform hung back
because interpreters were not ready; yet as events turned out, interpreters
became ready partly because of Inform overtaking them.

Inform 3a was a tidying-up release of Inform 3.


    (iii)  Inform 4
           --------

Inform truly became a high-level language, rather than an assembler with
delusions of grandeur, only with the release of Inform 4.

The important new addition was a full expression evaluator, recognising
operators (and an expanded set of these, including for instance ++ and --),
precedence levels, brackets and so on.  Conditions were also permitted to
be compound, using the new logical operators && and ||.  Properties could
be referred to using the new "." operator (along with ".&" and ".#").
System functions (such as "children" and "random") were added.

This was the decisive break of the correspondence between Inform statements
and Z-machine instructions: expressions no longer looked like three-address
code by any other name, and might compile to arbitrarily long sequences of
instructions.  Opcodes like "random" were no longer accessed directly, but
via system functions.

The most annoying syntax point remaining was that braces were still
compulsory.  A typical Inform 3 line might have been:

    if i == 2 { print "Two"; }

This restriction was now lifted.  Implementation considerations
(specifically, in the source line-reader which fed into the tokeniser) meant
that some indication was needed of where the condition ended, and so brackets
were imposed on conditions.  Thus, the corresponding Inform 4 line:

    if (i == 2) print "Two";

Making the parallel with C even closer, new-style "for" loops were
introduced:

    for (i=0:i<10:i++) print i;

(and the desire to make Inform capable of compiling "for" loops as complex as
those available to C caused the operators ++, -- and , to be added).  More
originally, "objectloop" was added.

Although it had little impact on the syntax, to the outside world the major
development was that Inform was now capable of compiling Version 5 files
(which had no limit on object numbers and could be twice as large as Version
3 ones).  The main impact of this on the language (which I now regret) was
that statements like "box" and "style" were added to control newly available
text features of the Z-machine.

Otherwise I took the decision that in principle all Inform programs should be
cross-compilable to all versions (subject to the limits applying in each
version), and this remains substantially true.  Since certain game operations
such as saving, loading and reading the keyboard were commanded by very
different assembly language in versions 3 and 5, these were abstracted into
statements so that, for instance, what looked like version-3 assembly
language to read the keyboard would in fact be compiled to appropriate but
different version-5 assembly.


    (iv)  Inform 5
          --------

The final stage in Inform's arrival as a programming language took place a
little over a year after Inform 1's creation.  Only then did I look back in
leisure at the language, and only then did I begin to seriously find out about
the rival adventure-game languages to see what they were capable of.  I was
unimpressed except by TADS (which I continue to have a great respect for):
the point was rubbed home when I was adapting the original mainframe
Adventure, "Colossal Cave", to Inform for use as an example.  Dave Baggett's
TADS translation was constantly at my side, and it became obvious that the
corresponding Inform 4 code was far less legible.

In some ways, then, the Inform 5 syntax was worked out by constructing the
language so as to make the implementation of "Advent" look neat.

Although Inform had, from the first, associated code closely with objects
(the concept of before/after routines was part of the original conception),
only now was this brought out in syntax by making explicit routine definitions
property values.  (Previous code had been full of property values like
"FireplacePre", which were names of what would now be called before-routines.)
The syntax now seems obvious and natural, but at the time it felt like a
breakthrough: what had been broken through, I think, was the residual feeling
that Inform was an assembler at heart, whose programs were a list of routines
interspersed with some overgrown variable definitions.

Class definitions (and the inheritance rules) followed in the same, somewhat
bewildering, couple of days, and a brace of aesthetic changes followed:
dictionary words written 'thus', ##Names for actions (rather than the
previous #a$Name syntax), the use of bare strings as implied "print_ret"
statements. Most importantly, the use of <...> and <<...>> to trigger
actions, and the implementation of what amounted to a switch statement on the
current action within embedded routines.

I have come to realise that adventure game programs are unusually dense in
selections and comparison against lists of alternative possibilities.  Inform
now contains many syntaxes to make such comparisons concise:

    switching on actions in embedded routines;
    the "switch" statement (added in Inform 5.5) - note that both of these
        deliberately take a more concise form from, say, C, by not allowing
        case fall-through and therefore not requiring the incessant use
        of "break" statements;
    switch ranges such as "5, 7, 8 to 19: ...";
    the "or" alternative operator in conditions, as in "if (x == 1 or 5) ...".

And this, I feel, is something for which I should make no apology.

Later developments (in 5.5) included the Array directive (previously, it was
customary to use global variables to point to arrays) and the printing
syntaxes for "print (The) ..." and so forth (see above).  An extension was
made to "Versions 7 and 8" (new hybrids of the Z-machine allowing larger
games to be compiled), though this had little impact on the language or the
compiler.

The release of Inform 5.5 notwithstanding, the language remained essentially
stable during the two years between Inform 5 and Inform 6.  For the first
time it became possible for people other than myself to seriously use Inform,
rather than toy with it.  From Inform 5 onwards, the issue has not been
adequacy of the language but adequacy of its documentation, and I think this
is the point where Inform can be counted a "proper" language.


   13.3   Implementation history
          ----------------------

Writing a compiler for a large language from scratch requires glacial
patience, but the result is a mountain.  The time invested in tedious details
during development is repaid ten times over when the compiler comes to be
used.  Temporary measures are infernal temptations, and must be rejected
firmly.

Inform was not written this way.

The compiler has been through three "design iterations": "zass", which was
discarded as soon as I thought I had understood how the Z-machine worked,
Inform 1 to 5.5, and now Inform 6.  This section briefly describes the second
iteration.

Inform 1 was essentially a program for translating assembly language lines
into Z-code statements.  It made the generous and unjustified assumption
that its input would always be well-formed except for the odd spelling
mistake, and did not check syntax particularly carefully: lexical analysis
consisted of grabbing a line from the source code files (accessing these
files a byte at a time, in an unbuffered way) and cutting out white space
to leave a set of tokens.  Syntax analysis consisted of going through
these tokens until a line seemed to have worked out all right, and then
forgetting the line altogether (never checking if there were any more tokens).
This caused an enormous number of little jobs later, testing for error
conditions like "expected a semicolon after ... but found ..." in a quite
unsystematic way.

Not only was there no formal apparatus for syntax analysis, but lexical
analysis went on all over the place, with strings being cut up and compared
against the symbols table erratically (and often repeatedly).  Tokens were
stored as strings (with all the consequent inefficiencies of manipulation).
Direct string comparisons were commonplace, making it difficult to pin down
what the set of reserved words was, and indeed almost impossible to write
a precise definition of what programs would pass Inform syntax checking.

When statements required translation into code, this translation would
actually be made into textual assembly language instructions, fed back into
Inform via a "preprocessor".  (As Inform became more elaborate, this
preprocessing might take three stages: high level code such as "objectloop"
constructs being knocked down to equivalent "while" statements and then to
assembly language lines.)

In order to cope with the problem of forward references to labels, Inform
made two passes through its assembly language source, which is fairly standard
behaviour for assemblers (because it is easy to implement and has low
memory overheads).  Yet it became a more and more nightmarish process to
ensure that high-level constructs would always translate into identical code
on pass 1 and pass 2, and ever more clear that great effort was being made
to do a long and difficult calculation twice, burning the result the first
time.

Given all this, the wonder is that Inform worked at all.  In fact it worked
rather well, after a couple of years of polishing and optimisation: although
slow in theory, profiling runs suggested that the process of translating into
textual assembly language (for instance) was not a significant speed loss
(since opcodes could be textually recognised very quickly, while the work
in parsing operands had been deferred from earlier on, and would have to have
been done anyway).  Inform performed very rapidly in terms of compilation
time per source line when compared to typical commercial compilers (which,
to be fair, were compiling more complex languages) and was unusually low
in memory consumption.

This last was a serious issue when Inform was released, as many users were
running it on machines like 0.5M Amigas.  Without a grown-up syntax analyser,
there was no need to store large parse trees (such as most compilers do for
entire routines at a time: the performance of most compilers seriously
degrades as the length of a single routine increases).

From Inform 1 to 5.5, the main algorithms and issues never changed, despite
much reorganisation, renaming, rewriting and extension.  But after two and a
half years, it had clearly "rusted" badly: it was difficult to maintain or
work on.  Bugs were appearing which seemed ridiculous in a program so
established and widely used: for example, not until 1995 did anybody realise
that division had been accidentally programmed as right, not left associative
(thus "12/6/2" evaluated to 4, not 1).  An endless mass of ad-hoc rules was
being added to cover up the lack of proper lexical analysis.  Moreover, the
major new feature -- the linker -- added another massive complication to an
already convoluted piece of code.  (I have by now designed and painfully got
working three different versions of the linker, and would not go through all
that again for all the tea in China.)

What began as a general tidying-up and bug-fixing exercise sank into the
depressing realisation that a total rewrite was the only way to make
significant progress.  Six months later, the writing of this sentence finally
completes that task.


   13.4   Modification history
          --------------------

   (i) Changes made between v6.01 and v6.02
       ------------------------------------

Features added:
"Declared but not used" and "no such constant" errors now reported on or near
    line of occurrence, not at end of source.
Error message added for local variable being defined twice for the same
    routine

Bugs fixed:
"Segmentation fault" (i.e., writing memory out of range) fixed to do with
    symbol flags being set wrongly
Constant binary numbers not lexed correctly
"for (p = 0::)" misinterpreted due to "::" being recognised as an operator
Backpatch error under RISC OS (and possibly elsewhere) when linker used
Grammar token routines reported as "declared but not used"
Grammar token routines not working at run time
"ofclass" failing to recognised inherited class membership
Inheritance of negated attributes not working
"children" function giving the wrong answer (specifically, always giving the
    object number of the first child); this had the knock-on effect of
    making the "tree" debugging verb produce incorrect output

Source code cleaning:
Header file rearranged to typedef int32 in the right place, and tidied
    up a little; ICL_Directory defined rather than left to cause an error
    when Inform is compiled; INFORM_FILE replaced with MAIN_INFORM_FILE
    in some of the OS definition blocks; SEEK_SET defined rather than left
    to the vagaries of ANSI
Type clashes between int and int32 reconciled for:
    assemble_label_no, routine_starts_line, read_byte_from_memory_block,
    write_byte_to_memory_block, parse_label, symbol_index
Return value for hash_code_from_string cast explicitly to int (rather than
    implicitly from unsigned int)
prec_table given type int
Many "dead assignments" (redundant settings of variables) removed
"=-" replaced by "= -" to prevent pre-ANSI compilers from misunderstanding
    "x=-1" as "x=x-1"
The veneer string for the source of "CA__Pr" has been contracted to make
    it smaller than 2048 chars long, since Microsoft Visual C/C++ won't
    compile strings longer than that
symbs[] explicitly cast to char * in a few points in "linker", "objects" and
    "symbols"
Format string for process IDs corrected from "_proc%08x" to "_proc%08lx"
Format string for serial number copying removed, and a use of strcpy put
    in its place

   (ii) Changes made between v6.02 and v6.03
        ------------------------------------

Feature added:
The on/off command-line switches can now be prefixed with ~ to turn them off.
   (This is useful only in ICL, to undo the effect of previous lines.)

Bugs fixed:
The "my parent is Class" value in modules changed from 0xffff to 0x7fff to
    make it valid in type int (rather than int32)
The "spaces" statement not being parsed correctly (bug in syntax analyser)
Arrays declared using Global rather than Array don't work (this is fixed,
    but also now causes an obsolete usage warning)
Fclose applied to null file handle (effectively, the same file closed twice)
"for (::p++)" misinterpreted due to "::" being recognised as an operator
Some -> -> object initialisations getting the structure wrong
Table of property names wrongly compiled (resulting in garbled run-time error
    messages)
Serious failure of individual property inheritance from a class when both
    inheritor and class define individual property values
Messages sent to a property whose value is NULL now do nothing and reply 0
    (as if the property value were 0)
Action switches not working in properties called via messages (well, not
    really a bug: but it's better that they should work, as this makes it
    possible to call "before"/"after" as messages)
The "children" inlined routine leaving redundant values on the stack (which
    resulted in complex expressions containing uses of children()
    mis-evaluating)
Actions in the form <A x> and <A x y> using a call opcode improperly (with
    possibly regrettable results -- though they worked on my interpreter!)
ICL files not working (due to a bug gratuitously introduced in v6.02), and
    not understanding tab characters as spaces, and generally being poorly
    implemented (I found print "Err..."; in the code, to my alarm): generally
    tidied up now
Negative constants not working as switch() case values
Slightly better reporting of bad statement errors, and slightly better
    recovery

Source code cleaning:
Calls to get_next_char() replaced with (*get_next_char)() which is equivalent
    to a modern ANSI compiler, but inequivalent to a pre-ANSI one
Logically redundant "break" statements added to parse_routine() to prevent
    either "unreachable code" or "no return from non-void function" warnings
    being produced
parse_given_directive() has its unnecessary parameter removed
uchar types used for characters as read in by the lexer before filtering takes
    place to, e.g., strip off any set top bits

   (iii) Changes made between v6.03 and v6.04
         ------------------------------------

Features added:
When double-quoted text is continued from one source line to another, and
    the first ends in ^, then the new-line is not replaced by a space " ".
    Thus:    print "Shall I compare thee to a summer's day?^
                    Thou art more...";
    results in the "T" being printed underneath the "S", not one space to the
    right.
Statements like

        "The wyvern swallows ", (the) noun, " in three gulps!";

    are now correctly understood to be print_ret statements.  (Previously
    this gave errors because the syntax analyser thought it was going to
    be a list of three switch cases separated by commas, until it hit the
    semicolon and realised there should be a colon, by which time it was
    too late to go back.  The language definition has never been clear
    on whether long implied print_rets are legal or not, so I've decided
    that they are.)
The #n$ construct (for single-character dictionary words) can now take
    non-alphabetic characters: e.g., #n$1 and #n$# refer to dictionary
    words '1' and '#'.

Bugs fixed:
The Verb "newverb" = "oldverb"; syntax not working, due to mistake in syntax
    analyser (or rather, to inadequate testing)
If routine R is being Replaced, then its new definition can now appear either
    before or after its Library definition (as long as the Replace declaration
    occurs before both definitions).  In 6.01 to 6.03, errors would result
    if the new definition was earlier than the Library one.
Constants not being allowed as switch cases when they happen to be the same as
    internal statement keywords (such as the single letter "a")
Memory allocations of 0 bytes now return NULL, which protects Inform from
    trying to apply "free" to them, which seems to have caused problems on
    some ports (not on mine, but it was my mistake)
Spurious "Second 'Ifnot' in same 'If...'" errors appearing in certain
    nested 'If...' configurations
A comma in between object headers and bodies is now optional once again (as
    it used to be in Inform 5), rather than forbidden.
Text or commentary appearing after a string folding character \ now causes
    an error, as it should (the rest of such a line should be empty).
"continue" not properly working inside a "for" loop of the "optimised" kind
    (that is, whose update code consists of just variable++ or variable--)
For two different reasons, "or" did not quite work when used in its
    extended way (i.e. with conditions other than ==, ~= or with many
    alternatives): apologies, as the code was in a muddle.  Better now.
Class classname(N) ... was allowing creation of only N-1 new instances,
    not N.
(a) not recognised as a print specification when "a" is also the name of
    a local variable.
"youngest" function failing to work.

Source code cleaning:
Header declarations of the variables "mv_xref" and "no_stubbed_routines"
    removed (neither one actually exists in the released state of Inform 6)
Miscellaneous comments added
Assembly of store opcodes optimised to pull opcodes in a few cases, saving
    1 byte of object code each time
Hooks for Robert Pelak's MAC_FACE port (traditionally the most strenuous)
    inserted
Error messages reworded slightly for better consistency

   (iv)  Changes made between v6.04 and v6.05
         ------------------------------------

Feature added:
Assembly language tracing slightly tidied up (a cosmetic change only).

Bugs fixed:
When a file with very long strings in (such as one generated automatically
    by "infoclues") is read, it can corrupt memory, resulting in the malloc
    heap being damaged.
Spurious backpatching errors (actually, all backpatching errors are spurious -
    backpatching is supposed to work every time) produced when the Z-code
    area of the Z-machine exceeds 64K.  (This seldom happens even in large
    games, unless quite long print and print_ret strings are quoted.)  The
    error message in question was
    
        *** Backpatch symbol out of range ***

    and, just to recap, this should never appear. Please send me the
    offending source code if it persists!
The only time I've seen this bug is that it once hung the Z-machine while
    working on printing an inventory of items which needed grouping using
    "list_together", but it's actually a mistake in the shift-reduce
    expression grammar: "(...)" (function call) has precedence level 11,
    whereas "." has level 12, in order to make messages
            object.message(...)
    parse correctly (i.e., as (object.message)(...)).  This isn't usually
    done in SR operator grammars because, as I now realise, it results in
            function(X).property
    being misinterpreted as function(X.property).  I've corrected this by
    giving (...) the asymmetric precedence level of 11 on the right, but
    14 on the left.
Printing dictionary contents to a text transcript file not working on the
    Amiga port, since an internal buffer was overwritten by one byte
    (5x16+1 = 81, not 80).
Spurious "variable declared but not used" warnings appearing for a variable
    used only as a store destination in assembly language.
The "box" statement producing boxes of the wrong width when the text contains
    the @ escape character (e.g. "@@92@@92@@92" used to be wrongly
    calculated as 12 characters wide, when it really consists of three
    backslash characters).
The v6.04 optimisation using "pull" (see above) didn't work in v6, v7 or v8
    (for two different reasons, but basically because the opcode behaves
    differently in these as compared with lower versions).
Assembly language in v8 recognising the wrong instruction set (same as
    the previous bug).
Version 6 games not properly compiled in some cases (because of a rounding
    error in calculating packed address offsets: a new section 8.8 has been
    added to this manual to document how Inform works these out).

   "I compiled Inform 6.05 under Purify, and am pleased to report that Purify
    reported exactly 0 memory access errors and 0 bytes of memory leaked."
    -- Brad Jones


   (v)  Changes made between v6.05 and v6.10
        ------------------------------------

Features added:
The four ICL path variables which define places to look for input of
    different kinds (source code, include files, ICL files and modules)
    can now hold lists of alternative locations, separated by a character
    called FN_ALT which may be OS-dependant (but is normally ',').
    These alternatives are tried left-to-right until the desired filename
    is found.  It's legal for an alternative to be empty, meaning the
    current position (e.g. "+include_path=,library,oldlib" gives three
    possible paths -- the current directory, then the library and oldlib
    subdirectories).  The on-line examples (in the -h1 help information)
    now include this new feature.
File and pathnames in ICL and on the command line can now contain spaces
    if written inside double-quotes: e.g., "Games Folder/New Games" would
    be a valid ICL pathname.  (Provided for benefit of Macintosh users.)
A new error message format, -E2, uses Macintosh Programmer's Workshop style.
Output game files in the MPW port are given the type and creator for the
    MaxZip interpreter.  (With thanks to Tom Emerson for supplying details.)
The compiler can now generate a new format of grammar table: if used with
    old code, it will behave just as it used to, but if used with Library
    6/3 or later will generate "GV2" (grammar version 2) tables.  Users
    will notice that several restrictions are lifted, most usefully:
    
        the number of tokens per grammar line can be up to 30, not up to 6;
        the total number of prepositions is now unlimited (it used to be
            limited to about 80);
        the total number of routines named in grammar tokens is now
            unlimited (it used to be limited to 32);
        the total number of actions per game can be up to 4096, not 256.

    (Chapter 8 above has been rewritten to document GV1 and GV2 in full.)
    In addition, there are three new grammar features:
    (a) you can mark an action as "reverse", as in the following example:

        Verb 'show' 'present' 'display'
                * creature held                  -> Show reverse
                * held 'to' creature             -> Show;

        "reverse" indicating that the two parameters are to be reversed
        in order before the action is generated.
    (b) you can give two or more prepositions (only) as alternatives:

        Verb 'sit' 'lie'
                * 'on' 'top' 'of' noun           -> Enter
                * 'on'/'in'/'inside' noun        -> Enter;

        the "/" markers indicating alternative prepositions, any one of
        which is equally good.
    (c) there is a new grammar token called "topic", which matches any
        block of text up to the next preposition or the end of the input.
        For instance,

        Verb 'read'
                * noun                           -> Examine
                * 'about' topic 'in' noun        -> Consult
                * topic 'in' noun                -> Consult;

        It's used mostly for reading material and subjects of conversation,
        hence the name "topic".  For how to deal with the results of parsing
        a topic, see the Designer's Manual on "Consult".
    These three features are only available when using Inform in conjunction
    with Library 6/3 or later.
The run-time veneer's implementation has been rewritten, especially in the
    areas of message-sending and superclass access.  Several benefits:
    (a) the :: superclass operator can be used with any property
        (in Inform 6.05 it only worked with individual properties);
    (b) it makes sense to look up or send to

            something.Object::cant_go

        and this refers to the default value of "cant_go".  (So we can
        think of common properties as being exactly the ones inherited
        from class Object.)
    (c) printing out a property name, as in
    
            print (property) Crown::colour

        now correctly prints "Crown::colour".  (Previously it got stuck
        on superclass properties.)
    (d) the limits on numbers of things are made more sensible.  These
        are now as follows:
        (i) up to 256 classes per game;
        (ii) up to 128 different individual properties (and up to 63
             common properties) can be provided by any single object;
        (iii) up to 32703 individual property names throughout the game.
        These limits are now enforced by the compiler, which used to
        lazily not check them.
    (e) in games using the Inform library, when the variable debug_flag
        has bottom bit set (the Inform library does this when the
        "routines" verb has been typed), all messages are traced to
        the screen.  (Except any messages caused as a result of trying
        to perform this printing.)  This is fuller and more helpful
        than the old "routines" output.
    (Several parts of chapter 9 above have been corrected.)
The dictionary data structure is entirely redesigned, as a red-black tree
    rather than a hash table.  The effects of this change are localised
    to "text.c" (i.e., no other source files were modified to achieve it).
    For small games no difference will be noticed; for large games there will
    be a slight speed improvement, becoming more noticeable as the game grows.
    (E.g.: on my machine "Curses" compiles 4% faster.)  See Section 8.4 above
    for details.
As part of a general move toward multilingualism, the notation for dictionary
    words is extended: 'word//letters' means the word 'word' with some flags
    attached.  At present, the only such flag is 'p', meaning "plural".
    Note that 'word//' is legal and equivalent to 'word'.  Thus, 'w//'
    is another way of making the dictionary word #n$w (which we can't write
    as 'w' because that's only a single character).
    (This information is stored in a previously unused flag in #dict_par1:
    see the bitmap in section 8.5 above.)
Dictionary words may now contain accented characters (this is likely to be
    essential for some Scandinavian languages).  It is also now legal to
    write a quotation mark within quotation marks, in two cases:
        '''         (meaning: the literal character ')
        '@'etude'   (to put an acute accent on something)
To avoid dictionary resolution falling unacceptably when accented chars
    are used, it's helpful to move commonly occurring ones into the
    Z-machine alphabets: the new directive "Zcharacter" does this.
    See section 12.2 above.
The dictionary contents are now given in alphabetical order in text transcript
    files, and the "Trace dictionary;" output is much more descriptive.
Tracing output for calls to functions marked with a * is more legible,
    giving exactly the arguments supplied and naming embedded routines
    more sensibly (e.g. as "lantern.time_left" rather than "Embedded__43").
    The -g switch now traces only the user's own functions, not the library
    ones, unless -g2 is set.
Inform now checks for the error of giving the same property twice in one
    definition.  This is not always as obvious as (say) giving the
    "description" of an object twice over, because of the existence of
    aliases.  A typical error message would be:

         line 15: Error: Property given twice in the same declaration,
         because the names 'initial' and 'when_open' actually refer
         to the same property

A warning is produced if the "box" statement is used in a Version 3 game
    (since it cannot have any effect in such a game).
The default memory allocation for arrays has been increased (from 4000 to
    10000 bytes).  The original was too conservative, and anyway more will
    be needed for language definition files in library 6/3.
Action and attribute names are now written into story files, so that it's
    possible to print them (for debugging purposes).  Full details of how
    to do this are in section 9.6 above.  (Users with library 6/3 may notice
    that the "actions" debugging verb now names all actions, not just the
    ones defined by the library.)
Inform now allows the three special constants DEBUG, USE_MODULES and
    MODULE_MODE to be defined more than once without giving an error.
    Thus, if your code contains "Constant DEBUG;" already, but you
    compile it with the "-D" option anyway, no error will be produced.
When an object isn't given a textual name, rather than calling it "?"
    Inform now calls it something like "(red_box)", where "red_box" is
    its internal symbol name.  (This makes debugging output easier to
    follow and wastes very few bytes.)  When the object hasn't got an
    internal name either (if it's defined just as "Object;" for
    example) it has the textual name "(102)", 102 being its object number.

Bugs fixed:
"Extend only ..." not always parsed correctly, sometimes giving a spurious
    syntax error, other times failing to notice "first/last/replace".
In -U mode, failure to report as an error a line consisting of two words, the
    first of which is a constant or variable name defined in a previously
    linked module.  (This arose when the line "Action RevTakeWith" was
    accidentally written instead of "Fake_action RevTakeWith" -- what
    happened was that "Action" was recognised as the variable "action",
    which Inform knew was a symbol exported from a linked module (ParserM);
    it took this value as a class, wrongly, and made a new object called
    RevTakeWith of this "class".  The fault was in the syntax analyser,
    which wrongly assumed that it wouldn't necessarily be known whether an
    exported symbol was a class-name or not, and so allowed all exported
    symbols to be used, just to be on the safe side.)
ICL fatal error messages quoting a garbled filename.
Version 3 games (ah, remember them?) not compiling because of the usage of
    call opcodes not present in the V3 Z-machine
The "Include ">name"" feature not working when the original source file
    lies at the current working directory of the host OS, rather than in
    some subdirectory.
The linker had a rarely-occurring bug which prevented the tasks scoring
    system from working properly in a game which linked rather than included
    the library modules: if an instruction wrote a value to certain variables
    whose internal numbers were in a certain narrow range, then these
    variables would not be correctly reconciled with the story file variables
    at link time.  This same bug possibly also accounts for an even rarer
    occurrence, which I've had no definite report of, in which the less than
    professional error message "Ooops" is produced.  (I've corrected the
    error message to something more helpful.)
Rather seriously (though this error took a long time to be found),
    if (condition) rfalse; else ... [or the same thing with rtrue;]
    producing syntax errors.  (A mistake in some optimisation code.)
The "elder" function wrongly returning 0.
Not always reporting errors when global variables were used wrongly as
    values of object properties or array elements.  (Sometimes a backpatch
    error would be given.)  These are now properly parsed in constant
    context, not quantity context.
Spurious warning about code not being reachable at the close brace of
    a "for" loop, when this is indeed not reachable.  (It can still be
    sensible code, if the idea is always to "continue" or "return"
    from the body of the loop.)
When the error "Error: System function name used as value "child"" is
    produced, also printing a spurious *** Emit token error *** message.
Backpatch errors sometimes produced when compiling a line (in a module only)
    which contains an "if" condition making heavy use of "or".
Code backpatch errors sometimes produced when the condition "ofclass" is
    compiled.
'->' or 'Nearby' mysteriously failing if the parent object referred to
    was (a) defined in a module and (b) one of the first four objects
    defined there.
"Declared but not used" warnings issued for replacements of routines
    only accessed from within a linked module.
Error message if a non-existent ICL variable set, not being printed.

Source code cleaning:
A port definition called PC_WIN32 has been added to "header.h", courtesy of
    Kirk Klobe.
Two variables made int32 instead of int (thanks to Evan Day for spotting
    the desirability of this).
The veneer routines are segmented so that no individual literal string
    is as long as 512 characters.  (An irritating requirement imposed by
    the Macintosh port's compiler.)
Six miscellaneous explicit casts of (char *) to (uchar *), suggested by
    Robert Pelak.  More systematically, the opcode names in asm.c are all
    explicitly cast to (uchar *).
A slight change to inform.c makes the Mac front end's arrangement (of
    calling only sub_main(), while main() itself is not compiled) more
    generally usable by other ports: see section 2.2 (e).


   (vi)  Changes made between v6.10 and v6.11
         ------------------------------------

Bugs fixed:
An important one -- the optimised form of statements like
    if (variable) rfalse;
    was being negated, i.e., was being wrongly compiled as
    if (~~variable) rfalse;
    (In Library 6/3, this showed up at one stage as odd behaviour when
    moving around in darkness.)
The statement "spaces 0" printing infinitely many spaces instead of
    doing nothing.  (Negative arguments have always done nothing.)  It is
    possible to provoke this behaviour in calls to the list-writer.
Bug to do with parsing quoted filenames in ICL.
Spurious "declared but not used" warning for global variables used only
    in a module being linked in.

Source code cleaning:
Atari ST definition block updated on advice of Charles Briscoe-Smith.
Copyright dates nudged on to 1997.
Paranoid test added when closing temporary files.
Macintosh interface code added on advice of Robert Pelak.
Two long printfs subdivided to assist poor compilers.
String subdivision in "CA__Pr" in the veneer moved slightly in a way
    which makes no logical difference, but which seems to fix a mysterious
    problem observed by Robert Pelak (possibly with his compiler: it has
    been observed on no other platform).
Text transcript file is now opened as a text file, not a binary file,
    which will hopefully assist some ports but inconvenience nobody.
    (All the same, it may be worth testing that text transcripts still
    look sensible on your port.)  Code added to the Macintosh port to
    ensure closure of the file if compilation aborts.


   (vii)  Changes made between v6.11 and v6.12
          ------------------------------------

Features added:
A new switch, -C, and the ability to read source code files in any
    ISO 8859 standard extension to the ASCII character set.  The
    default is ISO 8859-1, Latin1, the standard character set on
    most computers in most European countries.  This means that
    many accented characters can simply be typed directly, a
    particular advantage with exotic ISO ranges (Arabic, Hebrew,
    Greek, Cyrillic, etc.).  If your computer is unable to use any
    of 8859-1 to -9, the -C0 switch (for plain ASCII only) can be
    used.  Accent markers such as '@:u' (for u-diaeresis) remain
    valid, just as usual, but a new string escape '@{..hex number..}'
    allows the specification of any Unicode character.  For instance
    '@{a9}' produces a copyright sign (Unicode values between $0000
    and $00ff are equal to ISO Latin1 values), and '@{2657}' produces
    in principle a White bishop chess symbol.  In practice, if you
    wish to use rare Unicode symbols, you'll need an interpreter
    obeying the Z-Machine Standard 1.0, and you'll probably have
    to supply it with an appropriate font as well; also, any characters
    not normally resident in the Z-machine need to be declared in
    advance of use with the "Zcharacter table" directive.  But if
    you're only using (e.g.) ordinary German, Spanish or French
    accents the new facilities will work with any Standard 0.2
    interpreter.
    The "Zcharacter" directive, for configuring the Z-machine to
    use non-English alphabets, is greatly enhanced.
    (In the Z-machine, Inform now always generates a header extension
    table, and sometimes uses a slot within it to point to a Unicode
    translation table newly specified in Standard 1.0.)
The debugging information file has finally been brought up to date with
    Inform 6, and old code inherited from Inform 5 (the only such
    code in the whole compiler) has been cleaned out.  The main change
    is in producing source-code-to-PC-value maps, which is made more
    difficult by the compression of code at the end of each routine.
    The exact format will probably change: this release is an interim
    one, to assist development of a source-level debugger.
The concept of sequence points has been introduced, likewise.

Bugs fixed:
Not a bug as such, but the veneer implementation has been changed so that
    class-objects never have attributes set.  In 6.01 to 6.11, a class
    object like Monster might have the attribute "animate" and would
    therefore be picked up in objectloops over all animate objects, and
    so on.  This is undesirable, and illogical since after all class
    objects don't provide any properties, so why have attributes?
    Anyway, the result means that objectloops of the form
    
        objectloop (x has <attribute>)  ...

    will now not range through classes but only genuine game objects.
When compound Boolean conditions are used as expressions (for instance,
    in an assignment like "x = (y>1) || d;") the wrong answers could
    result if the top operator was ||.  (If it was &&, or if the
    condition had no &&s or ||s in, or if the condition was being used
    as a test rather than a numerical value, everything was normal.)
Fake actions being defined before grammar version is set were causing
    mysterious problems.  This would happen if the Fake_Action directive
    were used before "Parser" is included.  An error message has been
    added advising that Fake_Action directives should be moved.
    (Fake actions are now mostly obsolete anyway.)
Any attributes aliased together being given the unhelpful name
    "<unknown attribute>" in output from the "showobj" debugging verb.
Backpatch *** errors *** sometimes being produced as a knock-on effect
    from already-reported linkage errors.  (These *** errors *** are
    now suppressed if linkage has already broken down.)
The character codes for << and >> (Continental European quotation marks)
    were the wrong way around, following a mistake in version 0.2 of
    the Z-Machine Standards document.  They have therefore been
    swapped over, following the correction made in version 1.0 of that
    document.
Accented characters deep inside dictionary words sometimes getting lost.
    (Because of an ambiguity in the Standards document where Inform
    made the wrong choice, but interpreters made the right one.)
The -w (suppress warnings) switch doing nothing.  (In writing Inform 6
    I simply forgot this.)

Source code cleaning:
A new UNIX64 port, for 64-bit Unix machines, added (courtesy of Magnus
    Olsson) which should subtract pointers correctly even in the unlikely
    event that they lie in different 2^32 byte pages.  Otherwise identical
    to UNIX.
And a new BEOS port added (courtesy of Michael van Biesbrouck) but which is
    identical to the Linux one in all but name.
Link error message handling, and character error message handling,
    slightly more automated.
The "Compiled with %d errors and %d warnings" message reworded so as not
    to mention 0 of anything, and to indicate how many warnings were
    suppressed (by the -q or -w switches).
The table of memory settings is now printed more prettily in response
    to the ICL command $list.
A new memory setting, MAX_LABELS, has been added, along with error checking
    to test for too many label points in any single routine.
Various functions to do with character set handling tidied up and moved
    into the new section "chars.c".


   (viii)  Changes made between v6.12 and v6.13
           ------------------------------------
Bug fixed:
String escapes in the form "You can't go @00 from here." not working
    (slip of the keyboard when re-writing string escapes in v6.12).

Source code cleaning:
The type of alphabet[][] has been altered so as to ensure that the
    contents are not read-only strings:
    from    char *alphabet[3]    to    char alphabet[3][27].


   (ix)  Changes made between v6.13 and v6.14
         ------------------------------------

Bugs fixed:
"Parker's disease": In very large games only, backpatch errors being
    caused because the Z-code area of the Z-machine had overflowed
    past the size where the compiler could backpatch it (this tended
    to happen only if large amounts of text were printed by Inform code
    rather than being property values).  The Inform code generator now
    automatically writes any string of text longer than 32 characters
    into the static strings area: designers shouldn't notice any difference
    in behaviour of the "print" or "print_ret" statements.
    (The new arrangement is fractionally wasteful of bytes -- about
    2 to 4 bytes per string in excess of 32 characters -- but makes
    a big difference to the relative sizes of the code and strings
    areas.  "Advent" compiled the old way was 60.1% code, 17.3% strings;
    the new way, it comes out 40.6% code, 37.1% strings and is about
    1K bigger.  Most of the change occurs when compiling library
    messages; the effects are less dramatic on typical Inform code.)
A different cause of backpatch errors is attempting to link Version 5 modules
    into a Version 8 game.  If you're compiling a V8 game and want to
    link in the library, for instance, you'll need to compile the
    library modules with "-v8" as well.  Inform used not to check this
    error condition, but now does produce an error message if there's
    a version mismatch between game and module.
The linker also now checks that module and game are using the same
    Z-machine character set.  (If either one has made a Zcharacter directive
    which differs from the other, then the sets are inconsistent and
    linkage can't take place.  This will only ever happen if you're
    working in a language other than English, and trying to use the linker,
    and it can be cured by copying the Zcharacter directives out of the
    language definition file into a little include file, and including
    that in the source for your main file as well as the source for any
    modules of your own -- i.e., any modules other than the library ones.)
Finally, the linker checks for the error condition that the module asked
    to import a variable which turns out not to have been defined in the
    main game.  (This may cause problems in linking with library 6/6, as
    Inform is now able to detect an error in 6/6's "linklv.h" which it
    previously wasn't able to detect: delete the line reading
    "Import global gender_and_number;" from "linklv.h" if so.)
"continue" not working (branching to the non-existent label -1, in
    fact) if used inside a "switch" statement.  (It ought to continue
    the innermost loop containing the "switch" statement, if there is
    one, or give an error if there isn't.)
The two string escapes @@94 (producing a ^ character) and @@126
    (producing ~) in fact producing new-line and double-quote.
Conditional branches to rtrue or rfalse in assembly-language being
    reversed in sense.
Source code being garbled after any Include statement which ends at
    a byte position which is a multiple of 4096, minus 3.  (My thanks
    to Andrew Plotkin for both finding and diagnosing this.)
"For" loops in the form "for ( : : )" (with white space between the two
    colons), which occur in a routine after a previous one which happens
    to have a designer-defined label in a particular position, causing
    an incorrect if harmless "Label declared but not used" warning.
Very short programs testing "x provides <property>", but never reading,
    writing or otherwise using <property>, causing stack overflows or
    restarts.  Nobody would ever need such a program anyway.
Function calls not working, or having garbled arguments, in Version 3
    or Version 4 games (the Inform library requires Version 5 or better
    nowadays, so the ability to compile V3 or V4 files is only needed
    for interpreter testing).
The error message disallowing use of an assembly opcode if the Z-machine
    version doesn't support it, now quotes the opcode name.
Suppressed "declared but not used" warnings for global objects not being
    counted towards the total number reported as suppressed when Inform
    prints its final message.

Source code cleaning:
Three character conversions amended to unsigned character conversions.
Backpatch errors, which I devoutly hope not to see again, are now
    reported as a new category of error: "compiler errors" which produce
    an apologetic banner, asking the user to report the fault to me.
Link errors are reported more prettily (differently, anyway).


   (x)  Changes made between v6.14 and v6.15
        ------------------------------------
Features added:
Parametrised object creation.  That is, if you define a class in which
    objects can be created, and also give it a "create" property like so:

         Class Monster
         with ...
              create
              [ start_at new_species;
                  move self to start_at;
                  self.species = new_species;
              ], ...

    then you can now make creation calls like so:
    
         M = Monster.create(Tomb_of_Kings, Ogre);

    The same applies to "recreate":
    
         Monster.recreate(M, Wall_of_Thorns, Hobbit);

    An attempt to supply more than 3 parameters to either kind of creation
    will result in a run-time error message.
It is now legal to declare a Class with zero create-able objects:
    
         Class Moribund(0) ...

    This is useful (i) to permit a definition like
    
         Class Gadgets(NUM_GADGETS)
         
    in some library file, where NUM_GADGETS is supplied by the library
    file's user and might be zero; and/or (ii) to permit "recreate"
    to be used on objects of the given class, reinitialising them
    to the values set in the class definition.
The ASCII form feed character (12 decimal) is now legal as white space.
    More precisely, it is in all contexts treated as a line feed.

Bugs fixed:
Getting property lengths wrong for common property values referred to
    using the superclass operator :: (a bug which sometimes manifests
    itself in a crash when a common property routine belonging to a
    superclass is called).
Crashing when halting on a memory error because an extremely long string
    of text has caused MAX_STATIC_STRINGS to be exceeded.  (The compiler
    now ensures that MAX_STATIC_STRINGS is at least twice the size of
    MAX_QTEXT_SIZE, which should be more than plenty.)
Crashing when printing an extremely long "Expected ... but found ..."
    error (where the second ... stands for an awfully long string of
    text).
An inability to assemble the optional third operand to the "@set_colour"
    opcode (which is available to Version 6 games only).
Array sizes are now formally checked to lie in the range 1 to 32767
    entries, with an error message produced if they don't.
Misinterpretation of array definitions with calculated sizes:

         Array Muggins -> MAX_SERVERS * 50;

    In Inform 6.14, this would be wrongly parsed as an array with one
    entry, initialised to the value given: whereas 6.15 correctly parses
    it as an array with MAX_SERVERS*50 initially zero entries.
When constants are calculated with at compile time (as in the previous
    bug), Inform 6.15 now detects overflows of signed addition,
    subtraction and multiplication.  So if, for instance, MAX_SERVERS
    is 100000, then the following error is produced:

         Error:  Signed arithmetic on compile-time constants overflowed
         the range -32768 to +32767: "100000 * 50 = 500000"

Two bugs in the veneer (both found and fixed by Chris Hall): (i) a program
    using ".create()" but not "provides" will go into a recursive loop
    (this never happens if the Inform library is included);
    (ii) more importantly, some creations of objects within classes
    where deletions had previously occurred were resulting in two
    different created objects sharing the same properties.
Improper use of 'or' not always producing a good error message (sometimes
    producing instead the compiler error "*** emitter overflow ***").
    The more explanatory error message
    
         Error:  'or' not between values to the right of a condition

    should now be produced in all such cases.
Expressions featuring ) followed immediately by ( sometimes crashing
    the expression evaluator.  For instance:

         (ChooseRoutine())(1);
         BigRoom.(BigDoor.dir_to)();

    The grammar resolving whether a function call is intended, or
    only a bracketed expression, has been refined so that, e.g.,
    
         <SetTo dial (4*the_time)>;

    is not misinterpreted as containing one argument, the result of
    the function call "dial(4*the_time)".  (If you want this, you must
    put brackets around it.)
Named action constants (like ##Take) not being correctly numbered after
    the 256th in order of first usage.
The "Default" directive for setting constants not properly handling
    some values which are other than numerical -- e.g.,
    
         Default Story "Untitled Story";

    would not have worked correctly under Inform 6.14.
Backpatching "compiler errors" are now suppressed if errors have already
    been reported (because in some circumstances, error recovery is
    good enough to allow compilation to continue but not good enough
    to leave a self-consistent backpatch table in the affected area).
    Error recovery from missed close-quotation-marks in object
    definitions has been marginally improved in some cases.

Source code cleaning:
Use of the Inform 5 statements 'print_paddr', 'print_addr' and
    'print_char' (all illegal in Inform 6) now results in an obsolete
    usage message which prints up the necessary Inform 6 equivalent.
The statistics output now includes the story file's serial codes, in the
    traditional <release number>.<serial code> format.

Note that this manual now contains an algorithm for syntax-colouring
    Inform source code.


   (xi)  Changes made between v6.15 and v6.20
         ------------------------------------
Features added:
Strict (-S) mode added, under which Inform undertakes to compile a story
    file which cannot crash the Z-machine (unless the user has compiled
    some assembly-language) but will instead print helpful error messages
    when illegal operations are attempted.  -S mode is wasteful of story
    file size (it adds perhaps 10 to 15%) and of execution speed (but on
    most modern computers, Z-machine interpreters run extremely quickly
    anyway) but is intended to be helpful for debugging.  See section
    7.8 of this manual for a list of exactly what strict-mode does.
    In support of strict mode, several new # system constants have been
    added, but these should not be used by designers and are not public
    features.
Inform now checks that readable-memory usage (i.e. the top of the dictionary
    and the bottom of the code area) does not exceed $10000, as this is a
    Z-machine requirement.  The "-s" statistics listing has an added line
    indicating how much of readable memory a game file has needed.
    (To test this, try compiling a game with "Array gross --> 32767;"
    in it.)

Bugs fixed:
"class.copy(a,b)" not working properly when "a" overrides an individual
    property defined by some class.  (Because of a bug not in the code for
    "copy" but in "objects.c": individual property tables were accumulating
    redundant extra values, which never showed up except when "copy" was
    used.  Thanks to Erik Hetzner and Gevan for finding and fixing this.)
"class.copy(a,b)" also transferring the class memberships of b to a (or
    at least whenever a and b both belonged to at least one class).
When a message is sent to a common property of an object which that object
    doesn't provide, it wasn't being properly routed to the routine given
    in the default value of that common property (which we'd normally expect
    to be 0 or NULL, either way causing nothing to happen).  Instead, a
    call would be made to 0-->0, with unpredictable consequences.  (This
    was originally reported as a bug in "Balances".)
The maximum number of entries in a list given as a common property value
    wasn't being correctly restricted to 32 in two different situations:
    when directly giving values, where the limit was wrongly set to 64;
    and when the list grows through inheritance of an additive property,
    where no limit was checked.
The "Switches" directive can't set "-k" or "-r", because it's too late by
    then to reopen the question of opening debugging/transcript files,
    but 6.15 didn't know this.  Torbj|rn thus contributed the shortest
    legal source file yet known to crash Inform, at just 11 characters:
    "Switches r;".
The rennab (the un-banner) printed at the end of compilation claiming
    "(no output)" if there were no errors during the main compilation pass,
    but there were errors late in story file construction instead.
I'm not sure this is a bug as such, but assembly listing (caused by -a, -t
    or use of "#Trace assembly") no longer covers the veneer.  (Because
    it's a nuisance and clutters up the listing.)
Not a bug either.  The return opcode automatically generated by "]", where
    needed (i.e. when this is reachable) is now a sequence point.  This was
    requested by Mark Musante, to assist debuggers.


   (xii)  Changes made between v6.20 and v6.20a
          -------------------------------------
(These are bug fixes made to the early beta-version v6.20.)
The "length" calculation for properties (in "properties_segment()" of
    "objects.c") calculates the length differently for individual and
    common properties.
Some small test programs not properly working because the wrong combinations
    of veneer routines were being compiled.  (I've rewritten the appropriate
    routines in "veneer.c" which decide which routines need the presence of
    which other routines.)  In particular, ".#" applied to individual
    properties might not work in a very small file.
Array bounds checking not working on arrays of >= 255 entries.


   (xiii)  Changes made between v6.20a and v6.21
           -------------------------------------
A new "-X" switch provides support for the Infix debugging library file.
    (Which it does by compiling additional veneer tables of symbol
    names.)
The maximum number of source files read from in compilation has been
    raised from 64 to 256, at the request of a user.  (No, really.  She
    knows who she is.)  The filename storage has been reduced in size,
    which will reduce the physical size of the Inform program by 8K
    on most systems.
The "-g" switch for tracing function calls now has a third setting,
    "-g3", which traces even veneer calls: the "-g2" setting now traces
    game and library calls but not veneer calls.
New syntax: "Zcharacter terminating" specifies the terminating character
    table for games in Z-machine versions >= 5.  See section 12.3 for
    details.

Bugs fixed:
Loops in the form "objectloop (X in Y) ..." causing stack underflow crashes
    if compiled with -S checking off.
-S checking now guards against attempts to use properties of "nothing",
    for instance rejecting something like "nothing.name = 'nobody';",
    and error messages concerning nonexistent objects no longer print
    garbled names for them.
Linking sometimes producing backpatching errors if arrays are already
    created before the link occurs.
The -S (strict checking) and -X (Infix) switches are now disabled for files
    being compiled as modules.  (They wouldn't work, and would anyway make
    the library modules too large.)
-S checking no longer tries to compile "pop" opcodes in Versions 5, 6, 8
    to clear values off the stack in some cases of array-boundary checking.
    (These Z-machine versions lack a "pop" opcode: instead, the same effect
    is inelegantly produced by "@jz sp" with a null branch, taking up
    three bytes rather than one.  C'est la Z.)
A potential crash if the limit on the number of adjectives is exceeded in
    a grammar version 1 game (which probably means a very big one using
    library 6/2 or earlier) has been removed.
The readable-memory ceiling was incorrect for version 6 games.
The error message resulting from asking Inform to read an unopenable ICL
    file is now coherent.
Assembly tracing at level 3 (which you don't want to read) now correctly
    prints the size of a routine after branch optimization.
The checksum is now included in the header block attached to a debugging
    information file.  (Previously this slot contained 0000.)
A minor bug in the abbreviations optimizer, so obscure that I'm not even
    sure what its consequence was, has been removed.

Source code cleaning:
The MAC_68K machine definition has been removed from the header: there
    are no longer any specific lines of code for this system.


   (xiv)  Acknowledgements
          ----------------

Many thanks to the following informers, who have reported bugs in Inform 6:

    Torbj|rn Andersson, Toni Arnold, Jose Luis Diaz de Arriba, Michael Baum,
    Paul E. Bell, Michael van Biesbrouck, Gerald Bostock, Neil Brown,
    Russ Bryan, Jesse Burneko, Evan Day, Gevan Dutton, Stephen van Egmond,
    Tom Emerson, Theresa van Ettinger, Greg Falcon, Roy Fellows, Rob Fisher,
    David Fletcher, C. E. Forman, Richard Gault, Paul Gilbert, David Glasser,
    Michael Graham, Bjorn Gustavsson, Chris Hall, Kory Heath, Erik Hetzner,
    Andreas Hoppler, Paul Horth, Sam Hulick, Francis Irving, Brad Jones,
    Christoffer Karlsson, Niclas Karlsson, John Kean, John Kennedy,
    Matt Kimball, Kirk Klobe, Michael A. Krehan, Mary Kuhner, Josh Larios,
    J�rgen Lerch, Harvey Lodder, Jennifer Maher, Bonni Mierzejewska,
    Paul Mikell, Tim Muddleton, Olav Mueller, Jeff Nassiff, Ross Nicol,
    Magnus Olsson, Marnie Parker, Robert Pelak, Jason Penney,
    Aphoristic Petrofsky, Andrew Plotkin, Richard H. Poser, Fredrik Ramsberg,
    Gareth Rees, Evin Robertson, Matthew Russotto, Gunther Schmidl,
    Miron Schmidt, Rene Schnoor, Nyuchezuu Shampoo, Lucian P. Smith,
    Anson Turner, Lorelle VanFossen, David L. Wagner, John Wood,
    Uncle Bob Newell and all.

------------------------------------------------------------------------------