DM4 §36: Parsing non-English languages

§36 Parsing non-English languages

It, hell. She had Those.
— Dorothy Parker (1893–1967), reviewing the romantic novel ‘It’

Before embarking on a new language definition file, the translator may want to consider what compromises are worth making, omitting tricky but not really necessary features of the language. For instance, in German, adjectives take forms agreeing with whether their noun takes a definite or indefinite article:

ein großer Mann a tall man (German)
der große Mann the tall man

This is an essential. But German also has a “neutral” form for adjectives, used in sentences like

Der Mann ist groß The man is tall

Now it could be argued that if the parser asks the German equivalent of

Whom do you mean, the tall man or the short man?

then the player ought to be able to reply “groß”. But this is probably not worth the effort.

As another example from German, is it essential for the parser to recognise commands put in the polite form when addressed to somebody other than the player? For instance,

freddy, öffne den ofen Freddy, open the oven
herr krüger, öffnen sie den ofen Mr Krueger, open the oven

indicate that Freddy is a friend but Mr Krueger a mere acquaintance. A translator might go to the trouble of implementing this, but equally might not bother, and simply tell players always to use the familiar form. It's harder to avoid the issue of whether the computer is familiar to the player. Does the player address the computer or the main character of the game? In English it makes no difference, but there are languages where an imperative verb agrees with the gender of the person being addressed. Is the computer male? Is it still male if the game's main character is female?

Another choice is whether to require the player to use letters other than ‘a’ to ‘z’ from the keyboard. French readers are used to seeing words written in capital letters without accents, so that there is no need to make the player type accents. In Finnish, though, ‘ä’ and ‘ö’ are significantly different from ‘a’ and ‘o’: “vaara” means “danger”, but “väärä” means “wrong”.

Finally, there are also dialect forms. The number 80 is “quatre-vingt” in metropolitan French, “octante” in Belgian and “huitante” in Swiss French. In such cases, the translator may want to write the language definition file to cope with all possible dialects. For example, something like

#ifdef DIALECT_FRANCOPHONE; print "septante";
#ifnot; print "soixante-dix";
#endif;

would enable the same definition file to be used by Belgian authors and members of the Académie française alike. The standard "English.h" definition already has such a constant: DIALECT_US, which uses American spellings, so that if an Inform game defines

Constant DIALECT_US;

before including Parser, then (for example) the number 106 would be printed in words as “one hundred six” instead of “one hundred and six”.

▲ An alternative is to allow the player to change dialect during play, and to encode all spelling variations inside variable strings. Ralf Herrmann's "German.h" does this to allow the player to choose traditional, reformed or Swiss German conventions on the use of “ß”. The low string variables @30 and @31 each hold either “ss” or “ß” for use in words like "schlie@30t" and "mu@31t".

· · · · ·

Organisation of language definition files

A language definition file is itself written in Inform, and fairly readable Inform at that: you may want to have a copy of "English.h" to refer to while reading the rest of this section. This is divided into four parts:

Preliminaries
Vocabulary
Translating to Informese
Printing

It is helpful for all language definitions to follow the order and layout style of "English.h". The example used throughout the rest of the section is of developing a definition of "French.h".

(I.1) Version number and alphabet

The file should begin as follows:

! ===========================================================
!   Inform Library Definition File: French
!
!   (c) Angela M. Horns 1996
! -----------------------------------------------------------
System_file;
! -----------------------------------------------------------
!   Part I.   Preliminaries
! -----------------------------------------------------------
Constant LanguageVersion
    = "Traduction fran@ccais 961205 par Angela M. Horns";

("English.h" defines a constant called EnglishNaturalLanguage here, but this is just to help the library keep old code working with the new parser: don't define a similar constant yourself.) Note the c-cedilla written using escape characters, @cc not ç, which is a precaution to make absolutely certain that the file will be legible on anyone's system, even one whose character set doesn't have accented characters in the normal way.

The next ingredient of Part I is declaring the accented letters which need to be “cheap” in the following sense. Inside story files, dictionary words are stored to a “resolution” of nine Z-characters: that is, only the first nine Z-characters are looked at, so that

“chrysanthemum” is stored as 'chrysanth'
“chrysanthemums” is stored as 'chrysanth'

(This is one of the reasons why Informese doesn't make linguistic use of word-endings.) Normally no problem, but unfortunately Z-characters are not the same as letters. The letters ‘A’ to ‘Z’ are “cheap” and take only one Z-character each, but accented letters like ‘é’ normally take out four Z-characters. If your translation is going to ask the player to type accented letters at the keyboard (which even a French translation need not do: see above), the resolution may be unacceptably low:

“télécarte” is stored as 't@'el'
“téléphone” is stored as 't@'el'

as there are not even enough of the nine Z-characters left to encode the second ‘é’, let alone the ‘c’ or the ‘p’ which would distinguish the two words. Inform therefore provides a mechanism to make up to about 10 common accents cheaper to use, in that they then take only two Z-characters each, not four. In the case of French, we might write:

Zcharacter '@'e';   ! E-acute
Zcharacter '@`e';   ! E-grave
Zcharacter '@`a';   ! A-grave
Zcharacter '@`u';   ! U-grave
Zcharacter '@^a';   ! A-circumflex
Zcharacter '@^e';   ! E-circumflex

(Note that since the Z-machine automatically reduces anything the player types into lower case, we need only include lower-case accented letters here. Note also that there are plenty of other French accented letters (ï, û and so forth) but these are uncommon enough not to matter here.) With this change made,

“télécarte” is stored as 't@'el@'ecar'
“téléphone” is stored as 't@'el@'epho'

enabling a phone card and a phone to be correctly distinguished by the parser.

▲▲ In any story file, 78 of the characters in the ZSCII set are designated as “cheap” by being placed into what's called the “alphabet table”. One of these is mandatorily new-line, another is mandatorily double-quote and a third cannot be used, leaving 75. Zcharacter moves a ZSCII character into the alphabet table, throwing out a character which hasn't yet been used to make way. Alternatively, and provided no characters have so far been used at all, you can write a Zcharacter directive which sets the entire alphabet table. The form required is to give three strings containing 26, 26 and 23 ZSCII characters respectively. For instance:

Zcharacter "abcdefghijklmnopqrstuvwxyz"
           "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
           "0123456789!$&*():;.,<>@{386}";

Characters in alphabet 1, the top row, take only one Z-character to print; characters in alphabets 2 and 3 take two Z-characters to print; characters not in the table take four. Note that this assumes that Unicode $0386 (Greek capital Alpha with tonos accent, as it happens) is present in ZSCII. Ordinarily it would not be, but the block of ZSCII character codes between 155 and 251 is configurable and can in principle contain any Unicode characters of your choice. By default, if Inform reads ISO 8859-n (switch setting -Cn) then this block is set up to contain all the non-ASCII letter characters in ISO 8859-n. In the most common case, -C1 for ISO Latin-1, the ligatures ‘œ’ and ‘Œ’ are then added, but this still leaves 28 character codes vacant.

Zcharacter table + '@{9a}';

adds Unicode character $009a, a copyright symbol, to ZSCII. Alternatively, you can instruct Inform to throw away all non-standard ZSCII characters and replace them with a fresh stock. The effect of:

Zcharacter table '@{9a}' '@{386}' '@^a';

is that ZSCII 155 will be a copyright symbol, 156 will be a Greek capital alpha with tonos, 157 will be an a-circumflex and 158 to 251 will be undefined; and all other accented letters will be unavailable. Such Zcharacter directives must be made before the characters in question are first used in game text. You don't need to know the ZSCII values, anyway: you can always write @{9a} when you want a copyright symbol.

(I.2) Compass objects

All that is left in Part I is to declare standard compass directions. The corresponding part of "English.h", given below, should be imitated as closely as possible:

Class CompassDirection
 with article "the", number
  has scenery;
Object Compass "compass" has concealed;
Ifndef WITHOUT_DIRECTIONS;
CompassDirection -> n_obj "north wall"
                    with name 'n//' 'north' 'wall',    door_dir n_to;
CompassDirection -> s_obj "south wall"
                    with name 's//' 'south' 'wall',    door_dir s_to;
CompassDirection -> e_obj "east wall"
                    with name 'e//' 'east' 'wall',     door_dir e_to;
CompassDirection -> w_obj "west wall"
                    with name 'w//' 'west' 'wall',     door_dir w_to;
CompassDirection -> ne_obj "northeast wall"
                    with name 'ne' 'northeast' 'wall', door_dir ne_to;
CompassDirection -> nw_obj "northwest wall"
                    with name 'nw' 'northwest' 'wall', door_dir nw_to;
CompassDirection -> se_obj "southeast wall"
                    with name 'se' 'southeast' 'wall', door_dir se_to;
CompassDirection -> sw_obj "southwest wall"
                    with name 'sw' 'southwest' 'wall', door_dir sw_to;
CompassDirection -> u_obj "ceiling"
                    with name 'u//' 'up' 'ceiling',    door_dir u_to;
CompassDirection -> d_obj "floor"
                    with name 'd//' 'down' 'floor',    door_dir d_to;
Endif;
CompassDirection -> out_obj "outside"
                    with                               door_dir out_to;
CompassDirection -> in_obj "inside"
                    with                               door_dir in_to;

For example, "French.h" would contain:

Class  CompassDirection
  with article "le", number
  has  scenery;
Object Compass "compas" has concealed;
...
CompassDirection -> n_obj "mur nord"
                    with name 'n//' 'nord' 'mur',      door_dir n_to;

(II.1) Informese vocabulary: the small categories

This is where small grammatical categories like ‹again-word› are defined. The following constants must be defined:

`AGAIN*__WD`	words of type ‹again-word›
`UNDO*__WD`	words of type ‹undo-word›
`OOPS*__WD`	words of type ‹oops-word›
`THEN*__WD`	words of type ‹then-word›
`AND*__WD`	words of type ‹and-word›
`BUT*__WD`	words of type ‹but-word›
`ALL*__WD`	words of type ‹all-word›
`OTHER*__WD`	words of type ‹other-word›
`ME*__WD`	words of type ‹me-word›
`OF*__WD`	words of type ‹of-word›
`YES*__WD`	words of type ‹yes-word›
`NO*__WD`	words of type ‹no-word›

In each case * runs from 1 to 3, except for ALL*__WD where it runs 1 to 5 and OF*__WD where it runs 1 to 4. ‹of-words› have not been mentioned before: these are used in the sense of “three of the boxes”, when parsing a reference to a given number of things. They are redundant in English because the player could have typed simply “three boxes”, but Inform provides them anyway.

In French, we might begin with:

Constant AGAIN1__WD   = 'encore';
Constant AGAIN2__WD   = 'c//';
Constant AGAIN3__WD   = 'encore';

Here we can't actually think of a third synonymous word for “again”, but we must define AGAIN3__WD all the same, and must not allow it to be zero. And so on, through to:

Constant YES1__WD     = 'o//';
Constant YES2__WD     = 'oui';
Constant YES3__WD     = 'oui';

‹yes-words› and ‹no-words› are used to parse the answers to yes-or-no questions (oui-ou-non questions in French, of course). It causes no difficulty that the word “o” is also an abbreviation for “ouest” because they are used in different contexts. On the other hand, ‹oops-words›, ‹again-words› and ‹undo-words› should be different from any verb or compass direction name.

After the above, a few further words have to be defined as possible replies to the question asked when a game ends. Here the French example might be:

Constant AMUSING__WD    = 'amusant';
Constant FULLSCORE1__WD = 'grandscore';
Constant FULLSCORE2__WD = 'grand';
Constant QUIT1__WD      = 'a//';
Constant QUIT2__WD      = 'arret';
Constant RESTART__WD    = 'reprand';
Constant RESTORE__WD    = 'restitue';

(II.2) Informese vocabulary: pronouns

Part II continues with a table of pronouns, best explained by example. The following table defines the standard English accusative pronouns:

Array LanguagePronouns table
  !  word       possible GNAs:                     connected to:
  !             a     i
  !             s  p  s  p
  !             mfnmfnmfnmfn
     'it'     $$001000111000                       NULL
     'him'    $$100000100000                       NULL
     'her'    $$010000010000                       NULL
     'them'   $$000111000111                       NULL;

The “connected to” column should always be created with NULL entries. The pattern of 1 and 0 in the middle column indicates which types of ‹noun phrase› might be referred to with the given ‹pronoun›. This is really a concise way of expressing a set of possible GNA values, saying for instance that “them” can match against noun phrases with any GNA in the set {3, 4, 5, 9, 10, 11}.

The accusative and dative pronouns in English are identical: for instance “her” in “give her the flowers” is dative and in “kiss her” is accusative. French is richer in pronoun forms:

donne-le-lui give it to him/her
mange avec lui eat with him

Here “-lui” and “lui” are grammatically quite different, with one implying masculinity where the other doesn't. The table needed is:

Array LanguagePronouns table
  !  word       possible GNAs:                     connected to:
  !             a     i
  !             s  p  s  p
  !             mfnmfnmfnmfn
     '-le'    $$100000100000                       NULL
     '-la'    $$010000010000                       NULL
     '-les'   $$000110000110                       NULL
     '-lui'   $$110000110000                       NULL
     '-leur'  $$000110000110                       NULL
     'lui'    $$100000100000                       NULL
     'elle'   $$010000010000                       NULL
     'eux'    $$000100000100                       NULL
     'elles'  $$000010000010                       NULL;

This table assumes that “-le” can be treated as a free-standing word in its own right, not as part of the word “donne-le-lui”, and section (III.1) below will describe how to bring this about. Note that “-les” and “-leur” are treated as synonymous: Informese doesn't (ordinarily) care that dative and accusative are different.

Using the “pronouns” verb in any game will print out current values, which may be useful when debugging the above table. Here is the same game position, inside the building at the end of the road, in parallel English, German and Spanish text:

English: ‘Advent’
At the moment, “it” means the small bottle, “him” is unset, “her” is unset and “them” is unset.

German: ‘Abenteuer’
Im Augenblick, “er” heisst der Schlüsselbund, “sie” heisst die Flasche, “es” heisst das Essen, “ihn” heisst der Schlüsselbund, “ihm” heisst das Essen und “ihnen” ist nicht gesetzt.

Spanish: ‘Aventura’
En este momento, “-lo” significa el grupo de llaves, “-los” no está definido, “-la” significa la pequeña botella, “-las” significa las par de tuberí as de unos 15 cm de diámetro, “-le” significa la pequeña botella, “-les” significa las par de tuberí as de unos 15 cm de diámetro, “él” significa el grupo de llaves, “ella” significa la pequeña botella, “ellos” no está definido y “ellas” significa las par de tuberí as de unos 15 cm de diámetro.

(II.3) Informese vocabulary: descriptors

Part II continues with a table of descriptors, in a similar format.

Array LanguageDescriptors table
  !  word       possible GNAs   descriptor      connected
  !             to follow:      type:           to:
  !             a     i
  !             s  p  s  p
  !             mfnmfnmfnmfn
     'my'     $$111111111111    POSSESS_PK      0
     'this'   $$111000111000    POSSESS_PK      0
     'these'  $$000111000111    POSSESS_PK      0
     'that'   $$111111111111    POSSESS_PK      1
     'those'  $$000111000111    POSSESS_PK      1
     'his'    $$111111111111    POSSESS_PK      'him'
     'her'    $$111111111111    POSSESS_PK      'her'
     'their'  $$111111111111    POSSESS_PK      'them'
     'its'    $$111111111111    POSSESS_PK      'it'
     'the'    $$111111111111    DEFART_PK       NULL
     'a//'    $$111000111000    INDEFART_PK     NULL
     'an'     $$111000111000    INDEFART_PK     NULL
     'some'   $$000111000111    INDEFART_PK     NULL;

This gives three of the four types of ‹descriptor›. The constant POSSESS_PK signifies a ‹possessive adjective›, connected either to 0, meaning the player-object, or to 1, meaning anything other than the player-object (used for “that” and similar words) or to the object referred to by the given ‹pronoun›, which must be one of those in the pronoun table. DEFART_PK signifies a definite ‹article› and INDEFART_PK an indefinite ‹article›: these should both give the connected-to value of NULL in all cases.

The fourth kind allows extra descriptors to be added which force the objects that follow to have, or not have, a given attribute. For example, the following three lines would implement “lit”, “lighted” and “unlit” as adjectives automatically understood by the English parser:

     'lit'     $$111111111111    light           NULL
     'lighted' $$111111111111    light           NULL
     'unlit'   $$111111111111    (-light)        NULL

An attribute name means “must have this attribute”, while the negation of it means “must not have this attribute”.

To continue the example, "French.h" needs the following descriptors table:

Array LanguageDescriptors table
  !  word       possible GNAs   descriptor      connected
  !             to follow:      type:           to:
  !             a     i
  !             s  p  s  p
  !             mfnmfnmfnmfn
     'le'     $$100000100000    DEFART_PK       NULL
     'la'     $$010000010000    DEFART_PK       NULL
     'l^'     $$110000110000    DEFART_PK       NULL
     'les'    $$000110000110    DEFART_PK       NULL
     'un'     $$100000100000    INDEFART_PK     NULL
     'une'    $$010000010000    INDEFART_PK     NULL
     'des'    $$000110000110    INDEFART_PK     NULL
     'mon'    $$100000100000    POSSESS_PK      0
     'ma'     $$010000010000    POSSESS_PK      0
     'mes'    $$000110000110    POSSESS_PK      0
     'son'    $$100000100000    POSSESS_PK      '-lui'
     'sa'     $$010000010000    POSSESS_PK      '-lui'
     'ses'    $$000110000110    POSSESS_PK      '-lui'
     'leur'   $$110000110000    POSSESS_PK      '-les'
     'leurs'  $$000110000110    POSSESS_PK      '-les';

(recall that in dictionary words, the apostrophe is written ^). Thus, “son oiseau” means “his bird” or “her bird”, according to the current meaning of “-lui”, i.e., according to the gender of the most recent singular noun referred to.

The parser automatically tries both meanings if the same word occurs in both pronoun and descriptor tables. This happens in English, where “her” can mean either a feminine singular possessive adjective (“take her purse”) or a feminine singular object pronoun (“wake her up”).

(II.4) Informese vocabulary: numbers

An array should be given of words having type ‹number›. These should include enough to express the numbers 1 to 20, as in the example:

Array LanguageNumbers table
    'un' 1 'une' 1 'deux' 2 'trois' 3 'quatre' 4 'cinq' 5
    'six' 6 'sept' 7 'huit' 8 'neuf' 9 'dix' 10
    'onze' 11 'douze' 12 'treize' 13 'quatorze' 14 'quinze' 15
    'seize' 16 'dix-sept' 17 'dix-huit' 18 'dix-neuf' 19 'vingt' 20;

In some languages, such as Russian, there are numbers larger than 1 which inflect with gender: please recognise all possibilities here. If the same word appears in both numbers and descriptors tables, its meaning as a descriptor takes priority, which is useful in French as it means that the genders of “un” and “une” are recognised after all.

(III.1) Translating natural language to Informese

Part III holds the routine to convert what the player has typed into Informese. In "English.h" this does nothing at all:

[ LanguageToInformese; ];

This might just do for Dogg, the imaginary language in which Tom Stoppard's play Dogg's Hamlet is written, where the words are more or less English words rearranged. (It begins with someone tapping a microphone and saying “Breakfast, breakfast… sun, dock, trog…”, and “Bicycles!” is an expletive.) Other languages are structurally unlike English and the task of LanguageToInformese is to rearrange or rewrite commands to make them look more like English ones. Here are some typical procedures for LanguageToInformese to follow:

Strip out optional accents. For instance, "German.h" looks through the command replacing ü with ue and so forth, and replacing ß with ss. This saves having to recognise both spelling forms.
Break up words at hyphens and apostrophes, so that:
donne-lui l'oiseau give the bird to him → donne -lui l' oiseau
Remove inflections which don't carry useful information. For instance, most German imperatives can take two forms, one with an ‘e’ on the end: “lege” means “leg” (German: “put”) and “schaue” means “schau” (German: “look”). It would be helpful to remove this “e”, if only to avoid stuffing game dictionaries full of essentially duplicate entries. (Toni Arnold's "German.h" goes further and strips all inflections from nouns, while Ralf Herrmann's preserves inflections for parsing later on.)
Break affixes away from the words they're glued to. For instance:
della of the (Italian) → di la
cogela take it (Spanish) → coge la
so that the affix part “la” becomes a separate word and can be treated as a pronoun.
Replace parts of speech not existing in Informese, such as pronominal adverbs, with a suitable alternative. For instance:
dessus on top of it (French) → sur lui
dedans inside it → dans lui
Alter word order. For instance, break off an enclitic and move it between two nouns; or if the definite article is written as a suffix, cut it free and move it before the noun:
arma virumque arms and the man (Latin) → arma et virum
kakane the cakes (Norwegian) → ne kake

When the call to LanguageToInformese is made, the text that the player typed is held in a -> array called buffer, and some useful information about it is held in another array called parse. The contents of these arrays are described in detail in §2.5.

The translation process has to be done by shifting characters about and altering them in buffer. Of course the moment anything in buffer is changed, the information in parse becomes out of date. You can update parse with the assembly-language statement

@tokenise buffer parse;

(And the parser does just this when the LanguageToInformese routine has finished.) The most commonly made alterations come down to inserting characters, often spaces, in between words, and deleting other characters by overwriting them with spaces. Inserting characters means moving the rest of buffer along by one character, altering the length buffer->1, and making sure that no overflow occurs. The library therefore provides a utility routine

LTI_Insert(position, character)

to insert the given character at the given position in buffer.

• EXERCISE 110
Write a LanguageToInformese routine to insert spaces before each hyphen and after each apostrophe, so that:

donne-lui l'oiseau give the bird to him → donne -lui l' oiseau

• EXERCISE 111
Make further translations for French pronominal adverbs:

dessus on top of it → sur lui
dedans inside it → dans lui

•▲ EXERCISE 112
Write a routine called LTI_Shift(from,chars), which shifts the tail of the buffer array by chars positions to the right (so that it moves leftwards if chars is negative), where the “tail” is the part of the buffer starting at from and continuing to the end.

•▲ EXERCISE 113
Write a LanguageToInformese routine which sorts out German pronominal adverbs, which are made by adding “da” or “dar” to any preposition. Beware of breaking a name like “Darren” which might have a meaning within the game, though, so that:

davon → von es
darauf → auf es
darren –/→ ren es