Ubuntu Manpage: mmorph - MULTEXT morphology tool formalism syntax

NAME

       mmorph - MULTEXT morphology tool formalism syntax

DESCRIPTION

       A  mmorph  morphology  description  file  is divided into declaration sections.  Each section starts by a
       section header (`@ Alphabets', `@ Attributes', etc.)  followed  by  a  sequence  of  declarations.   Each
       declaration  starts by a name, followed by a colon (`:') and the definition associated to the name.  Here
       is a brief description of each section:

@ Alphabets

       In this section the lexical and surface alphabet are declared.  All symbols forming each alphabet has  to
       be  listed.   Symbols  may appear in both the lexical and surface alphabet definition in which case it is
       considered a bi-level symbol, otherwise it is a lexical only or surface only symbol.  Symbols are usually
       letters (eg.  a, b, c) , but may also consist of longer names (beta, schwa).  Symbol names consisting  of
       one special character (`:' or `(') may be specified by enclosing them in double quotes (`:' or `(').
       Example:

              Lexical : a b c d e f g h i j k l m n o p q r s t u v w x y z "-" "." "," "?" "!" "\"" "'" ":" ";"
                     "(" ")" strong_e

              Surface : a b c d e f g h i j k l m n o p q r s t u v w x y z "-" "." "," "?" "!" "\"" "'" ":" ";"
                     "(" ")" " "

       In  this  example,  the symbol strong_e is lexical only, the symbol " " (space) is surface only.  All the
       other symbols are bi-level.

       All the strings appearing in the rest of the grammar will be made exclusively of symbols declared in this
       section.

@ Attributes

       In this section, the name of attributes (sometimes called features) and their associated value  set.   At
       most 32 different values may be declared for an attribute.
       Examples:

              Gender : feminine masculine neuter
              Number : singular plural
              Person : 1st 2nd 3rd
              Transitive : yes no
              Inflection : base intermediate final

       In the current version of the implementation value sets of different attributes are incompatible, even if
       they  are  defined  identically.   To overcome this restriction, in a future version this section will be
       split into two:  declaration of value sets and declaration of attributes.

@ Types

       In this section, the different types of feature structures are declared.  The attributes allowed for each
       type are listed.  Attributes that are only used within the scope of the tool and have no meaning  outside
       can  be  listed after a bar (`|').  The values of these local attributes ar not stored in the database or
       written on the final output of the program.
       Examples:

              Noun : Gender Number
              Verb : Tense Person Gender Number Transitive | Inflection

Typed feature structures

       Typed feature structures are used in the grammar and spelling rules.  It is the specification of  a  type
       and  the value of some associated attributes.  The list of attribute specifications is enclosed in square
       brackets (`[' and `]').
       Example:

              Noun[ Gender=feminine Number=singular ]

       It is possible to specify a set of values for an attribute by listing the possible valuse separated  with
       a bar (`|'), or the complement of a set (with respect to all possible values of that attribute) indicated
       with `!=' instead of `='.
       Example:   Assuming  the  declaration  of Gender as above, the following two typed feature structures are
       equivalent

              Noun[ Gender=masculine|neuter ]
              Noun[ Gender!=feminine ]

@ Grammar

       This section contains the rules that specify the structure of words.  It  has  the  general  shape  of  a
       context  free grammar over typed feature structures.  There are three basic types of rules:  binary, goal
       and affixes.

       Binary rules specify the result of the concatenation of two elements. This is written as:

              Rule_name : Lhs <- Rhs1 Rhs2

       where Lhs is called the left hand side, and Rhs1 and Rhs2 the first and second part  of  the  right  hand
       side.  Lhs, Rhs1 and Rhs2 are specified as typed feature structures.
       Example:

              Rule_1  : Noun[ Gender=feminine Number=singular ]
                      <- Noun[ Gender=feminine Number=singular ]
                         NounSuffix[ Gender=feminine ]

       Variables  can  be  used  to  indicate  that  some  attributes have the same value.  A variable is a name
       starting with a dollar (`$').
       Example:

              Rule_2  : Noun[ Gender=$A Number=$number ]
                      <- Noun[ Gender=$A Number=$number ]
                         NounSuffix[ Gender=$A ]

       If needed, both a variable and a value specification can  be  given  for  an  attribute  (only  once  per
       attribute):
       Example:

              Rule_3  : Noun[ Gender=$A Number=$number ]
                      <- Noun[ Gender=$A Number=$number ]
                         NounSuffix[ Gender=$A=masculine|neuter ]

       Affix  rules define basic elements of the concatenations specified by binary rules (together with lexical
       entries, see the section @ Lexicon below).  An affix rule consists of  lexical  string  associated  to  a
       typed feature structure.
       Examples:

              Plural_s : "s" NounSuffix[ Number=plural ]
              Feminine_e : "e" NounSuffix[ Gender=feminine ]
              ing : "ing" VerbSuffix[ Tense=present_participle ]

       Goal  rules  specify  the valid results constructed by the grammar.  They consist of just a typed feature
       structure.
       Examples:

              Goal_1  : Noun[]
              Goal_2  : Verb[ inflection=final ]

       In addition to these three basic rule types, there are prefix or suffix composite rules and unary  rules.
       A unary rule consist of a left hand side and a right hand side.
       Example:

              Rule_4  : Noun[ gender=$G number=plural ]
                      <- Noun[ gender=$G number=singular invariant=yes]

       Prefix  and  suffix composite rules have the same shape as binary rules except that one part of the right
       hand side is an affix (i.e. has an associated string).
       Examples:

              Append_e   : Noun[ Gender=feminine Number=$number ]
                      <- Noun[ Gender=feminine Number=$number ]
                         "e" NounSuffix[ Gender=feminine ]

              anti    : Noun[ Gender=$gender Number=$number ]
                      <- "anti" NounPrefix[]
                         Noun[ Gender=$gender Number=$number ]

@ Classes

       This optional section contains the definition of symbol classes. Each  class  is  defined  as  a  set  of
       symbols, or other classes. If the class contains only bi-level elements it is a bi-level class, otherwise
       it is a lexical or surface class.
       Examples:

              Dental : d t
              Vowel : a e i o u
              Vowel_y : Vowel y
              Consonant: b c d f g h j k l m n p q r s t v w x z

@ Pairs

       This optional section contains the definition of pair disjunctions.  Each disjunction is defined as a set
       of  pairs.   Explicit  pairs  specify a sequence of surface symbols and a sequence of zero or one lexical
       symbol, one of them possibly empty.  A sequence is enclosed between angle  brackets  `<'  and  `>'.   The
       empty sequence is indicated with `<>'.  In the current implementation only the surface part of a pair can
       be  a  sequence  of  more  than one element.  The special symbol `?' stands for the class of all possible
       symbols, including the morpheme and word boundary.
       Examples:

              s_x_z_1 : s/s x/x z/z
              VowelPair1: a/a e/e i/i o/o u/u
              VowelPair2: Vowel/Vowel
              ie.y: <i e>/y
              Delete_e: <>/e
              Insert_d: d/<>
              Surface_Vowel: Vowel/?
              Lexical_s:  ?/s

              DoubleConsonant: <b b>/b <d d>/d <f f>/f <g g>/g <k k>/k <m m>/m <p p>/p <s s>/s  <t t>/t  <v v>/v
                     <z z>/z

       Note  that  VowelPair1  and  VowelPair2  don't  specify  the  same  thing: VowelPair2 would match a/o but
       VowelPair1 would not.

       Implicit pairs are specified by the name of a bi-level symbol or a bi-level class.
       Examples:  the following s_x_z_2 and VowelPair3 are  equivalent  to  the  above  s_x_z_1  and  VowelPair2
       (assuming that s, x, z and Vowel are bi-level symbols and classes).

              s_x_z_2 : s x z
              VowelPair3 : Vowel

       In  a  pair  disjunction all lexical parts should be disjoint. This means you cannot specify for the same
       pair disjunction a/a and o/a or a/a and Vowel/Vowel.

       In a future version this section will be split in two:  simple pair disjunctions and pair sequences.

@ Spelling

       In this section are declared the two level spelling rules.  A spelling rule consist of a  kind  indicator
       followed  by  a  left  context  a  focus  and a right context.  The kind indicator is `=>' if the rule is
       optional, `<=>' if it is obligatory and `<=' if it is a surface  coercion  rule.   The  contexts  may  be
       empty.  The focus is surrounded by two `-'.  The contexts and the focus consist of a sequence of pairs or
       pair  disjunctions declared in the `@ Pairs section.  A morpheme boundary is indicated by a `+' or a `*',
       a word boundary is indicated by a `~'.
       Examples:

              Sibilant_s: <=> s_x_z_1 * - e/<> - s
              Gemination: <=>
                      Consonant Vowel - DoubleConsonant - * Vowel
              i_y_optionnel: => a - i/y - * ?/e

       Constraints may be specified in the form of a list of typed feature structures.  They  are  affix-driven:
       the  rule  is  licensed  if  at least one of them subsumes the closest corresponding affix.  The morpheme
       boundary indicated by a star (`*') will be used to determine which affix it is.   If  there  is  no  such
       indication,  then  the  affix  adjacent  to the morpheme where the first character of the focus occurs is
       used.  In case there is no affix, the typed feature structure of the lexical stem is used.
       Example:

              Sibilant_s: <=>
                  s_x_z_1 * - e/<> - s NounSuffix[ Number=plural ]

@ Lexicon

       This section is optional and can also be repeated.  This section lists all the  lexical  entries  of  the
       morphological  description.   Unlike  the  other  sections, definitions do not have a name.  A definition
       consist of a typed feature strucure followed  by  a  list  of  lexical  stems  that  share  that  feature
       structure.   A  lexical  stem  consists  of the string used in the concatenation specified by the grammar
       rules followed by `=' and a reference string.  The reference string can be anything and usually  is  used
       to indicate the canonical form of the word or an identifier of an external database entry.
       Examples:
              Noun[ Number=singular ] "table" = "table" "chair" = "chair"
              Verb[ Transitive=yes|no Inflection=base ] "bow" = "bow1"
              Noun[ Number=singular ] "bow" = "bow2"

       If the stem string and the reference strings are identical, only one needs to be specified.
       Example:

              Noun[ Number=singular ] "table" "chair"

FORMAL SYNTAX

       The formal syntax description below is in Backus Naur Form (BNF).  The following conventions apply:

       <id>      is a non-terminal symbol (within angle brackets).
       ID        is a token (terminal symbol, all uppercase).
       <id>?     means zero or one occurrence of <id> (i.e. <id> is optional).
       <id>*     is zero or more occurrences of <id>.
       <id>+     is one or more occurrences of <id>.
       ::=       separates a non-terminal symbol and its expansion.
       |         indicates an alternative expansion.
       ;         starts a comment (not part of the definition).

       The  start  symbol  corresponding  to a complete description is named <Start>.  Symbols that parse but do
       nothing are marked with `; not operational'.

       <Start>           ::= <AlphabetDecl> <AttDecl> <TypeDecl> <GramDecl>
                             <ClassDecl>? <PairDecl>? <SpellDecl>? <LexDecl>*

       <AlphabetDecl>    ::= ALPHABETS <LexicalDef> <SurfaceDef>

       <LexicalDef>      ::= <LexicalName> COLON <LexicalSymbol>+

       <SurfaceDef>      ::= <SurfaceName> COLON <SurfaceSymbol>+

       <LexicalSymbol>   ::= <LexicalSymbolName>    ; lexical only
                         |   <BiLevelSymbolName>    ; both lexical and surface

       <SurfaceSymbol>   ::= <SurfaceSymbolName>    ; surface only
                         |   <BiLevelSymbolName>    ; both lexical and surface

       <AttDecl>         ::= ATTRIBUTES <AttDef>+

       <AttDef>          ::= <AttName> COLON <ValName>+

       <TypeDecl>        ::= TYPES <TypeDef>+

       <TypeDef>         ::= <TypeName> COLON <AttName>+ <NoProjAtt>?

       <NoProjAtt>       ::= BAR <AttName>+

       <LexDecl>         ::= LEXICON <LexDef>+

       <LexDef>          ::= <Tfs> <Lexical>+

       <Lexical>         ::= LEXICALSTRING <BaseForm>?

       <BaseForm>        ::= EQUAL LEXICALSTRING

       <Tfs>             ::= <TypeName> <AttSpec>?

       <VarTfs>          ::= <TypeName> <VarAttSpec>?

       <AttSpec>         ::= LBRA <AttVal>* RBRA

       <VarAttSpec>      ::= LBRA <VarAttVal>* RBRA

       <AttVal>          ::= <AttName> <ValSpec>

       <VarAttVal>       ::= <AttName> <VarValSpec>

       <ValSpec>         ::= EQUAL <ValSet>
                         |   NOTEQUAL <ValSet>

       <VarValSpec>      ::= <ValSpec>
                         |   EQUAL DOLLAR <VarName>
                         |   EQUAL DOLLAR <VarName> <ValSpec>

       <ValSet>          ::= <ValName> <ValSetRest>*

       <ValSetRest>      ::= BAR <ValName>

       <GramDecl>        ::= GRAMMAR <Rule>+

       <RuleDef>         ::= <RuleName> COLON <RuleBody>

       <RuleBody>        ::= <VarTfs> LARROW <Rhs>
                         |   <Tfs>    ; goal rule
                         |   LEXICALSTRING <Tfs>    ; lexical affix

       <Rhs>             ::= <VarTfs>    ; unary rule
                         |   <VarTfs> <VarTfs>    ; binary rule
                         |   LEXICALSTRING <Tfs> <VarTfs>   ; prefix rule
                         |   <VarTfs> <Tfs> LEXICALSTRING    ; suffix rule

       <ClassDecl>       ::= CLASSES<ClassDef>+

       <ClassDef>        ::= <LexicalClassName> COLON <LexicalClass>+
                         |   <SurfaceClassName> COLON <SurfaceClass>+
                         |   <BiLevelClassName> COLON <BiLevelClass>+

       <LexicalClass>    ::= <LexicalSymbol>
                         |   <LexicalClassName>
                         |   <BiLevelClassName>

       <SurfaceClass>    ::= <SurfaceSymbol>
                         |   <SurfaceClassName>
                         |   <BiLevelClassName>

       <BiLevelClass>    ::= <BiLevelSymbolName>
                         |   <BiLevelClassName>

       <PairDecl>        ::= PAIRS <PairDef>+

       <PairDef>         ::= <PairName> COLON <PairDef>+

       <PairDef>         ::= <PairName> COLON <Pair>+

       <Pair>            ::= <SurfaceSequence> SLASH <LexicalSequence>
                         |   <PairName>
                         |   <BiLevelClassName>
                         |   <BiLevelSymbolName>

       SurfaceSequence   ::= LANGLE <SurfaceSymbol>* RANGLE
                         |   SURFACESTRING
                         |   <SurfaceClass>
                         |   ANY

       LexicalSequence   ::= LANGLE <LexicalSymbol>* RANGLE
                         |   LEXICALSTRING
                         |   <LexicalClass>
                         |   ANY

       <SpellDecl>       ::= SPELLING <SpellDef>+

       <SpellDef>        ::= <SpellName> COLON <Arrow> <LeftContext> <Focus>
                                 <RightContext> <Constraint>*

       <LeftContext>     ::= <Pattern>*

       <RightContext>    ::= <Pattern>*

       <Focus>           ::= CONTEXTBOUNDARY <Pattern>+ CONTEXTBOUNDARY

       <Pattern>         ::= <Pair>
                         |   MORPHEMEBOUNDARY
                         |   WORDBOUNDARY
                         |   CONCATBOUNDARY

       <Constraint>      ::= <Tfs>

       <Arrow>           ::= RARROW
                         |   BIARROW
                         |   COERCEARROW

       <AttName>           ::= NAME
       <BiLevelClassName>  ::= NAME
       <BiLevelSymbolName> ::= NAME  | SYMBOLSTRING
       <LexicalClassName>  ::= NAME
       <LexicalName>       ::= NAME
       <LexicalSymbolName> ::= NAME  | SYMBOLSTRING
       <PairName>          ::= NAME
       <RuleName>          ::= NAME
       <SpellName>         ::= NAME
       <SurfaceClassName>  ::= NAME
       <SurfaceName>       ::= NAME
       <SurfaceSymbolName> ::= NAME  | SYMBOLSTRING
       <TypeName>          ::= NAME
       <ValName>           ::= NAME
       <VarName>           ::= NAME

   Simple tokens
       Simple tokens of the BNF above are defined as follow: The token  name  on  the  left  correspond  to  the
       literal character or characters on the right:

       ANY                 ?
       BAR                 |
       BIARROW             <=>
       COERCEARROW         <=
       COLON               :
       CONCATBOUNDARY      *
       CONTEXTBOUNDARY     -
       DOLLAR              $
       EQUAL               =
       LANGLE              <
       LARROW              <-
       LBRA                ]
       MORPHEMEBOUNDARY    +
       NOTEQUAL            !=
       RARROW              =>
       RANGLE              <
       RBRA                [
       SLASH               /
       WORDBOUNDARY        ~

       ALPHABETS           @Alphabets
       ATTRIBUTES          @Attributes
       CLASSES             @Classes
       GRAMMAR             @Grammar
       LEXICON             @Lexicon
       PAIRS               @Pairs
       SPELLING            @Spelling
       TYPES               @Types

       In the section header tokens above, spaces may separate the `@' from the reserved word.

   Complex tokens
       NAME
              is any sequence of letter, digit, underline (`_'), period (`.')
              Examples:
              category
              33
              Rule_9
              __2__
              Proper.Noun

       LEXICALSTRING
              is a string of lexical symbols

       SURFACESTRING
              is a string of surface symbols

       SYMBOLSTRING
              is a string of just just one character (used only in alphabet declaration).

       A  string  consist  of  zero  or  more  characters  within double quotes (`"').  Characters preceded by a
       backslash (`\') are escaped (the usual C escaping convention apply).  Symbols that  have  a  name  longer
       than one character are represented using a SGML entity like notation: `&symbolname;'.  The maximum number
       of symbols in a string is 127.
       Examples:

              "table"
              ","
              ""
              "double quote is \" and backslash is \\"
              "&strong_e;"
              "escape like in C : \t is ASCII tab"
              "escape with octal code: \011 is ASCII tab"

       Tokens can be separated by one or many blanks or comments.
       A blank separator is space, tab or newline.
       A comment starts with a semicolon and finishes at the next newline (except when the semicolon occurs in a
       string.

       Inclusion of files can be specified with the usual `#include' directive:
       Example:
              #include "verb.entries"

       will splice in the content of the file verb.entries at the point where this directive occurs.

       The  `#'  should be the first character on the line.  Tabs or spaces may separate `#' and `include'.  The
       file name must be quoted.  Only tabs or spaces may occur on the rest  of  the  line.   Inclusion  can  be
       nested up to 10 levels.

AUTHOR

       Dominique Petitpierre, ISSCO, <petitp@divsun.unige.ch>

COMMENTS

       The parser for the morphology description formalims above was written using yacc (1) and flex (1).   Flex
       was  written  by  Vern  Paxson, <vern@ee.lbl.gov>, and is distributed in the framework of the GNU project
       under the condition of the GNU General Public License

                                            Version 2.3, October 1995                                  MMORPH(5)

NAME

DESCRIPTION

@ Alphabets

@ Attributes

@ Types

Typed feature structures

@ Grammar

@ Classes

@ Pairs

@ Spelling

@ Lexicon

FORMAL SYNTAX

SEE ALSO

AUTHOR

COMMENTS