Provided by: libmarpa-r2-perl_2.086000~dfsg-10_amd64 bug

NAME

       Marpa::R2::Vocabulary - Standard parsing terms as used within Marpa

Description

       The definitions in this document are of standard parsing terms as they are used in the Marpa context.  I
       put defining uses of terms in boldface, for easy skimming.  Where a narrow or specialized sense of the
       term is the one that applies within Marpa, that is the only definition given.  Marpa also sometimes uses
       a standard term with a definition which is slightly different from the standard one.  ("Ambiguous
       grammar" is one example, and "grammar" itself is another.)  When this is the case, it is explicitly
       pointed out.

       A reader totally new to parsing will find this document too terse to act as a textbook or tutorial.  They
       will need to look elsewhere first.  As an introduction, I recommend Mark Jason Dominus's excellent
       chapter on parsing in the Perl context.  It's available on-line.  Wikipedia is also an excellent place to
       start.

   Basic
       A grammar is a set of rules, associated with a set of symbols, one of which is distinguished as the start
       symbol.  A symbol string, or simply string where the meaning is clear, is an ordered series of symbols.
       The length of a string is the number of symbols in it.

       A language is a set of symbol strings.  A grammar defines a language, as will be described later.

       It is important to note that the term language, as it is used in parsing theory, means something very
       different from what it means in ordinary use.  The meaning of the strings is an essential part of the
       ordinary idea of what a language is.  In ordinary use, the word "language" means far more than a
       unordered list of its sentences.  In parsing terminology, meaning (or semantics as it is called) is a
       separate issue.  For parsing theory a language is exactly a set of strings -- that and nothing more.

       The Marpa definition of a grammar differs slightly from the various standard ones.  Standard definitions
       usually sharply distinguish terminal symbols from non-terminals.  Marpa does not.

   Stages of parsing
       A recognizer is a program that determines whether its input is in the language of a grammar and a start
       symbol.  A parser is a program which finds the structure of that input.

       The term parsing is used in a strict and a loose sense.  Parsing in the loose sense is all phases of
       finding a grammar's structure, including a separate recognition phase if the parser has one.  (Marpa
       does.)  If a parser has phases, parsing in the strict sense refers specifically to the phase that finds
       the structure of the input.  When the Marpa documents use the term parsing in its strict sense, they will
       speak explicitly of "parsing in the strict sense".  Otherwise, parsing will mean parsing in the loose
       sense.

       Parsers often use a lexical analyzer to convert raw input, usually input text, into a token stream, which
       is a series of tokens.  Each token represents a symbol of the grammar and has a value.  A lexical
       analyzer is often called a lexer or a scanner, and lexical analysis is often called lexing or scanning.

       The series of symbols represented by the series of tokens becomes the symbol string input seen by the
       recognizer.  The symbol string input is more often called the input sentence.

       By default, Marpa uses the token stream model of input.  Marpa also allows alternative input models.
       These are new to Marpa, so that their terminology is of necessity non-standard.  The terminology needed
       for alternative input models is explained in the document that introduces them.

   Rules
       A standard way of describing rules is Backus-Naur Form, or BNF.  A rule of a grammar is also often called
       a production.  In one common way of writing BNF, a production looks like this:

           Expression ::= Term Factor

       In the production above, "Expression", "Term" and "Factor" are symbols.  A production consists of a left
       hand side and a right hand side.  In a context-free grammar, like those Marpa parses, the left hand side
       of a production is always a symbol string of length 1.  The right hand side of a production is a symbol
       string of zero or more symbols.  In the example, "Expression" is the left hand side, and "Term" and
       "Factor" are right hand side symbols.

       Left hand side and right hand side are often abbreviated as RHS and LHS.  If the RHS of a production has
       no symbols, the production is called an empty production or an empty rule.

       Any symbol which is allowed to occur in the symbol string input is called a terminal symbol.  If the
       symbols in a symbol string are all terminals, that symbol string is also called a sentence.

   Derivations
       A step of a derivation, or derivation step, is a change made to a symbol string by applying one of the
       productions from the grammar.  The production must be one of those with a LHS that occurs in the symbol
       string.  The result of the derivation step is another symbol string, one in which every occurence of the
       LHS symbol from the production is replaced by the RHS of the production.  For example, if "A", "B", "C",
       "D", and "X" are symbols, and

           X ::= B C

       is a production, then

           A X D -> A B C D

       is a derivation step, with ""A X D"" as its beginning and ""A B C D"" as its end or result.  We say that
       the symbol string ""A X D"" derives the symbol string ""A B C D"".

       A derivation is a sequence of derivation steps.  The length of a derivation is its length in steps.

       •   We  say  that a first symbol string directly derives a second symbol string if and only if there is a
           derivation of length 1 from the first symbol string to the second symbol string.

       •   Every symbol string is said to derive itself in a derivation of length 0.  A zero  length  derivation
           is a trivial derivation.

       •   A  derivation  which  is  not  trivial  (that is, a derivation which has one or more steps) is a non-
           trivial derivation.

       •   A non-trivial derivation of a symbol string from itself is called a cycle.   Grammars  which  contain
           cycles are traditionally considered useless, but Marpa will parse with such grammars.

       •   If  a  derivation  is  not  trivial  or  direct, that is, if it has more than one step, then it is an
           indirect derivation.

       Technically, a symbol "X" and a string that consists of only that symbol are two different  things.   But
       we  often  say  "the symbol "X"" as shorthand for "the string of length 1 whose only symbol is "X"".  For
       example, if the string containing only the symbol "X" derives a string "Y", we will  usually  say  simply
       that ""X" derives "Y"".

       Wherever  symbol  or  string  "X"  derives  "Y", we may also say "X" produces "Y".  Derivations are often
       described as symbol matches.  Wherever symbol or string "X" derives "Y", we may also say that "Y" matches
       "X" or that "X" matches "Y".  It is particularly common to say that "X" matches "Y" when "X" or "Y" is  a
       sentence.

       The  parse  of  an  input  by a grammar is successful if and only if, according to the grammar, the start
       symbol produces the input sentence.  The set of all input sentences  that  a  grammar  will  successfully
       parse is the language of the grammar.

   Nulling
       The  zero  length  symbol  string is called the empty string.  The empty string can be considered to be a
       sentence, in which case it is the empty sentence.  A string of one  or  more  symbols  is  non-empty.   A
       derivation  which  produces  the  empty  string is a null derivation.  A derivation from the start symbol
       which produces the empty string is a null parse.

       If in a particular grammar, a symbol has a null derivation, it is a nullable symbol.  If, in a particular
       grammar, the only sentence produced by a symbol is the empty sentence,  it  is  a  nulling  symbol.   All
       nulling symbols are nullable symbols.

       If  a symbol is not nullable, it is non-nullable.  If a symbol is not nulling, it is non-nulling.  In any
       instance where a symbol produces the empty string, it is said to be nulled, or to be a null symbol.

   Useless rules
       If any derivation from the start symbol uses a rule, that rule is called reachable or accessible.  A rule
       that is not accessible is called unreachable or inaccessible.  If  any  derivation  which  results  in  a
       sentence  uses  a  rule,  that  rule  is  said to be productive.  A rule that is not productive is called
       unproductive.  For example, a rule is unproductive unless every symbol on its RHS either is a terminal or
       is the LHS of some other rule.  A rule which is inaccessible or unproductive is called  a  useless  rule.
       Marpa can handle grammars with useless rules.

       A  symbol  is  reachable  or  accessible  if  it  appears  in a reachable production.  If a symbol is not
       reachable, it is unreachable or inaccessible.  A symbol is productive if it  appears  on  the  LHS  of  a
       productive  rule,  or  if it is a nullable symbol.  If a symbol is not productive, it is unproductive.  A
       symbol which is inaccessible or unproductive is called a useless symbol.  Marpa can handle grammars  with
       useless symbols.

   Recursion and cycles
       If  any  symbol  in  the grammar non-trivially produces a symbol string containing itself, the grammar is
       said to be recursive.  If any symbol non-trivially produces a symbol string with itself on the left,  the
       grammar  is  said to be left-recursive.  If any symbol non-trivially produces a symbol string with itself
       on the right, the grammar is said to be  right-recursive.   Marpa  can  handle  all  recursive  grammars,
       including  grammars  which  are  left-recursive,  grammars  which are right-recursive, and grammars which
       contain both left- and right-recursion.

       A cycle is a non-trivial derivation of a string of symbols from itself.  If it is not  possible  for  any
       derivation  using  a  grammar  to  contain  a  cycle,  then  that  grammar  is  said  to  be  cycle-free.
       Traditionally, a grammar is considered useless if it is not cycle-free.

       The traditional deprecation of cycles is well-founded.  A cycle is the parsing equivalent of an  infinite
       loop.   Once  a  cycle appears, it can be repeated over and over again.  Even a very short input sentence
       can have an infinite number of parses when the grammar is not cycle-free.

       For that reason, a grammar which contains a cycle is also called infinitely ambiguous.  Marpa  can  parse
       with  grammars  which  are not cycle-free, and will even parse inputs that cause cycles.  When a parse is
       infinitely ambiguous, Marpa limits cycles to a single loop, so that only a finite  number  of  parses  is
       returned.

   Parse structure
       The  structure of a parse can be represented as a series of derivation steps from the start symbol to the
       input.  Another way to represent structure is as a parse  tree.   Every  symbol  used  in  the  parse  is
       represented  by  a  node  of  the  parse  tree.   Wherever  a production is used in the parse, its LHS is
       represented by a parent node and the RHS symbols are represented by child nodes.  The start symbol is the
       root of the tree.  The node at the root of the tree is called the start node.

       Traditionally, grammars divide all symbols sharply into terminals and non-terminals.  A  terminal  symbol
       must ALWAYS be used as a terminal.  A non-terminal symbol can NEVER be used as a terminal.

       Marpa's  use  of  terminals  is non-traditional, and its terminology is different accordingly.  As in the
       traditional approach, Marpa's non-terminals can never be used as terminals.  But Marpa terminals  can  be
       used  anywhere,  even  in  places  where  the  traditional approach requires a a non-terminal symbol.  In
       particular, a Marpa terminal can be the LHS of a rule.

       Traditionally, and in Marpa as well, every node is either a inner node or a leaf node.   In  Marpa,  leaf
       nodes are of two kinds:

       •   Nodes for nulled symbols.  A node for a nulled symbol is called a nulled node.

       •   Nodes for symbols being used as terminals.

   Ambiguity
       Marpa  allows ambiguous grammars.  Traditionally we say that a parse is ambiguous if, for a given grammar
       and a given input, more than one derivation tree is possible.   However,  Marpa  allows  ambiguous  input
       tokens,  which  the  traditional  definition  does  not take into account.  If Marpa used the traditional
       definition, all grammars would be ambiguous except those grammars which allowed only the null parse.

       It is easiest if the Marpa definition and the traditional definition were  extensionally  equivalent  ---
       that  is, if Marpa's set of ambiguous grammars was exactly the same as the set of traditionally ambiguous
       grammars.  This can be accomplished by using a slightly altered definition.   In  the  Marpa  context,  a
       grammar  is  ambiguous if and only if, for some UNAMBIGUOUS stream of input tokens, that grammar produces
       more than one parse tree.

   Semantics
       In real life, the structure of a parse is usually a means to an end.  Grammars usually have  a  semantics
       associated  with  them,  and  what  the  user  actually  wants is the value of the parse according to the
       semantics.

       The tree representation is especially useful when evaluating a  parse.   In  the  traditional  method  of
       evaluating  a parse tree, every node which represents a terminal symbol has a value associated with it on
       input.  Non-null inner nodes take their semantics from the production whose LHS they  represent.   Nulled
       nodes are dealt with as special cases.

       The  semantics  for a production describe how to calculate the value of the node which represents the LHS
       (the parent node) from the values of zero or more of the nodes which represent  the  RHS  symbols  (child
       nodes).   Values  are  computed  recursively,  bottom-up.  The value of a parse is the value of its start
       symbol.

Copyright and License

         Copyright 2014 Jeffrey Kegler
         This file is part of Marpa::R2.  Marpa::R2 is free software: you can
         redistribute it and/or modify it under the terms of the GNU Lesser
         General Public License as published by the Free Software Foundation,
         either version 3 of the License, or (at your option) any later version.

         Marpa::R2 is distributed in the hope that it will be useful,
         but WITHOUT ANY WARRANTY; without even the implied warranty of
         MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
         Lesser General Public License for more details.

         You should have received a copy of the GNU Lesser
         General Public License along with Marpa::R2.  If not, see
         http://www.gnu.org/licenses/.

perl v5.40.0                                       2024-12-07                         Marpa::R2::Vocabulary(3pm)