Ubuntu Manpage: Marpa::R2::Advanced::Models

Provided by: libmarpa-r2-perl_2.086000~dfsg-9_amd64

NAME

       Marpa::R2::Advanced::Models - Other input models

About this document

       The alternative input models described in this document are an advanced technique.  If you are starting
       out with Marpa, you probably want to ignore this document.  If you are an experienced Marpa user, it is
       still safe to ignore this document, but you might find the possibilities it discusses interesting.

Marpa has two different ideas of location

       In the other Marpa documentation, we have spoken of "location", and assumed the standard input model.
       The locations actually in use by the methods described for the standard input model were Earley set
       ordinals.  (An Earley set's ordinal is also its ID.)

       Marpa actually has two different ideas of location -- Earley set ordinal and earleme.  This is ignored in
       the other Marpa documents, and it can be, because they assume the standard input model.  Use of the
       standard input model guarantees that earleme and Earley set ordinal will always be exactly the same.

       This document introduces methods which make it possible (and in fact likely) that earleme and Earley set
       ordinal will differ.  From here on, the reader will need to pay careful attention to the distinction.

What is an alternative input model?

       An alternative input model is anything that is not the default, token-stream model.  More helpfully,
       Marpa allows variable-length tokens and ambiguous tokens, and an alternative input model is any input
       model which

       •   Allows a token whose length is not exactly 1, or

       •   Allows locations which have more than one token.

       To  do either of these things, a user must use the recognizer's "alternative" method.  In other words, if
       an application is not directly using the recognizer's "alternative" method call, that application is  not
       using an alternative input method.

       Many  concepts,  such  as  parsing  location, parse exhaustion, and the end of parsing, are somewhat more
       complicated when alternative input models are involved.   These  concepts  were  explained  in  the  main
       document  for  the  recognizer  on the assumption that the default input model was in use.  This document
       revises those explanations as necessary to take into account the alternative input models.

Token streams

       Marpa's default input model is the traditional one -- a token stream.  Token streams are very standard in
       parsing applications -- so much so that most texts do not take the trouble of defining the term.  A token
       stream is input structured as a sequence of tokens, where each token  occupies  one  location  and  every
       location has a token.  In the token stream model, all tokens are of the same length.

       Conventionally,  all  tokens  are of length 1, and the token stream starts at location 0.  Following this
       convention, the Nth token would start at location N-1 and end at location  N.   For  example,  the  first
       token  would  start  at location 0 and end at location 1.  The second token would start at location 1 and
       end at location 2.

Earlemes

       For most parsers, position is location in a token stream.  To deal with variable-length  and  overlapping
       tokens, Marpa needs a more flexible idea of location.

       Marpa's  tracks  position in terms of earlemes.  Earlemes are named after Jay Earley, the inventor of the
       first algorithm in Marpa's lineage.  Every token has a start earleme and an end earleme.

       The token stream model may also be called the token-per-earleme model.  In the token stream model,  token
       location  and  earleme  location are exactly identical.  More formally, in the token stream model, if the
       token location is N, then the earleme location is also N.  If a user's application uses the token  stream
       model, the user can ignore the existence of earlemes, and can think entirely in terms of tokens and their
       position  in  a  token  stream.   Because  of  this,  the  main Marpa documents often speak simply of the
       "location" in the parse.

The furthest earleme

       The furthest earleme is the last earleme at which a token ends.  In the default input model, the furthest
       earleme and the current earleme are always the same.  As a  result,  in  the  default  input  model,  the
       furthest earleme is not an important concept.

       In  alternative  input  models,  tokens  may  be  longer than 1 earleme, and the furthest earleme and the
       current earleme may be far apart.  This becomes an issue when parsing  is  finished.   Alternative  input
       models  use  the  recognizer's  "end_input"  method  to ensure that processing of input catches up to the
       furthest earleme.

The latest Earley set and latest earleme

       The latest earleme is the earleme location of the latest Earley set.  In the  default  input  model,  the
       latest earleme is always the same as the current earleme.

       In  alternative  input  models, there may not be an Earley set at a given earleme location.  When that is
       the case for the current earleme, then the latest Earley set is not  at  the  current  earleme,  and  the
       latest earleme and current earlemes are different.

Methods

alternative()
$recce->alternative( 'a', \42, 1 ) or return 'First alternative failed';

The "alternative" method is the most general method for reading input, and is used in alternative input
models. It takes three arguments, only the first of which is required.

The first two arguments are the token type and a reference to the token value. If the reference to the
token value is omitted, or is "undef", the value of the token will be a Perl "undef".

The third argument to the "alternative" method is a token length. If omitted, token length defaults to
1, which is the correct value for the token stream model. Its value can be any integer greater than
zero. Marpa does not allow zero length tokens in any input model.

Unlike the "read" method, the "alternative" method does not advance the current location (current
earleme) on each call. This allows the application to read several tokens at the same earleme. This is
how ambiguous input is created. To advance the current earleme when input is read using the
"alternative" method, the "complete_earleme" method must be called.

On success, "alternative" returns a Perl true value. If the token is rejected, "alternative" returns a
Perl false. On other failures, "alternative" throws an exception.

current_earleme()
$current_earleme = $recce->current_earleme();

Returns the current parse location, also known as the current earleme. Not often needed.

earleme()
my $origin_earleme = $recce->earleme($origin_earley_set_id);

Given an Earley set ID as its argument, the earleme() recognizer method returns the corresponding
earleme. Every integer in the range from 0 to the ID of the latest Earley is a valid Earley set ID, and
every valid Earley set ID corresponds to an earleme. If the argument of earleme() is greater than the
latest Earley set ID, earleme() returns Perl "undef".

There is currently no method that translates from earleme to Earley set. Earley set to earleme
translation is a well-behaved one-to-one function in all input models -- for every Earley set there is a
earleme, and every earleme is mapped to by at most one Earley set. Earleme to Earley set translation is
far less well-behaved. In many input models, it is a partial function -- there are some earlemes that
are in the valid range of earlemes but do not map to any Earley set.

Earleme to Earley set translation is often not needed. When it is, it can be implemented at the
application level, with the application taking advantage of what it knows about its choice of input
model.

earleme_complete()
$recce->earleme_complete();

Processes all tokens at the current earleme and advances the current earleme by 1. If the earleme cannot
be completed, an exception is thrown. Otherwise, an "event" count is returned. If zero is returned, the
earleme was completed without event. In this context, an "event" is one of a list of occurrences of
special interest during the successful completion of an earleme, as described for the recognizer's "read"
method.

All tokens read using the "alternative" method start at one location -- the current earleme. When
reading input using the "alternative" method, "earleme_complete" is used to complete processing at the
current earleme and move forward in the input stream.

"earleme_complete" may be called even if the "alternative" method has been not called since the last call
to "earleme_complete". This will create an earleme with no tokens. In certain input models, such as the
character-per-earleme model, this can be useful.

end_input()
$recce->end_input();

Indicates that input is finished. Calling "end_input" is not necessary or useful in the default input
model, because in the default input model no token has a length greater than 1.

The "end_input" method takes no arguments. The "end_input" method returns a Perl true value on success.
On failure, it throws an exception.

In alternative input models, calling the "earleme_complete" method once input is finished does not ensure
that all input has been processed. The "earleme_complete" method completes the current earleme, but in
alternative models, tokens may extend well past the current earleme. The "end_input" method ensures that
all input is processed.

Calling "end_input" multiple times on the same recognizer object is harmless, but useless. The second
and subsequent calls will return a Perl true, but will have no effect.

Alternative models and read()

       A  recognizer  can  mix  calls  to  its "read" method with calls to its "alternative" method.  The "read"
       method has the same effect as a single call to the "alternative" method, followed immediately by  a  call
       of the "earleme_complete" method.

Ambiguous lexing

       Marpa  allows  ambiguous tokens.  Several Marpa tokens can start at a single parsing location.  Ambiguous
       tokens can be of various lengths.  Tokens can also overlap.

       Potentially ambiguous lexing occurs  when  more  than  one  token  starts  at  a  single  earleme.   When
       potentially  ambiguous  lexing  occurs,  it  becomes  possible  for there to be more than one sequence of
       tokens.

       An actual lexical ambiguity only occurs if more than one of the potential token sequences  is  consistent
       with  the grammar.  If there is no actual lexical ambiguity, Marpa will use the only token choice that is
       consistent with the grammar.

       When lexing is actually ambiguous, Marpa will use all the alternatives consistent with the grammar.  When
       the lexing in a parse is actually ambiguous, the parse will be ambiguous.  From  the  point  of  view  of
       Marpa's  semantics,  ambiguities  caused  by  lexing  look the same as ambiguities caused by an ambiguous
       grammar.

       In the standard terminology, if a grammar produces more than one parse tree  for  any  input,  then  that
       grammar  must  be  ambiguous.   In Marpa this is not strictly true.  In Marpa, if the input is ambiguous,
       even an unambiguous grammar can produce more than one parse.

Duplicate tokens

       A duplicate token is a token of the same type and the same length as another that was read  at  the  same
       earleme.   Duplicate  tokens  are impossible in the default, token-stream, model.  This is because in the
       token-stream model only one token can be read at each earleme.

       In alternative models, more than one token may be read at an earleme, and duplicates are possible.  Marpa
       detects duplicate tokens and treats them as "hard errors" -- Marpa throws an exception  when  it  sees  a
       duplicate token.  Marpa's assumption is that duplicate tokens indicate an error at the application level.

       An  application  can retry input after a duplicate token, if it catches the exception.  In the future, if
       recovery from duplicate tokens is found to be a useful technique, Marpa may provide an option  to  change
       its behavior, so that a soft failure is returned when there is a duplicate token.

Earlemes: the details

       While  scanning,  Marpa  keeps  track of the current earleme.  Earlemes in a parse start at earleme 0 and
       increase numerically.  The earleme immediately following earleme 0 is earleme 1, the earleme  immediately
       following  earleme  1  is  earleme  2,  and so on.  The earleme immediately following earleme N is always
       earleme N+1.

       Distance in the earleme stream is what you would intuitively expect  it  to  be.   The  distance  between
       earleme  X  and  earleme  Y is the absolute value of the difference between X and Y, |X-Y|.  The distance
       from earleme 3 to earleme 6, for example, is 3 earlemes.

       Whenever a token is given to Marpa to be scanned, it starts at the current earleme.  In addition  to  the
       type  and  value  of the token, Marpa must be told the token's length in earlemes.  The length of a Marpa
       token must be greater than zero.

       This earleme length will become the distance from the start of the token to the end of the token.  If the
       length of the token is L, and the current earleme is C, the end of the token will be at earleme C+L.

The character-per-earleme model

       Many different models of the  relationship  between  tokens  and  earlemes  are  possible,  but  two  are
       particularly  important.   One  is  the  one-token-per-earleme model, which is the default, and which has
       already been described.  The other is the one-character-per-earleme model.

       In the one-character-per-earleme model, every character will be treated as being exactly one  earleme  in
       length.  If a token is more than one character in length, that token will span earlemes.  When the lexing
       is ambiguous, tokens may overlap.

       When  a  one-character-per-earleme  model of input is used, there may be many earlemes at which no tokens
       start.  For example, in a  straightforward  character-per-earleme  implementation  of  a  grammar  for  a
       language  that  allows  comments,  no  tokens  will  start  at any earlemes which correspond to character
       locations inside a comment.

Other input models

       So far only the token-per-earleme and character-per-earleme models  have  seen  any  real  use  in  Marpa
       programs.   But  other  models  are  certainly possible.  Using earlemes, you can structure your input in
       almost any way you like.

       There are only three restrictions:

       1.  Scanning always starts at earleme 0.

       2.  All tokens starting at earleme N must be scanned before any tokens starting at earleme N+1.  In other
           words, the tokens must be scanned in non-decreasing order by start earleme.

       3.  Every token must have a length, in earlemes, which is greater  than  zero.   In  other  words,  token
           length can never be zero or negative.

Copyright and License

         Copyright 2014 Jeffrey Kegler
         This file is part of Marpa::R2.  Marpa::R2 is free software: you can
         redistribute it and/or modify it under the terms of the GNU Lesser
         General Public License as published by the Free Software Foundation,
         either version 3 of the License, or (at your option) any later version.

         Marpa::R2 is distributed in the hope that it will be useful,
         but WITHOUT ANY WARRANTY; without even the implied warranty of
         MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
         Lesser General Public License for more details.

         You should have received a copy of the GNU Lesser
         General Public License along with Marpa::R2.  If not, see
         http://www.gnu.org/licenses/.

perl v5.38.2                                       2024-03-06                   Marpa::R2::Advanced::Models(3pm)