Ubuntu Manpage: "Parser::MGC" - build simple recursive-descent parsers

Provided by: libparser-mgc-perl_0.19-1_all

NAME

       "Parser::MGC" - build simple recursive-descent parsers

SYNOPSIS

          package My::Grammar::Parser;
          use base qw( Parser::MGC );

          sub parse
          {
             my $self = shift;

             $self->sequence_of( sub {
                $self->any_of(
                   sub { $self->token_int },
                   sub { $self->token_string },
                   sub { \$self->token_ident },
                   sub { $self->scope_of( "(", \&parse, ")" ) }
                );
             } );
          }

          my $parser = My::Grammar::Parser->new;

          my $tree = $parser->from_file( $ARGV[0] );

          ...

DESCRIPTION

This base class provides a low-level framework for building recursive-descent parsers that consume a
given input string from left to right, returning a parse structure. It takes its name from the "m//gc"
regexps used to implement the token parsing behaviour.

It provides a number of token-parsing methods, which each extract a grammatical token from the string. It
also provides wrapping methods that can be used to build up a possibly-recursive grammar structure, by
applying a structure around other parts of parsing code.

Backtracking
Each method, both token and structural, atomically either consumes a prefix of the string and returns its
result, or fails and consumes nothing. This makes it simple to implement grammars that require
backtracking.

Several structure-forming methods have some form of "optional" behaviour; they can optionally consume
some amount of input or take some particular choice, but if the code invoked inside that subsequently
fails, the structure can backtrack and take some different behaviour. This is usually what is required
when testing whether the structure of the input string matches some part of the grammar that is optional,
or has multiple choices.

However, once the choice of grammar has been made, it is often useful to be able to fix on that one
choice, thus making subsequent failures propagate up rather than taking that alternative behaviour.
Control of this backtracking is given by the "commit" method; and careful use of this method is one of
the key advantages that "Parser::MGC" has over more simple parsing using single regexps alone.

CONSTRUCTOR

   new
          $parser = Parser::MGC->new( %args )

       Returns a new instance of a "Parser::MGC" object. This must be called on a subclass that provides method
       of the name provided as "toplevel", by default called "parse".

       Takes the following named arguments

       toplevel => STRING
               Name  of  the  toplevel method to use to start the parse from. If not supplied, will try to use a
               method called "parse".

       patterns => HASH
               Keys in this hash should map to quoted  regexp  ("qr//")  references,  to  override  the  default
               patterns used to match tokens. See "PATTERNS" below

       accept_0o_oct => BOOL
               If true, the "token_int" method will also accept integers with a "0o" prefix as octal.

PATTERNS

       The following pattern names are recognised. They may be passed to the constructor in the "patterns" hash,
       or provided as a class method under the name "pattern_name".

       •   ws

           Pattern used to skip whitespace between tokens. Defaults to "/[\s\n\t]+/"

       •   comment

           Pattern used to skip comments between tokens. Undefined by default.

       •   int

           Pattern  used to parse an integer by "token_int". Defaults to "/-?(?:0x[[:xdigit:]]+|[[:digit:]]+)/".
           If "accept_0o_oct" is given, then this will be expanded to match "/0o[0-7]+/" as well.

       •   float

           Pattern   used   to   parse   a    floating-point    number    by    "token_float".    Defaults    to
           "/-?(?:\d*\.\d+|\d+\.)(?:e-?\d+)?|-?\d+e-?\d+/i".

       •   ident

           Pattern used to parse an identifier by "token_ident". Defaults to "/[[:alpha:]_]\w*/"

       •   string_delim

           Pattern used to delimit a string by "token_string". Defaults to "/["']/".

METHODS

   from_string
          $result = $parser->from_string( $str )

       Parse the given literal string and return the result from the toplevel method.

   from_file
          $result = $parser->from_file( $file, %opts )

       Parse  the given file, which may be a pathname in a string, or an opened IO handle, and return the result
       from the toplevel method.

       The following options are recognised:

       binmode => STRING
               If set, applies the given binmode to the filehandle before reading. Typically this can be used to
               set the encoding of the file.

                  $parser->from_file( $file, binmode => ":encoding(UTF-8)" )

   from_reader
          $result = $parser->from_reader( \&reader )

       Since version 0.05.

       Parse the input which is read by the "reader" function. This function will be called in scalar context to
       generate portions of string to parse, being passed the $parser object. The function should return "undef"
       when it has no more string to return.

          $reader->( $parser )

       Note that because it is not generally possible to detect exactly when more input may be required  due  to
       failed  regexp  parsing,  the  reader function is only invoked during searching for skippable whitespace.
       This makes it suitable for reading lines of a file in the common  case  where  lines  are  considered  as
       skippable  whitespace,  or for reading lines of input interactively from a user. It cannot be used in all
       cases (for example, reading fixed-size buffers from a file) because two successive invocations may  split
       a single token across the buffer boundaries, and cause parse failures.

   pos
          $pos = $parser->pos

       Since version 0.09.

       Returns the current parse position, as a character offset from the beginning of the file or string.

   take
          $str = $parser->take( $len )

       Since version 0.16.

       Returns  the  next  $len characters directly from the input, prior to any whitespace or comment skipping.
       This does not take account of any end-of-scope marker that may be pending. It  is  intended  for  use  by
       parsers of partially-binary protocols, or other situations in which it would be incorrect for the end-of-
       scope marker to take effect at this time.

   where
          ( $lineno, $col, $text ) = $parser->where

       Returns the current parse position, as a line and column number, and the entire current line of text. The
       first line is numbered 1, and the first column is numbered 0.

   fail
   fail_from
          $parser->fail( $message )

          $parser->fail_from( $pos, $message )

       "fail_from" since version 0.09.

       Aborts the current parse attempt with the given message string. The failure message will include the line
       and  column  position,  and  the  line  of input that failed at the current parse position ("fail"), or a
       position earlier obtained using the "pos" method ("fail_from").

       This failure will propagate up to the inner-most structure parsing method that has not been committed; or
       will cause the entire parser to fail if there are no further options to take.

   at_eos
          $eos = $parser->at_eos

       Returns true if the input string is at the end of the string.

   scope_level
          $level = $parser->scope_level

       Since version 0.05.

       Returns the number of nested "scope_of" calls that have been made.

STRUCTURE-FORMING METHODS

       The following methods may be used to build a grammatical structure out of the defined basic token-parsing
       methods. Each takes at least one code reference, which will be passed the actual $parser  object  as  its
       first argument.

       Anywhere  that  a  code  reference is expected also permits a plain string giving the name of a method to
       invoke. This is sufficient in many simple cases, such as

          $self->any_of(
             'token_int',
             'token_string',
             ...
          );

   maybe
          $ret = $parser->maybe( $code )

       Attempts to execute the given $code in scalar context, and returns what it returned,  accepting  that  it
       might fail. $code may either be a CODE reference or a method name given as a string.

       If  the  code  fails (either by calling "fail" itself, or by propagating a failure from another method it
       invoked) before it has invoked "commit", then none of the input string  will  be  consumed;  the  current
       parsing position will be restored. "undef" will be returned in this case.

       If  it calls "commit" then any subsequent failure will be propagated to the caller, rather than returning
       "undef".

       This may be considered to be similar to the "?" regexp qualifier.

          sub parse_declaration
          {
             my $self = shift;

             [ $self->parse_type,
               $self->token_ident,
               $self->maybe( sub {
                  $self->expect( "=" );
                  $self->parse_expression
               } ),
             ];
          }

   scope_of
          $ret = $parser->scope_of( $start, $code, $stop )

       Expects to find the $start pattern, then attempts to execute the given $code, then expects  to  find  the
       $stop  pattern. Returns whatever the code returned. $code may either be a CODE reference of a method name
       given as a string.

       While the code is being executed, the $stop pattern will be used by the token parsing methods as an  end-
       of-scope marker; causing them to raise a failure if called at the end of a scope.

          sub parse_block
          {
             my $self = shift;

             $self->scope_of( "{", 'parse_statements', "}" );
          }

       If  the  $start  pattern  is  undefined,  it is presumed the caller has already checked for this. This is
       useful when the stop pattern needs to be calculated based on the start pattern.

          sub parse_bracketed
          {
             my $self = shift;

             my $delim = $self->expect( qr/[\(\[\<\{]/ );
             $delim =~ tr/([<{/)]>}/;

             $self->scope_of( undef, 'parse_body', $delim );
          }

       This method does not have any optional parts to it;  any  failures  are  immediately  propagated  to  the
       caller.

   committed_scope_of
          $ret = $parser->committed_scope_of( $start, $code, $stop )

       Since version 0.16.

       A  variant  of  "scope_of"  that  calls  "commit"  after a successful match of the start pattern. This is
       usually what you want if using "scope_of" from  within  an  "any_of"  choice,  if  no  other  alternative
       following this one could possibly match if the start pattern has.

   list_of
          $ret = $parser->list_of( $sep, $code )

       Expects  to find a list of instances of something parsed by $code, separated by the $sep pattern. Returns
       an ARRAY ref containing a list of the return values from  the  $code.  A  single  trailing  delimiter  is
       allowed,  and  does  not  affect  the return value. $code may either be a CODE reference or a method name
       given as a string. It is called in list context, and whatever values  it  returns  are  appended  to  the
       eventual result - similar to perl's "map".

       This  method  does  not consider it an error if the returned list is empty; that is, that the scope ended
       before any item instances were parsed from it.

          sub parse_numbers
          {
             my $self = shift;

             $self->list_of( ",", 'token_int' );
          }

       If the code fails (either by invoking "fail" itself, or by propagating a failure from another  method  it
       invoked)  before  it  has invoked "commit" on a particular item, then the item is aborted and the parsing
       position will be restored to the beginning of that  failed  item.  The  list  of  results  from  previous
       successful attempts will be returned.

       If  it  calls  "commit"  within  an  item then any subsequent failure for that item will cause the entire
       "list_of" to fail, propagating that to the caller.

   sequence_of
          $ret = $parser->sequence_of( $code )

       A shortcut for calling "list_of" with an empty string as separator; expects to find at least one instance
       of something parsed by $code, separated only by skipped whitespace.

       This may be considered to be similar to the "+" or "*" regexp qualifiers.

          sub parse_statements
          {
             my $self = shift;

             $self->sequence_of( 'parse_statement' );
          }

       The interaction of failures in the code and the "commit" method is identical to that of "list_of".

   any_of
          $ret = $parser->any_of( @codes )

       Since version 0.06.

       Expects that one of the given code instances can parse  something  from  the  input,  returning  what  it
       returned.  Each  code  instance may indicate a failure to parse by calling the "fail" method or otherwise
       propagating a failure.  Each code instance may either be a CODE reference or a method  name  given  as  a
       string.

       This  may  be  considered  to  be similar to the "|" regexp operator for forming alternations of possible
       parse trees.

          sub parse_statement
          {
             my $self = shift;

             $self->any_of(
                sub { $self->parse_declaration; $self->expect(";") },
                sub { $self->parse_expression; $self->expect(";") },
                sub { $self->parse_block },
             );
          }

       If the code for a given choice fails (either by invoking "fail" itself, or by propagating a failure  from
       another  method it invoked) before it has invoked "commit" itself, then the parsing position restored and
       the next choice will be attempted.

       If it calls "commit" then any subsequent failure for that choice will cause the entire "any_of" to  fail,
       propagating that to the caller and no further choices will be attempted.

       If none of the choices match then a simple failure message is printed:

          Found nothing parseable

       As this is unlikely to be helpful to users, a better message can be provided by the final choice instead.
       Don't forget to "commit" before printing the failure message, or it won't count.

          $self->any_of(
             'token_int',
             'token_string',
             ...,

             sub { $self->commit; $self->fail( "Expected an int or string" ) }
          );

   commit
          $parser->commit

       Calling  this  method  will  cancel  the  backtracking  behaviour  of  the  innermost "maybe", "list_of",
       "sequence_of", or "any_of" structure forming method.  That is, if  later  code  then  calls  "fail",  the
       exception  will  be  propagated  out  of "maybe", no further list items will be attempted by "list_of" or
       "sequence_of", and no further code blocks will be attempted by "any_of".

       Typically this will be called once the grammatical structure alter has been determined, ensuring that any
       further failures are raised as real exceptions, rather than by attempting other alternatives.

        sub parse_statement
        {
           my $self = shift;

           $self->any_of(
              ...
              sub {
                 $self->scope_of( "{",
                    sub { $self->commit; $self->parse_statements; },
                 "}" ),
              },
           );
        }

       Though in this common pattern, "committed_scope_of" may be used instead.

TOKEN PARSING METHODS

       The following methods attempt to consume some part of the input string, to be used as part of the parsing
       process.

   expect
          $str = $parser->expect( $literal )

          $str = $parser->expect( qr/pattern/ )

          @groups = $parser->expect( qr/pattern/ )

       Expects to find a literal string or regexp pattern match, and  consumes  it.   In  scalar  context,  this
       method  returns  the  string that was captured. In list context it returns the matching substring and the
       contents of any subgroups contained in the pattern.

       This method will raise a parse error (by calling "fail") if the regexp fails to match. Note that  if  the
       pattern  could match an empty string (such as for example "qr/\d*/"), the pattern will always match, even
       if it has to match an empty string. This method will not consider a failure if the  regexp  matches  with
       zero-width.

   maybe_expect
          $str = $parser->maybe_expect( ... )

          @groups = $parser->maybe_expect( ... )

       Since version 0.10.

       A  convenient  shortcut  equivalent to calling "expect" within "maybe", but implemented more efficiently,
       avoiding the exception-handling set up by "maybe". Returns "undef" or an empty list if the match fails.

   substring_before
          $str = $parser->substring_before( $literal )

          $str = $parser->substring_before( qr/pattern/ )

       Since version 0.06.

       Expects to possibly find a literal string or regexp pattern match. If it  finds  such,  consume  all  the
       input  text before but excluding this match, and return it. If it fails to find a match before the end of
       the current scope, consumes all the input text until the end of scope and return it.

       This method does not consume the part of input  that  matches,  only  the  text  before  it.  It  is  not
       considered  a  failure if the substring before this match is empty. If a non-empty match is required, use
       the "fail" method:

          sub token_nonempty_part
          {
             my $self = shift;

             my $str = $parser->substring_before( "," );
             length $str or $self->fail( "Expected a string fragment before ," );

             return $str;
          }

       Note that unlike most of the other token parsing methods, this method does not consume either leading  or
       trailing  whitespace around the substring. It is expected that this method would be used as part a parser
       to read quoted strings, or similar cases where whitespace should be preserved.

   generic_token
          $val = $parser->generic_token( $name, $re, $convert )

       Since version 0.08.

       Expects to find a token matching the precompiled regexp $re. If provided, the $convert CODE reference can
       be used to convert the string into a more convenient form. $name is used in the failure  message  if  the
       pattern fails to match.

       If  provided,  the  $convert  function will be passed the parser and the matching substring; the value it
       returns is returned from "generic_token".

          $convert->( $parser, $substr )

       If not provided, the substring will be returned as it stands.

       This method is mostly provided for subclasses to define their own token types.  For example:

          sub token_hex
          {
             my $self = shift;
             $self->generic_token( hex => qr/[0-9A-F]{2}h/, sub { hex $_[1] } );
          }

   token_int
          $int = $parser->token_int

       Expects to find an integer in decimal, octal or hexadecimal notation, and consumes it. Negative integers,
       preceeded by "-", are also recognised.

   token_float
          $float = $parser->token_float

       Since version 0.04.

       Expects to find a number expressed in floating-point notation; a sequence of digits possibly prefixed  by
       "-",  possibly  containing a decimal point, possibly followed by an exponent specified by "e" followed by
       an integer. The numerical value is then returned.

   token_number
          $number = $parser->token_number

       Since version 0.09.

       Expects to find a number expressed in either of the above forms.

   token_string
          $str = $parser->token_string

       Expects to find a quoted string, and consumes it. The string should be quoted  using  """  or  "'"  quote
       marks.

       The  content  of  the quoted string can contain character escapes similar to those accepted by C or Perl.
       Specifically, the following forms are recognised:

          \a               Bell ("alert")
          \b               Backspace
          \e               Escape
          \f               Form feed
          \n               Newline
          \r               Return
          \t               Horizontal Tab
          \0, \012         Octal character
          \x34, \x{5678}   Hexadecimal character

       C's "\v" for vertical tab is not supported as it is rarely used in practice and it collides  with  Perl's
       "\v" regexp escape. Perl's "\c" for forming other control characters is also not supported.

   token_ident
          $ident = $parser->token_ident

       Expects to find an identifier, and consumes it.

   token_kw
          $keyword = $parser->token_kw( @keywords )

       Expects to find a keyword, and consumes it. A keyword is defined as an identifier which is exactly one of
       the literal values passed in.

EXAMPLES

   Accumulating Results Using Variables
       Although  the  structure-forming  methods all return a value, obtained from their nested parsing code, it
       can sometimes be more convenient to use a variable to  accumulate  a  result  in  instead.  For  example,
       consider  the  following  parser  method, designed to parse a set of "name: "value"" assignments, such as
       might be found in a configuration file, or YAML/JSON-style mapping value.

          sub parse_dict
          {
             my $self = shift;

             my %ret;
             $self->list_of( ",", sub {
                my $key = $self->token_ident;
                exists $ret{$key} and $self->fail( "Already have a mapping for '$key'" );

                $self->expect( ":" );

                $ret{$key} = $self->parse_value;
             } );

             return \%ret
          }

       Instead of using the return value from "list_of", this  method  accumulates  values  in  the  %ret  hash,
       eventually returning a reference to it as its result. Because of this, it can perform some error checking
       while it parses; namely, rejecting duplicate keys.

TODO

       •   Make   unescaping   of   string   constants   more   customisable.   Possibly   consider   instead  a
           "parse_string_generic" using a loop over "substring_before".

       •   Easy ability for subclasses to define more token types as methods. Perhaps  provide  a  class  method
           such as

              __PACKAGE__->has_token( hex => qr/[0-9A-F]+/i, sub { hex $_[1] } );

       •   Investigate  how  well  "from_reader"  can cope with buffer splitting across other tokens than simply
           skippable whitespace

AUTHOR

       Paul Evans <leonerd@leonerd.org.uk>

perl v5.32.1                                       2021-09-02                                   Parser::MGC(3pm)