Provided by: rulex_3.8.6-1_amd64 bug

NAME

       lexholder-ru - rulex database holding utility

SYNOPSIS

       lexholder-ru [options] <db_path>

DESCRIPTION

       lexholder-ru is a small utility intended for use from the command line or shell-scripts. It allows one to
       construct, test, manage and query lexical database as well as extract its content in textual form.

       This database is primarily intended for use along with the Russian TTS engine ru_tts to provide stressing
       and pronunciation information for the Russian words.

       When  filling  and  updating the database, new records are read from the standard input.  When extracting
       data from the database, The result is printed to the standard output.  This behaviour can be  changed  by
       the -f switch.

OPTIONS

       All  options  recognized in the command line are described below.  For more convenience they are arranged
       into several groups by its functionality.

       The first group consists of options specifying  an  action  to  be  done.   These  options  are  mutually
       exclusive.  We  can  do  only one action per invocation. If no action is specified, the program reads its
       standard input (or a file specified by -f option) and stores its content in the database.  Here  are  the
       other actions:

       -h
              Print  summary of options and exit. This option discards all other command line specifications. It
              is the only case when the database path is not required.

       -l
              List database content in textual form. This action requires the dataset to be specified explicitly
              by one of the -X, -M, -G, -L, -P or -C options.

       -s <key>
              Search specified key in the lexical database. If the word is found program exits successfully  and
              outputs its pronunciation string,
               otherwise  prints  the lowercased original word and exits with non-zero exit code. This action is
              affected by the search mode options described below.  If the -q switch is specified in the command
              line, nothing will be printed on the standard output, but return code still can be  used  to  find
              out whether the word was found or not.

       -b <key>
              Treat  specified word as an implicit form and discover basic forms (if any) which could be used in
              the Implicit dictionary.  If quiet mode is not in use then all possible basic forms for  the  word
              will  be printed to the standard output (or to the file specified by the -f option) along with the
              numbers of corresponding Classifiers.  Program exits successfully if it  can  suggest  some  basic
              forms  for  specified word and returns a non-zero exit code otherwise.  In quiet mode nothing will
              be printed on the standard output, but the exit  code  still  can  be  used  to  make  a  decision
              concerning the operation result.

       -t <dictionary_file>
              Test  the  database  against specified dictionary. Test dictionary file is read line by line. Each
              line is treated as a record  consisting  of  two  fields  separated  by  space.  The  first  field
              represents  a  key  word and the second one gives its pronunciation string.  If this pronunciation
              string differs from the one obtained from the  database,  then  this  record  is  printed  to  the
              standard  output  or written to the file specified by -f option. Specifying "-" as test dictionary
              file name causes the testing records to be read from the standard input. This action  is  affected
              by the search mode options described below.

       -d <key>
              Delete  record  for  specified key. This action requires the dataset to be specified explicitly by
              one of the -X, -M, -G, -L, -P or -C options. For rules its number in the ruleset is used as a key.

       -D
              Discard the dataset. The dataset must be chosen by one of the -X, -M, -G, -L, -P or -C options.

       -c
              Clean the database removing redundant entries from  dictionaries.  By  default  all  records  that
              surely  do  not  affect  any  search  result  are  removed.  These are the entries of the Implicit
              dictionary that do not represent any lexical base and the entries of the Explicit dictionary  that
              in fact duplicate the result of usual lookup process.  If one of the -X or -M options is specified
              as  well,  then only that chosen dictionary will be cleaned. If the Implicit dictionary is chosen,
              the extensive cleanup is performed for it, that can drop some useful records. Be careful.

       The next group of options is responsible for choosing the dataset.  These options are mutually  exclusive
       and  affect  deletion,  insertion  and  listing  operations. For listing and deletion the dataset must be
       specified explicitly. If no one of these options is mentioned when inserting  new  data,  an  appropriate
       dataset  will  be chosen according to the input data. Only lexical data can be inserted in such a manner.
       For rules target dataset must be specified explicitly.

       -X
              Explicit dictionary.

       -M
              Implicit dictionary.

       -G
              General rules.

       -L
              Lexical classification rules.

       -P
              Prefix detection rules.

       -C
              Correction rules.

       The next group contains options devoted to search mode specification.
        These options affect search and test operation. By default (no options) full search will  be  performed,
       otherwise only those stages specified explicitly will be included in the search process.

       -x
              Search in the explicit dictionary.

       -m
              Try to treat the word as an implicit form.

       -g
              Try to apply general rules.

       The next group contains only one option that affects insertion new data into the lexical database.

       -r
              Replace  mode.  For  a dictionary this mode causes that the new records replace existing ones with
              the same key. By default such records are ignored. For rules this  mode  means  that  the  ruleset
              content should be fully replaced by the new data. Otherwise new rules are appended to the ruleset.

       The last group contains several options affecting program behaviour in general.

       -f <file>
              Use specified file instead of standard input or output.

       -q
              Be more quiet than usual: don't print search results as well as warnings about duplicate records.

       -v
              Be  more  verbose  than  usual: print messages about work stages and final statistical information
              when finishing.

DATA REPRESENTATION

       Externally all the data are represented textually. For the Russian letters the koi8-r  character  set  is
       used and only lower case is allowed.

       The database itself consists of two dictionaries and four sets of rules. The Explicit dictionary contains
       the  words  that  are  described  individually  and  do  not  imply any information for other forms. This
       dictionary is looked up first if the search includes this stage. The Implicit dictionary  contains  words
       in  some basic form. This dictionary is used to construct pronunciation string for various forms of these
       words. The basic form of a word is guessed according  to  the  rules  from  the  Classifiers  and  Prefix
       detectors  rulesets. This is the second stage of search process. If these stages do not bring a result or
       are not performed the rules from the General ruleset are used to guess stressing word. If no one of these
       rules can be applied than no guessing is made and search process fails. By default, all three stages  are
       performed, but it can be specified explicitly which ones should be taken in account.

       Externally dictionary data are represented by text lines consisting of two fields separated by space. The
       first  field  is  a  Russian  word. It serves as a key when searching. Only lowercase Russian letters are
       allowed here. The second field provides pronunciation string for this word. The pronunciation  string  is
       the  word  itself,  but  written  in such a manner as it should be pronounced. There are three additional
       symbols allowed in the pronunciation string along with the lowercase Russian letters. The "+" sign can be
       used to point the stressed letter. It should be placed just after that letter. The "=" sign  is  used  in
       some  cases just in the same manner to point so-called weak stress. The "-" sign can serve as a separator
       in some complex words. All other symbols are treated as illegal.

       There are four rulesets in the database: General rules, Classifiers,  Prefix  detectors  and  Correctors.
       Externally all these rules are represented by strings consisting of one or two fields separated by space.
       The first field always contains a regular expression which is matched against the word to make a decision
       whether this rule can be applied.

       The  only  task  of General rules is to guess stress in the words when dictionary lookup fails. The rules
       are tried sequentially until match or the list exhaustion.  If  match  succeeds  then  the  "+"  sign  is
       inserted into the word right after the first subexpression match to point stressing position.
        These rules do not contain a second field.

       For  the  Classifiers  ruleset each rule is checked one by one until match occurs. Then the part from the
       beginning of the word through to the end of the first subexpression match is extracted and  if  a  second
       field  is  present it is appended to the extracted part as a suffix. The resulting string is treated as a
       basic form of the word, so it is looked up in the Implicit dictionary.  If nothing is found  the  process
       continues until the ruleset will be exceeded.

       When nothing is found in the database for a word in its original form, Prefix detection rules are applied
       to  it  sequentially  until  match occurs. The matched prefix is stripped and replaced by the replacement
       string if any. Then the result word is searched in the Implicit dictionary. In the case  of  success  the
       original prefix is restored in the pronunciation string.

       The rules from Correctors ruleset are applied to the pronunciation strings instead of the original words.
       The  second  field  in  these  rules  specifies  a  regular  replacement  string  where  digits  serve as
       subexpression numbers.

SEE ALSO

       ru_tts(1),/usr/share/doc/rulex/README.

AUTHOR

       Igor B. Poretsky <poretsky@mlbox.ru>.

                                                October 28, 2006                                 LEXHOLDER-RU(1)