Ubuntu Manpage: frog - Dutch Natural Language Toolkit

NAME

       frog - Dutch Natural Language Toolkit

SYNOPSIS

       frog [-t] test-file

       frog [options]

DESCRIPTION

       Frog  is  an  integration of memory‐-based natural language processing (NLP) modules developed for Dutch.
       Frog's current version will (optionally) tokenize,  tag,  lemmatize,  and  morphologically  segment  word
       tokens in Dutch text files, add IOB chunks, add Named Entities and will assign a dependency graph to each
       sentence.

OPTIONS

       -c <file>  or --config=<file>
              set the configuration using 'file'.

              you can use -c lang/config-file to select the 'config-file' for an installed language 'lang'

       --debug=<module><level>,...
              set  debug  level  per module, indicated by a single letter: Tagger (T), Tokenizer (t), Lemmatizer
              (l), Morphological Analyzer (a), Chunker (c), Multi‐Word Units (m), Named Entity Recognition  (n),
              or Parser (p). Different modules must be separated by commas.

              (e.g. --debug=l5,n3 sets the level for the Lemmatizer to 5 and for the NER to 3 )

              Debugging lines are written to a file frog.<number>.debug
       The name of that file is given at the end of the run.

       -d <level>
              set a global debug level for all modules at once.

       --deep‐morph
              Add  a deep morphological analysis to the 'Tabbed' and JSON output.  This analysis is added to XML
              unconditionally.

       --compounds
              include compound information as a column to 'Tabbed' output.  This information is added to XML and
              JSON unconditionally.

       -e <encoding>
              set input encoding. (default UTF8)

       -h or --help
              give some help

       --language=<comma separated list of languages>
              Set the languages to work on. This parameter is passed to the tokenizer.  The strings are  assumed
              to be ISO 639-2 codes.

              The first language in the list will be the default, unspecified languages are asumed to be of that
              default.

              e.g.  --language=nld,eng,por  means:  detect  Dutch,  English and Portuguese, with Dutch being the
              default, using TextCat. Mainly useful for XML processing.

              Specifying a unsupported language is a fatal error. However, you  can  add  the  special  language
              'und'  which assures that sentences in an unknown languages will be labeled as such, and processed
              no further.

              IMPORTANT Frog can at the moment handle only one language at a time, as determined by  the  config
              file.  So  other  languages  mentioned  here will be tokenized correctly, but further they will be
              handled as that language.

       -n
              assume inputfile to have one sentence per line. (newline separators)

              Very useful when running interactive, otherwise an empty line is needed to signal end of input.

       --nostdout
              suppress the 'Tabbed' or JSON output to stdout. (when no  outputfile  was  specified  with  -o  or
              --outputdir)

              Especially useful when XML output is specified with -X or --xmldir.

       -o <file>
              send  'Tabbed'  output  to  'file'  instead  of stdout. Defaults to the name of the inputfile with
              '.out' appended.

       --outputdir <dir>
              send all 'Tabbed' or  JSON  output  to  'dir'  instead  of  stdout.  Creates  filenames  from  the
              inputfilename(s) with '.out' appended.

       --retry
              assume  a  re-run  on  the same input file(s). Frog wil only process those files that haven't been
              processed yet.

       --skip=[tlacnmp]
              skip parts of the process: Tokenizer (t), Lemmatizer (l), Morphological Analyzer (a), Chunker (c),
              Named Entity Recognition (n), Multi-Word Units (m) or Parser (p).

              The Tagger cannot be skipped.

              Skipping the Multiword Unit implies disabling the Parser too.

       --alpino
              Use a locally installed Alpino parser. Disables our build-in Dependency parser

       --alpino=server
              use a remote installed Alpino server, as specified in the frog configuration file.

       -S <port>
              Run Frog as a server on 'port'

       -t <file>
              process 'file'.

              This option can be omitted. Frog will run on any <file> found on the qcommand-line.  Wildcards are
              allowed too. When NO files are specified, Frog will start in interactive mode.

              Files with the extension '.gz' or '.bz2' are handled too. The corresponding output-files  will  be
              compressed using the same compression again. Except when an explicit output filename is specified.

       -x <xmlfile>
              process  'xmlfile',  which  is  supposed  to  be  in  FoLiA  format!  If  'xmlfile'  is empty, and
              --testdir=<dir> is provided, all '.xml' files in 'dir' will be processed as FoLia XML.

              This option can be omitted. Frog will process files with the 'xml' extension as FoLiA files.

              Files with the extension '.xml.gz' or '.xml.bz2' are handled too. The  corresponding  output-files
              will  be  compressed using the same compression again.  Except when an explicit output filename is
              specified.

       -X <xmlfile>
              When 'xmlfile' is specified, create a FoLiA XML output file with that name.

              When 'xmlfile' is empty, generate FoLiA XML output for every inputfile.

       --textclass=<cls>
              When -x is given, use 'cls' to find AND store text in the FoLiA document(s).   Using  --inputclass
              and --ptclass is in general a better choice.

       --inputclass=<cls>
              use 'cls' to find text in the FoLiA input document(s).

       --outputclass=<cls>
              use  'cls'  to  output text in the FoLiA input document(s).  Preferably this is another class then
              the inputclass.

       --testdir=<dir>
              process all files in 'dir'. When the input mode is XML, only '.xml' files, --outputdir

       --uttmarker=<mark>
              assume all utterances are separated by 'mark'. (the default is none).

       --threads=<n>
              use a maximum of 'n' threads. The default is to take whatever is needed.  In servermode we  always
              run on 1 thread per session.

       -V or --version
              show version info

       --xmldir=<dir>
              generate  FoLiA  XML  output  and  send it to 'dir'. Creates filenames from the inputfilename with
              '.xml' appended. (Except when it already ends with '.xml')

       -X <file>
              generate FoLiA XML output and send it to 'file'. Defaults to the name  of  the  inputfile(s)  with
              '.xml' appended. (Except when it already ends with '.xml')

       --id=<id>
              When  -X for FoLia is given, use 'id' to give the doc an ID. The default is an xml:id based on the
              filename.

       --allow-word-corrections
              Allow the ucto tokenizer to apply simple corrections on words while processing FoLiA output.   For
              instance splitting punctuation.

       --max-parser-tokens=<num>
              Limit the size of sentences to be handled by the Parser. (Default 500 words).

              The Parser is very memory consuming. 500 Words will already need 16Gb of RAM.

       --JSONin
              The input is in JSON format. Mainly for Server mode, but works on files too.

              This implies --JSONout too!

       --JSONout
              Output will be in JSON instead of 'Tabbed'.

       --JSONout=<indent>
              Output will be in JSON instead of 'Tabbed'. The JSON will be idented by value
               'indent'. (Default is indent=0. Meaning al the JSON will be on 1 line)

       -T or --textredundancy=[full|medium|none]
              Set  the  text  redundancy level in the tokenizer for text nodes in FoLiA output: full add text to
              all levels: <p> <s> <w> etc.  minimal don't introduce text on higher levels, but  retain  what  is
              already there.  none only introduce text on <w>, AND remove all text from higher levels

       --override=<section>.<parameter>=<value>
              Override a parameter from the configuration file with a different value.

              This option may be repeated several times.

BUGS

       likely

AUTHORS

       Maarten van Gompel

       Ko van der Sloot

       Antal van den Bosch

       e-mail: lamasoftware@science.ru.nl

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

BUGS

AUTHORS

SEE ALSO