Provided by: ucto_0.35-2build1_amd64 bug

NAME

       ucto - Unicode Tokenizer

SYNOPSIS

       ucto [[options]] [input‐file] [[output‐file]]

DESCRIPTION

       ucto  tokenizes  text  files:  it  separates  words  from  punctuation,  splits sentences (and optionally
       paragraphs), and finds paired  quotes.   Ucto  is  preconfigured  with  tokenisation  rules  for  several
       languages.

       Those rules are provided by uctodata

OPTIONS

       -c configfile
              read settings from a 'configfile'

       -B
              run in batch mode. Process all inputfiles to an output directory specified with -O.

       -d value
              set debug mode to 'value'

       -e value
              set input encoding. (default UTF8)

       -I value
              set the input directory to 'value'. (batch mode only)

       -O value
              set the ouput directory to 'value'. (Required for batch mode)

       -N value
              set UTF8 output normalization. (default NFC)

       --filter=[YES|NO]
              disable  filtering  of special characters, (default YES) These special characters can be specified
              in the [FILTER] block of the configuration file.

       -L language
              Automatically selects a configuration file by language code.  The language  code  is  generally  a
              three-letter  iso-639-3  code.   For  example,  'fra'  will select the file tokconfig‐fra from the
              installation directory

       --detectlanguages=<lang1,lang2,..langn>
              try to detect all the specified languages. The default language will be 'lang1'.  (only useful for
              FoLiA output).

              All values must be iso-639-3 codes.

              You can also use the special language code `und`. This ensures there is NO default  language,  and
              any language that is NOT in the list will remain unanalyzed.

              Warning:  To  be able to handle utterances of mixed language, Ucto uses a simple sentence splitter
              based on the markers '.' '?' and '!'.  This may occasionally lead to surprising results.

       -l
              Convert output text to all lowercase

       -u
              Convert all input text to all uppercase

       -n
              Emit one sentence per line on output

       -m
              Assume one sentence per line on input

       --normalize=class1,class2,..,classn
              map all occurrences of tokens with class1,...class to their generic  names.  e.g  --normalize=DATE
              will  map  all  dates  to  the  word {{DATE}}. Very useful to normalize tokens like URL's, DATE's,
              E-mail addresses and so on.

       -T value or --textredundancy=value
              set text redundancy level for text nodes in FoLiA output:
               'full'    - add text to all levels: <p> <s> <w> etc.
               'minimal' - don't introduce text on higher levels, but retain what is already
               there.
               'none'    - only introduce text on <w>, AND remove all text from higher levels

       --allow-word-correction
              Allow ucto to tokenize inside FoLiA Word elements, creating FoLiA Corrections

       --ignore-tag-hints
              Skip all tag=token hints from the FoLiA input. These hints can be used to signal text markup  like
              subscript and superscript

       --add-tokens="file"
              Add  additional tokens to the [TOKENS] block of the default language.  The file should contain one
              TOKEN per line.

       --passthru
              Don't tokenize, but perform input decoding and simple token role detection

       --filterpunct
              remove most of the punctuation from the output. (not from abreviations  and  embedded  punctuation
              like John's)

       -P
              Disable Paragraph Detection

       -Q
              Enable Quote Detection. (this is experimental and may lead to unexpected results)

       -s <string>
              Set End‐of‐sentence marker. (Default <utt>)

       -V or -- version
              Show version information

       -v
              set Verbose mode

       -F
              The input file(s) are assumed to be FoLiA XML. Text in the correct 'inputclass' will be tokenized.
              For files with an '.xml' extension, -F is the default.

              In  batch  mode,  this  forces  to  only  select  files  with  the '.xml' extension from the input
              directory.

       --inputclass="cls"
              When tokenizing a FoLiA XML document, search for text  nodes  of  class  'cls'.   The  default  is
              "current".

       --outputclass="cls"
              When  tokenizing  a  FoLiA  XML  document, output the tokenized text in text nodes with 'cls'. The
              default is "current".  It is recommended to have different classes for input and output.

       --textclass="cls"(obsolete)
              use 'cls' for input and output of text from  FoLiA.  Equivalent  to  both  --inputclass='cls'  and
              --outputclass='cls')

              This  option  is  obsolete  and  NOT  recommended.  Please  use  the  separate  --inputclass=  and
              --outputclass options.

       --copyclass
              when ucto is used on FoLiA with fully  tokenized  text  in  inputclass='inputclass',  no  text  in
              textclass  'outputclass'  is  produced.  (A  warning  will be given).  To circumvent this. Add the
              --copyclass option. Which assures that text will be emitted in that class

       -X
              All output will be FoLiA XML. Document id's are autogenerated.

              Works in batch mode too.

       --id <DocId>
              Use the specified Document ID for the FoLiA XML. (not allowed in batch mode) When not provided,  a
              document is is generated based on the nema of the input file.

BUGS

       likely

AUTHORS

       Maarten van Gompel

       Ko van der Sloot

       e-mail: lamasoftware@science.ru.nl

                                                   2024 apr 11                                           ucto(1)