Ubuntu Manpage: Lingua::EN::Sentence - split text into sentences

Provided by: liblingua-en-sentence-perl_0.31-1.1_all

NAME

       Lingua::EN::Sentence - split text into sentences

SYNOPSIS

               use Lingua::EN::Sentence qw( get_sentences add_acronyms );

               add_acronyms('lt','gen');               ## adding support for 'Lt. Gen.'
               my $sentences=get_sentences($text);     ## Get the sentences.
               foreach my $sentence (@$sentences) {
                       ## do something with $sentence
               }

DESCRIPTION

       The "Lingua::EN::Sentence" module contains the function get_sentences, which splits text into its
       constituent sentences, based on a regular expression and a list of abbreviations (built in and given).

       Certain well know exceptions, such as abbreviations, may cause incorrect segmentations. But some of them
       are already integrated into this code and are being taken care of. Still, if you see that there are words
       causing the get_sentences function to fail, you can add those to the module, so it notices them.

ALGORITHM

       Basically, I use a 'brute' regular expression to split the text into sentences.  (Well, nothing is yet
       split - I just mark the end-of-sentence). Then I look into a set of rules which decide when an end-of-
       sentence is justified and when it's a mistake. In case of a mistake, the end-of-sentence mark is removed.

       What are such mistakes? Cases of abbreviations, for example. I have a list of such abbreviations (Please
       see public globals belwo for a list), and more general rules (for example, the abbreviations 'i.e.' and
       '.e.g.' need not to be in the list as a special rule takes care of all single letter abbreviations).

FUNCTIONS

       All functions used should be requested in the 'use' clause. None is exported by default.

       get_sentences( $text )
           The  get_sentences  function  takes  a  scalar  containing  ascii  text  as an argument and returns a
           reference to an array of sentences that the text has been split  into.  Returned  sentences  will  be
           trimmed  (beginning  and end of sentence) of white space. Strings with no alpha-numeric characters in
           them, won't be returned as sentences.

       add_acronyms( @acronyms )
           This function is used for adding acronyms not supported by this code.  The input  should  be  regular
           expressions for matching the desired acronyms, but should not include the final period ("."). So, for
           example,  "blv?d" matches "blvd." and "bld.". "a\.mlf" will match "a.mlf.". You do not need to bother
           with  acronyms  consisting  of  single  letters  and  dots  (e.g.  "U.S.A."),  as  these  are   found
           automatically. Note also that acronyms are searched for on a case insensitive basis.

           Please  see  'Acronym/Abbreviations  list'  section  for  the abbreviations already supported by this
           module.

       get_acronyms( )
           This function will return the defined list of acronyms.

       set_acronyms( @my_acronyms )
           This function replaces the predefined acronym list  with  the  given  list.  See  "add_acronyms"  for
           details on the input specifications.

       get_EOS( )
           This  function  returns  the value of the string used to mark the end of sentence.  You might want to
           see what it is, and to make sure your text doesn't contain it.  You can use set_EOS()  to  alter  the
           end-of-sentence string to whatever you desire.

       set_EOS( $new_EOS_string )
           This function alters the end-of-sentence string used to mark the end of sentences.

       set_locale( $new_locale ) Receives language locale in the form language.country.character-set for
       example: "fr_CA.ISO8859-1" for Canadian French using character set ISO8859-1.
           Returns  a reference to a hash containing the current locale formatting values.  Returns undef if got
           undef.

           The following will set the LC_COLLATE  behaviour  to  Argentinian  Spanish.   NOTE:  The  naming  and
           availability  of  locales depends on your operating sysem.  Please consult the perllocale manpage for
           how to find out which locales are available in your system.

           $loc = set_locale( "es_AR.ISO8859-1" );

           This actually does this:

           $loc = setlocale( LC_ALL, "es_AR.ISO8859-1" );

Acronym/Abbreviations list

       You can use the get_acronyms() function to get acronyms.  It has  become  too  long  to  specify  in  the
       documentation.

       If  I  come  across  a  good  general-purpose  list - I'll incorporate it into this module.  Feel free to
       suggest such lists.

FUTURE WORK

               [1] Object Oriented like usage
               [2] Supporting more than just English/French
               [3] Code optimization. Currently everything is RE based and not so optimized RE
               [4] Possibly use more semantic heuristics for detecting a beginning of a sentence

REPOSITORY

       <https://github.com/kimryan/Lingua-EN-Sentence>

AUTHOR

       Shlomo Yona shlomo@cs.haifa.ac.il

       Currently being maintained by Kim Ryan, kimryan at CPAN d o t org

COPYRIGHT AND LICENSE

       Copyright (c) 2001-2016 Shlomo Yona. All rights reserved.   Copyright  (c)  2018  Kim  Ryan.  All  rights
       reserved.

       This  library  is  free  software;  you can redistribute it and/or modify it under the same terms as Perl
       itself.

perl v5.32.0                                       2020-12-29                          Lingua::EN::Sentence(3pm)

NAME

SYNOPSIS

DESCRIPTION

ALGORITHM

FUNCTIONS

Acronym/Abbreviations list

FUTURE WORK

SEE ALSO

REPOSITORY

AUTHOR

COPYRIGHT AND LICENSE