Ubuntu Manpage: Lingua::EN::Sentence - split text into sentences

Provided by: liblingua-en-sentence-perl_0.34-1_all

NAME

       Lingua::EN::Sentence - split text into sentences

SYNOPSIS

       use Lingua::EN::Sentence qw( get_sentences add_acronyms );

       add_acronyms('lt','gen');          ## adding support for 'Lt. Gen.'  my $text = q{ A sentence usually
       ends with a dot, exclamation or question mark optionally followed by a space!  A string followed by 2
       carriage returns denotes a sentence, even though it doesn't end in a dot

       Dots after single letters such as U.S.A. or in numbers like -12.34 will not cause a split as well as
       common abbreviations such as Dr. I. Smith, Ms. A.B. Jones, Apr. Calif. Esq.  and (some text) ellipsis
       such as ... or . . are ignored.  Some valid cases canot be deteected, such as the answer is X. It cannot
       easily be differentiated from the single letter-dot sequence to abbreviate a person's given name.
       Numbered points within a sentence will not cause a split 1. Like this one.  See the code for all the
       rules that apply.  This string has 7 sentences.  };

       if (defined($sentences)) {      my $sentences = get_sentences($text);      foreach my $sent (@$sentences)
            {           $i++;           print("SENTENCE $i:$sent\n");      } }

DESCRIPTION

       The "Lingua::EN::Sentence" module contains the function get_sentences, which splits text into its
       constituent sentences, based on a regular expression and a list of abbreviations (built in and given).

       Certain well know exceptions, such as abbreviations, may cause incorrect segmentations. But some of them
       are already integrated into this code and are being taken care of. Still, if you see that there are words
       causing the get_sentences function to fail, you can add those to the module, so it notices them.  Note
       that abbreviations are case sensitive, so 'Mrs.' is recognised but not 'mrs.'

ALGORITHM

       The first step is to mark  the dot ending an abbreviation by changing it to a special character. Now it
       won't cause a sentence split. The original dot is restored after the sentences are split

       Basically, I use a 'brute' regular expression to split the text into sentences.  (Well, nothing is yet
       split - I just mark the end-of-sentence). Then I look into a set of rules which decide when an end-of-
       sentence is justified and when it's a mistake. In case of a mistake, the end-of-sentence mark is removed.
       What are such mistakes?

       Letter-dot sequences:  U.S.A. ,  i.e. , e.g.  Dot sequences: '..' or '...'  or 'text . . more text' Two
       carriage returns denote the end of a sentence even if it doesn't end with a dot

LIMITATIONS

       1) John F. Kennedy was a former president 2) The answer is F. That ends the quiz

       In the first sentence, F. is detected as a persons initial and not the end of a sentence.  But this means
       we cannot detect the true end of sentence 2, which is after the 'F'. This case is not common though.

FUNCTIONS

       All functions used should be requested in the 'use' clause. None is exported by default.

       get_sentences( $text )
           The  get_sentences  function  takes  a  scalar  containing  ascii  text  as an argument and returns a
           reference to an array of sentences that the text has been split  into.  Returned  sentences  will  be
           trimmed  (beginning  and end of sentence) of white space. Strings with no alpha-numeric characters in
           them, won't be returned as sentences. If no text is supplied,  a  reference  to  an  empty  array  is
           returned.

       add_acronyms( @acronyms )
           This  function  is  used for adding acronyms not supported by this code.  The input should be regular
           expressions for matching the desired acronyms, but should not include the final period ("."). So, for
           example, "blv?d" matches "blvd." and "bld.". "a\.mlf" will match "a.mlf.". You do not need to  bother
           with   acronyms  consisting  of  single  letters  and  dots  (e.g.  "U.S.A."),  as  these  are  found
           automatically. Note also that acronyms are searched for on a case insensitive basis.

           Please see 'Acronym/Abbreviations list' section for  the  abbreviations  already  supported  by  this
           module.

       get_acronyms( )
           This function will return the defined list of acronyms.

       set_acronyms( @my_acronyms )
           This  function  replaces  the  predefined  acronym  list  with the given list. See "add_acronyms" for
           details on the input specifications.

       get_EOS( )
           This function returns the value of the string used to mark the end of sentence.  You  might  want  to
           see  what  it  is, and to make sure your text doesn't contain it.  You can use set_EOS() to alter the
           end-of-sentence string to whatever you desire.

       set_EOS( $new_EOS_string )
           This function alters the end-of-sentence string used to mark the end of sentences.

       set_locale( $new_locale ) Receives language locale in the form language.country.character-set for
       example: "fr_CA.ISO8859-1" for Canadian French using character set ISO8859-1.
           Returns a reference to a hash containing the current locale formatting values.  Returns undef if  got
           undef.

           The  following  will  set  the  LC_COLLATE  behaviour  to  Argentinian Spanish.  NOTE: The naming and
           availability of locales depends on your operating sysem.  Please consult the perllocale  manpage  for
           how to find out which locales are available in your system.

           $loc = set_locale( "es_AR.ISO8859-1" );

           This actually does this:

           $loc = setlocale( LC_ALL, "es_AR.ISO8859-1" );

Acronym/Abbreviations list

       You  can  use  the  get_acronyms()  function  to  get acronyms.  It has become too long to specify in the
       documentation.

       If I come across a good general-purpose list - I'll incorporate  it  into  this  module.   Feel  free  to
       suggest such lists.

FUTURE WORK

               [1] Object Oriented like usage
               [2] Supporting more than just English/French
               [3] Code optimization. Currently everything is RE based and not so optimized RE
               [4] Possibly use more semantic heuristics for detecting a beginning of a sentence

REPOSITORY

       <https://github.com/kimryan/Lingua-EN-Sentence>

AUTHOR

       Shlomo Yona shlomo@cs.haifa.ac.il

       Currently being maintained by Kim Ryan, kimryan at CPAN d o t org

COPYRIGHT AND LICENSE

       Copyright  (c)  2001-2016  Shlomo  Yona.  All  rights  reserved.  Copyright (c) 2022 Kim Ryan. All rights
       reserved.

       This library is free software; you can redistribute it and/or modify it under  the  same  terms  as  Perl
       itself.

perl v5.36.0                                       2023-10-21                          Lingua::EN::Sentence(3pm)

NAME

SYNOPSIS

DESCRIPTION

ALGORITHM

LIMITATIONS

FUNCTIONS

Acronym/Abbreviations list

FUTURE WORK

SEE ALSO

REPOSITORY

AUTHOR

COPYRIGHT AND LICENSE