Provided by: unibetacode_1.2-2_amd64 bug

NAME

       unibetacode - Format for polytonic Greek Beta Code files

SYNOPSIS

       source_file.beta

DESCRIPTION

       Unibetacode  is  an  implementation  of  Beta  Code,  as  adopted by the University of California, Irvine
       Thesaurus Linguae Graecae (TLG) Program and the Tufts University Perseus  Project,  among  others.   Beta
       Code provides a way of encoding polytonic Greek characters using plain ASCII characters.  The unibetacode
       package contains three utility programs: unibetaprep(1) converts TLG-unique numeric codes to Unicode code
       points,  beta2uni(1) converts a Beta Code file to UTF-8 Unicode, and uni2beta(1) converts a UTF-8 Unicode
       file to Beta Code.  These programs can also process Coptic and some Hebrew, but historically the focus of
       Beta Code documents has been classical Greek.

       A Unicode code point is an assignment to a specific numeric  value  for  glyphs  and  other  entities  in
       Unicode  fonts.   Throughout this document, Unicode code points are given by their Unicode numeric values
       in the form U+xxxx, where "xxxx" is a string of four hexadecimal  digits  representing  a  glyph  in  the
       Unicode  Basic  Multilingual  Plane.   This is how they are usually specified in The Unicode Standard and
       elsewhere.

       Note: Thesaurus Linguae Graecae and TLG are registered trademarks of the University of California.

GENERAL PUNCTUATION

   PUNCTUATION COMMON TO ALL MODES
       Regardless of the language mode (Greek, Latin, Coptic, or Hebrew), several punctuation marks are retained
       as is between the input file and the output file.  They are as follows:

               .        Full Stop (Period)

               ,        Comma

               ?        Question Mark (except in Greek mode, where it becomes a Combining Dot Below)

               !        Exclamation Mark

               ;        Semicolon / Greek Question Mark (see the Greek section below)

               [ ]      Square Brackets

   QUOTATION MARK STYLES
       The TLG Beta Code specification supports nine different styles  of  quotation  mark.   While  support  of
       different  styles  is  beneficial,  this  complicates round-trip conversion from Beta Code to Unicode and
       back.  This is further complicated by the same Unicode character being used as an opening quotation  mark
       in one style and as a closing quotation mark in another style.

       Double  quotes  in the unibetacode package just use ASCII quotation marks in a Beta Code source file; the
       quotation style is determined by the  language  mode.   This  greatly  simplifies  round-trip  conversion
       between  Beta  Code  and  a  Unicode UTF-8 document.  Double quotation marks must be balanced.  The first
       quotation mark encountered will be interpreted as a left quotation mark for the chosen style, the  second
       will  be  interpreted  as  a  right  quotation  mark, and so on.  Double quotation marks are converted as
       follows:

               Greek, Coptic
                        The opening double quotation mark is rendered  as  U+00AB,  LEFT-POINTING  DOUBLE  ANGLE
                        QUOTATION MARK.  The closing double quotation mark is rendered as U+00BB, RIGHT-POINTING
                        DOUBLE ANGLE QUOTATION MARK.

               Hebrew   The  opening  double  quotation mark is rendered as U+201E, DOUBLE LOW-9 QUOTATION MARK.
                        The closing double quotation mark is rendered as U+201D, RIGHT DOUBLE QUOTATION MARK.

               Latin    The opening double quotation mark is rendered as U+201C,  LEFT  DOUBLE  QUOTATION  MARK.
                        The closing double quotation mark is rendered as U+201D, RIGHT DOUBLE QUOTATION MARK.

       Single  quotes  are  specified  explicitly.  For Latin and Hebrew, they are specified with a Grave Accent
       (U+0060) for an opening single quote and an apostrophe (U+0027) for a closing single  quote.   For  Greek
       and  Coptic,  they  are  specified  with a "<" for an opening single quote and a ">" for a closing single
       quote.  The use of "<" and ">" for Greek and Coptic was a compromise so that an ASCII apostrophe in Greek
       mode would render as U+02BC, MODIFIER LETTER APOSTROPHE.  As with double quotation marks,  the  rendering
       of these characters is dependent on the language mode.  Single quotation marks are converted as follows:

               Greek, Coptic
                        The  opening  single  quotation  mark  is rendered as U+2039, SINGLE LEFT-POINTING ANGLE
                        QUOTATION MARK.  The closing single quotation mark is rendered as U+203A, SINGLE  RIGHT-
                        POINTING ANGLE QUOTATION MARK.

               Hebrew   The  opening  single  quotation mark is rendered as U+201A, SINGLE LOW-9 QUOTATION MARK.
                        The closing single quotation mark is rendered as U+2018, LEFT SINGLE QUOTATION MARK.

               Latin    The opening single quotation mark is rendered as U+2018,  LEFT  SINGLE  QUOTATION  MARK.
                        The closing single quotation mark is rendered as U+2019, RIGHT SINGLE QUOTATION MARK.

EXTENSIONS FOR ASCII AND UNICODE

       The  unibetacode package includes two extensions to TLG Beta Code: one for efficiently inserting an ASCII
       string into text when not in Latin mode, and the other for inserting a Unicode code point in any language
       mode.  These are described in the following two sub-sections.

   ASCII STRING INSERTION
       An ASCII string can be enclosed in curly brackets when in a non-Latin language mode.  The string will  be
       output  verbatim.   This  can  be  useful if a Greek text uses ASCII symbols, in order to produce a Greek
       document with an ASCII colon (':') rather than a Unicode  "GREEK  ANO  TELEIA"  character,  U+0387.   The
       format is as follows:

               {ASCII-string}

       This  is  an  extension  to standard Beta Code; the TLG specification assigns a different use to '{'.  By
       itself, the use of '{' not followed by a decimal number is deprecated in the TLG specification,  so  this
       should avoid some conflict.  Curly brackets are not allowed in the string.

   SPECIAL UNICODE CHARACTER INSERTION
       The  original  Beta  Code specification lists many numeric codes for producing special symbols that today
       have become part of The Unicode Standard.  In the future, it  will  likely  be  most  beneficial  if  any
       specialized numeric codes for characters use Unicode code points rather than the historical TLG Beta Code
       numeric assignments.

       The  unibetacode  package  allows any Unicode code point to be specified in hexadecimal (which is how The
       Unicode Standard provides them) as a string inside an  ASCII  '{'...'}'  escape  sequence.   The  Unicode
       hexadecimal code point of one to six digits is preceded by "\u", taking the form "\ux...x".  For example,
       strings such as

               {\u3d8} or {\u3D8} or {\u03D8}

       can  be  used  to insert the Unicode character U+03D8, GREEK LETTER ARCHAIC KOPPA, which does not have an
       associated Beta Code letter assignment.  Such  Unicode  code  point  strings  can  be  mixed  with  other
       characters  in  the  same  string,  as long as any character that follows the Unicode code point is not a
       hexadecimal digit of '0'-'9', 'A'-'F', or 'a'-'f'.

   NUMERIC DIGITS
       The numeric digits '0' through '9' are simply entered as '0' through '9', respectively, in  any  language
       mode.   The  language  modes  are  Coptic, Greek, Hebrew, and Latin.  They are described in the following
       sections.

GREEK

   LETTERS
       Capital and small letters can take the same set of accent  marks,  but  the  order  in  which  these  are
       specified differs between capital and small.

       Small letters are given in Beta Code in this order: (1) letter, (2) breathing marks, (3) accents, and (4)
       iota  subscript.   This  follows the traditional typed appearance of small polytonic Greek letters, where
       breathing marks and then accent marks appear on top of the small letter, and iota subscripts appear below
       small long vowels.

       Capital letters are given in Beta Code in this order: (1) asterisk (which denotes a Capital letter),  (2)
       breathing  marks,  (3)  accents,  (4) letter, and (5) iota subscript.  This follows the traditional typed
       appearance of capital polytonic Greek letters, where breathing marks and then accent marks appear to  the
       left of the capital letter, and iota subscripts appear to the right of capital long vowels.

       The  letter  mapping  is  as  follows,  in  Greek  alphabetical  order.  Letters can be capital or small;
       generally speaking, small is easier to read, so it is the default output from uni2beta(1):

               *a or a    Capital or Small Alpha, respectively

               *b or b    Capital or Small Beta, resp.

               *g or g    Capital or Small Gamma, resp.

               *d or d    Capital or Small Delta, resp.

               *e or e    Capital or Small Epsilon, resp.

               *z or z    Capital or Small Zeta, resp.

               *h or h    Capital or Small Eta, resp.

               *q or q    Capital or Small Theta, resp.

               *i or i    Capital or Small Iota, resp.

               *k or k    Capital or Small Kappa, resp.

               *l or l    Capital or Small Lambda, resp.

               *m or m    Capital or Small Mu, resp.

               *n or n    Capital or Small Nu, resp.

               *c or c    Capital or Small Xi, resp.

               *o or o    Capital or Small Omicron, resp.

               *p or p    Capital or Small Pi, resp.

               *r or r    Capital or Small Rho, resp.

               *s or s    Capital or Small Sigma, resp.  Note: a small "s" is  interpreted  as  middle  (medial)
                          sigma  or  final sigma depending upon the context.  To force one or the other, see the
                          following two entries.

               s1         Small Middle (Medial) Sigma

               s2 or j    Small Final Sigma

               *s3 or s3  Capital or Small Lunate Sigma, resp.

               *t or t    Capital or Small Tau, resp.

               *u or u    Capital or Small Upsilon, resp.

               *f or f    Capital or Small Phi, resp.

               *x or x    Capital or Small Chi, resp.

               *y or y    Capital or Small Psi, resp.

               *w or w    Capital or Small Omega, resp.

               *v or v    Capital or Small Digamma, resp.

       Example: "*to fws", "the light" (without accent marks).  This could also be written as  "*TO  FWS";  both
       capital and small letters give the same conversion into UTF-8.

   BREATHING MARKS AND ACCENTS
       These  are  the  encodings of breathing marks and accents.  In Beta Code (as in written Greek), breathing
       marks appear before accents.

               )        Smooth Breathing

               (        Rough Breathing

               \        Grave accent

               /        Acute accent

               =        Circumflex

               +        Diaresis

               &        Macron

               ´        Breve

               ?        Combining Dot Below

       Example: "*to\ fw=s", "the light", with a grave accent, or varia,  over  the  omicron  and  a  circumflex
       accent,  or perispomeni, over the omega.  This could also be written as "*TO\ FW=S".  N.B.: Note that the
       case of the Latin letter does not matter for accent placement; it is only the case of  the  Greek  letter
       that  matters.  Greek capital letters are encoded with a preceding asterisk, so in this example, "O\" and
       "W=" will appear as small UTF-8.

   IOTA SUBSCRIPT
       The iota subscript is the last character written after a long vowel with which it  appears,  whether  the
       letter is capital or small.  It is denoted by a vertical bar:

               |        Iota subscript

   GREEK PUNCTUATION
       These are the punctuation symbols that the unibetacode package supports:

               .                  Period (Teleia)

               ,                  Comma

               :                  Middle Dot (Ano Teleia)

               ;                  Question Mark (Epotematiko)

               ´                  Apostrophe (Apostrophos)

               - (hyphen)         Hyphen (Pavla)

               _ (underscore)     Em Dash

               #                  Greek Number Sign

   UNICODE GREEK
       The  Greek  Extended range of The Unicode Standard, U+1F00 - U+1FFF, contains 16 small and capital vowels
       that have identical representation in the Greek and Coptic range, U+0370 - U+03FF.  These are vowels with
       an "oxia" (acute) accent in the Greek Extended range; they have equivalent glyphs with a "tonos"  (acute)
       accent  in  the  Greek and Coptic range.  Because of this duplication, the use of these 16 Greek Extended
       glyphs is deprecated.  uni2beta(1) will convert those 16 characters to Beta Code,  but  beta2uni(1)  will
       convert  the resulting Beta Code into characters in the Greek and Coptic range (U+0370 - U+03FF); it will
       not convert them back into Greek Extended glyphs.

       Also in the Greek Extended Unicode range, the TLG Project considers U+1FBF to  be  the  equivalent  of  a
       smooth breathing mark, and uni2beta(1) will convert it as such.

LATIN (ASCII)

       To  display ASCII characters, including the Latin letters 'A' through 'Z' and 'a' through 'z', begin with
       an ampersand ('&') character.  Switch back to Greek mode with a dollar sign ('$') character.

       ASCII characters can also be surrounded with curly brackets; for example, "{Here is some ASCII!}".   This
       is  non-standard  though;  the  TLG specification uses '&' and '$' to enter Latin and then switch back to
       Greek.

       For efficiency, beta2uni(1) is conditioned to interpret  sequences  that  look  like  accented  Greek  as
       accnted Greek.  Curly brackets can also be useful for overriding such interpretations.  For example, if a
       document contained the text

               ...(this is an example)

       The  "e)"  could  be interpreted as a small epsilon with a smooth breathing mark above it.  To break this
       behavior, type

               ...(this is an example{}) or ...(this is an example{)}

       and the Unicode output from beta2uni(1) will appear as intended.  This technique will appear familiar  to
       TeX users.

COPTIC

       To  display  Coptic  letters, begin with the character sequence "&100".  Switch back to Greek mode with a
       dollar sign ('$') character.  As with Greek Beta Code, capital Coptic letters in Beta Code begin with  an
       asterisk ('*') and small Coptic letters do not.

       Note  that  unlike  in  Greek  mode, the Coptic Beta Code letters are case-sensitive.  In general, Coptic
       letters derived from Demotic use lowercase Beta Codes and map to the Greek and Coptic Unicode  script  in
       the  range  U+03E2  -  U+03EF;  the  rest  of  the Coptic letters use uppercase Beta Codes and map to the
       separate Coptic Unicode script in the range U+2C80 - U+2C8D.

       The encoding is as follows:

               *A or A    Capital or Small Alfa, respectively

               *B or B    Capital or Small Vida, resp.

               *G or G    Capital or Small Gamma, resp.

               *D or D    Capital or Small Dalda, resp.

               *E or E    Capital or Small Eie, resp.

               *V or V    Capital or Small Sou, resp.

               *Z or Z    Capital or Small Zata, resp.

               *H or H    Capital or Small Hate, resp.

               *Q or Q    Capital or Small Tethe, resp.

               *I or I    Capital or Small Iauda, resp.

               *K or K    Capital or Small Kapa, resp.

               *L or L    Capital or Small Laula, resp.

               *M or M    Capital or Small Mi, resp.

               *N or N    Capital or Small Ni, resp.

               *C or C    Capital or Small Ksi, resp.

               *O or O    Capital or Small O, resp.

               *P or P    Capital or Small Pi, resp.

               *R or R    Capital or Small Ro, resp.

               *S or S    Capital or Small Sima, resp.

               *T or T    Capital or Small Tau, resp.

               *U or U    Capital or Small Ua, resp.

               *F or F    Capital or Small Fi, resp.

               *X or X    Capital or Small Khi, resp.

               *Y or Y    Capital or Small Psi, resp.

               *W or W    Capital or Small Oou, resp.

               *s or s    Capital or Small Shei, resp.

               *f or f    Capital or Small Fei, resp.

               *k or k    Capital or Small Khei, resp.

               *h or h    Capital or Small Hori, resp.

               *j or j    Capital or Small Gangia, resp.

               *g or g    Capital or Small Shima, resp.

               *t or t    Capital or Small Dei, resp.

               \          Jinma (Grave) Accent

       Switch back to Greek mode by ending with a dollar sign ('$') character.

HEBREW

       The TLG specification only covers the basic Hebrew letters aleph through tav.  These letters map  to  the
       Hebrew  Unicode script in the range U+05D0 - U+05EA.  Beta Codes are not defined in the specification for
       cantillation marks, Yiddish digraphs, etc.

       To display Hebrew letters, begin with the character sequence "&300".

       Note that unlike in Greek mode, the Hebrew Beta Codes are case-sensitive and they  never  begin  with  an
       asterisk ('*').

       The encoding is as follows:

              A    Alef

              b    Bet

              g    Gimel

              d    Dalet

              h    He

              v    Vav

              z    Zayin

              H    Het

              Q    Tet

              y    Yod

              k1   Middle Kaf

              k2   Final Kaf

              l    Lamed

              m1   Middle Mem

              m2   Final Mem

              n1   Middle Nun

              n2   Final Nun

              S    Samekh

              a    Ayin

              p1   Middle Pe

              p2   Final Pe

              T1   Middle Tsadi

              T2   Final Tsadi

              q    Qof

              r    Resh

              s    Shin

              t    Tav

       Switch back to Greek mode with a dollar sign ('$') character.

SAMPLES

       The  directory test/reference contains samples with mappings from Beta Code to UTF-8 and vice versa.  The
       "genesis-1-1.beta" and "genesis-1-1.utf8" files show the verse Genesis  1:1  in  Koine  Greek  (from  the
       Septuagint), Hebrew, and Bohairic Coptic in Beta Code and UTF-8, respectively.

SEE ALSO

       unibetaprep(1), beta2uni(1), uni2beta(1)

AUTHOR

       The unibetacode package was created by Paul Hardy.

LICENSE

       unibetacode is Copyright © 2018, 2019 Paul Hardy.

       This  program  is  free  software;  you  can  redistribute it and/or modify it under the terms of the GNU
       General Public License as published by the Free Software Foundation; either version 2 of the License,  or
       (at your option) any later version.

BUGS

       The  format  is  very  straightforward and no known bugs exist.  However, Beta Code has been evolving for
       almost 50 years, especially since the advent of Unicode.  As a result, many Beta  Code-encoded  documents
       exist  in  versions  of  the  standard  much  older than the current version.  This version also does not
       implement many numbered codes that are contained in the TLG Beta Code specification.  There are no  plans
       to support the TLG Beta Code formatting codes, as that is beyond the scope of Unicode.

                                                   2019 Jan 26                                    UNIBETACODE(5)