Ubuntu Manpage: unicode_canonical, unicode_ccc, unicode_decomposition_init, unicode_decomposition

Provided by: libcourier-unicode-dev_2.3.2-1ubuntu1_amd64

NAME

       unicode_canonical, unicode_ccc, unicode_decomposition_init, unicode_decomposition_deinit,
       unicode_decompose, unicode_decompose_reallocate_size, unicode_compose, unicode_composition_init,
       unicode_composition_deinit, unicode_composition_apply - unicode canonical normalization and
       denormalization

SYNOPSIS

       #include <courier-unicode.h>

       unicode_canonical_t unicode_canonical(char32_t c);

       uint8_t unicode_ccc(char32_t c);

       void unicode_decomposition_init(unicode_decomposition_t *info, char32_t *string, size_t *string_size,
                                       void *arg);

       int unicode_decompose(unicode_decomposition_t *info);

       void unicode_decomposition_deinit(unicode_decomposition_t *info);

       size_t unicode_decompose_reallocate_size(unicode_decomposition_t *info, const size_t *sizes, size_t n);

       int unicode_compose(char32_t *string, size_t string_size, int flags, size_t *new_size);

       int unicode_composition_init(const char32_t *string, size_t string_size, int flags,
                                    unicode_composition_t *compositions);

       void unicode_composition_deinit(unicode_composition_t *compositions);

       size_t unicode_composition_apply(char32_t *string, size_t string_size,
                                        unicode_composition_t *compositions);

DESCRIPTION

       These functions compose or decompose a Unicode string into a canonical or a compatible normalized form.

       unicode_canonical() looks up the character's canonical and compatibility mapping[1].  unicode_canonical()
       returns a structure with the following fields:

       canonical_chars
           A pointer to the canonical or equivalent representation of the character.

       n_canonical_chars
           Number of characters in the canonical_chars.

       format
           A value of UNICODE_CANONICAL_FMT_NONE indicates a canonical mapping, other values indicate a
           compatibility equivalent mapping.

       A NULL canonical_chars (with a 0 n_canonical_chars) indicates that the character has no canonical or
       compatibility equivalence.

       unicode_ccc() returns the character's canonical combining class value.

       unicode_decomposition_init(), unicode_decompose() and unicode_decomposition_deinit() implement a complete
       interface for decomposing a Unicode string:

           unicode_decomposition_t info;

           unicode_decomposition_init(&info, before, (size_t)-1, NULL);
           info.decompose_flags=UNICODE_DECOMPOSE_FLAG_QC;
           unicode_decompose(&info);
           unicode_decomposition_deinit(&info);

       unicode_decomposition_init() initializes a new unicode_decomposition_t structure, that gets passed in as
       its first parameter. The second parameter is a pointer to a Unicode string, with the number of characters
       in the string in the third parameter. A string size of -1 indicates a \0-terminated string and calculates
       its string_size (which does not include the trailing \0. The last parameter is a void *, an opaque
       pointer that gets stored in the initialized unicode_decomposition_t object:
       typedef struct unicode_decomposition {
           char32_t   *string;
           size_t     string_size;
           int        decompose_flags;
           int        (*reallocate)(
                           struct unicode_decomposition   *info,
                           const size_t                   *offsets,
                           const size_t                   *sizes,
                           size_t                         n
                      );
           void       *arg;
       } unicode_decomposition_t;

       unicode_decompose() proceeds and decomposes the string and replaces it with its decomposed string
       version.

       unicode_decomposition_t's string, string_size and arg are copies of unicode_decomposition_init's
       parameters.  unicode_decomposition_init initializes all other fields to their default values.

       The decompose_flags bitmask gets initialized to 0, and is a bit mask:

       UNICODE_DECOMPOSE_FLAG_QC
           Check each character's appropriate “quick check” property and skip decomposing Unicode characters
           that would get re-composed by unicode_composition_apply().

       UNICODE_DECOMPOSE_FLAG_COMPAT
           Perform a compatibility decomposition instead of a canonical decomposition.

       reallocate is a pointer to a function that gets called to reallocate a larger string.
       unicode_decompose() determines which characters in the string need decomposing and calls the reallocate
       function pointer zero or more times. Each call to reallocate passes information about where new
       characters will get inserted into the string.

       reallocate only needs to grow the size of the buffer where string points so that it's big enough to hold
       a larger, decomposed string; then update string accordingly.  reallocate should not update string_size or
       make any changes to the existing string, that's unicode_decompose()'s job (after reallocate returns).

       The reallocate callback function receives the following parameters.

       •   A pointer to the unicode_decomposition_t and, notably, its arg.

       •   A pointer to the array of offset indexes in the string where new characters will get inserted in
           order to hold the decomposed string.

       •   A pointer to the array that holds the number of characters that get inserted each corresponding
           offset.

       •   The size of the two arrays.

       reallocate must update the string if necessary to hold at least the number of characters that's the sum
       total of the initial string_size and the sum total of al sizes.

       unicode_decomposition_init() initializes the reallocate pointer to a default implementation that uses
       realloc(3) and updates string with its return value. The application can use its own reallocate to handle
       this task on its own, and use unicode_decompose_reallocate_size to compute the minimum string size:

           size_t unicode_decompose_reallocate_size(unicode_decomposition_t *info,
                                                    const size_t *sizes,
                                                    size_t n)
           {
               size_t i;
               size_t new_size=info->string_size;

               for (i=0; i<n; ++i)
                   new_size += sizes[i];

               return new_size;
           }

       The reallocate function returns 0 on success and a non-0 error code to report a failure; and
       unicode_decompose() does the same. The only error condition from unicode_decompose() is a non-0 error
       code from the reallocate function. Otherwise: a successful decomposition results in unicode_decompose()
       returning 0 and unicode_decomposition_init()'s string pointing to the decomposed string and string_size
       giving the number of characters in the decomposed string.

           Note

           string_size does not include the trailing \0 character. The input string also has its string_size
           specified without counting its \0 character. The default implementation of reallocate allocates an
           extra char32_t ands sets it to a \0. Therefore:

           •   If the Unicode string before decomposition has a trailing \0 and no decomposition occurs, and no
               calls to reallocate takes place: the string in the unicode_decomposition_t is unchanged and it's
               still \0-terminated.

           •   The default reallocate allocates an extra char32_t ands sets it to a \0; and it takes care of
               that for the decomposed string.

           •   An application that provides its own replacement reallocate is responsible for doing the same, if
               it wants the decomposed string to be \0 terminated.

           Note

           Multiple calls to the reallocate callback are possible. Each call to reallocate reflect the prior
           calls' decompositions. Example: the original string has five characters and the first call to
           reallocate had two offsets, at position 1 and 3, with a value of 1 for their both sizes. This effects
           transforming an original Unicode string "AAAAA" into "AXAAXAA" (with “A” representing unspecified
           characters in the original string, and “X” showing the two characters added in the first call to
           reallocate.

           A second call to varname with am offset at position 4, and a size of 1, results in the updated string
           of "AXAAYXAA" (with “Y”) marking an unspecified character inserted by the second call.

           Note

           Unicode string decomposition involves replacing a given Unicode character with one or more other
           characters. The sizes given to reallocate reflect the net addition to the Unicode string. For
           example: decomposing one Unicode character into three decomposed characters results in a call to
           reallocate reporting an insert of two more characters.

           Note

           offsets actually report the indices of each Unicode character that's getting decomposed. A 1:1
           decomposition of a Unicode Character gets reported as an additional sizes entry of 0.

       unicode_decomposition_deinit() releases all resources and destroys the unicode_decomposition_t; it is no
       longer valid.

           Note

           unicode_decomposition_deinit() does not free(3) the string. The original string gets passed in to
           unicode_decomposition_init() and the decomposed string is left in the string.

       The default implementation of the reallocate function assumes the string is a malloc(3)-ed string, and
       reallocs it.

           Note

           At this time unicode_decomposition_deinit() does nothing. All code should explicitly call it in order
           to remain forward-compatible (at the source level).

       unicode_compose() performs a canonical composition of a decomposed string. Its parameters are:

       •   A pointer to the decomposed Unicode string.

       •   The number of characters in the Unicode string. The Unicode string does not need to be \0-terminated;
           if it is this number does not include it.

       •   A flags bitmask, which can have the following values:

           UNICODE_COMPOSE_FLAG_REMOVEUNUSED
               Remove all combining marks after doing all canonical compositions. Normally any unused combining
               marks are left in place, in the combined text. This option removes them.

           UNICODE_COMPOSE_FLAG_ONESHOT
               Perform canonical composition once per character, and do not attempt to combine any resulting
               combined characters again.

       •   A non-NULL pointer to a size_t.

           A successful composition sets this size_t to the number of characters in the combined string, and
           returns 0. The combined string gets placed back into the string parameter, this string gets combined
           in place and this gives the size of the combined string.

           unicode_compose() returns a non-zero value to indicate an error.

       unicode_composition_init(), unicode_composition_apply() and unicode_composition_deinit() implement a
       detailed interface for canonical composition of a decomposed Unicode string:

           unicode_compositions_t compositions;

           if (unicode_composition_init(str, strsize, flags, &compositions) == 0)
           {
               size_t new_size=unicode_composition_apply(str, strsize, &compositions);

               unicode_composition_deinit(&compositions);
           }

       The first two parameters to both unicode_composition_init() and unicode_composition_apply() are the same:
       the Unicode string and the number of characters (not including any trailing \0 character) in the Unicode
       string.

       unicode_composition_init()'s additional parameters are: any optional flags (see unicode_compose() for a
       list of available flags), and the address of a unicode_composition_t object. A non-0 return from
       unicode_composition_init() indicates an error.  unicode_composition_init() indicates success by returning
       0 and initializing the unicode_composition_t's object which contains a pointer to an array of pointers to
       of unicode_compose_info objects, and the number of pointers.  unicode_composition_init() does not change
       the string; the only thing it does is initialize the unicode_composition_t object.

       unicode_composition_apply() applies the compositions to the string, in place, and returns the new size of
       the string (also not including the \0 byte, however it does append one if the composed string is smaller,
       so the composed string is \0-terminated if the decomposed string was).

       It is necessary to call unicode_composition_deinit() to free all memory that was allocated for the
       unicode_composition_t object:
       struct unicode_compose_info {
           size_t                        index;
           size_t                        n_composed;
           char32_t                      *composition;
           size_t                        n_composition;
       };

       typedef struct {
           struct unicode_compose_info   **compositions;
           size_t                        n_compositions;
       } unicode_composition_t;

       index gives the character index in the string where each composition occurs.  n_composed gives the number
       of characters in the original string that get composed. The composed characters are the composition; and
       n_composition gives the number of composed characters.

       Effectively: at the index position in the original string, #n_composed characters get removed and there
       are #n_composition characters that replace them (always n_composed or less).

           Note

           The UNICODE_COMPOSE_FLAG_REMOVEUNUSED flag has the effect of including the combining marks that did
           not get combined in the n_composed count. It's possible that, in this case, n_composition is 0. This
           indicates complete removal of the combining marks, without anything getting combined in their place.

       unicode_composition_init() sets unicode_composition_t's compositions pointer to an array of pointers to
       unicode_compose_infos that are sorted according to their index.  n_compositions gives the number of
       pointers in the array, and is 0 if there are no compositions, the array is empty. The empty array gets
       interpreted accordingly when it gets passed to unicode_composition_apply() and
       unicode_composition_deinit(): nothing happens.  unicode_composition_apply() simply returns the size of
       the unchanged string, and unicode_composition_deinit() does a pro-forma cleanup.

AUTHOR

       Sam Varshavchik
           Author

NOTES

        1. canonical and compatibility mapping
           https://www.unicode.org/reports/tr15/tr15-54.html

        2. TR-15
           https://www.unicode.org/reports/tr15/tr15-54.html

Courier Unicode Library                            05/18/2024                               UNICODE_CANONICAL(3)

NAME

SYNOPSIS

DESCRIPTION

SEE ALSO

AUTHOR

NOTES