Provided by: libmsoffice-word-surgeon-perl_2.10-1_all bug

NAME

       MsOffice::Word::Surgeon - tamper with the guts of Microsoft docx documents, with regexes

SYNOPSIS

         my $surgeon = MsOffice::Word::Surgeon->new(docx => $filename);

         # extract plain text
         my $main_text    = $surgeon->document->plain_text;
         my @header_texts = map {$surgeon->part($_)->plain_text} $surgeon->headers;

         # unlink fields
         $surgeon->document->unlink_fields;

         # reveal bookmarks
         $surgeon->document->reveal_bookmarks(color => 'cyan');

         # anonymize
         my %alias = ('Claudio MONTEVERDI' => 'A_____', 'Heinrich SCHÜTZ' => 'B_____');
         my $pattern = join "|", keys %alias;
         my $replacement_callback = sub {
           my %args =  @_;
           my $replacement = $surgeon->new_revision(to_delete  => $args{matched},
                                                    to_insert  => $alias{$args{matched}},
                                                    run        => $args{run},
                                                    xml_before => $args{xml_before},
                                                   );
           return $replacement;
         };
         $surgeon->all_parts_do(replace => qr[$pattern], $replacement_callback);

         # save the result
         $surgeon->overwrite; # or ->save_as($new_filename);

DESCRIPTION

   Purpose
       This module supports a few operations for inspecting or modifying contents in Microsoft Word documents in
       '.docx' format -- therefore the name 'surgeon'. Since a surgeon does not give life, there is no support
       for creating fresh documents; if you have such needs, use one of the other packages listed in the "SEE
       ALSO" section -- or use the companion module MsOffice::Word::Template.

       Some applications for this module are :

       •   content extraction in plain text format;

       •   unlinking fields (equivalent of performing Ctrl-Shift-F9 on the whole document)

       •   adding markers at bookmark start and end positions

       •   regex replacements within text, for example for :

           •   anonymization, i.e. replacement of names or addresses by aliases;

           •   templating,  i.e.  replacement  of  special  markup by contents coming from a data tree (see also
               MsOffice::Word::Template).

       •   insertion    of    generated    images    (for    example    barcodes)    --    see    "images"    in
           MsOffice::Word::Surgeon::PackagePart;

       •   pretty-printing the internal XML structure.

   The ".docx" format
       The       format       of       Microsoft       ".docx"       documents       is       described       in
       <http://www.ecma-international.org/publications/standards/Ecma-376.htm> and  <http://officeopenxml.com/>.
       An excellent introduction can be found at  <https://www.toptal.com/xml/an-informal-introduction-to-docx>.
       Another   precious   source   of   documentation   is   <http://officeopenxml.com/WPcontentOverview.php>.
       Internally, a document is a zipped archive, where the member named "word/document.xml"  stores  the  main
       document contents, in XML format.

   Operating mode
       The  present module does not parse all details of the whole XML structure because it only focuses on text
       nodes (those that contain literal text) and run nodes (those that contain  text  formatting  properties).
       All remaining XML information, for example for representing sections, paragraphs, tables, etc., is stored
       as  opaque  XML  fragments;  these fragments are re-inserted at proper places when reassembling the whole
       document after having modified some text nodes.

METHODS

   Constructor
       new

         my $surgeon = MsOffice::Word::Surgeon->new(docx => $filename_or_filehandle);
         # or simply : ->new($filename);

       Builds a new surgeon instance, initialized with the contents of the given filename or filehandle.

   Accessors
       docx

       Path to the ".docx" file

       zip

       Instance of Archive::Zip associated with this file

       parts

       Hashref to MsOffice::Word::Surgeon::PackagePart objects, keyed by their part name in the ZIP file.  There
       is always a 'document' part. Other parts may be headers, footers, footnotes or endnotes.

       document

       Shortcut to "$surgeon->part('document')" -- the MsOffice::Word::Surgeon::PackagePart object corresponding
       to the main document.  See the "PackagePart" documentation for operations on part objects.  Besides,  the
       following  operations  are  supported  directly  as  methods to the $surgeon object and are automatically
       delegated to the "document" part : "contents",  "original_contents",  "indented_contents",  "plain_text",
       "replace".

       headers

         my @header_parts = $surgeon->headers;

       Returns the ordered list of names of header members stored in the ZIP file.

       footers

         my @footer_parts = $surgeon->footers;

       Returns the ordered list of names of footer members stored in the ZIP file.

   Other methods
       part

         my $part = $surgeon->part($part_name);

       Returns the MsOffice::Word::Surgeon::PackagePart object corresponding to the given part name.

       all_parts_do

         my $result = $surgeon->all_parts_do($method_name => %args);

       Calls the given method on all part objects. Results are accumulated in a hash, with part names as keys to
       the  results. This is mostly used to invoke the "replace" in MsOffice::Word::Surgeon::PackagePart method,
       i.e.

         $surgeon->all_parts_do(replace => qr[$pattern], $replacement_callback, %replacement_args);

       xml_member

         my $xml = $surgeon->xml_member($member_name); # reading
         # or
         $surgeon->xml_member($member_name, $new_xml); # writing

       Reads or writes the given member name in the ZIP file, with utf8 decoding or encoding.

       save_as

         $surgeon->save_as($docx_file_or_filehandle);

       Writes the ZIP archive into the given file or filehandle.

       overwrite

         $surgeon->overwrite;

       Writes the updated ZIP archive into the initial file.  If the initial "docx" was given as  a  filehandle,
       use the "save_as" method instead.

       new_revision

         my $xml = $surgeon->new_revision(
           to_delete   => $text_to_delete,
           to_insert   => $text_to_insert,
           author      => $author_string,
           date        => $date_string,
           run         => $run_object,
           xml_before  => $xml_string,
         );

       This  method  is  syntactic  sugar  for  instantiating  the  MsOffice::Word::Surgeon::Revision  class and
       returning XML markup for MsWord revisions (a.k.a. "tracked changes") generated by that class.  Users  can
       then  manually  review  those  revisions  within  MsWord  and accept or reject them. This is best used in
       collaboration with the "replace" method : the replacement callback can call "$self->new_revision(...)" to
       generate revision marks in the document.

       Either "to_delete" or "to_insert"  (or  both)  must  be  present.  Other  parameters  are  optional.  The
       parameters are :

       to_delete
           The  string  of text to delete (usually this will be the "matched" argument passed to the replacement
           callback).

       to_insert
           The string of new text to insert.

       author
           A short string that will be displayed by MsWord as the "author" of this revision.

       date
           A date (and optional time) in ISO format that will be  displayed  by  MsWord  as  the  date  of  this
           revision. The current date and time will be used by default.

       run A  reference  to  the  MsOffice::Word::Surgeon::Run  object surrounding this revision. The formatting
           properties of that run will be copied into the  "<w:r>"  nodes  of  the  deleted  and  inserted  text
           fragments.

       xml_before
           An optional XML fragment to be inserted before the "<w:t>" node of the inserted text

   Operations on parts
       See  the  MsOffice::Word::Surgeon::PackagePart  documentation  for  other  operations  on  package parts,
       including operations on fields, bookmarks or images.

SEE ALSO

       The <https://metacpan.org/pod/Document::OOXML> distribution on CPAN also  manipulates  "docx"  documents,
       but  with  another  approach  : internally it uses XML::LibXML and XPath expressions for manipulating XML
       nodes. The API has some intersections with the present module, but there  are  also  some  differences  :
       "Document::OOXML"  has  more  support  for  styling,  while  "MsOffice::Word::Surgeon"  has more flexible
       mechanisms for replacing text fragments.

       Other programming languages also  have  packages  for  dealing  with  "docx"  documents;  here  are  some
       references :

       <https://docs.microsoft.com/en-us/office/open-xml/word-processing>
           The C# Open XML SDK from Microsoft

       <http://www.ericwhite.com/blog/open-xml-powertools-developer-center/>
           Additional functionalities built on top of the XML SDK.

       <https://poi.apache.org>
           An open source Java library from the Apache foundation.

       <https://www.docx4java.org/trac/docx4j>
           Another open source Java library, competitor to Apache POI.

       <https://phpword.readthedocs.io/en/latest/>
           A PHP library dealing not only with Microsoft OOXML documents but also with OASIS and RTF formats.

       <https://pypi.org/project/python-docx/>
           A Python library, documented at <https://python-docx.readthedocs.io/en/latest/>.

       As  far  as  I  can  tell,  most  of these libraries provide objects and methods that closely reflect the
       complete XML structure : for example they have classes for paragraphs, styles, fonts, inline shapes, etc.

       The present module is much simpler but also much more limited : it was optimised  for  dealing  with  the
       text  contents  and  offers  no  support  for  presentation  or paging features. However, it has the rare
       advantage of providing an API for regex substitutions within Word documents.

       The MsOffice::Word::Template module relies on  the  present  module,  together  with  the  Perl  Template
       Toolkit, to implement a templating system for Word documents.

AUTHOR

       Laurent Dami, <dami AT cpan DOT org<gt>

COPYRIGHT AND LICENSE

       Copyright 2019-2024 by Laurent Dami.

       This  program  is free software, you can redistribute it and/or modify it under the terms of the Artistic
       License version 2.0.

perl v5.40.1                                       2025-05-16                       MsOffice::Word::Surgeon(3pm)