Ubuntu Manpage: PDL::BadValues - Discussion of bad value support in PDL

NAME

       PDL::BadValues - Discussion of bad value support in PDL

DESCRIPTION

   What are bad values and why should I bother with them?
       Sometimes it's useful to be able to specify a certain value is 'bad' or 'missing'; for example CCDs used
       in astronomy produce 2D images which are not perfect since certain areas contain invalid data due to
       imperfections in the detector.  Whilst PDL's powerful index routines and all the complicated business
       with dataflow, slices, etc etc mean that these regions can be ignored in processing, it's awkward to do.
       It would be much easier to be able to say "$c = $x + $y" and leave all the hassle to the computer.

       If you're not interested in this, then you may (rightly) be concerned with how this affects the speed of
       PDL, since the overhead of checking for a bad value at each operation can be large.  Because of this, the
       code has been written to be as fast as possible - particularly when operating on ndarrays which do not
       contain bad values.  In fact, you should notice essentially no speed difference when working with
       ndarrays which do not contain bad values.

       You may also ask 'well, my computer supports IEEE NaN, so I already have this'.  Well, yes and no - many
       routines, such as "y=sin(x)", will propagate NaN's without the user having to code differently, but
       routines such as "qsort", or finding the median of an array, need to be re-coded to handle bad values.
       For floating-point datatypes, "NaN" and "Inf" can be used to flag bad values, but by default special
       values are used (Default bad values).  I do not have any benchmarks to see which option is faster.

       As of PDL 2.040, you can have different bad values for separate ndarrays of the same type.

   A quick overview
        pdl> $x = sequence(4,3);
        pdl> p $x
        [
         [ 0  1  2  3]
         [ 4  5  6  7]
         [ 8  9 10 11]
        ]
        pdl> $x = $x->setbadif( $x % 3 == 2 )
        pdl> p $x
        [
         [  0   1 BAD   3]
         [  4 BAD   6   7]
         [BAD   9  10 BAD]
        ]
        pdl> $x *= 3
        pdl> p $x
        [
         [  0   3 BAD   9]
         [ 12 BAD  18  21]
         [BAD  27  30 BAD]
        ]
        pdl> p $x->sum
        120

       "demo bad" and "demo bad2" within perldl or pdl2 gives a demonstration of some of the things possible
       with bad values.  These are also available on PDL's web-site, at http://pdl.perl.org/demos/.  See
       PDL::Bad for useful routines for working with bad values and t/bad.t to see them in action.

       To find out if a routine supports bad values, use the "badinfo" command in perldl or pdl2 or the "-b"
       option to pdldoc.  This facility is currently a 'proof of concept' (or, more realistically, a quick hack)
       so expect it to be rough around the edges.

       Each ndarray contains a flag - accessible via "$pdl->badflag" - to say whether there's any bad data
       present:

       •   If  false/0,  which  means  there's  no  bad  data  here,  the  code supplied by the "Code" option to
           "pp_def()" is executed.

       •   If true/1, then this says there MAY be bad data in the ndarray, so use  the  code  in  the  "BadCode"
           option  (assuming  that the "pp_def()" for this routine has been updated to have a BadCode key).  You
           get all the advantages of threading, as with the "Code" option, but it will run slower since you  are
           going to have to handle the presence of bad values.

       If   you  create  an  ndarray,  it  will  have  its  bad-value  flag  set  to  0.  To  change  this,  use
       "$pdl->badflag($new_bad_status)", where $new_bad_status can be  0  or  1.   When  a  routine  creates  an
       ndarray,   its   bad-value  flag  will  depend  on  the  input  ndarrays:  unless  over-ridden  (see  the
       "CopyBadStatusCode" option to "pp_def"), the bad-value flag will be set true if any of the input ndarrays
       contain bad values.  To check that an ndarray really contains bad data, use the "check_badflag" method.

       NOTE: propagation of the badflag

       If you change the badflag of an ndarray, this change is propagated to all the children of an ndarray, so

          pdl> $x = zeroes(20,30);
          pdl> $y = $x->slice('0:10,0:10');
          pdl> $c = $y->slice(',(2)');
          pdl> print ">>c: ", $c->badflag, "\n";
          >>c: 0
          pdl> $x->badflag(1);
          pdl> print ">>c: ", $c->badflag, "\n";
          >>c: 1

       No change is made to the parents of an ndarray, so

          pdl> print ">>a: ", $x->badflag, "\n";
          >>a: 1
          pdl> $c->badflag(0);
          pdl> print ">>a: ", $x->badflag, "\n";
          >>a: 1

       Thoughts:

       •   the badflag can ONLY be cleared IF an ndarray has NO parents, and that this change will propagate  to
           all the children of that ndarray. I am not so keen on this anymore (too awkward to code, for one).

       •   "$x->badflag(1)" should propagate the badflag to BOTH parents and children.

       This  shouldn't  be  hard to implement (although an initial attempt failed!).  Does it make sense though?
       There's also the issue of what happens if you change the badvalue of an ndarray - should these  propagate
       to  children/parents (yes) or whether you should only be able to change the badvalue at the 'top' level -
       i.e. those ndarrays which do not have parents.

       The "orig_badvalue()" method returns the compile-time value for a given datatype. It works  on  ndarrays,
       PDL::Type objects, and numbers - eg

         $pdl->orig_badvalue(), byte->orig_badvalue(), and orig_badvalue(4).

       It also has a horrible name...

       To get the current bad value, use the "badvalue()" method - it has the same syntax as "orig_badvalue()".

       To change the current bad value, supply the new number to badvalue - eg

         $pdl->badvalue(2.3), byte->badvalue(2), badvalue(5,-3e34).

       Note:  the  value  is silently converted to the correct C type, and returned - i.e. "byte->badvalue(-26)"
       returns 230 on my Linux machine.

       Note that changes to the bad value are NOT propagated to previously-created ndarrays -  they  will  still
       have  the  bad  value set, but suddenly the elements that were bad will become 'good', but containing the
       old bad value.  See discussion below.  It's not a problem for floating-point types which use  NaN,  since
       you can not change their badvalue.

   Bad values and boolean operators
       For  those  boolean  operators in PDL::Ops, evaluation on a bad value returns the bad value.  Whilst this
       means that

        $mask = $img > $thresh;

       correctly propagates bad values, it will cause problems for checks such as

        do_something() if any( $img > $thresh );

       which need to be re-written as something like

        do_something() if any( setbadtoval( ($img > $thresh), 0 ) );

       When using one of the 'projection' functions in PDL::Ufunc - such as orover - bad values are skipped over
       (see the documentation of these functions for the current (poor) handling of the case when  all  elements
       are bad).

   A bad value for each ndarray, and related issues
       There  is one default bad value for each datatype, but you can have a separate bad value for each ndarray
       as of PDL 2.040.

IMPLEMENTATION DETAILS

       PDL code just needs to access the %PDL::Config array (e.g. Basic/Bad/bad.pd) to  find  out  whether  bad-
       value support is required.

       A  new flag has been added to the state of an ndarray - "PDL_BADVAL". If unset, then the ndarray does not
       contain bad values, and so all the support code can be ignored. If set, it does not  guarantee  that  bad
       values  are  present,  just  that  they  should  be checked for. Thanks to Christian, "badflag()" - which
       sets/clears this flag (see Basic/Bad/bad.pd) - will  update  ALL  the  children/grandchildren/etc  of  an
       ndarray   if   its   state   changes  (see  "badflag"  in  Basic/Bad/bad.pd  and  "propagate_badflag"  in
       Basic/Core/Core.xs.PL).  It's not clear what to do with parents: I can see the reason for  propagating  a
       'set  badflag'  request  to  parents,  but  I  think a child should NOT be able to clear the badflag of a
       parent.  There's also the issue of what happens when you change the bad value for an ndarray.

       The "pdl_trans" structure has been extended to include an integer value,  "bvalflag",  which  acts  as  a
       switch  to  tell  the  code  whether  to  handle bad values or not. This value is set if any of the input
       ndarrays  have  their  "PDL_BADVAL"  flag  set  (although  this  code  can   be   replaced   by   setting
       "FindBadStateCode"  in pp_def).  The logic of the check is going to get a tad more complicated if I allow
       routines to fall back to using the "Code" section for floating-point types.

       The default bad values are now stored in a structure within the Core  PDL  structure  -  "PDL.bvals"  (eg
       Basic/Core/pdlcore.h.PL);  see  also  "typedef  badvals"  in  Basic/Core/pdl.h.PL  and  the  BOOT code of
       Basic/Core/Core.xs.PL  where  the  values  are  initialised  to   (hopefully)   sensible   values.    See
       PDL/Bad/bad.pd for read/write routines to the values.

   Why not make a PDL subclass?
       The support for bad values could have been done as a PDL sub-class.  The advantage of this approach would
       be that you only load in the code to handle bad values if you actually want to use them.  The downside is
       that  the  code  then  gets  separated:  any  bug  fixes/improvements  have to be done to the code in two
       different files.  With the present approach the code is in the same "pp_def" function (although there  is
       still the problem that both "Code" and "BadCode" sections need updating).

   Default bad values
       The default/original bad values are set to (taken from the Starlink distribution):

         #include <limits.h>

         PDL_Byte    ==  UCHAR_MAX
         PDL_Short   ==   SHRT_MIN
         PDL_Ushort  ==  USHRT_MAX
         PDL_Long    ==    INT_MIN
         PDL_Float   ==   -FLT_MAX
         PDL_Double  ==   -DBL_MAX

   How do I change a routine to handle bad values?
       Examples  can  be found in most of the *.pd files in Basic/ (and hopefully many more places soon!).  Some
       of the logic might appear a bit unclear - that's probably because it is! Comments appreciated.

       All routines should automatically propagate the bad status flag to output ndarrays,  unless  you  declare
       otherwise.

       If a routine explicitly deals with bad values, you must provide this option to pp_def:

          HandleBad => 1

       This ensures that the correct variables are initialised for the $ISBAD etc macros. It is also used by the
       automatic document-creation routines to provide default information on the bad value support of a routine
       without the user having to type it themselves (this is in its early stages).

       To flag a routine as NOT handling bad values, use

          HandleBad => 0

       This  should  cause  the  routine  to  print  a  warning if it's sent any ndarrays with the bad flag set.
       Primitive's "intover" has had this set - since it would be awkward to convert - but I've not tried it out
       to see if it works.

       If you want to handle bad values but not set the state of all the output ndarrays, or if  it's  only  one
       input  ndarray  that's important, then look at the PP rules "NewXSFindBadStatus" and "NewXSCopyBadStatus"
       and the corresponding "pp_def" options:

       FindBadStatusCode
           By default, "FindBadStatusCode" creates code which sets "$PRIV(bvalflag)" depending on the  state  of
           the bad flag of the input ndarrays: see "findbadstatus" in Basic/Gen/PP.pm.  User-defined code should
           also store the value of "bvalflag" in the "$BADFLAGCACHE()" variable.

       CopyBadStatusCode
           The  default  code  here  is  a  bit simpler than for "FindBadStatusCode": the bad flag of the output
           ndarrays are set if  "$BADFLAGCACHE()"  is  true  after  the  code  has  been  evaluated.   Sometimes
           "CopyBadStatusCode"  is set to an empty string, with the responsibility of setting the badflag of the
           output   ndarray   left   to   the   "BadCode"   section   (e.g.   the    "xxxover"    routines    in
           Basic/Primitive/primitive.pd).

           Prior  to  PDL  2.4.3 we used "$PRIV(bvalflag)" instead of "$BADFLAGCACHE()". This is dangerous since
           the "$PRIV()" structure is not guaranteed to be valid at this point in the code.

       If you have a routine that you want to be able to use as in-place, look at the  routines  in  bad.pd  (or
       ops.pd)  which  use  the  "in-place"  option  to see how the bad flag is propagated to children using the
       "xxxBadStatusCode" options.  I decided not to automate this as rules would be a little complex, since not
       every in-place op will need to propagate the badflag (eg unary functions).

       If the option

          HandleBad => 1

       is given, then many things happen.  For integer types, the readdata code automatically creates a variable
       called "(pdl name)_badval", which contains the bad value for that  ndarray  (see  "get_xsdatapdecl()"  in
       Basic/Gen/PP/PdlParObjs.pm).   However,  do  not  hard code this name into your code!  Instead use macros
       (thanks to Tuomas for the suggestion):

         '$ISBAD(a(n=>1))'  expands to '$a(n=>1) == a_badval'
         '$ISGOOD(a())'                '$a()     != a_badval'
         '$SETBAD(bob())'              '$bob()    = bob_badval'

       well, the "$a(...)" is expanded as well. Also, you can use a "$" before the pdl name, if you so wish, but
       it begins to look like line noise - eg "$ISGOOD($a())".

       If you cache an ndarray value in a variable -- eg "index" in slices.pd  --  the  following  routines  are
       useful:

          '$ISBADVAR(c_var,pdl)'       'c_var == pdl_badval'
          '$ISGOODVAR(c_var,pdl)'      'c_var != pdl_badval'
          '$SETBADVAR(c_var,pdl)'      'c_var  = pdl_badval'

       The following have been introduced, They may need playing around with to improve their use.

         '$PPISBAD(CHILD,[i])          'CHILD_physdatap[i] == CHILD_badval'
         '$PPISGOOD(CHILD,[i])         'CHILD_physdatap[i] != CHILD_badval'
         '$PPSETBAD(CHILD,[i])         'CHILD_physdatap[i]  = CHILD_badval'

       You can use "NaN" as the bad value for any floating-point type, including complex.

       This all means that you can change

          Code => '$a() = $b() + $c();'

       to

          BadCode => 'if ( $ISBAD(b()) || $ISBAD(c()) ) {
                        $SETBAD(a());
                      } else {
                        $a() = $b() + $c();
                      }'

       leaving Code as it is. PP::PDLCode will then create a loop something like

          if ( __trans->bvalflag ) {
               threadloop over BadCode
          } else {
               threadloop over Code
          }

       (it's probably easier to just look at the .xs file to see what goes on).

   Going beyond the Code section
       Similar to "BadCode", there's "BadBackCode", and "BadRedoDimsCode".

       Handling  "EquivCPOffsCode"  is a bit different: under the assumption that the only access to data is via
       the "$EQUIVCPOFFS(i,j)" macro, then we can  automatically  create  the  'bad'  version  of  it;  see  the
       "[EquivCPOffsCode]" and "[Code]" rules in PDL::PP.

   Macro access to the bad flag of an ndarray
       Macros have been provided to provide access to the bad-flag status of a pdl:

         '$PDLSTATEISBAD(a)'    -> '($PDL(a)->state & PDL_BADVAL) > 0'
         '$PDLSTATEISGOOD(a)'      '($PDL(a)->state & PDL_BADVAL) == 0'

         '$PDLSTATESETBAD(a)'      '$PDL(a)->state |= PDL_BADVAL'
         '$PDLSTATESETGOOD(a)'     '$PDL(a)->state &= ~PDL_BADVAL'

       For use in "xxxxBadStatusCode" (+ other stuff that goes into the INIT: section) there are:

         '$SETPDLSTATEBAD(a)'       -> 'a->state |= PDL_BADVAL'
         '$SETPDLSTATEGOOD(a)'      -> 'a->state &= ~PDL_BADVAL'

         '$ISPDLSTATEBAD(a)'        -> '((a->state & PDL_BADVAL) > 0)'
         '$ISPDLSTATEGOOD(a)'       -> '((a->state & PDL_BADVAL) == 0)'

       In   PDL   2.4.3   the  "$BADFLAGCACHE()"  macro  was  introduced  for  use  in  "FindBadStatusCode"  and
       "CopyBadStatusCode".

WHAT ABOUT DOCUMENTATION?

       One of the strengths of PDL is its on-line documentation. The aim  is  to  use  this  system  to  provide
       information  on  how/if  a  routine  supports  bad  values:  in  many  cases  "pp_def()" contains all the
       information anyway, so the function-writer doesn't need to do anything at all! For the cases when this is
       not sufficient, there's the "BadDoc" option. For code written at the Perl level - i.e. in a  .pm  file  -
       use the "=for bad" pod directive.

       This  information  will  be  available  via man/pod2man/html documentation. It's also accessible from the
       "perldl" or "pdl2" shells - using the "badinfo" command - and the "pdldoc" shell command - using the "-b"
       option.

CURRENT ISSUES

       There are a number of areas that need work, user input, or both!  They are mentioned  elsewhere  in  this
       document, but this is just to make sure they don't get lost.

   Trapping invalid mathematical operations
       Should  we  add  exceptions  to  the functions in "PDL::Ops" to set the output bad for out-of-range input
       values?

        pdl> p log10(pdl(10,100,-1))

       I would like the above to produce "[1 2 BAD]", but this would slow down operations on all  ndarrays.   We
       could check for "NaN"/"Inf" values after the operation, but I doubt that would be any faster.

   Dataflow of the badflag
       Currently  changes  to the bad flag are propagated to the children of an ndarray, but perhaps they should
       also be passed on to the parents as well. With the advent of per-ndarray bad values we need  to  consider
       how to handle changes to the value used to represent bad items too.

EVERYTHING ELSE

       The build process has been affected. The following files are now created during the build:

         Basic/Core/pdlcore.h      pdlcore.h.PL
                    pdlcore.c      pdlcore.c.PL
                    pdlapi.c       pdlapi.c.PL
                    Core.xs        Core.xs.PL
                    Core.pm        Core.pm.PL

       Several new files have been added:

         Basic/Pod/BadValues.pod (i.e. this file)

         t/bad.t

         Basic/Bad/
         Basic/Bad/Makefile.PL
                   bad.pd

       etc

TODO/SUGGESTIONS

       •   what to do about "$y = pdl(-2); $x = log10($y)" - $x should be set bad, but it currently isn't.

       •   Allow  the  operations in PDL::Ops to skip the check for bad values when using NaN as a bad value and
           processing a floating-point ndarray.  Needs a fair bit of work to PDL::PP::PDLCode.

       •   "$pdl->badflag()" now updates all the children of this ndarray as well. However, not sure what to  do
           with parents, since:

             $y = $x->slice();
             $y->badflag(0)

           doesn't mean that $x shouldn't have its badvalue cleared.  however, after

             $y->badflag(1)

           it's sensible to assume that the parents now get flagged as containing bad values.

           PERHAPS  you  can only clear the bad value flag if you are NOT a child of another ndarray, whereas if
           you set the flag then all children AND parents should be set as well?

           Similarly, if you change the bad value in an ndarray, should this be propagated to parent & children?
           Or should you only be able to do this on the 'top-level' ndarray? Nasty...

       •   some of the names aren't appealing  -  I'm  thinking  of  "orig_badvalue()"  in  Basic/Bad/bad.pd  in
           particular. Any suggestions appreciated.

AUTHOR

       Copyright (C) Doug Burke (djburke@cpan.org), 2000, 2006.

       The per-ndarray bad value support is by Heiko Klein (2006).

       Commercial reproduction of this documentation in a different format is forbidden.

perl v5.34.0                                       2022-02-08                                      BADVALUES(1p)