Provided by: spamoracle_1.6-2_amd64 bug

NAME

       spamoracle.conf - SpamOracle configuration file format

DESCRIPTION

       The  spamoracle.conf  file  is  a  configuration file governing the operation of the spamoracle(1) e-mail
       classification tool.  By default, the configuration file is searched  in  $HOME/.spamoracle.conf  but  an
       alternate location can be specified using the -config flag to spamoracle(1).

       Important  note:  most of the configuration parameters should not be modified lightly, as this may result
       in completely wrong e-mail classification.  Familiarity with Graham's filtering algorithm,  as  described
       in  the  paper  referenced  at the end of this page, is recommended to fully understand the effect of the
       parameters.

SYNTAX

       The spamoracle.conf file is composed of lines of the form variable = value.  Lines starting with a # sign
       are treated as comments and ignored.  Blank lines are ignored.

       Depending on the type of the variable (see the list of variables below), the value part takes one of  the
       following forms:

       string A  sequence  of  characters.  Blanks (spaces, tabs) at the beginning and the end of the string are
              ignored.  Alternatively, the string can be enclosed in double quotes ("), in which case spaces are
              not trimmed.  Inside quoted strings, blackslashes (\) and double quotes (") must be escaped with a
              backslash, as in \\ or \

       boolean
              Either on, yes, true, or 1 to activate the boolean option, or off, no, false, or 0  to  deactivate
              it.

       integer
              A decimal integer

       float  A decimal floating-point number.

       regexp A  regular  expression in emacs(1) syntax.  The repetition operators are *, +, and ?.  Alternation
              is written \| and grouping is written \(...\).  Character classes  are  written  between  brackets
              [...]   as  usual.   A  single  dot denotes any character except newline.  Regular expressions are
              case-insensitive.

CONFIGURABLE PARAMETERS

       database_file
              (type string, default value $HOME/.spamoracle.db )
              The location of the file that contains the database of word frequencies used by spamoracle(1).

       html_retain_tags
              (type boolean, default value false)
              In HTML-formatted e-mails and attachments, the names of HTML tags  are  normally  not  treated  as
              words  and  are  ignored for the word frequency calculations. If the html_retain_tags parameter is
              set to true, HTML tags (such as img or bold) are treated as words and included in the  computation
              of word frequencies.

       html_tag_attributes
              (type regexp, default value
              a/href\|img/src\|img/alt\|frame/src\|font/face\|font/color)
              This  regular  expression matches pairs of HTML tags and HTML attributes written as tag/attribute.
              When scanning HTML-formatted e-mails  and  attachments,  attributes  to  HTML  tags  are  normally
              ignored, unless the tag/attribute pair matches the regular expression html_tag_attributes.  If the
              tag/attribute  pair matches this regexp, the value of the attribute (for instance, the URL for the
              a/href attribute) is scanned for words.

       mail_headers
              (type regexp, default value from:\|subject:)
              A regular expression determining which headers of an e-mail message are scanned for words.

       alternative_favor_html
              (type bool, default value true)
              Determine how multipart/alternative messages are treated.  If this parameter is set, and one  part
              of the alternative is of type text/html, this part is scanned and all other parts are ignored.  In
              all other cases, all parts of the alternative are scanned.

       spam_header
              (type string, default value X-Spam)
              The  name of the header that spamoracle mark adds to incoming e-mail messages, with the results of
              the spam/non-spam classification.

       attachments_header
              (type string, default value X-Attachments)
              The name of the header that spamoracle mark adds to incoming e-mail messages,  with  the  one-line
              summary  of  attachment  types,  names  and  character sets.  The generation of this header can be
              turned off with the summarize_attachment parameter.

       summarize_attachment
              (type boolean, default value true)
              If this parameter is set, spamoracle mark generates a one-line summary of the attachments  of  the
              incoming  messages,  and  inserts  this summary in the message headers.  Setting this parameter to
              false disables the generation of this extra header.

       num_meaningful_words
              (type integer, default value 15)
              Maximal number of "meaningful" words that are retained for computing the spam probability.  During
              mail analysis, spamoracle extracts all  words  of  the  message,  and  retains  those  whose  spam
              frequency  (frequency  of  occurrence  in  spam  messages)  is  closest  to  1  or  to 0.  At most
              num_meaningful_words such "meaningful" words are retained.

       max_repetitions
              (type integer, default value 2)
              Maximum number of times a given word can occur in the  set  of  "meaningful"  words  retained  for
              computing  the  spam  probability.  The default value of 2 means that at most 2 occurrences of the
              same word will be retained.

       low_freq_limit
              (type float, default value 0.01)

       high_freq_limit
              (type float, default value 0.99)
              The spam frequency of a word is computed as the number of occurrences in spam divided by number of
              occurrences in all messages.  This ratio  is  then  clipped  to  the  interval  [  low_freq_limit,
              high_freq_limit  ],  so that words that are extremely rare or extremely common in spam do not bias
              the probability computation too much.  The default values of 0.01 and  0.99  are  adequate  for  a
              corpus  of  a few thousand e-mails.  For larger corpora (e.g. 10000 e-mails), the values 0.001 and
              0.999 may give better results.

       min_meaningful_words
              (type integer, default value 5)
              Minimum number of "meaningful" words below which spamoracle mark refuses to  classify  the  e-mail
              and  outputs  "unknown"  status.   This  happens  with very short e-mails, or e-mails that consist
              exclusively of links and pictures.

       good_mail_prob
              (type float, default value 0.2)
              Spam probability below which the e-mail is classified as non-spam.

       spam_mail_prob
              (type float, default value 0.8)
              Spam probability above which the e-mail is classified as spam.  Messages whose  probability  falls
              between good_mail_prob and spam_mail_prob are classified as "unknown".

AUTHOR

       Xavier Leroy <Xavier.Leroy@inria.fr>

SEE ALSO

       spamoracle(1)

       http://www.paulgraham.com/spam.html (Paul Graham's seminal paper)

                                                                                              SPAMORACLE.CONF(5)