Provided by: waymore_3.7-1.1_all bug

NAME

       waymore - Tool to discover extensive data from online archives

SYNOPSIS

          waymore [-h] [-i INPUT] [-n] [-mode {U,R,B}] [-oU OUTPUT_URLS] [-oR OUTPUT_RESPONSES] [-f] [-fc FC] [-mc MC] [-l <signed integer>] [-from <yyyyMMddhhmmss>] [-to <yyyyMMddhhmmss>]
                  [-ci {h,d,m,none}] [-ra REGEX_AFTER] [-url-filename] [-xwm] [-xcc] [-xav] [-xus] [-xvt] [-lcc LCC] [-lcy LCY] [-t <seconds>] [-p <integer>] [-r RETRIES] [-m <integer>]
                  [-ko [KEYWORDS_ONLY]] [-lr LIMIT_REQUESTS] [-ow] [-nlf] [-c CONFIG] [-wrlr WAYBACK_RATE_LIMIT_RETRY] [-urlr URLSCAN_RATE_LIMIT_RETRY] [-co] [-nd] [-v] [--version]

DESCRIPTION

       waymore  is a versatile tool designed to extract comprehensive information from various sources including
       the Wayback Machine, Common Crawl, Alien Vault OTX, URLScan, and VirusTotal. Whether you're searching for
       historical web data or analyzing security threats,  waymore  provides  a  seamless  experience  with  its
       intuitive interface and extensive features.

OPTIONS

       -h, --help:
              Display  command  usage  and options. Provides quick access to comprehensive assistance, including
              detailed explanations of available options.

       -i INPUT, --input INPUT:
              The target domain (or file of domains) to find links for. This can be a domain only, or  a  domain
              with a specific path.  If it is a domain only to get everything for that domain, don't prefix with
              "www."

       -n, --no-subs:
              Don't  include subdomains of the target domain (only used if input is not a domain with a specific
              path).

       -mode {U,R,B}:
              The mode to run: U (retrieve URLs only), R (download Responses only) or B (Both).

       -oU OUTPUT_URLS, --output-urls OUTPUT_URLS:
              The file to save the Links output to, including path if necessary. If the "-oR"  argument  is  not
              passed,  a "results" directory will be created in the path specified by the DEFAULT_OUTPUT_DIR key
              in config.yml file (typically defaults to "~/.config/waymore/").  Within that, a directory will be
              created with target domain (or domain with path) passed with "-i" (or for  each  line  of  a  file
              passed with "-i").

       -oR OUTPUT_RESPONSES, --output-responses OUTPUT_RESPONSES:
              The  directory  to save the response output files to, including path if necessary. If the argument
              is  not  passed,  a  "results"  directory  will  be  created  in  the  path   specified   by   the
              DEFAULT_OUTPUT_DIR  key  in  config.yml  file (typically defaults to "~/.config/waymore/"). Within
              that, a directory will be created with target domain (or domain with path) passed  with  "-i"  (or
              for each line of a file passed with "-i").

       -f, --filter-responses-only:
              The  initial  links  from Wayback Machine will not be filtered (MIME Type and Response Code), only
              the responses that are downloaded, e.g. it maybe useful to still see all available paths from  the
              links even if you don't want to check the content.

       -fc FC:
              Filter HTTP status codes for retrieved URLs and responses. Comma separated list of codes (default:
              the  FILTER_CODE  values  from  config.yml).   Passing  this argument will override the value from
              config.yml

       -mc MC:
              Only Match HTTP status codes for retrieved URLs and responses.  Comma  separated  list  of  codes.
              Passing this argument overrides the config FILTER_CODE and -fc.

       -l <signed integer>, --limit <signed integer>:
              How  many  responses  will  be  saved  (if -mode is R or B). A positive value will get the first N
              results, a negative value will will get the last N results.  A value of 0 will get  ALL  responses
              (default: 5000)

       -from <yyyyMMddhhmmss>, --from-date <yyyyMMddhhmmss>:
              What  date to get responses from. If not specified it will get from the earliest possible results.
              A partial value can be passed, e.g. 2016, 201805, etc.

       -to <yyyyMMddhhmmss>, --to-date <yyyyMMddhhmmss>:
              What date to get responses to. If not specified it will get to  the  latest  possible  results.  A
              partial value can be passed, e.g. 2016, 201805, etc.

       -ci {h,d,m,none}, --capture-interval {h,d,m,none}:
              Filters  the  search  on Wayback Machine (archive.org) to only get at most 1 capture per hour (h),
              day (d) or month (m). This filter is used for responses only.  The default is 'd' but can also  be
              set to 'none' to not filter anything and get all responses.

       -ra REGEX_AFTER, --regex-after REGEX_AFTER:
              RegEx  for  filtering  purposes  against links found all sources of URLs AND responses downloaded.
              Only positive matches will be output.

       -url-filename:
              Set the file name of downloaded responses to the URL that generated  the  response,  otherwise  it
              will  be  set  to  the  hash value of the response.  Using the hash value means multiple URLs that
              generated the same response will only result in one file being saved for that response.

       -xwm:  Exclude checks for links from Wayback Machine (archive.org)

       -xcc:  Exclude checks for links from commoncrawl.org

       -xav:  Exclude checks for links from alienvault.com

       -xus:  Exclude checks for links from urlscan.io

       -xvt:  Exclude checks for links from virustotal.com

       -lcc LCC:
              Limit the number of Common Crawl index collections searched, e.g. '-lcc 10' will just  search  the
              latest  10 collections (default: 3).  As of July 2023 there are currently 95 collections.  Setting
              to 0 (default) will search ALL collections. If you don't want to search Common Crawl at  all,  use
              the -xcc option.

       -lcy LCY:
              Limit  the  number  of  Common Crawl index collections searched by the year of the index data. The
              earliest index has data from 2008.  Setting to 0 (default) will search  collections  or  any  year
              (but  in  conjunction  with  -lcc).  For example, if you are only interested in data from 2015 and
              after, pass -lcy 2015. If you don't want to search Common Crawl at all, use the -xcc option.

       -t <seconds>, --timeout <seconds>:
              This is for archived responses only! How many seconds to wait for the server to send  data  before
              giving up (default: 30 seconds)

       -p <integer>, --processes <integer>:
              Basic  multithreading  is  done when getting requests for a file of URLs. This argument determines
              the number of processes (threads) used (default: 1)

       -r RETRIES, --retries RETRIES:
              The number of retries for requests that get connection error or rate limited (default: 1).

       -m <integer>, --memory-threshold <integer>:
              The memory threshold percentage. If the machines memory goes above the threshold, the program will
              be stopped and ended gracefully before running out of memory (default: 95)

       -ko [KEYWORDS_ONLY], --keywords-only [KEYWORDS_ONLY]:
              Only return links and responses that contain keywords that you are interested in. This can  reduce
              the  time it takes to get results.  If you provide the flag with no value, Keywords are taken from
              the comma separated list in the "config.yml" file with the "FILTER_KEYWORDS"  key,  otherwise  you
              can  pass  an  specific Regex value to use, e.g. -ko "admin" to only get links containing the word
              admin, or -ko ".js(?|$)" to only get JS files. The Regex check is NOT case sensitive.

       -lr LIMIT_REQUESTS, --limit-requests LIMIT_REQUESTS:
              Limit the number of requests that will be made when getting links  from  a  source  (this  doesn't
              apply  to  Common  Crawl).  Some targets can return a huge amount of requests needed that are just
              not feasible to get, so this can be used to manage that situation. This defaults to 0 (Zero) which
              means there is no limit.

       -ow, --output-overwrite:
              If the URL output file (default waymore.txt) already exists, it will  be  overwritten  instead  of
              being appended to.

       -nlf, --new-links-file:
              If  this  argument  is  passed,  a  .new file will also be written that will contain links for the
              latest run.

       -c CONFIG, --config CONFIG:
              Path to the YML config file. If not passed, it looks for file 'config.yml' in the  same  directory
              as runtime file 'waymore.py'.

       -wrlr WAYBACK_RATE_LIMIT_RETRY, --wayback-rate-limit-retry WAYBACK_RATE_LIMIT_RETRY:
              The  number  of  minutes  the  user  wants  to  wait  for  a  rate  limit pause on Watback Machine
              (archive.org) instead of stopping with a 429 error (default: 3).

       -urlr URLSCAN_RATE_LIMIT_RETRY, --urlscan-rate-limit-retry URLSCAN_RATE_LIMIT_RETRY:
              The number of minutes the user wants to wait for a rate  limit  pause  on  URLScan.io  instead  of
              stopping with a 429 error (default: 1).

       -co, --check-only:
              This will make a few minimal requests to show you how many requests, and roughly how long it could
              take, to get URLs from the sources and downloaded responses from Wayback Machine.

       -nd, --notify-discord:
              Whether  to  send a notification to Discord when waymore completes. It requires WEBHOOK_DISCORD to
              be provided in the config.yml file.

       -v, --verbose:
              Verbose output

       --version:
              Show version number

EXAMPLES

       Common usage:

       • Example 1:

       Just get the URLs from all sources for redbull.com (-mode U  is  just  for  URLs,  so  no  responses  are
       downloaded):

          $ waymore -i redbull.com -mode U

       The   URLs   are   saved   in   the   same   path   as  config.yml  (typically  ~/.config/waymore)  under
       results/redbull.com/waymore.txt

       • Example 2:

       Get ALL the URLs from Wayback for redbull.com (no filters are applied in mode U with -f, and no URLs  are
       retrieved  from  Common  Crawl,  Alien Vault, URLScan and Virus Total, because -xcc, -xav, -xus, -xvt are
       passed respectively). Save the FIRST 200 responses that are found starting from 2022 (-l 200 -from 2022):

          $ waymore -i redbull.com -f -xcc -xav -xus -xvt -l 200 -from 2022

       • Example 3:

       You can pipe waymore to other tools. Any errors are sent to stderr  and  any  links  found  are  sent  to
       stdout.  The  output  file  is  still  created  in addition to the links being piped to the next program.
       However, archived responses are not piped to the next program, but they are still written to  files.  For
       example:

          $ waymore -i redbull.com -mode U | unfurl keys | sort -u

       You can also pass the input through stdin instead of -i:

          $ cat redbull_subs.txt | waymore

       • Example 4:

       Sometimes you may just want to check how many requests, and how long waymore is likely to take if you ran
       it  for  a  particular  domain.   You  can  do  a quick check by using the -co/--check-only argument. For
       example:

          $ waymore -i redbull.com --check-only

AUTHOR

       Aquila Macedo <aquilamacedo@riseup.net>

COPYRIGHT

       Expat

                                                   2024-03-22                                         WAYMORE(1)