Ubuntu Manpage: waymore - Tool to discover extensive data from online archives

NAME

       waymore - Tool to discover extensive data from online archives

SYNOPSIS

          waymore [-h] [-i INPUT] [-n] [-mode {U,R,B}] [-oU OUTPUT_URLS] [-oR OUTPUT_RESPONSES] [-f] [-fc FC] [-mc MC] [-l <signed integer>] [-from <yyyyMMddhhmmss>] [-to <yyyyMMddhhmmss>]
                  [-ci {h,d,m,none}] [-ra REGEX_AFTER] [-url-filename] [-xwm] [-xcc] [-xav] [-xus] [-xvt] [-lcc LCC] [-lcy LCY] [-t <seconds>] [-p <integer>] [-r RETRIES] [-m <integer>]
                  [-ko [KEYWORDS_ONLY]] [-lr LIMIT_REQUESTS] [-ow] [-nlf] [-c CONFIG] [-wrlr WAYBACK_RATE_LIMIT_RETRY] [-urlr URLSCAN_RATE_LIMIT_RETRY] [-co] [-nd] [-v] [--version]

DESCRIPTION

       waymore  is a versatile tool designed to extract comprehensive information from various sources including
       the Wayback Machine, Common Crawl, Alien Vault OTX, URLScan, and VirusTotal. Whether you're searching for
       historical web data or analyzing security threats,  waymore  provides  a  seamless  experience  with  its
       intuitive interface and extensive features.

OPTIONS

-h, --help:
Display command usage and options. Provides quick access to comprehensive assistance, including
detailed explanations of available options.

-i INPUT, --input INPUT:
The target domain (or file of domains) to find links for. This can be a domain only, or a domain
with a specific path. If it is a domain only to get everything for that domain, don't prefix with
"www."

-n, --no-subs:
Don't include subdomains of the target domain (only used if input is not a domain with a specific
path).

-mode {U,R,B}:
The mode to run: U (retrieve URLs only), R (download Responses only) or B (Both).

-oU OUTPUT_URLS, --output-urls OUTPUT_URLS:
The file to save the Links output to, including path if necessary. If the "-oR" argument is not
passed, a "results" directory will be created in the path specified by the DEFAULT_OUTPUT_DIR key
in config.yml file (typically defaults to "~/.config/waymore/"). Within that, a directory will be
created with target domain (or domain with path) passed with "-i" (or for each line of a file
passed with "-i").

-oR OUTPUT_RESPONSES, --output-responses OUTPUT_RESPONSES:
The directory to save the response output files to, including path if necessary. If the argument
is not passed, a "results" directory will be created in the path specified by the
DEFAULT_OUTPUT_DIR key in config.yml file (typically defaults to "~/.config/waymore/"). Within
that, a directory will be created with target domain (or domain with path) passed with "-i" (or
for each line of a file passed with "-i").

-f, --filter-responses-only:
The initial links from Wayback Machine will not be filtered (MIME Type and Response Code), only
the responses that are downloaded, e.g. it maybe useful to still see all available paths from the
links even if you don't want to check the content.

-fc FC:
Filter HTTP status codes for retrieved URLs and responses. Comma separated list of codes (default:
the FILTER_CODE values from config.yml). Passing this argument will override the value from
config.yml

-mc MC:
Only Match HTTP status codes for retrieved URLs and responses. Comma separated list of codes.
Passing this argument overrides the config FILTER_CODE and -fc.

-l <signed integer>, --limit <signed integer>:
How many responses will be saved (if -mode is R or B). A positive value will get the first N
results, a negative value will will get the last N results. A value of 0 will get ALL responses
(default: 5000)

-from <yyyyMMddhhmmss>, --from-date <yyyyMMddhhmmss>:
What date to get responses from. If not specified it will get from the earliest possible results.
A partial value can be passed, e.g. 2016, 201805, etc.

-to <yyyyMMddhhmmss>, --to-date <yyyyMMddhhmmss>:
What date to get responses to. If not specified it will get to the latest possible results. A
partial value can be passed, e.g. 2016, 201805, etc.

-ci {h,d,m,none}, --capture-interval {h,d,m,none}:
Filters the search on Wayback Machine (archive.org) to only get at most 1 capture per hour (h),
day (d) or month (m). This filter is used for responses only. The default is 'd' but can also be
set to 'none' to not filter anything and get all responses.

-ra REGEX_AFTER, --regex-after REGEX_AFTER:
RegEx for filtering purposes against links found all sources of URLs AND responses downloaded.
Only positive matches will be output.

-url-filename:
Set the file name of downloaded responses to the URL that generated the response, otherwise it
will be set to the hash value of the response. Using the hash value means multiple URLs that
generated the same response will only result in one file being saved for that response.

-xwm: Exclude checks for links from Wayback Machine (archive.org)

-xcc: Exclude checks for links from commoncrawl.org

-xav: Exclude checks for links from alienvault.com

-xus: Exclude checks for links from urlscan.io

-xvt: Exclude checks for links from virustotal.com

-lcc LCC:
Limit the number of Common Crawl index collections searched, e.g. '-lcc 10' will just search the
latest 10 collections (default: 3). As of July 2023 there are currently 95 collections. Setting
to 0 (default) will search ALL collections. If you don't want to search Common Crawl at all, use
the -xcc option.

-lcy LCY:
Limit the number of Common Crawl index collections searched by the year of the index data. The
earliest index has data from 2008. Setting to 0 (default) will search collections or any year
(but in conjunction with -lcc). For example, if you are only interested in data from 2015 and
after, pass -lcy 2015. If you don't want to search Common Crawl at all, use the -xcc option.

-t <seconds>, --timeout <seconds>:
This is for archived responses only! How many seconds to wait for the server to send data before
giving up (default: 30 seconds)

-p <integer>, --processes <integer>:
Basic multithreading is done when getting requests for a file of URLs. This argument determines
the number of processes (threads) used (default: 1)

-r RETRIES, --retries RETRIES:
The number of retries for requests that get connection error or rate limited (default: 1).

-m <integer>, --memory-threshold <integer>:
The memory threshold percentage. If the machines memory goes above the threshold, the program will
be stopped and ended gracefully before running out of memory (default: 95)

-ko [KEYWORDS_ONLY], --keywords-only [KEYWORDS_ONLY]:
Only return links and responses that contain keywords that you are interested in. This can reduce
the time it takes to get results. If you provide the flag with no value, Keywords are taken from
the comma separated list in the "config.yml" file with the "FILTER_KEYWORDS" key, otherwise you
can pass an specific Regex value to use, e.g. -ko "admin" to only get links containing the word
admin, or -ko ".js(?|$)" to only get JS files. The Regex check is NOT case sensitive.

-lr LIMIT_REQUESTS, --limit-requests LIMIT_REQUESTS:
Limit the number of requests that will be made when getting links from a source (this doesn't
apply to Common Crawl). Some targets can return a huge amount of requests needed that are just
not feasible to get, so this can be used to manage that situation. This defaults to 0 (Zero) which
means there is no limit.

-ow, --output-overwrite:
If the URL output file (default waymore.txt) already exists, it will be overwritten instead of
being appended to.

-nlf, --new-links-file:
If this argument is passed, a .new file will also be written that will contain links for the
latest run.

-c CONFIG, --config CONFIG:
Path to the YML config file. If not passed, it looks for file 'config.yml' in the same directory
as runtime file 'waymore.py'.

-wrlr WAYBACK_RATE_LIMIT_RETRY, --wayback-rate-limit-retry WAYBACK_RATE_LIMIT_RETRY:
The number of minutes the user wants to wait for a rate limit pause on Watback Machine
(archive.org) instead of stopping with a 429 error (default: 3).

-urlr URLSCAN_RATE_LIMIT_RETRY, --urlscan-rate-limit-retry URLSCAN_RATE_LIMIT_RETRY:
The number of minutes the user wants to wait for a rate limit pause on URLScan.io instead of
stopping with a 429 error (default: 1).

-co, --check-only:
This will make a few minimal requests to show you how many requests, and roughly how long it could
take, to get URLs from the sources and downloaded responses from Wayback Machine.

-nd, --notify-discord:
Whether to send a notification to Discord when waymore completes. It requires WEBHOOK_DISCORD to
be provided in the config.yml file.

-v, --verbose:
Verbose output

--version:
Show version number

EXAMPLES

       Common usage:

       • Example 1:

       Just get the URLs from all sources for redbull.com (-mode U  is  just  for  URLs,  so  no  responses  are
       downloaded):

          $ waymore -i redbull.com -mode U

       The   URLs   are   saved   in   the   same   path   as  config.yml  (typically  ~/.config/waymore)  under
       results/redbull.com/waymore.txt

       • Example 2:

       Get ALL the URLs from Wayback for redbull.com (no filters are applied in mode U with -f, and no URLs  are
       retrieved  from  Common  Crawl,  Alien Vault, URLScan and Virus Total, because -xcc, -xav, -xus, -xvt are
       passed respectively). Save the FIRST 200 responses that are found starting from 2022 (-l 200 -from 2022):

          $ waymore -i redbull.com -f -xcc -xav -xus -xvt -l 200 -from 2022

       • Example 3:

       You can pipe waymore to other tools. Any errors are sent to stderr  and  any  links  found  are  sent  to
       stdout.  The  output  file  is  still  created  in addition to the links being piped to the next program.
       However, archived responses are not piped to the next program, but they are still written to  files.  For
       example:

          $ waymore -i redbull.com -mode U | unfurl keys | sort -u

       You can also pass the input through stdin instead of -i:

          $ cat redbull_subs.txt | waymore

       • Example 4:

       Sometimes you may just want to check how many requests, and how long waymore is likely to take if you ran
       it  for  a  particular  domain.   You  can  do  a quick check by using the -co/--check-only argument. For
       example:

          $ waymore -i redbull.com --check-only

AUTHOR

       Aquila Macedo <aquilamacedo@riseup.net>

COPYRIGHT

       Expat

                                                   2024-03-22                                         WAYMORE(1)