Ubuntu Manpage: mlpack_kmeans - k-means clustering

Provided by: mlpack-bin_3.4.2-5ubuntu1_amd64

NAME

       mlpack_kmeans - k-means clustering

SYNOPSIS

        mlpack_kmeans -c int -i string [-a string] [-e bool] [-P bool] [-I string] [-E bool] [-l bool] [-m int] [-p double] [-r bool] [-S int] [-s int] [-V bool] [-C string] [-o string] [-h -v]

DESCRIPTION

       This  program  performs  K-Means  clustering  on  the  given  dataset.  It can return the learned cluster
       assignments, and the centroids of the clusters. Empty clusters are not allowed by default; when a cluster
       becomes empty, the point furthest from the centroid of the cluster with maximum variance is taken to fill
       that cluster.

       Optionally, the Bradley and Fayyad approach ("Refining initial points for k-means clustering", 1998)  can
       be  used to select initial points by specifying the '--refined_start (-r)' parameter. This approach works
       by taking random samplings of the dataset; to specify the number of  samplings,  the  '--samplings  (-S)'
       parameter  is  used,  and  to  specify  the  percentage  of  the  dataset  to be used in each sample, the
       '--percentage (-p)' parameter is used (it should be a value between 0.0 and 1.0).

       There are several options available for the algorithm used for each Lloyd iteration, specified  with  the
       '--algorithm  (-a)'  option. The standard O(kN) approach can be used ('naive'). Other options include the
       Pelleg-Moore  tree-based  algorithm  ('pelleg-moore'),  Elkan's   triangle-inequality   based   algorithm
       ('elkan'),  Hamerly's  modification  to  Elkan's  algorithm  ('hamerly'), the dual-tree k-means algorithm
       ('dualtree'), and the dual-tree k-means algorithm using the cover tree ('dualtree-covertree').

       The behavior for when an empty cluster is encountered can be modified  with  the  ’--allow_empty_clusters
       (-e)'  option.  When  this  option  is specified and there is a cluster owning no points at the end of an
       iteration, that cluster's centroid will simply remain in its position from the previous iteration. If the
       '--kill_empty_clusters (-E)' option is specified, then when a cluster owns no points at  the  end  of  an
       iteration,  the cluster centroid is simply filled with DBL_MAX, killing it and effectively reducing k for
       the rest of the computation. Note that the default option when neither empty cluster option is  specified
       can  be  time-consuming  to  calculate;  therefore,  specifying  either  of  these  parameters will often
       accelerate runtime.

       Initial clustering assignments may be specified using the ’--initial_centroids_file (-I)' parameter,  and
       the maximum number of iterations may be specified with the '--max_iterations (-m)' parameter.

       As  an  example,  to  use  Hamerly's  algorithm  to  perform  k-means clustering with k=10 on the dataset
       'data.csv',  saving  the  centroids  to  'centroids.csv'  and  the  assignments   for   each   point   to
       'assignments.csv', the following command could be used:

       $  mlpack_kmeans  --input_file  data.csv  --clusters  10  --output_file  assignments.csv  --centroid_file
       centroids.csv

       To run k-means on that same dataset with initial centroids specified in ’initial.csv' with a  maximum  of
       500 iterations, storing the output centroids in 'final.csv' the following command may be used:

       $ mlpack_kmeans --input_file data.csv --initial_centroids_file initial.csv --clusters 10 --max_iterations
       500 --centroid_file final.csv

REQUIRED INPUT OPTIONS

       --clusters (-c) [int]
              Number of clusters to find (0 autodetects from initial centroids).

       --input_file (-i) [string]
              Input dataset to perform clustering on.

OPTIONAL INPUT OPTIONS

       --algorithm (-a) [string]
              Algorithm to use for the Lloyd iteration ('naive', 'pelleg-moore', 'elkan', 'hamerly', 'dualtree',
              or 'dualtree-covertree'). Default value 'naive'.

       --allow_empty_clusters (-e) [bool]
              Allow empty clusters to be persist.

       --help (-h) [bool]
              Default help info.

       --in_place (-P) [bool]
              If  specified,  a  column  containing  the  learned cluster assignments will be added to the input
              dataset file. In this case, --output_file is overridden. (Do not use in Python.)

       --info [string]
              Print help on a specific option. Default value ''.

       --initial_centroids_file (-I) [string]
              Start with the specified initial centroids.

       --kill_empty_clusters (-E) [bool]
              Remove empty clusters when they occur.

       --labels_only (-l) [bool]
              Only output labels into output file.

       --max_iterations (-m) [int]
              Maximum number of iterations before k-means terminates. Default value 1000.

       --percentage (-p) [double]
              Percentage of dataset to use  for  each  refined  start  sampling  (use  when  --refined_start  is
              specified). Default value 0.02.

       --refined_start (-r) [bool]
              Use the refined initial point strategy by Bradley and Fayyad to choose initial points.

       --samplings (-S) [int]
              Number of samplings to perform for refined start

       (use when --refined_start is specified).
              Default value 100.

       --seed (-s) [int]
              Random seed. If 0, 'std::time(NULL)' is used.  Default value 0.

       --verbose (-v) [bool]
              Display informational messages and the full list of parameters and timers at the end of execution.

       --version (-V) [bool]
              Display the version of mlpack.

OPTIONAL OUTPUT OPTIONS

       --centroid_file (-C) [string]
              If specified, the centroids of each cluster will be written to the given file.

       --output_file (-o) [string]
              Matrix to store output labels or labeled data to.

ADDITIONAL INFORMATION

       For  further  information,  including  relevant  papers, citations, and theory, consult the documentation
       found at http://www.mlpack.org or included with your distribution of mlpack.

mlpack-3.4.2                                      11 April 2022                                 mlpack_kmeans(1)