Daniel Müllner |
fastcluster: Fast hierarchical clustering routines for R and Python
1 Introduction
A common task in unsupervised machine learning and data analysis is clustering. This means a method to partition a discrete metric space into sensible subsets. The exact setup and procedures may vary, but the general idea is to group data points with similar features together. This might reveal some structure in the data set, or one can simplify the data set by dealing with entire clusters instead of many individual data points.
An important subset of clustering techniques are hierarchical clustering schemes. These do not produce a prescribed number of clusters but a so-called dendrogram, which allows the user to decide on a reasonable number of clusters based on the algorithmic output and then partition the data accordingly. This workflow also complies with the paradigm in topological data analysis that often an overview over the data on all scales is helpful to detect features which might be present at different scales.
From a practical point of view, clustering is used by many people in different jobs, ranging from basic research to commerce. Nevertheless, the hierarchical clustering schemes were implemented in a largely sub-optimal way in the standard software, to say the least. Even R, which is the most widely used statistical software, does not use the most efficient algorithms in the several packages that have been made for hierarchical clustering.
The fastcluster package implements the seven common hierarchical clustering schemes efficiently. The package is made with two interfaces to standard software: R and Python, which should cover a big part of the scientific community.
A full User's Manual is available on CRAN.
For a first overview over what performance improvements to expect, look at the graphs in the next section. Note the log-log scale, which compresses improvements by orders of magnitude into a single diagram.
If you use fastcluster for scientific work, please cite it as follows:
Daniel Müllner, fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python, Journal of Statistical Software 53 (2013), no. 9, 1–18, URL http://www.jstatsoft.org/v53/i09/
2 Technical key facts
- The fastcluster package is a C++ library for hierarchical (agglomerative) clustering on data with a dissimilarity index. It efficiently implements the seven most widely used clustering schemes: single, complete, average, weighted, Ward, centroid and median linkage. (The “weighted” distance update scheme (Matlab, SciPy) is also called “mcquitty” in R.)
- The fastcluster library currently has interfaces to two languages: R and Python/SciPy.
- The interfaces are designed as drop-in replacements for the existing routines. Once the fastcluster library is loaded at the beginning of the code, every program that uses hierarchical clustering can benefit immediately and effortlessly from the performance gain.
- The fastcluster package is licensed under the BSD license and published on CRAN and PyPI.
- The following comparison chart shows the asymptotic worst-case time complexity of various implementations in terms of the number of data points N. The input is a matrix of pairwise dissimilarities of size N⋅(N−1)/2.
Method fastcluster R:
agnesR:
flashClustR:
hclustSciPy Matlab
R2010asingle Θ(N²) Θ(N³) Θ(N³) Θ(N³) Θ(N³) Ω(N³) complete Θ(N²) Θ(N³) Θ(N³) Θ(N³) Θ(N³) Ω(N³) average Θ(N²) Θ(N³) Θ(N³) Θ(N³) Θ(N³) Ω(N³) weighted Θ(N²) Θ(N³) Θ(N³) Θ(N³) Θ(N³) Ω(N³) Ward Θ(N²) Θ(N³) Θ(N³) Θ(N³) Θ(N³) Ω(N³) centroid Θ(N³) Θ(N³) Θ(N³) Θ(N³) Θ(N³) Ω(N³) median Θ(N³) Θ(N³) Θ(N³) Θ(N³) Θ(N³) Ω(N³) - Regarding the best-case complexity, only the “fastcluster“ and “flashClust” packages achieve the theoretically optimal bound of Θ(N²). SciPy, “agnes” and “hclust” have a best-case complexity of only Θ(N³). In Matlab Release 2010a, the situation is more complicated: While the clustering step itself has a best-case complexity of Θ(N²), the “Ward”, “centroid” and “median“ methods ensure beforehand that the pairwise distances can be generated from a configuration of N points in Euclidean space. This sanity check involves finding the eigenvalues of a dense (N×N) matrix, which outweighs the actual clustering step. Also, the theoretical performance of single linkage is seriously hampered in Matlab Release 2010a by an unfortunate caching strategy. See the performance tests below on how this affects a variety of data sets.
- All of the above implementations use Θ(N²) memory, except the single linkage routine in the fastcluster package: Apart from the Θ(N²) input data, which is not overwritten, it uses only Θ(N) temporary memory.
- The figure below shows performance comparisons between the various packages. Note that the fastcluster package is not only faster for big data sets but the overhead for small data sets is also smaller. Hence, the improved asymptotic and worst-case complexity does not come with a trade-off for small input sizes here.
Click on the links below to see the comparisons for all seven methods.
← Previous: Median linkage | Single linkage |
Next: Complete linkage → |
The colored bands show maximum and minimum time over a variety of data sets. The average is plotted as a solid line. The synthetic data sets are samples drawn in an iid. manner from various mixtures of Gaussian distributions in Euclidean space of various dimensions. The results were obtained on a PC with an Intel dual-core CPU T7500 with 2.2 GHz clock speed and 4GB of RAM. The operating system was Ubuntu 11.04 64-bit (Ubuntu 10.04 64-bit for Matlab R2010a). R version: 2.13.0, fastcluster version: 1.1.7, flashClust version: 1.01, Python version: 2.7.1, NumPy version 1.5.1, SciPy version: 0.8.0.
3 Download and installation
Installation procedures were tested by the author under 64-bit Ubuntu and Arch Linux. Thanks to Christoph Gohlke for providing precompiled installation files for the Python package under Windows on his web page. Moreover, CRAN hosts binaries of the R library for Windows and OS X. In principle, it should be possible to install the fastcluster package on any system that has a C++ compiler and R respectively Python with NumPy. There are no unusual libraries needed to compile the package, only the STL library, which every C++ compiler should have by default.
Please send me feedback if you accomplish to install the fastcluster package on a certain platform but needed to tweak the configuration! I will update the installation instructions and modify the package if needed (eg. include the right compiler flags for various operating systems).
Source distribution
Get it from here: Download page at CRAN.
Installation for R
Enter the command install.packages("fastcluster") in R, and R will download the package automatically, then install it. That's it!
If this does not work, please consult R's help function by typing ?INSTALL from within R or read the “R installation and administration” manual.
You may need to start R with administrator rights to be able to install packages. There are ways to install R packages without administrator privileges in your user directories. See this help page for example.
Installation for Python
Microsoft Windows
Christoph Gohlke provides installation packages on his web page. This is the recommended way to obtain fastcluster for Windows.
If you want or need to compile the package yourself, Christoph reports that you might need to add the option
extra_compile_args=['/EHsc']
for the MSVC compilers in the file setup.py. (Try it out and run the test scripts. If nantest.py or vectortest.py crashes, add this option and compile again.)
Every other operating system
First, try the easiest way:
easy_install --user -U fastcluster
This will compile and install fastcluster from PyPI if the Python package setuptools is available on your computer.
If this works, you are done, so ignore the rest of this section. Otherwise, fastcluster can still be easily installed with only a few commands as follows:
Make sure that you have Python and NumPy installed. Download the fastcluster package from the R archive here: Download page at CRAN. Open a terminal, go to the directory with the downloaded file and extract the contents of the archive with:
tar -xvf fastcluster_(version).tar.gz
Alternatively, use your favorite archive manager for unpacking, eg. on Windows. This will generate a new directory fastcluster. Switch to its python subdirectory:
cd fastcluster/python/
Now compile the library and install the Python module by:
python setup.py install
You may need to precede this command with sudo or install the package in your home directory, like this:
python setup.py install --user
See the chapter “Installing Python modules” in the Python documentation for further help.
4 Usage
A full User's Manual is available as part of the source distribution: User's Manual on CRAN. This gives a complete reference and also covers fastcluster's memory-saving algorithms for vector data, whereas only basic usage is explained below for a quick start.
R
In R, load the package with the following command:
library('fastcluster')
The package overwrites the function hclust from the “stats” package (in the same way as the flashClust package does). Please remove any references to the flashClust package in your R files to not accidentally overwrite the hclust function with the flashClust version.
The new hclust function has exactly the same calling conventions as the old one. You may just load the package and immediately and effortlessly enjoy the performance improvements. The function is also an improvement to the flashClust function from the “flashClust” package. Just replace every call to flashClust by hclust and expect your code to work as before, only faster. (If you are using flashClust prior to version 1.01, update it! See the change log for flashClust.)
The agnes function from the “cluster“ package does a bit more than hclust. If you do not need the extra functionality, you may also wish to replace agnes by fastcluster's hclust for higher speed.
If you need to access the old function or make sure that the right function is called, specify the package as follows:
stats::hclust(…) fastcluster::hclust(…) flashClust::hclust(…)
WARNING
R and Matlab/SciPy use different conventions for the “Ward”, “centroid” and “median” methods. R assumes that the dissimilarity matrix consists of squared Euclidean distances, while Matlab and SciPy expect non-squared Euclidean distances. The fastcluster package respects these conventions and uses different formulas in the two interfaces.
If you want the same results in both interfaces, then feed R with the entry-wise square of the distance matrix, d^2, for the “Ward”, “centroid” and “median” methods and later take the square root of the height field in the dendrogram. For the “average” and “weighted” alias “mcquitty” methods, you must still take the same distance matrix d as in the Python interface for the same results. The “single” and “complete” methods only depend on the relative order of the distances, hence it does not make a difference whether one operates on the distances or the squared distances.
The code example in the R documentation (enter ?hclust or example(hclust) in R) contains an instance where the squared distance matrix is generated from Euclidean data.
Python
The fastcluster package is imported as usual by
import fastcluster
It provides the following functions:
linkage(X, method='single', metric='euclidean', preserve_input=True) single(X) complete(X) average(X) weighted(X) ward(X) centroid(X) median(X)
The argument X is either a condensed distance matrix or a collection of N observation vectors in D dimensions as an (N×D) array. Apart from the argument preserve_input, the methods have the same input and output as the functions of the same name in the package scipy.cluster.hierarchy. Therefore, I do not duplicate the documentation and refer to the SciPy documentation here and here for further details.
The additional, optional argument preserve_input specifies whether the fastcluster package first copies the distance matrix or writes into the existing array. If you generate the distance matrix only for the clustering step and do not need it afterwards, you may save half the memory by saying preserve_input=False.
Note that the input array X contains unspecified values after this procedure. You may want to write
linkage(X, method="…", preserve_input=False) del X
to make sure that you do not accidentally use the matrix X after it has been used as scratch memory.
The single linkage algorithm does not write to the distance matrix or its copy anyway, so the preserve_input flag has no effect in this case.