Filter functions

Mathematical definition

A filter function is a function on the data set, \(f\:{:}\;X\to \mathbb{R}^k\). The Mapper algorithm supports general, vector-valued functions, while the GUI is restricted to real-valued functions (the case \(k=1\)) for simplicity.

Data structure

The filter values are stored in a floating-point NumPy array of shape \((n,k)\), where \(n\) is the number of points and \(k\) the dimensionality of the filter values.

Filter functions in Python Mapper

A number of one-dimensional filter functions is provided in the module mapper.filters.

The argument data must be a NumPy array of dimension 1 or 2. If it is one-dimensional, it is interpreted as a compressed matrix of pairwise dissimilarities (i.e. the flattened, upper part of a symmetric, quadratic matrix with zeros on the diagonal). Accordingly, this array must have length \(\tbinom n2\) for \(n\) data points. If the array data is two-dimensional of shape \((n,d)\), the rows are interpreted as \(n\) data points in a \(d\)-dimensional vector space, and the pairwise distances are generated from the vector data and the metricpar parameter. See the function scipy.spatial.distance.pdist for possible entries in the metricpar dictionary. For example,

metricpar={'metric': 'euclidean'}

generates Euclidean distances from vector data.

(In principle, there could also be filter functions which require vector data and do not work on a dissimilarity matrix. No such filter function is currently present in the module.)

The argument callback is used by the GUI to display progress report. Some filter functions report from time to time what percentage of the data has been processed. For example,


would just print the percentages.

The return value of the function below is always a NumPy array of shape \((n,)\) with data type float.

mapper.filters.eccentricity(data, exponent=1.0, metricpar={}, callback=None)

The eccentricity filter is the higher the further away a point is from the ‘center’ of the data. Note however that neither an explicit central point is required nor a barycenter in an embedding of the data. Instead, an intrinsic notion of centrality is derived from the pairwise distances.

If exponent is finite:

\[\mathit{eccentricity}(i)=\left(\frac 1N\sum_{j=0}^{N-1} d(x_i,x_j)^{\mathit{exponent}}\right)^{1/\mathit{exponent}}\]

If exponent is numpy.inf:

\[\mathit{eccentricity}(i)=\max_j d(x_i,x_j)\]

This is an equivalent description: Consider the full \((N\times N)\)-matrix of pairwise distances. The eccentricity of the \(i\)-th data point is the Minkowski norm of the \(i\)-th row with the respective exponent.

mapper.filters.Gauss_density(data, sigma, metricpar={}, callback=None)

Kernel density estimator with a multivariate, radially symmetric Gaussian kernel.

For vector data and \(x\in\mathbb{R}^d\):

\[\mathit{Gauss\_density}(x) = \frac{1}{N(\sqrt{2\pi}\sigma)^d} \sum_{j=0}^{N-1}\exp\left(-\frac{\|x-x_j\|^2}{2\sigma^2}\right)\]

The density estimator is normalized to a probability measure, i.e. the integral of this function over \(\mathbb{R}^d\) is 1. The \(i\)-th filter value is the density estimator evaluated at \(x_i\).

For dissimilarity data:

\[\mathit{Gauss\_density}(i) = \sum_{j=0}^{N-1}\exp\left(-\frac{d(x_i,x_j)^2}{2\sigma^2}\right)\]

In this case, the density estimator is not normalized since there is no domain to integrate over.

mapper.filters.kNN_distance(data, k, metricpar={}, callback=None)

The distance to the \(k\)-th nearest neighbor as an (inverse) measure of density.

Note how the number of nearest neighbors is understood: \(k=1\), the first neighbor, makes no sense for a filter function since the first nearest neighbor of a data point is always the point itself, and hence this filter function is constantly zero. The parameter \(k=2\) measures the distance from \(x_i\) to the nearest data point other than \(x_i\) itself.

mapper.filters.distance_to_measure(data, k, metricpar={}, callback=None)
\[\mathit{distance\_to\_measure}(x) = \sqrt{\frac 1k\sum^k_{j=1}d(x,\nu_j(x))^2},\]

where \(\nu_1(x),\ldots,\nu_k(x)\) are the \(k\) nearest neighbors of \(x\) in the data set. Again, the first nearest neighbor is \(x\) itself with distance 0.

Reference: [R4].

mapper.filters.graph_Laplacian(data, eps, n=1, k=1, weighted_edges=False, sigma_eps=1.0, normalized=True, metricpar={}, verbose=True, callback=None)

Graph Laplacian of the neighborhood graph.

  • First, if k is 1, form the eps-neighborhood graph of the data set: vertices are the data points; two points are connected if their distance is at most eps.

  • Alternatively, if k is greater than 1, form the neighborhood graph from the \(k\)-th nearest neighbors of each point. Each point counts as its first nearest neighbor, so feasible values start with \(k=2\).

  • If weighted_edges is False, each edge gets weight 1. Otherwise, each edge is weighted with


    where \(\sigma=\mathtt{eps}\cdot\mathtt{sigma\_eps}\) and \(d\) is the distance between the two points.

  • Form the graph Laplacian. The graph Laplacian is a self-adjoint operator on the real vector space spanned by the vertices and can thus be described by a symmetric matrix \(L\):

    If normalized is false, \(L\) is closely related to the adjacency matrix of the graph: it has entries \(-w(i,j)\) whenever nodes \(i\) and \(j\) are connected by an edge of weight \(w(i,j)\) and zero if there is no edge. The \(i\)-th diagonal entry holds the degree \(\deg(i)\) of the corresponding vertex, so that row and column sums are zero.

    If normalized is true, each row \(i\) of \(L\) is additionally scaled by \(1/\sqrt{\deg(i)}\), and so is each column. This destroys the zero row and column sums but preserves symmetry.

  • Return the \(n\)-th eigenvector of the graph Laplacian. The index is 0-based: the 0-th eigenvector is constant on all vertices, corresponding to the eigenvalue 0. \(n=1\) returns the Fiedler vector, which is the second smallest eigenvector after 0.

The normalized variant seems to give consistently better results, so this is always chosen in the GUI. However, this experience is based on few examples only, so please do not hesitate to also try the non-normalized version if there is a reason for it.

Reference: [R9]; see especially Section 6.3 for normalization.

mapper.filters.dm_eigenvector(data, k=0, mean_center=True, metricpar={}, verbose=True, callback=None)

Return the \(k\)-th eigenvector of the distance matrix.

The matrix of pairwise distances is symmetric, so it has an orthonormal basis of eigenvectors. The parameter \(k\) can be either an integer or an array of integers (for multi-dimensional filter functions). The index is zero-based, and eigenvalues are sorted by absolute value, so \(k=0\) returns the eigenvector corresponding to the largest eigenvalue in magnitude.

If mean_center is True, the distance matrix is double-mean-centered before the eigenvalue decomposition.

Reference: [R6], subsection “Principal metric SVD filters”.

mapper.filters.zero_filter(data, **kwargs)

Return an array of the correct size filled with zeros.