Method of non-hierarchical partitioning

Method of non–hierarchical partitioning

Let us see some set of m elements wОW. It is known a partition is a set of mutually non–intersected subsets W₁,...W_K, being united give initial set W. Let elements w_i be presented as vectors x_i = (x_i₁,…x_in) in some n–dimensional space. We call them as objects in the n–dimensional space of features (measured characteristics). Objects form rows of data matrix X(m,n), and their partitioning is called as clustering (classification).

In the cluster–analysis of features such an elements are features in the n–dimensional space of objects (acts of measuring). It is obvious features form columns in the same matrix X(m,n). Here it is convenient for us to consider features form rows in another matrix X(m,n) after initial matrix have been transposed, and their dimensionalities have been re-denoted. Feature partitioning is usually called as grouping.

Similarity of a couple of objects is usually estimated by Euclidean distance between them, and similarity of a couple of features is usually estimated by mutual correlation. Let matrix D(m,m) be the matrix of paired similarity between elements w in corresponding space.

Let some partitioning of the set W have been done. It is usually possible for each subset W_i to point some element v_i as its most typical one in some sense. Let’s call such an element as a representative. Representative v_i can be non-coinciding with all elements from W_i in a common case. For example, in clustering problem the representative is usually the mean vector z_i for sub-matrix X(m_i,n), and in some grouping problems (extreme grouping) the representative is the first principal component of sub–matrix D(m_i,m_i). In this case components of the representative vector are calculated slightly more complicated.

Calculating of the group representative. Let a_i = (a_is, s=1,…m_i) be the first principal component, and l_i be the maximal eigenvalue for correlation matrix D(m_i,m_i) of the feature group W_i. Then vector (l_i)^1/2a_i defines correlations d(f_i,x_j) of group factor f_i = (f_i₁,…f_in) with its features x_j = (x_j₁,…x_jn). Factor f_i correlations with features from another groups W₁,...W_i–1, W_i+1,…W_K are defined as d(f_i,x_j) = (1/n)f_ix_j^T= (l_i)^-1/2a_id_j^T, where d_j=(d_js, s = 1,…m_i) is a part of j^th row from matrix D(m,m). That row corresponds to feature x_jПW_i, and their elements correspond to features from group W_i. After denoting factor f_i correlations with all features as vector g_i = (g_i₁,…g_im) we get vector components of the representative as values z_i = f_i = g_iX on the factor axis.

Let’s say representatives be “centers” of corresponding sets. It is obvious the sense of the center depends on the contents of problem in a common case.

Let’s call some partition as unbiased one, if all its representatives coincide with own centers.

Elements w arrangement in some space is usually estimated by some characteristic of arrangement density (concentration). For example, partition density is described by dispersion of clusters W_i in cluster–analysis. Partition of extreme grouping is described, for example, by sum of correlation squares (or absolute values) of each group W_i representative with its own features.

Now we can describe base algorithm to split some set W to a priori defined number K of subsets W_i.

Algorithm 1. Let K representatives be defined, for example, as K most dissimilar elements. Let for current step s splitting be produced by scanning all elements w to mark each of them as from that set W_i, if corresponding representative is most similar to it. If partition appears to be biased, let centers of sets W_i be their representatives. Let’s go to the next step s=s+1. This algorithm stops for unbiased partition.

It is easy to see we can get right now well–known base clustering (K–means), and extreme grouping algorithms. By using the same way we can get another well-known forms of these algorithms.

Let’s note for K=2 in base algorithm representatives are defined by simplest way as a pair of elements for extreme value from matrix D of similarity (maximum for distances, and minimum for absolute value of correlations). Let’s build non–hierarchical divisive partitioning algorithm.

Algorithm 2. Let for each step from K=1 a pair of most dissimilar objects from the least compact subset W_k be its representatives. Let W_k be split by Algorithm 1 on two subsets denoted as W_k, and W_K+1. For K+1 subsets from new partitioning let representatives be defined as K-1 previous representatives v₁,…v_k-1, v_k+1,…v_K, and two new ones v_k, and v_K+1. Let the set W be split by Algorithm 1 on K+1 subsets again. Let’s go to the next step s=s+1. This algorithm stops for K = m.

This algorithm builds full sequence of partitions from the single set W₁=W to single-element sets W₁,...W_m. Not all partitions from this sequence are hierarchically nested in a common case. It is obvious two subsequent partitions on K, and K+1 subsets are hierarchically nested, if partitioning of the least compact subset W_k on two ones W_k, and W_K+1gives straight away unbiased partition W₁,...W_K+1. Let’s call previous partition on K subsets as stable one.

Let’s see elements w as objects. It is obvious a graph of shortest non-closed path (minimal spanning tree) is defined on the set W. If elements w are called as features then a graph of maximal non-closed correlation path is defined on the set W. Let’s say the non-closed graph of maximal similarity is defined on the set W.

Theorem. Let tree of partitions be built by non–hierarchical divisive partitioning algorithm. Then this tree is equivalent to non-closed graph of maximal similarity.

Proof. As since partitioning is a tree then it is enough for current step K to split only the set W_k on two subsets Wў_k, and WІ_k, and denote them as W_k, and W_K+1. As since such a partitioning gives straight away unbiased partition W₁,...W_K+1, then it does not matter what a subset needs to be taken for splitting as a current one. Have got the set W_k by all possible ways, we get some set of partitions sequences. All of them appear to be trees. Such a trees are specially called as dendrograms, and trace their building process. Without tracing all of them are equivalent to single hierarchical tree. Let us define similarity of sets W_p, and W_q as D(W_p,W_q) = max d_ij for w_iОW_p, w_jОW_q. Let for step K the set W_k be pointed as the set creating subsets Wў_k, and WІ_k with minimal similarity W_k =argmin D(Wў_l,WІ_l), l=1,…K after splitting. Let’s build by this way divisive tree of partitions. In reversed way let’s build agglomerative tree of partitions by joining subsets with maximal similarity. We get the same tree. But this signify the minimal spanning tree is built. Edges of such a graph connect nodes (elements w), and define similarity D(W_p,W_q) of joined sets W_p, and W_q. Hence, tree of partitions is equivalent to non-closed graph of maximal similarity.

Corollary. If tree of partitions is built, then non–hierarchical divisive partitioning algorithm is equivalent to the “nearest neighbor” algorithm in cluster–analysis, and is equivalent to the “correlation pleiades” algorithm in grouping problem. In a common case this algorithm is equivalent to some set of hierarchical divisive algorithms, starting from single cluster (group), forming one of subsequences of hierarchically nested partitions, and finishing with single–element clusters (groups).

Selection of optimal partitioning

The considered method of non–hierarchical partitioning combines two essentially different approaches to clustering. From one side, well-known algorithms are used, and from other side, the result is evident like the dendrogram. Here we try to directly use intuitive assumption about “stability” of a good-quality partitioning.

It is obvious, some stable partition needs to be used as the optimal one from one of subsequences of hierarchically nested partitions.

It is obvious too, the general sequence can be limited. We can throw off partitions with weakly filled clusters (groups) starting from some K near of sample size m. We also can throw off partitions with friable clusters for small value of K. For example, for these partitions the average dispersion of clusters exceeds the dispersion of their centers. Then remained partitions will represent compact clusters corresponding to generally accepted heuristic cluster definition.

At the same time using only single type of compact cluster or group representing undoubtedly constricts a field of applicability of this method. But proposed algorithm appears to be quite simple, and fast to get at least initial information about structure of data under investigation.

It isn’t so hard to build agglomerative version of non–hierarchical partitioning algorithm. Its behavior is sufficiently evident.

We will show below using of non–hierarchical partitioning method for another important problem of restoring of missing data.


	part 1 (RUS/ENG) part 2 (RUS/ENG) part 3 (RUS/ENG) Example