Method of non–hierarchical partitioning

Let us see some set of m elements wÎW. It is known a partition is a set of mutually non–intersected subsets W1,...WK, being united give initial set W. Let elements wi be presented as vectors xi = (xi1,…xin) in some n–dimensional space. We call them as objects in the n–dimensional space of features (measured characteristics). Objects form rows of data matrix X(m,n), and their partitioning is called as clustering (classification).

In the cluster–analysis of features such an elements are features in the n–dimensional space of objects (acts of measuring). It is obvious features form columns in the same matrix X(m,n). Here it is convenient for us to consider features form rows in another matrix X(m,n) after initial matrix have been transposed, and their dimensionalities have been re-denoted. Feature partitioning is usually called as grouping.

Similarity of a couple of objects is usually estimated by Euclidean distance between them, and similarity of a couple of features is usually estimated by mutual correlation. Let matrix D(m,m) be the matrix of paired similarity between elements w in corresponding space.

Let some partitioning of the set W have been done. It is usually possible for each subset Wi to point some element vi as its most typical one in some sense. Let’s call such an element as a representative. Representative vi can be non-coinciding with all elements from Wi in a common case. For example, in clustering problem the representative is usually the mean vector zi for sub-matrix X(mi,n), and in some grouping problems (extreme grouping) the representative is the first principal component of sub–matrix D(mi,mi). In this case components of the representative vector are calculated slightly more complicated.

  • Calculating of the group representative. Let ai = (ais, s=1,…mi) be the first principal component, and li be the maximal eigenvalue for correlation matrix D(mi,mi) of the feature group Wi. Then vector (li)1/2ai defines correlations d(fi,xj) of group factor fi = (fi1,…fin) with its features xj = (xj1,…xjn). Factor fi correlations with features from another groups W1,...Wi–1, Wi+1,…WK are defined as d(fi,xj) = (1/n)fixjT = (li)-1/2aidjT, where dj=(djs, s = 1,…mi) is a part of jth row from matrix D(m,m). That row corresponds to feature xjÏWi, and their elements correspond to features from group Wi. After denoting factor fi correlations with all features as vector gi = (gi1,…gim) we get vector components of the representative as values zi = fi = giX on the factor axis.

Let’s say representatives be “centers” of corresponding sets. It is obvious the sense of the center depends on the contents of problem in a common case.

Let’s call some partition as unbiased one, if all its representatives coincide with own centers.

Elements w arrangement in some space is usually estimated by some characteristic of arrangement density (concentration). For example, partition density is described by dispersion of clusters Wi in cluster–analysis. Partition of extreme grouping is described, for example, by sum of correlation squares (or absolute values) of each group Wi representative with its own features.

Now we can describe base algorithm to split some set W to a priori defined number K of subsets Wi.

Algorithm 1. Let K representatives be defined, for example, as K most dissimilar elements. Let for current step s splitting be produced by scanning all elements w to mark each of them as from that set Wi, if corresponding representative is most similar to it. If partition appears to be biased, let centers of sets Wi be their representatives. Let’s go to the next step s=s+1. This algorithm stops for unbiased partition.

It is easy to see we can get right now well–known base clustering (K–means), and extreme grouping algorithms. By using the same way we can get another well-known forms of these algorithms.

Let’s note for K=2 in base algorithm representatives are defined by simplest way as a pair of elements for extreme value from matrix D of similarity (maximum for distances, and minimum for absolute value of correlations). Let’s build non–hierarchical divisive partitioning algorithm.

Algorithm 2. Let for each step from K=1 a pair of most dissimilar objects from the least compact subset Wk be its representatives. Let Wk be split by Algorithm 1 on two subsets denoted as Wk, and WK+1. For K+1 subsets from new partitioning let representatives be defined as K-1 previous representatives v1,…vk-1, vk+1,…vK, and two new ones vk, and vK+1. Let the set W be split by Algorithm 1 on K+1 subsets again. Let’s go to the next step s=s+1. This algorithm stops for K = m.

This algorithm builds full sequence of partitions from the single set W1=W to single-element sets W1,...Wm. Not all partitions from this sequence are hierarchically nested in a common case. It is obvious two subsequent partitions on K, and K+1 subsets are hierarchically nested, if partitioning of the least compact subset Wk on two ones Wk, and WK+1 gives straight away unbiased partition W1,...WK+1. Let’s call previous partition on K subsets as stable one.

Let’s see elements w as objects. It is obvious a graph of shortest non-closed path (minimal spanning tree) is defined on the set W. If elements w are called as features then a graph of maximal non-closed correlation path is defined on the set W. Let’s say the non-closed graph of maximal similarity is defined on the set W.

Theorem. Let tree of partitions be built by non–hierarchical divisive partitioning algorithm. Then this tree is equivalent to non-closed graph of maximal similarity.

  • Proof. As since partitioning is a tree then it is enough for current step K to split only the set Wk on two subsets k, and k, and denote them as Wk, and WK+1. As since such a partitioning gives straight away unbiased partition W1,...WK+1, then it does not matter what a subset needs to be taken for splitting as a current one. Have got the set Wk by all possible ways, we get some set of partitions sequences. All of them appear to be trees. Such a trees are specially called as dendrograms, and trace their building process. Without tracing all of them are equivalent to single hierarchical tree. Let us define similarity of sets Wp, and Wq as D(Wp,Wq) = max dij for wiÎWp, wjÎWq. Let for step K the set Wk be pointed as the set creating subsets k, and k with minimal similarity Wk =argmin D(W¢l,l), l=1,…K after splitting. Let’s build by this way divisive tree of partitions. In reversed way let’s build agglomerative tree of partitions by joining subsets with maximal similarity. We get the same tree. But this signify the minimal spanning tree is built. Edges of such a graph connect nodes (elements w), and define similarity D(Wp,Wq) of joined sets Wp, and Wq. Hence, tree of partitions is equivalent to non-closed graph of maximal similarity.

Corollary. If tree of partitions is built, then non–hierarchical divisive partitioning algorithm is equivalent to the “nearest neighbor” algorithm in cluster–analysis, and is equivalent to the “correlation pleiades” algorithm in grouping problem. In a common case this algorithm is equivalent to some set of hierarchical divisive algorithms, starting from single cluster (group), forming one of subsequences of hierarchically nested partitions, and finishing with single–element clusters (groups).

Selection of optimal partitioning

The considered method of non–hierarchical partitioning combines two essentially different approaches to clustering. From one side, well-known algorithms are used, and from other side, the result is evident like the dendrogram. Here we try to directly use intuitive assumption about “stability” of a good-quality partitioning.

It is obvious, some stable partition needs to be used as the optimal one from one of subsequences of hierarchically nested partitions.

It is obvious too, the general sequence can be limited. We can throw off partitions with weakly filled clusters (groups) starting from some K near of sample size m. We also can throw off partitions with friable clusters for small value of K. For example, for these partitions the average dispersion of clusters exceeds the dispersion of their centers. Then remained partitions will represent compact clusters corresponding to generally accepted heuristic cluster definition.

At the same time using only single type of compact cluster or group representing undoubtedly constricts a field of applicability of this method. But proposed algorithm appears to be quite simple, and fast to get at least initial information about structure of data under investigation.

It isn’t so hard to build agglomerative version of non–hierarchical partitioning algorithm. Its behavior is sufficiently evident.

We will show below using of non–hierarchical partitioning method for another important problem of restoring of missing data.

 part 1 (RUS/ENG)

-- part 2 (RUS/ENG)

 part 3 (RUS/ENG)