About classification and cluster-analysis

About classification and cluster–analysis problems

Let observations be carried out on objects wОW from a general population W. Let objects appear randomly and independently each from other and be from some of m classes W_k, k=1,…m. It needs to define for each currently appeared object w_i, i=1, 2,… its proper class from the set W_k, k=1,…m.

We need to note the number N of objects is not limited in principle N®Ґ. So, as a solution we need to define some method to classify each new object independently from others. This method is in using so called a decision rule.

It is convenient for us to distinguish between terms “classification” and “clustering” in the sense that classification is used for unlimited number of objects, and clustering is used for limited ones to be processed together. We suppose this is not in the great contradiction with the common sense of these terms.

The decision rule is a function g(w) defined on the set of classes {W₁,…W_m}, where g(w_i)=W_k for w_iОW_k. Let object w_i be represented as a vector x_i=(x_i₁,…x_in) in the n–dimensional feature space. Then the decision function g(w) can be available for us in the form of function g(x), and we want the condition g(x_i)=W_k for x_iОW_k be true.

The most common approach to solve the classification problem is using a probabilistic decision rule. It is supposed an m–modal probability density function is given in the n–dimensional space, where local maximums of density function are local condensations (clusters) of objects. Areas W₁,…W_m of local concentrations cover all the space volume, but mutually intersect themselves in a common case. So, they are not “classes” as we usually understand. But classification is created by the decision rule g(x) as non–intersected areas O_kНW_k, k=1,…m, where g(x_i)=O_k for x_iОO_k, covering all the space volume too.

It is often difficult to build the probabilistic decision rule for it needs to know all about suitable probability distributions. If probability measures are supposed to be self–concentrated in the limited areas of the feature space, then the deterministic decision rule is available. In this case it is supposed areas W₁,…W_m don’t intersect and break n–dimensional space to m classes. It needs to know only a mutual arrangement of areas W₁,…W_m in the feature space to build the deterministic decision rule.

Applied methods to build decision rules are developed for example in the pattern recognition theory.

A collection of observations is usually represented as a data matrix X(N, n) with N rows and n columns. The fixed number N of observations give us opportunity to classify them by enumeration of classes instead of using a decision rule.

Classification methods by enumeration of object classes solve cluster–analysis problem. Its peculiarity is in the simultaneous processing all of N observations. Hence, cluster–analysis methods are based on studying of characteristics of a mutual arrangement of classes as a local concentrations of points in the feature space.

But without an explicitly defined decision rule it needs to solve the cluster–analysis problem for all N observations again after new observation have been added to initial data matrix. At that the classification of the previously observed objects can be changed. This fact brings us to the problem of the cluster decision stability. On the contrary, the previous classification by the decision rule can’t be changed. So, it doesn’t need to process the previously observed objects by the decision rule again – we can simply forget them.

By using the decision rule some algorithm of classification simply assigns an object to the class which number is pointed by the decision rule. But it needs to develop a special algorithm to classify objects by enumeration of object classes. It is obvious there can be many such an algorithms by using different characteristics of a mutual arrangement of classes, and different ways to define enumeration. Therefore, variety of clustering algorithms results in different classifications for the same data in a common case.

Clustering algorithms are usually quite simple ones. This is very attractive for practice, but results can be insufficiently statistically proved.

From the other side the essentially non–statistical data interpretation and large variety of cluster definitions can be available in the cluster–analysis. This gives researchers the opportunity to get satisfying results in real cases, where assumption about non–statistical nature of data is often used.

It needs to note we haven’t difficulties to use the previously built decision rule as initial result for classification by enumeration, i.e. for clustering. And on the contrary, have supposed the statistical model of data, we can use clustering result as training data to form the optimal decision rule.

As a whole we can opportunity to create different clustering algorithms based on some set of initial ones. This way appears to be very useful for some cases. Below we will show examples of such an algorithms.

About cluster or group number selecting

It is very important to select number of clusters. Clustering is supposed to uncover natural local concentrations of objects. So, cluster number K is a parameter which makes algorithms essentially more complicated if it is unknown, and strongly influences to the result quality if it is fixed.

The problem of cluster number selecting is highly non–trivial. Suffice it to say to get satisfactory theoretical decision it is often needs to set quite strong suppositions about some given distribution family. But what suppositions we can talk about for unknown data especially at the beginning of the research? Hence, clustering algorithms are usually built as some way of cluster numbers inspection to define an optimal one from them.

Two essentially different ways of inspection are well known today: hierarchical agglomerative and divisive algorithms, and non–hierarchical algorithms like ISODATA based on K–means family of algorithms (Ball G., and D. Hall, MacQueen J.) or FOREL family algorithms (Zagoruiko N.G., Lbov G.S., and V.N. Elkina).

Actually cluster number is not defined by hierarchical algorithms. Instead of the full tree of nested clusters (dendrogram) is built. The cluster number is defined based on suppositions not concerning functioning of algorithms, for example, based on changing of a threshold of splitting (merging) of clusters. Difficulties of such an algorithms have been well studied: a measure of vicinity, index inversions in dendrograms, hierarchical classification inflexibility which sometimes is very undesirable. Nevertheless the clustering representation by the dendrogram can get us the complete idea about cluster structure.

It needs to previously define the way of functioning, and stop condition for non–hierarchical algorithms by rather a large number of parameters. Sometimes it is difficult to do especially at the beginning of data studying. But such an algorithms provide a flexible clustering, and usually define cluster number.

From other side objects can be described by large number of features (parameters). Then the problem of feature grouping becomes important. The initial information is contained in the square matrix of feature interdependence, in particular in correlation matrix. Informal hypothesis about small number of hidden factors determining feature interdependence structure is the basis to successfully solve the grouping problem.

The grouping problem has standalone meaning (aggregating problem, diagonalizing of the interdependence matrix problem) to distinguish feature groups characterizing objects in some sense, and so, can be considered as a specific clustering problem. But it can have an auxiliary meaning, for example, to define prior hypothesis about number of common factors in factor analysis.

The problem of group number selecting is highly non–trivial too. So, grouping algorithms is often built for previously defined group number K (extreme grouping by Braverman E.M., and V.Y. Lumelsky) or as agglomerative and divisive procedures of group sequence inspection for different K.

The experience of using hierarchical algorithms demonstrates sometimes unsuccessful (unsuitable) partitioning, especially for sufficiently complicated correlation matrixes in the sense of task contents (“correlation pleiades” algorithm by Vyhandy L.K., and V.P. Terent’ev, algorithm “Joining” by Dorofeyuk A.A., and V.Y. Lumelsky). It is natural result for hierarchical grouping as a rule essentially limits variety of possible grouping. In their turn, algorithms of feature extreme grouping are free of this restriction, but they are locally optimal. So, their result essentially depends on the group number, and the start grouping for complicated data.

We propose here new non–hierarchical divisive clustering algorithm without tuning of parameters based on the classical K–means algorithm. The result of such an algorithm is a sequence of not generating a hierarchy clusterizations. We propose not to define number of clusters a priori as since behavior of built sequence can point no doubt non–optimal clusterizations. Then the rest clusterizations define possible cluster numbers. We can use them for final selecting in a case of need.

We also propose here new non–hierarchical divisive grouping algorithm based on the well–known feature extreme grouping algorithms. The result of such an algorithm is a sequence of not generating a hierarchy groupings. We also propose not to define number of groups a priori as since behavior of built sequence can point no doubt non–optimal groupings. Then the rest groupings define possible group numbers. We can use them for final selecting in a case of need too.


	part 1 (RUS/ENG) part 2 (RUS/ENG) part 3 (RUS/ENG) Example