TopList

About restoring of missing data

One of possible ways to restore missing data is immediately based on the fundamental idea of cluster–analysis. So, if it is possible to split the set of elements to compact in some sense subsets, then if is naturally to restore missing values of some attributes of some elements as the most typical values for appropriate compact subsets. It is obvious, different methods depend on different ways to form such a partitioning (for example, linguistic algorithms by Braverman E.M., and I.B. Muchnik, collection of ZET algorithms by N.G. Zagoruiko).

The problem of restoring of missing data seems to have been investigated enough in Russia by N.G.Zagoruiko’s team as restoring data by so called “local” methods. As a matter of fact, the collection of ZET algorithms is the highly detailed and approved technology for experimental data verification based on the hypothesis of their redundancy.

By using the ZET algorithm, each separate missing value xij in data matrix X(m,n), i.e. missing jth attribute of ith object, is estimated by a predictor matrix. Such a matrix consists of m0 most similar objects, and n0 most similar features, where values m0, and n0 are given a priori. The missing value is restored by regression equation for each pair of the predictor matrix rows, where one row from the pair contains the missing value (blank value or simply blank). By the same way the same blank is restored for each pair of the predictor matrix columns too. The final estimation is an average of such a previous weighted estimations, where weights are depended on some parameters.

A necessity to set the predictor matrix size by values m0, and n0, and to set another important parameters leads to necessity to be sure of reliability of restored values. Just therefore modern versions of the ZET family algorithms present multilevel, and multi–step technology to set all necessary parameters based on the quality verification of predicted values. Such a technology has self-dependent merit, and is suitable for some set of prognostic problems.

From other side it is convenient to define the predictor matrix size, for example, by an appropriate cluster and group after some clustering of objects, and grouping of features have been built. So, we need to select number of clusters and groups again. Let’s apply the method of non–hierarchical partitioning to restore blanks in the data matrix.

Let’s denote blank as a symbol · . Let all incomplete pairs (xil, xjl), l=1,…n not be used while a similarity dij of elements wi, and wj is calculated. As it is known incomplete pair has at least one missing value. Let dij = d·ij be for correlation, and dij = d·ij n/(nn· ) be for Euclidean distances calculation, where d·ij is the similarity with incomplete pairs, n· is number of incomplete pairs, n· < n.

Let a partitioning of the set W be built. For clustering problem we will restore missing value xij for object xi for jth feature as an average ckj of this jth feature in cluster Wk, where xiÎWk.

For grouping problem we will restore missing value xij for feature xi for jth object (observation) as a weight–average ckj of this jth observation by features in group Wk, where xiÎWk. It needs to remind here it is convenient for us to use matrix X(m,n) with features as rows in it. Missing value xij can be restored by linear regression equation

        (xijm (xi))/s (xi) = diq(xqjm (xq))/s (xq)

for features xi, and xq as normalized value

        xqij = diq(xqjm (xq))/s (xq),

where m (xq), and s (xq) are the average, and the mean square deviation of the feature xq (in a row). Then missing value is estimated as

        xij = m (xi) + ckj s (xi),

        ckj = S xqij |diq| / S |dip|  for  xqÎWk, xpÎWk,  where xqj, xpjÏ· .

If values xlj appear to be missed for all xlÎWk, then missing value xij can’t be restored.

Algorithm 3. Let’s build the complete sequence of partitions by Algorithm 2, and form a table of restored blanks. Let columns of this table correspond to blanked elements xij, and rows correspond to restored values for each stable partition in the sequence. Let’s scan the table by columns from bottom to top, and restore each blanked xij as first non–blanked value in the appropriate column.

It is obvious, this algorithm tends to restore some blank of some attribute of element wi simply as a value of the correspond attribute of the most similar to wi element wq. It is obvious right away for agglomerative version of the Algorithm 3.

This algorithm has one important property. All blanks in the table are naturally ordered by moments of their restoring. It is obvious, the earlier restored blanks are the more suitable to use.

We can additionally demand for each restored value xij the element wi appears to be in appropriate subset Wk with the size |Wk|= m0, m0< m. Restored values can simultaneously be ordered by decreasing of density of sets Wk, where w iÎWk.

It can be supposed for some sufficiently large number of single blanks the general effect of data distortion can exist. So, it is naturally to use the next algorithm.

Algorithm 4. Let all blanks be ordered by Algorithm 3. Let’s restore first or some first blanks. Let’s repeat such an actions for other blanks till all of them will be restored.

Let for some matrix X(m,n) number of objects (rows) sufficiently exceeds number of features (columns). So, we suppose m >> n. Then to restore blanks it is naturally to use algorithm 4 for this matrix. If n >> m, then it is naturally to use algorithm 4 for transposed matrix XT. If m» n, then the next algorithm can be used.

Algorithm 5. Let all blanks be ordered by algorithm 3 in matrix X, and in matrix XT too. Let some blank xij appears to be first in the first sequence, and appears to be on the kth place in the second sequence. Let’s use the sequence where its first blank has the minimal sum of places 1+k. Let’s restore this blank as weight-average value of two values from two sequences xij = x1ij k/(1+k) + xkij /(1+k). Let’s repeat such an actions for other blanks till all of them will be restored.

So, by using the method of non–hierarchical partitioning we define a family of simple algorithms to restore missing values. Let’s note it doesn’t need to define previously none of parameters. We can automatically restore all blanks in a natural way without any previously tuned parameters.

From other side, quantitative estimation of the prediction error for missing values can be required for more detail analysis of data matrix. In such a case it needs to use both proper the scheme of non–hierarchical partitioning, and proposed here algorithms in a framework of showed above technology for quality verification of predicted values. Ways of using proposed algorithms for such a technology are sufficiently evident.

References
  • Aivazyan, S.A., Buhshtaber, V.M., and L.D. Meshalkin, Applied Statistics: Classification and Reduction of Dimensionality, Finansy i statistika, Moscow, 1989 (in Russian).
  • Aldenderfer, M.S., and R.K. Blashfield, Cluster Analysis, Sage Publications, Inc., 1984.
  • Braverman, E.M, and I.B. Muchnik, Structure Methods for Empiric Data Processing, Nauka, Moscow, 1983 (in Russian).
  • Braverman, E.M., “Methods of extreme grouping of parameters, and the problem of essential factors extracting”, Avtomatica i telemehanica, No.1, 1970, pp.123–132 (in Russian, have translated in English for Automation and Remote Control).
  • Christofides, N., Graph Theory. An Algorithmic Approach, Academic Press Inc., London, 1975.
  • Dodge Y., Analysis of Experiments with Missing Data, John Wiley & Sons, N.Y., 1985.
  • Dorofeyuk, A.A., “Algorithms of Automatic Classification (Survey)”, Avtomatica i telemehanica, No.12, 1971, pp.78–113 (in Russian, have translated in English for Automation and Remote Control).
  • Duda, R.O., and P.E. Hart, Pattern Classification and Scene Analysis, John Wiley & Sons, N.Y., 1973.
  • Duran, B.S., and Odell ,P.L., Cluster Analysis. A Survey, Springer-Verlag, 1974.
  • Dvoenko, S.D., “Non–Hierarchical Divisive Clustering Algorithm”, Avtomatica i telemehanica, No.4, 1999, pp.117–124 (in Russian, under translating for Automation and Remote Control).
  • Dvoenko, S.D., “Non–Hierarchical Divisive Grouping Algorithm”, Avtomatica i telemehanica, No.9, 1999, pp.47–57 (in Russian, under translating for Automation and Remote Control).
  • Harman, H.H., Modern Factor Analysis, University of Chicago Press, Chicago, 1976.
  • Kim, J.-O., and C.W. Mueller, Factor Analysis: Statistical Methods and Practical Issues, Sage Publications, Inc., 1978.
  • Lawly, D.N., and A.E. Maxwell, Factor Analysis as a Statistical Method, Butterworths, London, 1963.
  • Little R.J., and D.B. Rubin, Statistical Analysis with Missing Data, John Wiley & Sons, N.Y., 1987.
  • Lumelsky, V.Y., “Grouping of Parameters Based on Square Tie Matrix”, Avtomatica i telemehanica, No.1, 1970, pp.133–143 (in Russian, have translated in English for Automation and Remote Control).
  • Mirkin, B.G., Analysis of Qualitative Attributes and Structures, Statistika, Moscow, 1980 (in Russian).
  • Tou, J.T., and R.G. Gonzalez, Pattern Recognition Principles, Addison-Wesley, 1974.
  • Zagoruiko N.G., Applied Methods of Data and Knowledge Analysis, Institute of Mathematics, Novosibirsk, 1999 (in Russian).
 part 1 (RUS/ENG)

 part 2 (RUS/ENG)

-- part 3 (RUS/ENG)

 Example