About restoring of missing data

About restoring of missing data

One of possible ways to restore missing data is immediately based on the fundamental idea of cluster–analysis. So, if it is possible to split the set of elements to compact in some sense subsets, then if is naturally to restore missing values of some attributes of some elements as the most typical values for appropriate compact subsets. It is obvious, different methods depend on different ways to form such a partitioning (for example, linguistic algorithms by Braverman E.M., and I.B. Muchnik, collection of ZET algorithms by N.G. Zagoruiko).

The problem of restoring of missing data seems to have been investigated enough in Russia by N.G.Zagoruiko’s team as restoring data by so called “local” methods. As a matter of fact, the collection of ZET algorithms is the highly detailed and approved technology for experimental data verification based on the hypothesis of their redundancy.

By using the ZET algorithm, each separate missing value x_ij in data matrix X(m,n), i.e. missing j^th attribute of i^th object, is estimated by a predictor matrix. Such a matrix consists of m₀ most similar objects, and n₀ most similar features, where values m₀, and n₀ are given a priori. The missing value is restored by regression equation for each pair of the predictor matrix rows, where one row from the pair contains the missing value (blank value or simply blank). By the same way the same blank is restored for each pair of the predictor matrix columns too. The final estimation is an average of such a previous weighted estimations, where weights are depended on some parameters.

A necessity to set the predictor matrix size by values m₀, and n₀, and to set another important parameters leads to necessity to be sure of reliability of restored values. Just therefore modern versions of the ZET family algorithms present multilevel, and multi–step technology to set all necessary parameters based on the quality verification of predicted values. Such a technology has self-dependent merit, and is suitable for some set of prognostic problems.

From other side it is convenient to define the predictor matrix size, for example, by an appropriate cluster and group after some clustering of objects, and grouping of features have been built. So, we need to select number of clusters and groups again. Let’s apply the method of non–hierarchical partitioning to restore blanks in the data matrix.

Let’s denote blank as a symbol · . Let all incomplete pairs (x_il, x_jl), l=1,…n not be used while a similarity d_ij of elements w_i, and w_j is calculated. As it is known incomplete pair has at least one missing value. Let d_ij = d^·_ij be for correlation, and d_ij = d^·_ijn/(n–n^· ) be for Euclidean distances calculation, where d^·_ij is the similarity with incomplete pairs, n^· is number of incomplete pairs, n^· < n.

Let a partitioning of the set W be built. For clustering problem we will restore missing value x_ij for object x_i for j^th feature as an average c^k_j of this j^th feature in cluster W_k, where x_iОW_k.

For grouping problem we will restore missing value x_ij for feature x_i for j^th object (observation) as a weight–average c^k_j of this j^th observation by features in group W_k, where x_iОW_k. It needs to remind here it is convenient for us to use matrix X(m,n) with features as rows in it. Missing value x_ij can be restored by linear regression equation

(x_ij – m (x_i))/s (x_i) = d_iq(x_qj – m (x_q))/s (x_q)

for features x_i, and x_q as normalized value

x^q_ij = d_iq(x_qj – m (x_q))/s (x_q),

where m (x_q), and s (x_q) are the average, and the mean square deviation of the feature x_q (in a row). Then missing value is estimated as

x_ij = m (x_i) + c^k_j s (x_i),

c^k_j = S x^q_ij |d_iq| / S |d_ip| for x_qОW_k, x_pОW_k, where x_qj, x_pjП· .

If values x_lj appear to be missed for all x_lОW_k, then missing value x_ij can’t be restored.

Algorithm 3. Let’s build the complete sequence of partitions by Algorithm 2, and form a table of restored blanks. Let columns of this table correspond to blanked elements x_ij, and rows correspond to restored values for each stable partition in the sequence. Let’s scan the table by columns from bottom to top, and restore each blanked x_ij as first non–blanked value in the appropriate column.

It is obvious, this algorithm tends to restore some blank of some attribute of element w_i simply as a value of the correspond attribute of the most similar to w_i element w_q. It is obvious right away for agglomerative version of the Algorithm 3.

This algorithm has one important property. All blanks in the table are naturally ordered by moments of their restoring. It is obvious, the earlier restored blanks are the more suitable to use.

We can additionally demand for each restored value x_ij the element w_i appears to be in appropriate subset W_k with the size |W_k|= m₀, m₀< m. Restored values can simultaneously be ordered by decreasing of density of sets W_k, where w _iОW_k.

It can be supposed for some sufficiently large number of single blanks the general effect of data distortion can exist. So, it is naturally to use the next algorithm.

Algorithm 4. Let all blanks be ordered by Algorithm 3. Let’s restore first or some first blanks. Let’s repeat such an actions for other blanks till all of them will be restored.

Let for some matrix X(m,n) number of objects (rows) sufficiently exceeds number of features (columns). So, we suppose m >> n. Then to restore blanks it is naturally to use algorithm 4 for this matrix. If n >> m, then it is naturally to use algorithm 4 for transposed matrix X^T. If m» n, then the next algorithm can be used.

Algorithm 5. Let all blanks be ordered by algorithm 3 in matrix X, and in matrix X^T too. Let some blank x_ij appears to be first in the first sequence, and appears to be on the k^th place in the second sequence. Let’s use the sequence where its first blank has the minimal sum of places 1+k. Let’s restore this blank as weight-average value of two values from two sequences x_ij = x¹_ij k/(1+k) + x^k_ij /(1+k). Let’s repeat such an actions for other blanks till all of them will be restored.

So, by using the method of non–hierarchical partitioning we define a family of simple algorithms to restore missing values. Let’s note it doesn’t need to define previously none of parameters. We can automatically restore all blanks in a natural way without any previously tuned parameters.

From other side, quantitative estimation of the prediction error for missing values can be required for more detail analysis of data matrix. In such a case it needs to use both proper the scheme of non–hierarchical partitioning, and proposed here algorithms in a framework of showed above technology for quality verification of predicted values. Ways of using proposed algorithms for such a technology are sufficiently evident.

References

Aivazyan, S.A., Buhshtaber, V.M., and L.D. Meshalkin, Applied Statistics: Classification and Reduction of Dimensionality, Finansy i statistika, Moscow, 1989 (in Russian).
Aldenderfer, M.S., and R.K. Blashfield, Cluster Analysis, Sage Publications, Inc., 1984.
Braverman, E.M, and I.B. Muchnik, Structure Methods for Empiric Data Processing, Nauka, Moscow, 1983 (in Russian).
Braverman, E.M., “Methods of extreme grouping of parameters, and the problem of essential factors extracting”, Avtomatica i telemehanica, No.1, 1970, pp.123–132 (in Russian, have translated in English for Automation and Remote Control).
Christofides, N., Graph Theory. An Algorithmic Approach, Academic Press Inc., London, 1975.
Dodge Y., Analysis of Experiments with Missing Data, John Wiley & Sons, N.Y., 1985.
Dorofeyuk, A.A., “Algorithms of Automatic Classification (Survey)”, Avtomatica i telemehanica, No.12, 1971, pp.78–113 (in Russian, have translated in English for Automation and Remote Control).
Duda, R.O., and P.E. Hart, Pattern Classification and Scene Analysis, John Wiley & Sons, N.Y., 1973.
Duran, B.S., and Odell ,P.L., Cluster Analysis. A Survey, Springer-Verlag, 1974.
Dvoenko, S.D., “Non–Hierarchical Divisive Clustering Algorithm”, Avtomatica i telemehanica, No.4, 1999, pp.117–124 (in Russian, under translating for Automation and Remote Control).
Dvoenko, S.D., “Non–Hierarchical Divisive Grouping Algorithm”, Avtomatica i telemehanica, No.9, 1999, pp.47–57 (in Russian, under translating for Automation and Remote Control).
Harman, H.H., Modern Factor Analysis, University of Chicago Press, Chicago, 1976.
Kim, J.-O., and C.W. Mueller, Factor Analysis: Statistical Methods and Practical Issues, Sage Publications, Inc., 1978.
Lawly, D.N., and A.E. Maxwell, Factor Analysis as a Statistical Method, Butterworths, London, 1963.
Little R.J., and D.B. Rubin, Statistical Analysis with Missing Data, John Wiley & Sons, N.Y., 1987.
Lumelsky, V.Y., “Grouping of Parameters Based on Square Tie Matrix”, Avtomatica i telemehanica, No.1, 1970, pp.133–143 (in Russian, have translated in English for Automation and Remote Control).
Mirkin, B.G., Analysis of Qualitative Attributes and Structures, Statistika, Moscow, 1980 (in Russian).
Tou, J.T., and R.G. Gonzalez, Pattern Recognition Principles, Addison-Wesley, 1974.
Zagoruiko N.G., Applied Methods of Data and Knowledge Analysis, Institute of Mathematics, Novosibirsk, 1999 (in Russian).


	part 1 (RUS/ENG) part 2 (RUS/ENG) part 3 (RUS/ENG) Example