Anda di halaman 1dari 5

IMPUTASI MISSING VALUE PADA DATA YANG MENGANDUNG OUTLIER

Oleh : HAFTI MARDIAH 140720090012

TESIS
Untuk memenuhi salah satu syarat Guna memperoleh gelar Magister Statistika Terapan Program Magister Statistika Terapan Konsentrasi Statistika Sosial

UNIVERSITAS PADJADJARAN PROGRAM PASCASARJANA BANDUNG 2010

ABSTRAK

Judul Tesis

Imputasi Missing Value pada Data yang Mengandung Outlier.

Kata Kunci

Missing Data, Outlier, Predictive Mean Matching, Least Trimmed Squares, Robust Estimation

Nama NPM Program Studi Bidang Kajian Utama Tim Pembimbing

: : : : :

Hafti Mardiah 140720090012 Statistika Terapan Statistika Sosial 1. Gandhi Pawitan, Ph.D. 2. Budhi Handoko, M.Si.

Tahun Kelulusan

2010

Abstrak Missing Data merupakan salah satu permasalahan yang sering terjadi pada sebuah survey. Imputasi adalah pilihan penanganan missing data yang paling bijak dari pada membuang sebagian observasi atau variabel yang mengandung missing value, mengingat bahwa data sangat mahal dan berharga. Penanganan missing value pada sekumpulan data yang terdapat outlier menjadi perhatian khusus karena sebagian besar metode imputasi dengan mekanisme Missing at Random (MAR) dan Missing Completely at Random (MCAR) mengasumsikan data berdistribusi normal multivariat. Asumsi ini menjadi tidak valid ketika terdapat outlier pada data, sehingga sebaiknya menggunakan metode imputasi berdasarkan estimasi yang robust terhadap outlier. Metode Predictive Mean Matching (PMM) adalah salah satu alternatif metode imputasi komposit, penggabungan metode imputasi regresi iii

iv dengan metode imputasi nearest neighbour, yang mengasumsikan data berasal dari distribusi normal multivariat. Ketika asumsi normalitas dilanggar, PMM

menghasilkan nilai hasil imputasi yang tidak masuk akal dan statistik Efciency Relative yang lebih rendah dibandingkan dengan metode imputasi regresi Least Trimmed Squares (LTS). Metode imputasi regresi LTS merupakan penggabungan algoritma LTS dan algoritma imputasi regresi. Abstract Missing data is the most frequent problem that occurs in a survey. Thus, imputation is a prudent alternative of handling the missing data instead of reducing the number of observations or variables due to its cost achieved and value. The treatment of the missing data in the presence of outlier becomes the major problem which is the most imputation method based on the Missing at Random (MAR) and Missing Completely at Random (MCAR) mechanism. Moreover, it assumes data originated from a multivariate normal distribution, which is no longer valid in the presence of outliers in the data. For instance, Predictive Mean Matching (PMM), a combination of regression imputation method and the nearest neighbour method, assumes the data originated from a multivariate normal distribution. When the normality assumption is violated, the predictive mean matching method does not yield plausible imputed values plus the performance of the Efciency Relative (ER) is below compared to the ER of Least Trimmed Squares (LTS) regression imputation method. LTS regression imputation method is actually a regression imputation method which its parameter is the result of LTS regression estimation then combined with the regression imputation algorithm.

DAFTAR PUSTAKA Barnett, V., & Lewis, T. (1994). Outliers in statistical data. New York: John Wiley & Sons, Inc. Basuki, R. (2009). Imputasi berganda menggunakan metode regresi dan metode predictive mean matching untuk menangani missing data. Naskah tesis yang tidak dipublikasikan, Institut Teknologi Sepuluh Nopember, Surabaya. Box, G. E. P., & Tiao, G. C. (1973). Bayesian inference in statistical analysis. Reading Mass: Addison-Wesley. Carpenter, J., & Kenward, M. (2006). A comparison of multiple imputation and doubly robust estimation for analysis with missing data. Journal of Royal Statistics Society, 0964-1998/06/169000. Chaimongkol, W. (2005). Three composite imputation method for item nonresponse estimation in sample survey. Unpublished doctoral dissertation, National Institute of Development Administration, Thailand. Chambers, R., & Skinner, C. (2003). Analysis of survey data. New York: John Wiley & Sons, Inc. Elliott, M. (2006). Multiple imputation in the presence of outliers (Tech. Rep. No. 59). University of Michigan School of Public Health. Available from http://www.bepress.com/umichbiostat/paper59 Godambe, V., & Thompson, M. (1986). Parameters of superpopulation and survey population: Their relationships and estimation. Internal Statistical Review, 54, 127-138. Horton, N., & Kleinman, K. (2007). Much ado about nothing: A comparison of missing data method and software to t incomplete data regression models. Journal of the American Statistical Association, 61, 79-90. Horton, N., & Lipsitz, S. (2001). Multiple imputation in practice: Comparison of software package for regression model with missing variables. Journal of the American Statistical Association, 55, 244-255. Hron, K., Templ, M., & Filzmoser, P. (2008, Desember). Imputation of missing value for compositional data using classical and robust methods (Research report sm-2008-4, Departement of Statistics and Probability Theory). Austria: Vienna University of Technology. Available from http://www.statistik .tuwien.ac.at/forschung/SM/SM-2008-4complete.pdf. Huber, P. (1981). Robust statistics. New York: John Wiley & Sons, Inc.

43

44 Inc., S. I. (2008). Sas/stat 9.2 users guide. Cary, NC: SAS Institute Inc. Leeuw, E. de, Hox, J., & Huisman, M. (2003). Prevention and treatment of item nonresponse. Journal of Ofcial Statistics, 19, 153-176. Lessler, J., & Kalsbeek, W. (1992). Nonsampling error in surveys. New York: John Wiley & Sons, Inc. Little, R., & Rubin, D. (1987). Statistical analysis with missing data. Cambridge: John Wiley & Sons, Inc. Longford, N. (2005). Missing data and small-area estimation. New York: Springer. Maronna, R., Martin, R., & Yohai, V. (2006). Robust statistics: Theory and methods. New York: John Wiley & Sons, Inc. Neter, J., Wasserman, W., & Kutner, M. (1989). Applied linier regression. Boston: Irwin. Pawitan, G. (2001). Analysis of aggregated spatial social data. Naskah disertasi yang tidak dipublikasikan, University of Wollongong, Australia. Rousseeuw, P. J., & Driessen, K. V. (2006). Computing lts regression for large data sets. Data Mining and Knowledge Discovery, 12, 29-45. Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier detection. Canada: John Wiley & Sons, Inc. Rubin, D. (1987). Multiple imputation for nonresponse in surveys. New York: John Wiley & Sons, Inc. Sembiring, R. (1995). Analisis regresi. Bandung: Institut Teknologi Bandung. Statistik, B. P. (2006). Statistik industri besar sedang. Jakarta: BPS. Thibaudeau, Y., Gottschalck, A., & Palumbo, T. (2006). The predictive-mean method of imputation for preserving coupling between assets and liabilities (Research report series of Statistical Research Division). U.S. Census Bureau. Yuan, Y. C. (2001). Multiple imputation for missing data: Concept and new development sas/stat 8.2. Cary NC: SAS Institute Inc. Available from http://www.sas.com/statistics