Imputation of predictors

Missing Data in Social Sciences

Missing data generally refers to a situation when data points are unknown (missing) for some but not all observations for a specific variable. The 'missingness' may occur for many different reasons. In the social sciences missingness may, for instance, be related to a survey respondent refusing to answer a specific question, a country not reporting a specific statistic, or simply a human coding error in some variable resulting in the value having to be deleted. Missing data is ubiquitous to all social science, as it is almost impossible for social scientists to obtain the type of controlled experimental data which may be available to other scientists.

In quantitative studies, missing data is problematic as almost all statistical estimation techniques require the data to be complete in order for the estimation to run (Graham, 2009). This means that for estimation to happen, the missing data must be dealt with in some way or another. The standard solution to the missing data problem in most statistical software is simply to remove all observations with at least one missing value in one of the variables from the analysis. This practice is known as Listwise Deletion Allison (1999). Listwise Deletion has several drawbacks, including that it is a method which wastes a lot of the collected data, and reduces statistical power as the sample size is reduced. In addition, Listwise Deletion will create biased parameter estimates unless certain, often unrealistic, assumptions are fulfilled (see for instance Allison, 1999; Graham, 2009; Schafer and Graham, 2002).

The alternative to removing the observations with missing data in them is to impute some value instead of the value which is missing. Several possibilities exist here. Among the more naive suggestions are to use either the mean or the predicted values from a linear regression of the variable on all other variables. Using the mean as the imputed value does, however, seriously bias the estimated parameters unless some very strong assumptions are fulfilled.

The solution to these problems are to use multiple imputation, i.e. instead of imputing one single value, multiple values are imputed for each missing value, creating multiple complete datasets which the analysis can be run on. The results from these multiple datasets are then merged to produce the estimates for the parameters.

Methods used for the ViEWS project

The data used by the ViEWS project contains a large amount of missing data. In order to avoid biased results and for the simulations to run properly, the project uses Multiple Imputation with the Amelia II package in R to replace the missing data. Five different, complete, datasets are generated and simultaneously used in Dynasim. The results from these five datasets are then combined using the so-called Rubin Rules (Allison, 1999) to produce the forecasts. For the one-step ahead forecasts, only one single imputed dataset is randomly selected and used for the entire one-step ahead forecast.

The ViEWS project will conduct further tests on how different missing data techniques affect prediction and simulation, in order to create a best practice for missing data in predictive studies.

References

  • Allison, Paul D (1999). Missing data. Sage Thousand Oaks, CA.
  • Graham, John W (2009). "Missing data analysis: Making it work in the real world". In: Annual review of psychology 60, pp. 549-576.
  • Schafer, Joseph L and John W Graham (2002). "Missing data: our view of the state of the art." In: Psychological methods 7.2, p. 147.