ViEWS is evaluated in three main ways. First, the models and ensembles generate predictions from 36 months in the past up to the present day (t-36 to t). These predictions are compared with the available data for this period. Second, the models and ensembles generate forecasts 36 months into the future (t+1 to t+36). These forecasts are published on this website and continually compared as data on large-scale political violence becomes available. Third, ViEWS forecasts are compared with results from other forecasting systems.
Since model performance is multidimensional, ViEWS relies on a suite of metrics to evaluate performance. In addition to providing a more complete picture of model performance, such an approach lowers the risk of favouring models that perform well in one aspect (for example correctly classifying the absence of conflict) over others (for example correctly classifying conflict). In the following, we briefly summarize our metrics of model performance.
Area under the Receiver Operator Curve (AUROC)
AUROC summarizes performance as a relative measure of the true positive rate and the false positive rate of predictions. The goal is to maximize true positives relative to false positives. In other words, the measure rewards models for increasing detection of actual conflict (true positives) relative to "false alarms'' (false positives). A model that predicts perfectly has a ROC-AUC value of 1 and a model which cannot distinguish the true and false positive has a value of .5 (equal to a coin toss).
Area under the Precision-Recall curve (AUPR)
AUPR is a relative measure of how precisely a model predicts true positives and the true positive rate. Precision is measured as the proportion of predicted conflict onsets that are correct. This means that the AUPR measure rewards models for getting conflicts correct once a model predicts them. Since only a small percentage of observations experience conflict, it is more difficult to get predictions of conflicts correct than it is to get predictions of the absence of conflict correct. AUPR is therefore a more demanding measure than AUROC. Since we are more interested in predicting instances of political violence than the absence of such, we give priority to the the AUPR over the AUROC, as the former rewards models more for accurately predicting conflict, as compared to absence of conflict.
The Brier score measures the accuracy of probabilistic predictions. It favours sharp, accurate probabilistic predictions (near 0 or 1), which is different to the relative ordering of the forecasts that is needed for the computation of the AUPR and AUROC. The Brier score is particularly useful to distinguish models that perform similarly or inconsistent on the AUPR and AUROC scores.
A confusion matrix tabulates the performance of a model by actual class (did we observe conflict or not) and predicted class (did we predict conflict or not). When looking at binary outcomes, this becomes a two-by-two table with true positives, false positives, false negatives, and true negatives.
Diverse model calibration metrics
A system that produces probabilistic forecasts should also be well calibrated to the actual data. When the model suggests that there is a X percent chance of an event, do events happen approximately X percent of the time? Calibration can be effectively gauged visually using calibration plots, in which forecasts are binned on the x-axis and the frequency of actual events within the observations in each bin is plotted on the y-axis. A perfectly calibrated model follows a 45 degree angle. Calibration can also be gauged over time by plotting the actual vs predicted frequency of events in a given time interval.
As we proceed with developing system, we will follow the guidelines of Colaresi and Mahmood (2017), who suggest an iterative loop, whereby model representations are built from domain knowledge, their parameters computed, their performance critiqued, and then the successes and particularly the failures of the previous models inform a new generation of model representations. Crucial to this machine learning-inspired workflow are visual tools, such as model criticism and biseparation plots for researchers to inspect patterns that are captured by some models and ensembles but missed by others. We will also expand on these tools, looking at mistakes in geographic context.
ViEWS will also work to develop and adapt a number of other performance metrics, for example a domain-specific evaluation measure based on differential classification rewards and misclassification costs and classification of predictions depending on "distance" in time and space.
- Colaresi, M., & Mahmood, Z. (2017). Do the Robot: Lessons From Machine Learning to Improve Conflict Forecasting. Journal of Peace Research, 54(2), 193-214.