17 Appendix
Model Details:
The predictive model underlying the forecasts is a preliminary model undergoing collaborative development. The model is limited in its predictive ability by the variables supplied (i.e., the quantity and quality of data used) and the simplifying assumptions applied. As indicated above, the current iteration of the model relies exclusively on count data from the NSF and ACS relating to race/ethnicity, gender, and standard occupational class (SOC). Given the small number of variables we are using an extremely simple linear regression model, formulized as follows:
Where Y is the modeled count of individuals in each year, i; G is the reported gender; SOC is the reported standard occupational class; RE is the reported race/ethnicity; and t is an indexed time parameter to allow for annual fluctuations (serving as a random “year” parameter). The parameter M is a constant magnitude applied to all counts for a year, serving as an artificial weighting parameter to simulate the efficacy of a particular event implemented with the intent of modifying the composition of the workforce. We assume this model follows a Poisson distribution, as is common for count-data.
We applied this model to a subset of the data 1000 times, varying the magnitude variable randomly between 0.01 and 0.5 (i.e., simulating 1000 events with between 1% and 50% effectiveness). Each data subset constituted a random sample of 80% of the available data. This allowed us to capture a reasonable breadth of possible outcomes to provide expectations regarding the potential influence of a given action on the NSF workforce composition.
We evaluated the fit of the models by assessing the closeness of the relationship between the deviance of the predictions (the difference between predictions and observations) and the degrees of freedom of the model to a Χ2(“Chi-square”) distribution. The former portion of this was formularized as
where D is the deviance, n is the number of data points, and df is the degrees of freedom. Deviance refers to the difference between the number of data points used for prediction and the number of parameters in the model. This metric was > 0.95 for all iterations of the model (i.e., for each simulation), indicating reasonable fit of the model to the data. This is to be expected with such a simple model of few parameters (we are essentially creating lines of best fit through the data with some annual variation added in).