2.2 Model Specification
2.2.7 Fitting
The time to fit an ANN (of fixed specification) scales linearly with the number of training observations. In text analysis, we often wish to train on a large number of documents—e.g., on many thousands to many millions of speeches, sentences, tweets or comments. The loss func- tions, which sum over all observations in the training data, can become computationally ex- pensive with large sets of data. To mitigate this cost, we often preferstochastic gradient descent. The SGD algorithm partitions the data into ‘mini batches’ on each iteration, and evaluates the loss for a mini batch per iteration. Setting the mini batch size to be large reduces computational efficiency gains over regular gradient descent. Setting the mini batch size to be too small can render the algorithm unstable, resulting in a non-smooth reduction in training loss on each iteration. Typical values for mini batch size vary between 30 and 200, but this parameter too requires tuning during training.
2.3 Simulation Example
At this point, we have discussed ANN models, and the parameters that need to be decided by the user to fit such models. To ground the theory, I present a short simulation study mapping a set of input variables onto a binary output. In this section, I compare a traditional logistic
regression to a variety of neural network model specifications. For the purpose of comparison, I simulate data from two data generating processes: one simple DGP, and one with high levels of interactivity between inputs and the output of interest. Specifically, for the straightforward DGP, I simulateN =500 observations, with y∈{0, 1} and five input variablesX1,X2, . . .X5from
the following process:
X∼U(−10, 10) * β∼U(−1, 1) yi ∼Bernoulli(σ(Xi * β)) (2.20)
I also specify an interactive DGP, which allows for a more complex mapping from inputs and the output. * β,*γ∼N (0, 0.02) yi∼Bernoulli ¡ σ(Xi * β+γ1X12+γ2X1X2+γ3X1X2X3 +γ4X4X5+γ5X4X5X6) ¢ (2.21)
To assess in-sample model fit, I consider the percentage of outputs correctly classified, the cross-entropy error (see Equation 2.4), and the mean square error. As mentioned previously, ANN models can easily overfit to data, so I evaluate the models on both prediction on the data used to train the model, and on out-of-sample predictions on 500 new observations.
Results appear in Table 2.2. There are several results to unpack. First, note the equivalence in the results between the logistic regression model and the one-layer, one-node ANN. As I de- scribed earlier, the two models are functionally equivalent, and the only difference in the results arises from the different optimization routines. Second, referring to the top panel of the table, note that additional ANN model complexity increases the predictive accuracy on the training data, but actually decreases accuracy in the testing (out-of-sample) data. This is a classic case
of overfitting. No ANN model outperforms the logistic regression on the test set when the DGP matches the canonical logistic regression.
Moving to the lower panel of the table, however, we see the advantage of the neural network model. With a more complex DGP, the logistic regression model naturally performs worse, only correctly predicting about 66 percent of training outputs and 56 percent of test outputs. Adding additional nodes to the networks dramatically improves the training predictive accuracy and lowers the mean square error on the training set, just as with the simple DGP. Unlike earlier, though, the ANNs substantially outperform the logistic regression model on the test set. The logistic regression model correctly predicts about 56 percent of test cases, with a mean square error of 0.25. Compare that to a one-layer, ten-node network that correctly predicts 77 percent of test cases, with mean square error 0.18. As before, we also see some evidence of overfitting with additional model complexity. Comparing the training and test performance, a more com- plicated network specification of three layers with ten nodes each predicts 97 percent of the training cases, but does not outperform the one-layer, ten-node network on the test set.
We might finally want to compare the neural network model to the correctly-specified lo- gistic regression model. If we include all appropriate quadratics and interactions in the logistic regression, the test set percent correctly predicted jumps to 0.79, with mean squared error 0.21. In other words, a logistic regression model with the correct functional form outperforms even the best neural network model.
Seen from that perspective, the neural network results do not appear impressive. But read- ers should keep in mind that the ANN models are automatically learning patterns in the data. To repeat an earlier point, ANNs are not optimal tools for building explanatory models. The results from the ANN are largely uninterpretable, since inputs are weighted at potentially mul- tiple stages and in multiple combinations. The advantage of ANNs rests largely in their ability to approximate arbitrarily complex functions. In applications, like text analysis, complex pat- terns exist in data. These patterns are largely unknowna priori, making the automated learning
ability of ANNs appealing.