When we test a theory using data, it is common to focus on correctness: do the predictions of the theory match what we see in the data? But we also care about a property we might call completeness: how much of the predictable variation in the data is captured by the theory? This question is difficult to answer, because in general we do not know how much “predictable variation”
there is in the problem. This paper proposes the use of machine learning algorithms as a means of constructing a benchmark level for the best attainable level of prediction. We illustrate this approach on the problem of predicting human generation of random sequences. Relative to an atheoretical machine learning algorithm benchmark, we find that existing behavioral models explain
roughly 10 to 30% of the predictable variation in this problem. This fraction is robust across several datasets, suggesting that (1) there is a significant amount of structure in this problem that our models have yet to capture and (2) machine learning may provide a generally viable approach to testing theory completeness.