It is common to hear that certain natural language processing (NLP) tasks have been "solved". These claims are often misconstrued as being about general human capabilities (e.g., to answer questions, to reason with language), but they are always actually about how systems performed on narrowly defined evaluations. Recently, adversarial testing methods have begun to expose just how narrow many of these successes are. This is extremely productive, but we should insist that these evaluations be *fair*. Has the model been shown data sufficient to support the kind of generalization we are asking of it? Unless we can say "yes" with complete certainty, we can't be sure whether a failed evaluation traces to a model limitation or a data limitation that no model could overcome. In this talk, I will present a formally precise, widely applicable notion of fairness in this sense. I will then apply these ideas to natural language inference by constructing challenging but provably fair artificial datasets and showing that standard neural models fail to generalize in the required ways; only task-specific models are able to achieve high performance, and even these models do not solve the task perfectly. I'll close with discussion of what properties I suspect general-purpose architectures will need to have to truly solve deep semantic tasks.
(joint work with Atticus Geiger, Stanford Linguistics)

Christopher Potts is Professor of Linguistics and, by courtesy, of Computer Science, at Stanford, and Director of the Center for the Study of Language and Information (CSLI) at Stanford. In his research, he develops computational models of linguistic reasoning, emotional expression, and dialogue. He is the author of the 2005 book The Logic of Conventional Implicatures as well as numerous scholarly papers in linguistics and natural language processing.