Rotem Dror: Experiment Design and Evaluation of Empirical Models in Natural Language Processing
The research field of Natural Language Processing (NLP) puts a lot of emphasis on empirical results. It seems that models are reaching state-of-the-art and even “super-human” performance on language understanding tasks on a daily basis, thanks to large datasets and powerful models with billions of parameters. However, the existing evaluation methodologies are lacking in rigor, leaving the field susceptible to erroneous claims. In this talk, I will describe efforts to build a solid framework for evaluation and experiment design for diverse NLP tasks.