Benito Santos, AlejandroGhajari Espinosa, AdriánFresno Fernández, Víctor Diego2025-12-032025-12-032025-01-01Alejandro Benito-Santos, Adrian Ghajari, and Víctor Fresno. 2025. Robust Estimation of Population-Level Effects in Repeated-Measures NLP Experimental Designs. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 33076–33089, Vienna, Austria. Association for Computational Linguistics.0736-587Xhttps://doi.org/10.18653/v1/2025.acl-long.1586https://hdl.handle.net/20.500.14468/30995The registered version of this conference paper, first published in " In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 33076–33089, Vienna, Austria", is available online at the publisher's website: Association for Computational Linguistics, https//doi: 10.18653/v1/2025.acl-long.1586La versión registrada de esta comunicación, publicada por primera vez en"In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 33076–33089, Vienna, Austria", está disponible en línea en el sitio web del editor: Association for Computational Linguistics, https//doi: 10.18653/v1/2025.acl-long.1586NLP research frequently grapples with multiple sources of variability—spanning runs, datasets, annotators, and more—yet conventional analysis methods often neglect these hierarchical structures, threatening the reproducibility of findings. To address this gap, we contribute a case study illustrating how linear mixed-effects models (LMMs) can rigorously capture systematic language-dependent differences (i.e., population-level effects) in a population of monolingual and multilingual language models. In the context of a bilingual hate speech detection task, we demonstrate that LMMs can uncover significant population-level effects—even under low-resource (small-N) experimental designs—while mitigating confounds and random noise. By setting out a transparent blueprint for repeated-measures experimentation, we encourage the NLP community to embrace variability as a feature, rather than a nuisance, in order to advance more robust, reproducible, and ultimately trustworthy results.eninfo:eu-repo/semantics/openAccess1203.23 Lenguajes de programación1203.07 Modelos causalesRobust Estimation of Population-Level Effects in Repeated-Measures NLP Experimental Designsactas de congreso