Publicación: Transformers BERT para Question-Answering sobre COVID-19
Cargando...
Fecha
2021-09-01
Autores
Editor/a
Director/a
Tutor/a
Coordinador/a
Prologuista
Revisor/a
Ilustrador/a
Derechos de acceso
Atribución-NoComercial-SinDerivadas 4.0 Internacional
info:eu-repo/semantics/openAccess
info:eu-repo/semantics/openAccess
Título de la revista
ISSN de la revista
Título del volumen
Editor
Universidad Nacional de Educación a Distancia (España). Escuela Técnica Superior de Ingeniería Informática. Departamento de Lenguajes y Sistemas Informáticos
Resumen
La sobrecarga de información debido al ritmo de publicación de artículos científicos requiere sistemas question-answering que proporcionen acceso eficiente al conocimiento adecuando las respuestas al tipo de usuario. Para el desarrollo de sistemas question-answering son necesarios datasets de entrenamiento/evaluación anotados por expertos. Sin embargo, los datasets existentes para comprensión lectora en áreas de conocimiento especializadas como medicina no tienen un volumen de muestras suficiente para usarlos con métodos de aprendizaje supervisado. BioASQ v9b, un dataset biomédico, contiene 3.742 preguntas; COVID-QA-2019 (Möller et al., 2020) 2.019 ternas de pregunta, artículo, respuesta; COVID-QA-147 (Tang et al., 2020) 147 ternas, COVID-QA-111 (Lee et al., 2020) 111 ternas mientras que la versión 2 del Stanford Question Answering Dataset SQuAD v2 (Rajpurkar et al., 2018), un dataset genérico creado a partir de artículos de Wikipedia, contiene 130.319 muestras de entrenamiento, 11.873 de validación y 8.862 de pruebas. Una solución a la falta de datasets question-answering específicos del dominio con tamaños suficientes de muestras consiste en inducir el modelo de lenguaje en un dataset de dominio general y aplicarlo al dominio específico. Este trabajo estudia el rendimiento de modelos BERT (Devlin et al., 2018) y SBERT (Reimers et Gurevych, 2019) entrenados en corpus de dominio general SQuAD v2, QuAC (Choi et al., 2018) y MS MARCO (Nguyen et al., 2016) cuando se utilizan para obtener respuestas en el dominio COVID-19 mediante el corpus CORD-19 (Wang et al., 2020).
Information overload due to the increase in scientific literature requires question-answering systems that provides efficient access to knowledge, adapting the answers to the user. One of the most important requirements for the development of a question-answering system is an expert annotated training/validation dataset. However, existing machine reading comprehension datasets for question-answering in specialized knowledge areas such as biomedicine are not large enough to be used in supervised learning models; e.g., BioASQ v9b, a biomedical dataset containing 3742 questions; COVIDQA- 2019 (Möller et al., 2020) consisting of 2019 question-articleanswer triples; COVID-QA-147 (Tang et al., 2020) 147 triples, COVIDQA- 111 (Lee et al., 2020) 111 triples. The Stanford Question Answering Dataset SQuAD v2 (Rajpurkar et al., 2018), a reading comprehension dataset created by crowdworkers on a set of Wikipedia articles is composed of 130319 training samples, 11873 validation samples and 8862 test samples. To address the lack of biomedical dataset, the language representations are pretrained and fine-tuned on large generic corpora, e.g., SQuAD, and evaluated in COVID-19 domain. We evaluate performance of BERT (Devlin et al., 2018) and SBERT (Reimers et Gurevych, 2019) models trained in SQuAD v2, QuAC (Choi et al., 2018) and MS MARCO (Nguyen et al., 2016) to obtain answers in the COVID-19 domain using CORD-19 dataset (Wang et al., 2020)
Information overload due to the increase in scientific literature requires question-answering systems that provides efficient access to knowledge, adapting the answers to the user. One of the most important requirements for the development of a question-answering system is an expert annotated training/validation dataset. However, existing machine reading comprehension datasets for question-answering in specialized knowledge areas such as biomedicine are not large enough to be used in supervised learning models; e.g., BioASQ v9b, a biomedical dataset containing 3742 questions; COVIDQA- 2019 (Möller et al., 2020) consisting of 2019 question-articleanswer triples; COVID-QA-147 (Tang et al., 2020) 147 triples, COVIDQA- 111 (Lee et al., 2020) 111 triples. The Stanford Question Answering Dataset SQuAD v2 (Rajpurkar et al., 2018), a reading comprehension dataset created by crowdworkers on a set of Wikipedia articles is composed of 130319 training samples, 11873 validation samples and 8862 test samples. To address the lack of biomedical dataset, the language representations are pretrained and fine-tuned on large generic corpora, e.g., SQuAD, and evaluated in COVID-19 domain. We evaluate performance of BERT (Devlin et al., 2018) and SBERT (Reimers et Gurevych, 2019) models trained in SQuAD v2, QuAC (Choi et al., 2018) and MS MARCO (Nguyen et al., 2016) to obtain answers in the COVID-19 domain using CORD-19 dataset (Wang et al., 2020)
Descripción
Categorías UNESCO
Palabras clave
Citación
Centro
Facultades y escuelas::E.T.S. de Ingeniería Informática
Departamento
Lenguajes y Sistemas Informáticos