Publicación: Desarrollo de un sistema de clasificación multi-etiqueta basado en Transformers para la clasificación de códigos eCIE-O-3.1
Cargando...
Archivos
Fecha
2021-10-01
Autores
Editor/a
Director/a
Tutor/a
Coordinador/a
Prologuista
Revisor/a
Ilustrador/a
Derechos de acceso
Atribución-NoComercial-SinDerivadas 4.0 Internacional
info:eu-repo/semantics/openAccess
info:eu-repo/semantics/openAccess
Título de la revista
ISSN de la revista
Título del volumen
Editor
Universidad Nacional de Educación a Distancia (España). Escuela Técnica Superior de Ingeniería Informática. Departamento de Lenguajes y Sistemas Informáticos
Resumen
En el presente trabajo se proponen diferentes arquitecturas basadas en Tranformers que pretenden resolver un problema de clasificación multi-etiqueta de códigos morfológicos eCIE-O-3.1, que son los dedicados a las neoplasias. Para ello, se ha utilizado el conjunto de datos facilitados en la tarea competitiva CANTEMIST cuyo objetivo era presentar un sistema capaz de hacer una clasificación multi-etiqueta de códigos morfológicos eCIE-O-3.1 en informes médicos. Los datos de CANTEMIST se han utilizado para entrenar los diferentes modelos propuestos, junto con diferentes experimentos que se han realizado para tratar de mejorar el rendimiento de los modelos base. Los modelos base consisten en diferentes modelos BERT pre-entrenados con distintos conjuntos de datos y diferentes idiomas: algunos modelos son específicamente empleados en español mientras que otros son multi-idioma, algunos modelos han sido pre-entrenados con textos médicos y otros con textos de ámbito general. Se ha realizado un preprocesamiento de los textos médicos para que puedan ser entrenados por el modelo, ya que BERT necesita un tipo específico de datos de entrada. Finalmente, se ha realizado una evaluación exhaustiva del sistema de clasificación sobre el conjunto de test de CANTEMIST para determinar su rendimiento. Se han comparado los resultados obtenidos con los de los sistemas que presentaron los participantes de CANTEMIST en 2020. Los resultados obtenidos muestran que el procedimiento empleado es capaz de realizar una clasificación multi-etiqueta con un buen acierto, aunque con limitaciones y problemas debido a la aproximación empleada.
The present work proposes different architectures based on Tranformers that intend to solve a problem of multi-label classification of morphological codes eCIE-O-3.1, which are those dedicated to neoplasms. For this purpose, the dataset provided in the competitive task CANTEMIST was used. The objective was to present a system capable of making a multi-label classification of morphological codes eCIE-O-3.1 in medical reports. The different proposed models will be trained with this dataset, together with different experiments that have been carried out to try to improve the performance of the base models. The base models consist of different pre-trained BERT models with different data sets and different languages: some models are specifically used in Spanish while others are multi-language, some models have been pre-trained with medical texts and others with medical texts. general scope. Medical texts have been preprocessed so that they can be trained by the model, since BERT needs a specific type of input data. Finally, an extensive evaluation is performed on the CANTEMIST test set to determine the performance of our approach in the multi-label classification task. The results obtained have been compared with the systems presented by the CANTEMIST participants in 2020. The results obtained show that this procedure used is able to perform a multi-label classification with good results, although with limitations and problems due to the approach used.
The present work proposes different architectures based on Tranformers that intend to solve a problem of multi-label classification of morphological codes eCIE-O-3.1, which are those dedicated to neoplasms. For this purpose, the dataset provided in the competitive task CANTEMIST was used. The objective was to present a system capable of making a multi-label classification of morphological codes eCIE-O-3.1 in medical reports. The different proposed models will be trained with this dataset, together with different experiments that have been carried out to try to improve the performance of the base models. The base models consist of different pre-trained BERT models with different data sets and different languages: some models are specifically used in Spanish while others are multi-language, some models have been pre-trained with medical texts and others with medical texts. general scope. Medical texts have been preprocessed so that they can be trained by the model, since BERT needs a specific type of input data. Finally, an extensive evaluation is performed on the CANTEMIST test set to determine the performance of our approach in the multi-label classification task. The results obtained have been compared with the systems presented by the CANTEMIST participants in 2020. The results obtained show that this procedure used is able to perform a multi-label classification with good results, although with limitations and problems due to the approach used.
Descripción
Categorías UNESCO
Palabras clave
Citación
Centro
Facultades y escuelas::E.T.S. de Ingeniería Informática
Departamento
Lenguajes y Sistemas Informáticos