Persona:
Rodrigo Yuste, Álvaro

ORCID

0000-0002-6331-4117

Apellidos

Rodrigo Yuste

Nombre de pila

Álvaro

Página completa del ítem

Resultados de la búsqueda

Mostrando 1 - 7 de 7

Evaluación de sistemas de búsqueda y validación de respuestas
(Universidad Nacional de Educación a Distancia (España). Escuela Técnica Superior de Ingeniería Informática. Departamento de Lenguajes y Sistemas Informáticos, 2010-06-01) Rodrigo Yuste, Álvaro; Peñas Padilla, Anselmo
En esta tesis se propone un marco para la evaluación de módulos de Validación de Respuestas (AV) que tienen el propósito de mejorar los resultados de los sistemas de Búsqueda de Respuestas (QA). La motivación para la definición de este marco surge del análisis de los resultados de las evaluaciones de QA, donde se observan las siguientes situaciones en las cuáles se podrían mejorar los resultados mediante la incorporación de módulos de AV: - Los conjuntos de respuestas devueltas contienen respuestas incorrectas que provocan que los resultados empeoren. El hecho de eliminar el mayor número de respuestas incorrectas de un conjunto de candidatas supondría una mejora de los resultados. - Los distintos sistemas de QA se complementan entre si de modo que, aunque individualmente obtienen resultados similares, la combinación efectiva de los mismos da lugar a resultados mejores que los de cualquiera de los sistemas individuales. - El procesamiento en cadena, típico de las arquitecturas clásicas utilizadas en QA, provoca que haya una alta dependencia entre módulos y los errores se propaguen de unos módulos a otros. La posibilidad de romper este procesamiento en cadena permitiría disminuir la dependencia entre módulos, permitiendo mejorar los resultados. El primer paso para la definición del marco de evaluación consiste en la propuesta de un modelo de AV basado en el Reconocimiento de la Implicación Textual (RTE). Para comprobar la validez de este modelo se construye una colección de pares texto-hipótesis (que siguen un formato similar al de las colecciones de los RTE Challenges) enfocados a la tarea de AV. El análisis de esta colección permite comprobar la validez del modelo propuesto y supone el punto de partida para la definición del marco de evaluación. La metodología propuesta permite la evaluación de sistemas de AV que actúan en diversos escenarios dentro de un sistema de QA, y la comparación de sus resultados con otros sistemas de QA, para así comprobar si el uso de estos módulos supone mejoras de rendimiento. Además, como parte de la metodología se describen diversos métodos para construir colecciones de evaluación reutilizando los juicios humanos de las evaluaciones de QA. El marco definido se puso en práctica dentro de una tarea de evaluación internacional, el Answer Validation Exercise (AVE), que se desarrolló durante tres ediciones dentro del marco del Cross Language Evaluation Forum (CLEF). La experiencia obtenida durante las tres ediciones de la tarea sirvió para refinar la metodología hasta su versión final, la cuál está a disposición de la comunidad científica junto con los recursos de evaluación generados, para la evaluación de futuros sistemas de AV. Los resultados obtenidos por los sistemas participantes en las campañas del AVE permiten observar que la utilización de módulos de AV mejoraría los resultados en QA, en las tres líneas que se observaron al analizar las evaluaciones de sistemas de QA (eliminar respuestas candidatas incorrectas, combinar distintos sistemas de QA y romper el procesamiento en cadena de un sistema de QA). De hecho, estas observaciones han servido para que haya sistemas de QA que incorporen módulos de AV. Como consecuencia, dichos sistemas de QA han logrado mejorar sus resultados. Además, la mayoría de estos sistemas hizo uso del modelo basado en RTE que se presenta en esta tesis, por lo que se ha demostrado su validez y utilidad en entornos reales. Finalmente, en esta tesis se observa que los módulos de AV podrían ser también de utilidad en escenarios de QA donde es mejor no responder a una pregunta que responderla incorrectamente, como podría suceder por ejemplo en diagnóstico médico. Sin embargo, las evaluaciones de QA no han prestado especial atención a este tipo de escenarios. Por este motivo, en esta tesis se propone una nueva medida para evaluar sistemas de QA que permite premiar a los sistemas que mantienen el número de preguntas respondidas correctamente y logran reducir la cantidad de respuestas incorrectas al dejar preguntas sin responder. Las pruebas realizadas sobre esta medida han mostrado su eficacia a la hora de detectar los mejores enfoques para este tipo de escenarios en comparación con otras medidas de evaluación típicas en QA.
A simple measure to assess non-response
(2011-06-19) Peñas Padilla, Anselmo; Rodrigo Yuste, Álvaro
There are several tasks where is preferable not responding than responding incorrectly. This idea is not new, but despite several previous attempts there isn’t a commonly accepted measure to assess non-response. We study here an extension of accuracy measure with this feature and a very easy to understand interpretation. The measure proposed (c@1) has a good balance of discrimination power, stability and sensitivity properties. We show also how this measure is able to reward systems that maintain the same number of correct answers and at the same time decrease the number of incorrect ones, by leaving some questions unanswered. This measure is well suited for tasks such as Reading Comprehension tests, where multiple choices per question are given, but only one is correct.
Evaluating Multilingual Question Answering Systems at CLEF
(2010-05-17) Forner, Pamela; Giampiccolo, Danilo; Magnini, Bernardo; Sutcliffe, Richard; Peñas Padilla, Anselmo; Rodrigo Yuste, Álvaro
The paper offers an overview of the key issues raised during the seven years’ activity of the Multilingual Question Answering Track at the Cross Language Evaluation Forum (CLEF). The general aim of the Multilingual Question Answering Track has been to test both monolingual and cross-language Question Answering (QA) systems that process queries and documents in several European languages, also drawing attention to a number of challenging issues for research in multilingual QA. The paper gives a brief description of how the task has evolved over the years and of the way in which the data sets have been created, presenting also a brief summary of the different types of questions developed. The document collections adopted in the competitions are sketched as well, and some data about the participation are provided. Moreover, the main evaluation measures used to evaluate system performances are explained and an overall analysis of the results achieved is presented.
Temporally anchored relation extraction
(2012-12-08) Garrido, Guillermo; Cabaleiro, Bernardo; Peñas Padilla, Anselmo; Rodrigo Yuste, Álvaro
Although much work on relation extraction has aimed at obtaining static facts, many of the target relations are actually fluents, as their validity is naturally anchored to a certain time period. This paper proposes a methodological approach to temporally anchored relation extraction. Our proposal performs distant supervised learning to extract a set of relations from a natural language corpus, and anchors each of them to an interval of temporal validity, aggregating evidence from documents supporting the relation. We use a rich graphbased document-level representation to generate novel features for this task. Results show that our implementation for temporal anchoring is able to achieve a 69% of the upper bound performance imposed by the relation extraction step. Compared to the state of the art, the overall system achieves the highest precision reported.
Together we can do it! A roadmap to effectively tackle propaganda-related tasks
(Emerald, 2024) Rodríguez García, Raquel; Centeno Sánchez, Roberto; Rodrigo Yuste, Álvaro
Purpose In this paper, we address the need to study automatic propaganda detection to establish a course of action when faced with such a complex task. Although many isolated tasks have been proposed, a roadmap on how to best approach a new task from the perspective of text formality or the leverage of existing resources has not been explored yet. Design/methodology/approach We present a comprehensive study using several datasets on textual propaganda and different techniques to tackle it. We explore diverse collections with varied characteristics and analyze methodologies, from classic machine learning algorithms, to multi-task learning to utilize the available data in such models. Findings Our results show that transformer-based approaches are the best option with high-quality collections, and emotionally enriched inputs improve the results for Twitter content. Additionally, MTL achieves the best results in two of the five scenarios we analyzed. Notably, in one of the scenarios, the model achieves an F1 score of 0.78, significantly surpassing the transformer baseline model’s F1 score of 0.68. Research limitations/implications After finding a positive impact when leveraging propaganda’s emotional content, we propose further research into exploiting other complex dimensions, such as moral issues or logical reasoning. Originality/value Based on our findings, we provide a roadmap for tackling propaganda-related tasks, depending on the types of training data available and the task to solve. This includes the application of MTL, which has yet to be fully exploited in propaganda detection.
None of the above: comparing scenarios for answerability detection in question answering systems
(Springer, 2025-07-04) Reyes Montesinos, Julio; Rodrigo Yuste, Álvaro; Peñas Padilla, Anselmo
Question Answering (QA) is often used to assess the reasoning capabilities of NLP systems. For a QA system, it is crucial to have the capability to determine answerability– whether the question can be answered with the information at hand. Previous works have studied answerability by including a fixed proportion of unanswerable questions in a collection without explaining the reasons for such proportion or the impact on systems’ results. Furthermore, they do not answer the question of whether systems learn to determine answerability. This work aims to answer that question, providing a systematic analysis of how unanswerable question ratios in training data impact QA systems. To that end, we create a series of versions of the well-known Multiple-Choice QA dataset RACE by modifying different amounts of questions to make them unanswerable, and then train and evaluate several Large Language Models on them. We show that LLMs tend to overfit the distribution of unanswerable questions encountered during training, while the ability to decide on answerability always comes at the expense of finding the answer when it exists. Our experiments also show that a proportion of unanswerable questions around 30%– as found in existing datasets– produces the most discriminating systems. We hope these findings offer useful guidelines for future dataset designers looking to address the problem of answerability.
Study of a Lifelong Learning Scenario for Question Answering
(Elsevier, 2022-12-15) Echegoyen, Guillermo; Rodrigo Yuste, Álvaro; Peñas Padilla, Anselmo
Question Answering (QA) systems have witnessed a significant advance in the last years due to the development of neural architectures employing pre-trained large models like BERT. However, once the QA model is fine-tuned for a task (e.g a particular type of questions over a particular domain), system performance drops when new tasks are added along time, (e.g new types of questions or new domains). Therefore, the system requires a retraining but, since the data distribution has shifted away from the previous learning, performance over previous tasks drops significantly. Hence, we need strategies to make our systems resistant to the passage of time. Lifelong Learning (LL) aims to study how systems can take advantage of the previous learning and the knowledge acquired to maintain or improve performance over time. In this article, we explore a scenario where the same LL based QA system suffers along time several shifts in the data distribution, represented as the addition of new different QA datasets. In this setup, the following research questions arise: (i) How LL based QA systems can benefit from previously learned tasks? (ii) Is there any strategy general enough to maintain or improve the performance over time when new tasks are added? and finally, (iii) How to detect a lack of knowledge that impedes the answering of questions and must trigger a new learning process? To answer these questions, we systematically try all possible training sequencesover three well known QA datasets. Our results show how the learning of a new dataset is sensitive to previous training sequences and that we can find a strategy general enough to avoid the combinatorial explosion of testing all possible training sequences. Thus, when a new dataset is added to the system, the best way to retrain the system without dropping performance over the previous datasets is to randomly merge the new training material with the previous one.

Persona:
Rodrigo Yuste, Álvaro

Dirección de correo electrónico

ORCID

Fecha de nacimiento

Proyectos de investigación

Unidades organizativas

Puesto de trabajo

Apellidos

Nombre de pila

Nombre

Filtros

Autor

Tipo

Departamento

Centro

Fecha

Tiene archivos

Tipo de ítem

Nivel de acceso

Ajustes

Ordenar por

resultados por página

Resultados de la búsqueda

Persona: Rodrigo Yuste, Álvaro

Dirección de correo electrónico

ORCID

Fecha de nacimiento

Proyectos de investigación

Unidades organizativas

Puesto de trabajo

Apellidos

Nombre de pila

Nombre

Filtros

Autor

Tipo

Departamento

Centro

Fecha

Tiene archivos

Tipo de ítem

Nivel de acceso

Ajustes

Ordenar por

resultados por página

Resultados de la búsqueda

Persona:
Rodrigo Yuste, Álvaro