Persona: Araujo Serna, M. Lourdes
Cargando...
Dirección de correo electrónico
ORCID
0000-0002-7657-4794
Fecha de nacimiento
Proyectos de investigación
Unidades organizativas
Puesto de trabajo
Apellidos
Araujo Serna
Nombre de pila
M. Lourdes
Nombre
19 resultados
Resultados de la búsqueda
Mostrando 1 - 10 de 19
Publicación Semi‑supervised incremental learning with few examples for discovering medical association rules(BioMed Central, 2022) Sánchez‑de‑Madariaga, Ricardo; Cantero Escribano, José Miguel; Martínez Romo, Juan; Araujo Serna, M. LourdesBackground: Association Rules are one of the main ways to represent structural patterns underlying raw data. They represent dependencies between sets of observations contained in the data. The associations established by these rules are very useful in the medical domain, for example in the predictive health field. Classic algorithms for association rule mining give rise to huge amounts of possible rules that should be filtered in order to select those most likely to be true. Most of the proposed techniques for these tasks are unsupervised. However, the accuracy provided by unsupervised systems is limited. Conversely, resorting to annotated data for training supervised systems is expensive and time‑consuming. The purpose of this research is to design a new semi‑supervised algorithm that performs like supervised algorithms but uses an affordable amount of training data. Methods: In this work we propose a new semi‑supervised data mining model that combines unsupervised techniques (Fisher’s exact test) with limited supervision. Starting with a small seed of annotated data, the model improves results (F‑measure) obtained, using a fully supervised system (standard supervised ML algorithms). The idea is based on utilising the agreement between the predictions of the supervised system and those of the unsupervised techniques in a series of iterative steps. Results: The new semi‑supervised ML algorithm improves the results of supervised algorithms computed using the F‑measure in the task of mining medical association rules, but training with an affordable amount of manually annotated data. Conclusions: Using a small amount of annotated data (which is easily achievable) leads to results similar to those of a supervised system. The proposal may be an important step for the practical development of techniques for mining association rules and generating new valuable scientific medical knowledge.Publicación Disentangling categorical relationships through a graph of co-occurrences(American Physical Society, 2011-10-19) Borge Holthoefer, Javier; Arenas, Alex; Capitán, José A.; Cuesta, José A.; Martínez Romo, Juan; Araujo Serna, M. LourdesThe mesoscopic structure of complex networks has proven a powerful level of description to understand the linchpins of the system represented by the network. Nevertheless, themapping of a series of relationships between elements, in terms of a graph, is sometimes not straightforward. Given that all the information we would extract using complex network tools depend on this initial graph, it is mandatory to preprocess the data to build it on in the most accurate manner. Here we propose a procedure to build a network, attending only to statistically significant relations between constituents. We use a paradigmatic example of word associations to show the development of our approach. Analyzing the modular structure of the obtained network we are able to disentangle categorical relations, disambiguating words with success that is comparable to the best algorithms designed to the same end.Publicación Automatic detection of trends in time-stamped sequences : an evolutionary approach(Springer-Verlag, 2009-01-14) Merelo, Juan Julián; Araujo Serna, M. LourdesThis paper presents an evolutionary algorithm for modeling the arrival dates in time-stamped data sequences such as newscasts, e-mails, IRC conversations, scientific journal articles or weblog postings. These models are applied to the detection of buzz (i.e. terms that occur with a higher-than-normal frequency) in them, which has attracted a lot of interest in the online world with the increasing number of periodic content producers. That is why in this paper we have used this kind of online sequences to test our system, though it is also valid for other types of event sequences. The algorithm assigns frequencies (number of events per time unit) to time intervals so that it produces an optimal fit to the data. The optimization procedure is a trade off between accurately fitting the data and avoiding too many frequency changes, thus overcoming the noise inherent in these sequences. This process has been traditionally performed using dynamic programming algorithms, which are limited by memory and efficiency requirements. This limitation can be a problem when dealing with long sequences, and suggests the application of alternative search methods with some degree of uncertainty to achieve tractability, such as the evolutionary algorithm proposed in this paper. This algorithm is able to reach the same solution quality as those classical dynamic programming algorithms, but in a shorter time. We also test different cost functions and propose a new one that yields better fits than the one originally proposed by Kleinberg on real-world data. Finally, several distributions of states for the finite state automata are tested, with the result that an uniform distribution produces much better fits than the geometric distribution also proposed by Kleinberg. We also present a variant of the evolutionary algorithm, which achieves a fast fit of a sequence extended with new data, by taking advantage of the fit obtained for the original subsequence.Publicación Generation of social network user profiles and their relationship with suicidal behaviour(Sociedad Española para el Procesamiento del Lenguaje Natural, 2024) Fernández Hernández, Jorge; Araujo Serna, M. Lourdes; Martínez Romo, JuanActualmente el suicidio es una de las principales causas de muerte en el mundo, por lo que poder caracterizar a personas con esta tendencia puede ayudar a prevenir posibles intentos de suicidio. En este trabajo se ha recopilado un corpus, llamado SuicidAttempt en español compuesto por usuarios con o sin menciones explícitas de intentos de suicidio, usando la aplicación de mensajería Telegram. Para cada uno de los usuarios se han anotado distintos rasgos demográficos de manera semi-automática mediante el empleo de distintos sistemas, en unos casos supervisados y en otros no supervisados. Por último se han analizado estos rasgos recogidos, junto con otros lingüísticos extraídos de los mensajes de los usuarios, para intentar caracterizar distintos grupos en base a su relación con el comportamiento suicida. Los resultados sugieren que la detección de estos rasgos demográficos y psicolingüísticos permiten caracterizar determinados grupos de riesgo y conocer en profundidad los perfiles que realizan dichos actos.Publicación Identifying patterns for unsupervised grammar induction(2010-07-15) Santamaría, Jesús; Araujo Serna, M. LourdesThis paper describes a new method for unsupervised grammar induction based on the automatic extraction of certain patterns in the texts. Our starting hypothesis is that there exist some classes of words that function as separators, marking the beginning or the end of new constituents. Among these separators we distinguish those which trigger new levels in the parse tree. If we are able to detect these separators we can follow a very simple procedure to identify the constituents of a sentence by taking the classes of words between separators. This paper is devoted to describe the process that we have followed to automatically identify the set of separators from a corpus only annotated with Part-of-Speech (POS) tags. The proposed approach has allowed us to improve the results of previous proposals when parsing sentences fromtheWall Street Journal corpus.Publicación Building a framework for fake news detection in the health domain(San Francisco CA: Public Library of Science, 2024-07-08) Martinez Rico, Juan R.; Araujo Serna, M. Lourdes; Martínez Romo, Juan; Bongelli, RamonaDisinformation in the medical field is a growing problem that carries a significant risk. Therefore, it is crucial to detect and combat it effectively. In this article, we provide three elements to aid in this fight: 1) a new framework that collects health-related articles from verification entities and facilitates their check-worthiness and fact-checking annotation at the sentence level; 2) a corpus generated using this framework, composed of 10335 sentences annotated in these two concepts and grouped into 327 articles, which we call KEANE (faKe nEws At seNtence lEvel); and 3) a new model for verifying fake news that combines specific identifiers of the medical domain with triplets subject-predicate-object, using Transformers and feedforward neural networks at the sentence level. This model predicts the fact-checking of sentences and evaluates the veracity of the entire article. After training this model on our corpus, we achieved remarkable results in the binary Classification of sentences (check-worthiness F1: 0.749, fact-checking F1: 0.698) and in the final classification of complete articles (F1: 0.703). We also tested its performance against another public dataset and found that it performed better than most systems evaluated on that dataset. Moreover, the corpus we provide differs from other existing corpora in its duality of sentence-article annotation, which can provide an additional level of justification of the prediction of truth or untruth made by the model.Publicación Understanding and Improving Disability Identification in Medical Documents(IEEE, 2020) Fabregat Marcos, Hermenegildo; Martínez Romo, Juan; Araujo Serna, M. LourdesDisabilities are a problem that affects a large number of people in the world. Gathering information about them is crucial to improve the daily life of the people who suffer from them but, since disabilities are often strongly associated with different types of diseases, the available data are widely dispersed. In this work we review existing proposal for the problem, making an in-depth analysis, and from it we make a proposal that improves the results of previous systems. The analysis focuses on the results of the participants in DIANN shared task was proposed (IberEval 2018), devoted to the detection of named disabilities in electronic documents. In order to evaluate the proposed systems using a common evaluation framework, a corpus of documents, in both English and Spanish, was gathered and annotated. Several teams participated in the task, either using classic methods or proposing specific approaches to deal effectively with the complexities of the task. Our aim is to provide insight for future advances in the field by analyzing the participating systems and identifying the most effective approaches and elements to tackle the problem. We have validated the lessons learned from this analysis through a new proposal that includes the most promising elements used by the participating teams. The proposed system improves, for both languages, the results obtained during the task.Publicación RoBERTime: un nuevo modelo para la detección de expresiones temporales en español(Sociedad Española para el Procesamiento del Lenguaje Natural, 2023-03) Sánchez de Castro Fernández, Alejandro; Araujo Serna, M. Lourdes; Martínez Romo, JuanTemporal expressions are all those words that refer to temporality. Their detection or extraction is a complex task, since it depends on the domain of the text, the language and the way they are written. Their study in Spanish and more specifically in the clinical domain is scarce, mainly due to the lack of annotated corpora. In this paper we propose the use of large language models to address the task, comparing the performance of five models of different characteristics. After a process of experimentation and fine tuning, a new model called RoBERTime is created for the detection of temporal expressions in Spanish, especially focused in the clinical domain. This model is publicly available. RoBERTime achieves state-of-the-art results in the E3C and Timebank corpora, being the first public model for the detection of temporal expressions in Spanish specialized in the clinical domain.Publicación Analyzing information retrieval methods to recover broken web links(2011-06-19) Martínez Romo, Juan; Araujo Serna, M. LourdesIn this work we compare different techniques to automatically find candidate web pages to substitute broken links. We extract information from the anchor text, the content of the page containing the link, and the cache page in some digital library.The selected information is processed and submitted to a search engine. We have compared different information retrievalmethods for both, the selection of terms used to construct the queries submitted to the search engine, and the ranking of the candidate pages that it provides, in order to help the user to find the best replacement. In particular, we have used term frequencies, and a language model approach for the selection of terms; and cooccurrence measures and a language model approach for ranking the final results. To test the different methods, we have also defined a methodology which does not require the user judgments, what increases the objectivity of the results.Publicación Deep-Learning Approach to Educational Text Mining and Application to the Analysis of Topics’ Difficulty(Institute of Electrical and Electronics Engineers, 2020-12-02) Araujo Serna, M. Lourdes; López Ostenero, Fernando; Martínez Romo, Juan; Plaza Morales, LauraLearning analytics has emerged as a promising tool for optimizing the learning experience and results, especially in online educational environments. An important challenge in this area is identifying the most difficult topics for students in a subject, which is of great use to improve the quality of teaching by devoting more effort to those topics of greater difficulty, assigning them more time, resources and materials. We have approached the problem by means of natural language processing techniques. In particular, we propose a solution based on a deep learning model that automatically extracts the main topics that are covered in educational documents. This model is next applied to the problem of identifying the most difficult topics for students in a subject related to the study of algorithms and data structures in a Computer Science degree. Our results show that our topic identification model presents very high accuracy (around 90 percent) and may be efficiently used in learning analytics applications, such as the identification and understanding of what makes the learning of a subject difficult. An exhaustive analysis of the case study has also revealed that there are indeed topics that are consistently more difficult for most students, and also that the perception of difficulty in students and teachers does not always coincide with the actual difficulty indicated by the data, preventing to pay adequate attention to the most challenging topics.