Persona: Martínez Romo, Juan
Cargando...
Dirección de correo electrónico
ORCID
0000-0002-6905-7051
Fecha de nacimiento
Proyectos de investigación
Unidades organizativas
Puesto de trabajo
Apellidos
Martínez Romo
Nombre de pila
Juan
Nombre
16 resultados
Resultados de la búsqueda
Mostrando 1 - 10 de 16
Publicación Semi‑supervised incremental learning with few examples for discovering medical association rules(BioMed Central, 2022) Sánchez‑de‑Madariaga, Ricardo; Cantero Escribano, José Miguel; Martínez Romo, Juan; Araujo Serna, M. LourdesBackground: Association Rules are one of the main ways to represent structural patterns underlying raw data. They represent dependencies between sets of observations contained in the data. The associations established by these rules are very useful in the medical domain, for example in the predictive health field. Classic algorithms for association rule mining give rise to huge amounts of possible rules that should be filtered in order to select those most likely to be true. Most of the proposed techniques for these tasks are unsupervised. However, the accuracy provided by unsupervised systems is limited. Conversely, resorting to annotated data for training supervised systems is expensive and time‑consuming. The purpose of this research is to design a new semi‑supervised algorithm that performs like supervised algorithms but uses an affordable amount of training data. Methods: In this work we propose a new semi‑supervised data mining model that combines unsupervised techniques (Fisher’s exact test) with limited supervision. Starting with a small seed of annotated data, the model improves results (F‑measure) obtained, using a fully supervised system (standard supervised ML algorithms). The idea is based on utilising the agreement between the predictions of the supervised system and those of the unsupervised techniques in a series of iterative steps. Results: The new semi‑supervised ML algorithm improves the results of supervised algorithms computed using the F‑measure in the task of mining medical association rules, but training with an affordable amount of manually annotated data. Conclusions: Using a small amount of annotated data (which is easily achievable) leads to results similar to those of a supervised system. The proposal may be an important step for the practical development of techniques for mining association rules and generating new valuable scientific medical knowledge.Publicación Disentangling categorical relationships through a graph of co-occurrences(American Physical Society, 2011-10-19) Borge Holthoefer, Javier; Arenas, Alex; Capitán, José A.; Cuesta, José A.; Martínez Romo, Juan; Araujo Serna, M. LourdesThe mesoscopic structure of complex networks has proven a powerful level of description to understand the linchpins of the system represented by the network. Nevertheless, themapping of a series of relationships between elements, in terms of a graph, is sometimes not straightforward. Given that all the information we would extract using complex network tools depend on this initial graph, it is mandatory to preprocess the data to build it on in the most accurate manner. Here we propose a procedure to build a network, attending only to statistically significant relations between constituents. We use a paradigmatic example of word associations to show the development of our approach. Analyzing the modular structure of the obtained network we are able to disentangle categorical relations, disambiguating words with success that is comparable to the best algorithms designed to the same end.Publicación Generation of social network user profiles and their relationship with suicidal behaviour(Sociedad Española para el Procesamiento del Lenguaje Natural, 2024) Fernández Hernández, Jorge; Araujo Serna, M. Lourdes; Martínez Romo, JuanActualmente el suicidio es una de las principales causas de muerte en el mundo, por lo que poder caracterizar a personas con esta tendencia puede ayudar a prevenir posibles intentos de suicidio. En este trabajo se ha recopilado un corpus, llamado SuicidAttempt en español compuesto por usuarios con o sin menciones explícitas de intentos de suicidio, usando la aplicación de mensajería Telegram. Para cada uno de los usuarios se han anotado distintos rasgos demográficos de manera semi-automática mediante el empleo de distintos sistemas, en unos casos supervisados y en otros no supervisados. Por último se han analizado estos rasgos recogidos, junto con otros lingüísticos extraídos de los mensajes de los usuarios, para intentar caracterizar distintos grupos en base a su relación con el comportamiento suicida. Los resultados sugieren que la detección de estos rasgos demográficos y psicolingüísticos permiten caracterizar determinados grupos de riesgo y conocer en profundidad los perfiles que realizan dichos actos.Publicación Building a framework for fake news detection in the health domain(San Francisco CA: Public Library of Science, 2024-07-08) Martinez Rico, Juan R.; Araujo Serna, M. Lourdes; Martínez Romo, Juan; Bongelli, RamonaDisinformation in the medical field is a growing problem that carries a significant risk. Therefore, it is crucial to detect and combat it effectively. In this article, we provide three elements to aid in this fight: 1) a new framework that collects health-related articles from verification entities and facilitates their check-worthiness and fact-checking annotation at the sentence level; 2) a corpus generated using this framework, composed of 10335 sentences annotated in these two concepts and grouped into 327 articles, which we call KEANE (faKe nEws At seNtence lEvel); and 3) a new model for verifying fake news that combines specific identifiers of the medical domain with triplets subject-predicate-object, using Transformers and feedforward neural networks at the sentence level. This model predicts the fact-checking of sentences and evaluates the veracity of the entire article. After training this model on our corpus, we achieved remarkable results in the binary Classification of sentences (check-worthiness F1: 0.749, fact-checking F1: 0.698) and in the final classification of complete articles (F1: 0.703). We also tested its performance against another public dataset and found that it performed better than most systems evaluated on that dataset. Moreover, the corpus we provide differs from other existing corpora in its duality of sentence-article annotation, which can provide an additional level of justification of the prediction of truth or untruth made by the model.Publicación Understanding and Improving Disability Identification in Medical Documents(IEEE, 2020) Fabregat Marcos, Hermenegildo; Martínez Romo, Juan; Araujo Serna, M. LourdesDisabilities are a problem that affects a large number of people in the world. Gathering information about them is crucial to improve the daily life of the people who suffer from them but, since disabilities are often strongly associated with different types of diseases, the available data are widely dispersed. In this work we review existing proposal for the problem, making an in-depth analysis, and from it we make a proposal that improves the results of previous systems. The analysis focuses on the results of the participants in DIANN shared task was proposed (IberEval 2018), devoted to the detection of named disabilities in electronic documents. In order to evaluate the proposed systems using a common evaluation framework, a corpus of documents, in both English and Spanish, was gathered and annotated. Several teams participated in the task, either using classic methods or proposing specific approaches to deal effectively with the complexities of the task. Our aim is to provide insight for future advances in the field by analyzing the participating systems and identifying the most effective approaches and elements to tackle the problem. We have validated the lessons learned from this analysis through a new proposal that includes the most promising elements used by the participating teams. The proposed system improves, for both languages, the results obtained during the task.Publicación RoBERTime: un nuevo modelo para la detección de expresiones temporales en español(Sociedad Española para el Procesamiento del Lenguaje Natural, 2023-03) Sánchez de Castro Fernández, Alejandro; Araujo Serna, M. Lourdes; Martínez Romo, JuanTemporal expressions are all those words that refer to temporality. Their detection or extraction is a complex task, since it depends on the domain of the text, the language and the way they are written. Their study in Spanish and more specifically in the clinical domain is scarce, mainly due to the lack of annotated corpora. In this paper we propose the use of large language models to address the task, comparing the performance of five models of different characteristics. After a process of experimentation and fine tuning, a new model called RoBERTime is created for the detection of temporal expressions in Spanish, especially focused in the clinical domain. This model is publicly available. RoBERTime achieves state-of-the-art results in the E3C and Timebank corpora, being the first public model for the detection of temporal expressions in Spanish specialized in the clinical domain.Publicación Analyzing information retrieval methods to recover broken web links(2011-06-19) Martínez Romo, Juan; Araujo Serna, M. LourdesIn this work we compare different techniques to automatically find candidate web pages to substitute broken links. We extract information from the anchor text, the content of the page containing the link, and the cache page in some digital library.The selected information is processed and submitted to a search engine. We have compared different information retrievalmethods for both, the selection of terms used to construct the queries submitted to the search engine, and the ranking of the candidate pages that it provides, in order to help the user to find the best replacement. In particular, we have used term frequencies, and a language model approach for the selection of terms; and cooccurrence measures and a language model approach for ranking the final results. To test the different methods, we have also defined a methodology which does not require the user judgments, what increases the objectivity of the results.Publicación Deep-Learning Approach to Educational Text Mining and Application to the Analysis of Topics’ Difficulty(Institute of Electrical and Electronics Engineers, 2020-12-02) Araujo Serna, M. Lourdes; López Ostenero, Fernando; Martínez Romo, Juan; Plaza Morales, LauraLearning analytics has emerged as a promising tool for optimizing the learning experience and results, especially in online educational environments. An important challenge in this area is identifying the most difficult topics for students in a subject, which is of great use to improve the quality of teaching by devoting more effort to those topics of greater difficulty, assigning them more time, resources and materials. We have approached the problem by means of natural language processing techniques. In particular, we propose a solution based on a deep learning model that automatically extracts the main topics that are covered in educational documents. This model is next applied to the problem of identifying the most difficult topics for students in a subject related to the study of algorithms and data structures in a Computer Science degree. Our results show that our topic identification model presents very high accuracy (around 90 percent) and may be efficiently used in learning analytics applications, such as the identification and understanding of what makes the learning of a subject difficult. An exhaustive analysis of the case study has also revealed that there are indeed topics that are consistently more difficult for most students, and also that the perception of difficulty in students and teachers does not always coincide with the actual difficulty indicated by the data, preventing to pay adequate attention to the most challenging topics.Publicación Web spam detection : new classification features based on qualified link analysis and language models(Institute of Electrical and Electronics Engineers (IEEE), 2010-09-01) Araujo Serna, M. Lourdes::virtual::5632::600; Martínez Romo, Juan::virtual::5633::600; Araujo Serna, M. Lourdes; Martínez Romo, Juan; Araujo Serna, M. Lourdes; Martínez Romo, Juan; Araujo Serna, M. Lourdes; Martínez Romo, JuanWeb spam is a serious problem for search engines because the quality of their results can be severely degraded by the presence of this kind of page. In this paper, we present an efficient spam detection system based on a classifier that combines new link-based features with language-model (LM)-based ones. These features are not only related to quantitative data extracted from the Web pages, but also to qualitative properties, mainly of the page links.We consider, for instance, the ability of a search engine to find, using information provided by the page for a given link, the page that the link actually points at. This can be regarded as indicative of the link reliability. We also check the coherence between a page and another one pointed at by any of its links. Two pages linked by a hyperlink should be semantically related, by at least a weak contextual relation. Thus, we apply an LM approach to different sources of information from aWeb page that belongs to the context of a link, in order to provide high-quality indicators of Web spam. We have specifically applied the Kullback–Leibler divergence on different combinations of these sources of information in order to characterize the relationship between two linked pages. The result is a system that significantly improves the detection of Web spam using fewer features, on two large and public datasets such as WEBSPAM-UK2006 and WEBSPAM-UK2007.Publicación Detecting malicious tweets in trending topics using a statistical analysis of language(Elsevier, 2013-06-01) Martínez Romo, Juan; Araujo Serna, M. LourdesTwitter spam detection is a recent area of research in which most previous works had focused on the identification of malicious user accounts and honeypot-based approaches. However, in this paper we present a methodology based on two new aspects: the detection of spam tweets in isolation and without previous information of the user; and the application of a statistical analysis of language to detect spam in trending topics. Trending topics capture the emerging Internet trends and topics of discussion that are in everybody’s lips. This growing microblogging phenomenon therefore allows spammers to disseminate malicious tweets quickly and massively. In this paper we present the first work that tries to detect spam tweets in real time using language as the primary tool. We first collected and labeled a large dataset with 34 K trending topics and 20 million tweets. Then, we have proposed a reduced set of features hardly manipulated by spammers. In addition, we have developed a machine learning system with some orthogonal features that can be combined with other sets of features with the aim of analyzing emergent characteristics of spam in social networks. We have also conducted an extensive evaluation process that has allowed us to show how our system is able to obtain an F-measure at the same level as the best state-ofthe- art systems based on the detection of spam accounts. Thus, our system can be applied to Twitter spam detection in trending topics in real time due mainly to the analysis of tweets instead of user accounts.