Persona:
Martínez Romo, Juan

ORCID

0000-0002-6905-7051

Apellidos

Martínez Romo

Nombre de pila

Juan

Página completa del ítem

Resultados de la búsqueda

Mostrando 1 - 10 de 20

Analyzing information retrieval methods to recover broken web links
(2011-06-19) Martínez Romo, Juan; Araujo Serna, M. Lourdes
In this work we compare different techniques to automatically find candidate web pages to substitute broken links. We extract information from the anchor text, the content of the page containing the link, and the cache page in some digital library.The selected information is processed and submitted to a search engine. We have compared different information retrievalmethods for both, the selection of terms used to construct the queries submitted to the search engine, and the ranking of the candidate pages that it provides, in order to help the user to find the best replacement. In particular, we have used term frequencies, and a language model approach for the selection of terms; and cooccurrence measures and a language model approach for ranking the final results. To test the different methods, we have also defined a methodology which does not require the user judgments, what increases the objectivity of the results.
Técnicas de recuperación de información para la resolución de problemas en la Web
(Universidad Nacional de Educación a Distancia (España). Escuela Técnica Superior de Ingeniería Informática. Departamento de Lenguajes y Sistemas Informáticos, 2010-07-08) Martínez Romo, Juan; Araujo Serna, M. Lourdes
En esta tesis, se abordan dos de los problemas más importantes que afectan a la Web en la actualidad. El crecimiento vertiginoso de esta red mundial, ha propiciado la conexión en esta tesis de uno de sus principales problemas desde el origen en 1989, los enlaces rotos, con una reciente preocupación de los motores de búsqueda, el web spam. El vínculo entre el problema de los enlaces rotos en las páginas web y el spam de buscadores, se ha establecido mediante el uso común de un conjunto de técnicas de recuperación de información, en forma de sistema de recuperación de información web. El inconveniente que genera la desaparición de una página web, ha sido afrontado mediante el diseño de un Sistema de Recuperación de Enlaces Rotos (SRER). Este sistema analiza la información disponible acerca de una página desaparecida, y recomienda al usuario un conjunto de documentos candidatos para reemplazar el enlace obsoleto. El SRER propuesto en esta tesis, a diferencia del resto de sistemas con objetivos similares, no necesita del almacenamiento previo de ningún tipo de información acerca de la página desaparecida, para poder realizar una recomendación. El diseño de este sistema se compone de cuatro etapas, en las que se aplican diferentes técnicas de recuperación de información y procesamiento del lenguaje natural, para obtener el mejor rendimiento. La primera etapa consiste en un proceso de selección de información, en el cual se analiza en primer lugar, el texto del ancla del hiperenlace que ha dejado de funcionar. Los términos que componen el ancla son una pieza fundamental en el buen funcionamiento del sistema, y de esta forma se realiza un reconocimiento de entidades nombradas, con el objetivo de determinar aquellos términos con un valor descriptivo superior. En segundo lugar, se extrae información del contexto del hiperenlace para conseguir un mayor grado de precisión. Cuando una página web desaparece, durante un periodo de tiempo variable, es posible encontrar datos acerca de dicha página en la infraestructura web. Teniendo en cuenta la presencia de esta información, en tercer lugar se propone el uso de varios recursos disponibles en la Web, con el objetivo de seguir el rastro que ha dejado la página desaparecida. Entre estos recursos se encuentran aplicaciones proporcionadas por los principales motores de búsqueda, librerías digitales, servicios web y redes sociales. La segunda etapa se centra en las fuentes de información obtenidas a partir del contexto del enlace y de los recursos online disponibles. En algunos casos, el tamaño de dichas fuentes es demasiado grande como para discriminar la información relevante de la que no lo es. Por este motivo se lleva a cabo un proceso de extracción de terminología a fin de sintetizar la información. Con el objetivo de optimizar la extracción de los términos más relevantes en cada caso, se han analizado diferentes técnicas de recuperación de información. En la tercera etapa, el SRER analiza la información obtenida y establece un conjunto de consultas, que posteriormente serán ejecutadas en un motor de búsqueda. En esta fase se parte de los datos obtenidos del texto del ancla y a continuación se realiza un proceso de expansión de consultas. Por cada una de las consultas, el sistema recupera los primeros resultados devueltos por el buscador. Una vez finalizada la etapa de expansión de consultas y recuperados las páginas candidatas a reemplazar al enlace roto, se lleva a cabo una ordenación por relevancia, para mostrar al usuario un conjunto de resultados en orden decreciente. Para establecer el orden de aparición, se han analizado algunas funciones de ranking. Estas funciones utilizan la información disponible en la primera etapa para otorgar un valor de relevancia a cada documento. Finalmente, el sistema presenta al usuario una lista de resultados ordenados según su relevancia. Las cuatro etapas en las que se divide el SRER, se encuentran dirigidas por un algoritmo que analiza la información disponible en cada caso, y toma una decisión, con el objetivo de optimizar por un lado los resultados mostrados al usuario y por otro lado el tiempo de respuesta del sistema. Entre las aportaciones de esta tesis, también se encuentra el desarrollo de una metodología de evaluación, que evita el juicio de humanos a fin de ofrecer unos resultados más objetivos. Por último, el SRER, representado a su vez por el algoritmo de recuperación de enlaces rotos, ha sido integrado en una aplicación web denominada Detective Brooklynk. La recuperación de un enlace, es decir, encontrar una página en Internet en función de la información relativa a ella disponible en la página que la apunta, está basada en la hipótesis de que dicha información es coherente. Existen casos es los que los autores de páginas web manipulan la información relativa a una determinada página, con el objetivo de obtener algún beneficio. En esta tesis, analizamos los casos en los que una página web inserta información incoherente acerca de una segunda página apuntada, con el objetivo de promocionarla en un buscador. En la segunda parte de esta tesis, enmarcada dentro del área de la detección de web spam, se parte del concepto de recuperación de enlaces para detectar aquellos de naturaleza fraudulenta. En esta ocasión, el motor del sistema de recuperación de enlaces rotos es modificado para la recuperación de enlaces activos. El objetivo de dicha adaptación es localizar los enlaces cuya información acerca del recurso apuntado es voluntariamente incoherente y por tanto resulta imposible su recuperación. El sistema resultante es capaz de proporcionar un conjunto de indicadores por cada página analizada, empleados para una etapa posterior de clasificación automática. El web spam se divide principalmente en dos grupos de técnicas: aquellas que inciden sobre los enlaces de las páginas web, y las que emplean el contenido para promocionarlas. De esta forma, si mediante el sistema de recuperación de enlaces se consiguen detectar los enlaces fraudulentos, en esta tesis se ha decidido completar la detección de spam de contenido. Para ello, se ha llevado a cabo un análisis de la divergencia entre el contenido de dos páginas enlazadas. El resultado de esta segunda parte de la tesis dedicada a la detección de web spam, es la propuesta de utilización de dos nuevos conjuntos de indicadores. Además, la combinación de ambas características da lugar a un sistema ortogonal que mejora los resultados de detección de ambos conjuntos por separado.
RoBERTime: un nuevo modelo para la detección de expresiones temporales en español
(Sociedad Española para el Procesamiento del Lenguaje Natural, 2023-03) Sánchez de Castro Fernández, Alejandro; Araujo Serna, M. Lourdes; Martínez Romo, Juan
Temporal expressions are all those words that refer to temporality. Their detection or extraction is a complex task, since it depends on the domain of the text, the language and the way they are written. Their study in Spanish and more specifically in the clinical domain is scarce, mainly due to the lack of annotated corpora. In this paper we propose the use of large language models to address the task, comparing the performance of five models of different characteristics. After a process of experimentation and fine tuning, a new model called RoBERTime is created for the detection of temporal expressions in Spanish, especially focused in the clinical domain. This model is publicly available. RoBERTime achieves state-of-the-art results in the E3C and Timebank corpora, being the first public model for the detection of temporal expressions in Spanish specialized in the clinical domain.
A keyphrase-based approach for interpretable ICD-10 code classification of Spanish medical reports
(Elsevier, 2021) Fabregat Marcos, Hermenegildo; Duque Fernández, Andrés; Araujo Serna, M. Lourdes; Martínez Romo, Juan
Background and objectives: The 10th version of International Classification of Diseases (ICD-10) codification system has been widely adopted by the health systems of many countries, including Spain. However, manual code assignment of Electronic Health Records (EHR) is a complex and time-consuming task that requires a great amount of specialised human resources. Therefore, several machine learning approaches are being proposed to assist in the assignment task. In this work we present an alternative system for automatically recommending ICD-10 codes to be assigned to EHRs. Methods: Our proposal is based on characterising ICD-10 codes by a set of keyphrases that represent them. These keyphrases do not only include those that have literally appeared in some EHR with the considered ICD-10 codes assigned, but also others that have been obtained by a statistical process able to capture expressions that have led the annotators to assign the code. Results: The result is an information model that allows to efficiently recommend codes to a new EHR based on their textual content. We explore an approach that proves to be competitive with other state-of-the-art approaches and can be combined with them to optimise results. Conclusions: In addition to its effectiveness, the recommendations of this method are easily interpretable since the phrases in an EHR leading to recommend an ICD-10 code are known. Moreover, the keyphrases associated with each ICD-10 code can be a valuable additional source of information for other approaches, such as machine learning techniques.
Automatic Recommendation of Forum Threads and Reinforcement Activities in a Data Structure and Programming Course
(MDPI, 2023-09-21) Plaza Morales, Laura; Araujo Serna, M. Lourdes; López Ostenero, Fernando; Martínez Romo, Juan
Online learning is quickly becoming a popular choice instead of traditional education. One of its key advantages lies in the flexibility it offers, allowing individuals to tailor their learning experiences to their unique schedules and commitments. Moreover, online learning enhances accessibility to education, breaking down geographical and economical boundaries. In this study, we propose the use of advanced natural language processing techniques to design and implement a recommender that supports e-learning students by tailoring materials and reinforcement activities to students’ needs. When a student posts a query in the course forum, our recommender system provides links to other discussion threads where related questions have been raised and additional activities to reinforce the study of topics that have been challenging. We have developed a content-based recommender that utilizes an algorithm capable of extracting key phrases, terms, and embeddings that describe the concepts in the student query and those present in other conversations and reinforcement activities with high precision. The recommender considers the similarity of the concepts extracted from the query and those covered in the course discussion forum and the exercise database to recommend the most relevant content for the student. Our results indicate that we can recommend both posts and activities with high precision (above 80%) using key phrases to represent the textual content. The primary contributions of this research are three. Firstly, it centers on a remarkably specialized and novel domain; secondly, it introduces an effective recommendation approach exclusively guided by the student’s query. Thirdly, the recommendations not only provide answers to immediate questions, but also encourage further learning through the recommendation of supplementary activities.
Understanding and Improving Disability Identification in Medical Documents
(IEEE, 2020) Fabregat Marcos, Hermenegildo; Martínez Romo, Juan; Araujo Serna, M. Lourdes
Disabilities are a problem that affects a large number of people in the world. Gathering information about them is crucial to improve the daily life of the people who suffer from them but, since disabilities are often strongly associated with different types of diseases, the available data are widely dispersed. In this work we review existing proposal for the problem, making an in-depth analysis, and from it we make a proposal that improves the results of previous systems. The analysis focuses on the results of the participants in DIANN shared task was proposed (IberEval 2018), devoted to the detection of named disabilities in electronic documents. In order to evaluate the proposed systems using a common evaluation framework, a corpus of documents, in both English and Spanish, was gathered and annotated. Several teams participated in the task, either using classic methods or proposing specific approaches to deal effectively with the complexities of the task. Our aim is to provide insight for future advances in the field by analyzing the participating systems and identifying the most effective approaches and elements to tackle the problem. We have validated the lessons learned from this analysis through a new proposal that includes the most promising elements used by the participating teams. The proposed system improves, for both languages, the results obtained during the task.
Discovering HIV related information by means of association rules and machine learning
(Nature Research, 2022-10-22) Araujo Serna, M. Lourdes; Martínez Romo, Juan; Bisbal, Otilia; Sanchez de Madariaga, Ricardo; The Cohort of the National AIDS Network (CoRIS); https://orcid.org/0000-0003-3746-3378
Acquired immunodeficiency syndrome (AIDS) is still one of the main health problems worldwide. It is therefore essential to keep making progress in improving the prognosis and quality of life of affected patients. One way to advance along this pathway is to uncover connections between other disorders associated with HIV/AIDS-so that they can be anticipated and possibly mitigated. We propose to achieve this by using Association Rules (ARs). They allow us to represent the dependencies between a number of diseases and other specific diseases. However, classical techniques systematically generate every AR meeting some minimal conditions on data frequency, hence generating a vast amount of uninteresting ARs, which need to be filtered out. The lack of manually annotated ARs has favored unsupervised filtering, even though they produce limited results. In this paper, we propose a semi-supervised system, able to identify relevant ARs among HIV-related diseases with a minimal amount of annotated training data. Our system has been able to extract a good number of relationships between HIV-related diseases that have been previously detected in the literature but are scattered and are often little known. Furthermore, a number of plausible new relationships have shown up which deserve further investigation by qualified medical experts.
Deep-Learning Approach to Educational Text Mining and Application to the Analysis of Topics’ Difficulty
(Institute of Electrical and Electronics Engineers, 2020-12-02) Araujo Serna, M. Lourdes; López Ostenero, Fernando; Martínez Romo, Juan; Plaza Morales, Laura
Learning analytics has emerged as a promising tool for optimizing the learning experience and results, especially in online educational environments. An important challenge in this area is identifying the most difficult topics for students in a subject, which is of great use to improve the quality of teaching by devoting more effort to those topics of greater difficulty, assigning them more time, resources and materials. We have approached the problem by means of natural language processing techniques. In particular, we propose a solution based on a deep learning model that automatically extracts the main topics that are covered in educational documents. This model is next applied to the problem of identifying the most difficult topics for students in a subject related to the study of algorithms and data structures in a Computer Science degree. Our results show that our topic identification model presents very high accuracy (around 90 percent) and may be efficiently used in learning analytics applications, such as the identification and understanding of what makes the learning of a subject difficult. An exhaustive analysis of the case study has also revealed that there are indeed topics that are consistently more difficult for most students, and also that the perception of difficulty in students and teachers does not always coincide with the actual difficulty indicated by the data, preventing to pay adequate attention to the most challenging topics.
Can deep learning techniques improve classification performance of vandalism detection in Wikipedia?
(Elsevier, 2019) Martinez-Rico, Juan R.; Martínez Romo, Juan; Araujo Serna, M. Lourdes
Wikipedia is a free encyclopedia created as an international collaborative project. One of its peculiarities is that any user can edit its contents almost without restrictions, what has given rise to a phenomenon known as vandalism. Vandalism is any attempt that seeks to damage the integrity of the encyclopedia deliberately. To address this problem, in recent years several automatic detection systems and associated features have been developed. This work implements one of these systems, which uses three sets of new features based on different techniques. Specifically we study the applicability of a leading technology as deep learning to the problem of vandalism detection. The first set is obtained by expanding a list of vandal terms taking advantage of the existing semantic-similarity relations in word embeddings and deep neural networks. Deep learning techniques are applied to the second set of features, specifically Stacked Denoising Autoencoders (SDA), in order to reduce the dimensionality of a bag of words model obtained from a set of edits taken from Wikipedia. The last set uses graph-based ranking algorithms to generate a list of vandal terms from a vandalism corpus extracted from Wikipedia. These three sets of new features are evaluated separately as well as together to study their complementarity, improving the results in the state of the art. The system evaluation has been carried out on a corpus extracted from Wikipedia (WP_Vandal) as well as on another called PAN-WVC-2010 that was used in a vandalism detection competition held at CLEF conference.
Building a framework for fake news detection in the health domain
(San Francisco CA: Public Library of Science, 2024-07-08) Martinez Rico, Juan R.; Araujo Serna, M. Lourdes; Martínez Romo, Juan; Bongelli, Ramona
Disinformation in the medical field is a growing problem that carries a significant risk. Therefore, it is crucial to detect and combat it effectively. In this article, we provide three elements to aid in this fight: 1) a new framework that collects health-related articles from verification entities and facilitates their check-worthiness and fact-checking annotation at the sentence level; 2) a corpus generated using this framework, composed of 10335 sentences annotated in these two concepts and grouped into 327 articles, which we call KEANE (faKe nEws At seNtence lEvel); and 3) a new model for verifying fake news that combines specific identifiers of the medical domain with triplets subject-predicate-object, using Transformers and feedforward neural networks at the sentence level. This model predicts the fact-checking of sentences and evaluates the veracity of the entire article. After training this model on our corpus, we achieved remarkable results in the binary Classification of sentences (check-worthiness F1: 0.749, fact-checking F1: 0.698) and in the final classification of complete articles (F1: 0.703). We also tested its performance against another public dataset and found that it performed better than most systems evaluated on that dataset. Moreover, the corpus we provide differs from other existing corpora in its duality of sentence-article annotation, which can provide an additional level of justification of the prediction of truth or untruth made by the model.

Persona:
Martínez Romo, Juan

Dirección de correo electrónico

ORCID

Fecha de nacimiento

Proyectos de investigación

Unidades organizativas

Puesto de trabajo

Apellidos

Nombre de pila

Nombre

Filtros

Autor

Tipo

Departamento

Centro

Fecha

Tiene archivos

Tipo de ítem

Ajustes

Ordenar por

resultados por página

Resultados de la búsqueda

Persona: Martínez Romo, Juan

Dirección de correo electrónico

ORCID

Fecha de nacimiento

Proyectos de investigación

Unidades organizativas

Puesto de trabajo

Apellidos

Nombre de pila

Nombre

Filtros

Autor

Tipo

Departamento

Centro

Fecha

Tiene archivos

Tipo de ítem

Ajustes

Ordenar por

resultados por página

Resultados de la búsqueda

Persona:
Martínez Romo, Juan