An improved fuzzy system for representing web pages in clustering tasks

Pérez García-Plaza, Alberto. An improved fuzzy system for representing web pages in clustering tasks . 2012. Universidad Nacional de Educación a Distancia (España). Escuela Técnica Superior de Ingeniería Informática. Departamento de Lenguajes y Sistemas Informáticos

Ficheros (Some files may be inaccessible until you login with your e-spacio credentials)
Nombre Descripción Tipo MIME Size
Documento.pdf Pdf del documento application/pdf

Título An improved fuzzy system for representing web pages in clustering tasks
Autor(es) Pérez García-Plaza, Alberto
Abstract Keeping information organized is an important issue to make information access easier. Although the information we need is sometimes available on the Web, this information is only useful if we have the ability to find it. With this aim, it is increasingly frequent to use automatic techniques for grouping documents. In this thesis we are interested in document clustering, that is, grouping doc- uments based on the similarity of their contents. In this regard, document repre- sentation plays a very important role in web page clustering and constitutes the central point of research of this dissertation. Web pages are commonly written in HTML language, that offers explicit information (tags, in this case) about their visual representation, the typography of the text or its structure, among others. It is also a widely used format on the Internet. The main goal of this thesis is to perform a deep study with the aim of making the most of a fuzzy model to represent HTML documents for clustering tasks. Our study deals with the idea of discovering whether any part of the system could be exploited in a different way to improve clustering results. We begin our work analyzing the parts of the system where there is room for improvement and then we study different alternatives to do so. Thereby, we do not propose a document representation from the beginning, but we build it trying to understand its different parts during each step. To evaluate our results and compare the different representation proposals, we use different web page collections previously gathered to be used as gold stan- dards. Clustering is performed by using state-of-the-art algorithms and our pro- posals are validated in environments of plain and hierarchical clustering. Lastly, we also test the usefulness of our approaches in two languages: English and Spanish
Materia(s) Ingeniería Informática
Palabras clave páginas web
Editor(es) Universidad Nacional de Educación a Distancia (España). Escuela Técnica Superior de Ingeniería Informática. Departamento de Lenguajes y Sistemas Informáticos
Director de tesis Fresno Fernández, Víctor (Director de Tesis)
Martínez Unanue, Raquel (Director de Tesis)
Fecha 2012-10-23
Identificador tesisuned:IngInf-Aperez
http://e-spacio.uned.es/fez/view/tesisuned:IngInf-Aperez
Idioma eng
Versión de la publicación acceptedVersion
Nivel de acceso y licencia http://creativecommons.org/licenses/by-nc-nd/4.0
info:eu-repo/semantics/openAccess
Tipo de recurso Thesis
Tipo de acceso Acceso abierto

 
Versiones
Versión Tipo de filtro
Contador de citas: Google Scholar Search Google Scholar
Estadísticas de acceso: 424 Visitas, 414 Descargas  -  Estadísticas en detalle
Creado: Tue, 29 Jan 2013, 11:21:32 CET