Publicación:
Web spam detection : new classification features based on qualified link analysis and language models

dc.contributor.authorAraujo Serna, M. Lourdes
dc.contributor.authorMartínez Romo, Juan
dc.date.accessioned2024-05-21T13:03:33Z
dc.date.available2024-05-21T13:03:33Z
dc.date.issued2010-09-01
dc.description.abstractWeb spam is a serious problem for search engines because the quality of their results can be severely degraded by the presence of this kind of page. In this paper, we present an efficient spam detection system based on a classifier that combines new link-based features with language-model (LM)-based ones. These features are not only related to quantitative data extracted from the Web pages, but also to qualitative properties, mainly of the page links.We consider, for instance, the ability of a search engine to find, using information provided by the page for a given link, the page that the link actually points at. This can be regarded as indicative of the link reliability. We also check the coherence between a page and another one pointed at by any of its links. Two pages linked by a hyperlink should be semantically related, by at least a weak contextual relation. Thus, we apply an LM approach to different sources of information from aWeb page that belongs to the context of a link, in order to provide high-quality indicators of Web spam. We have specifically applied the Kullback–Leibler divergence on different combinations of these sources of information in order to characterize the relationship between two linked pages. The result is a system that significantly improves the detection of Web spam using fewer features, on two large and public datasets such as WEBSPAM-UK2006 and WEBSPAM-UK2007.es
dc.description.versionversión publicada
dc.identifier.doi10.1109/TIFS.2010.2050767
dc.identifier.issn1556-6013
dc.identifier.urihttps://hdl.handle.net/20.500.14468/19988
dc.language.isoen
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)
dc.relation.centerE.T.S. de Ingeniería Informática
dc.relation.departmentLenguajes y Sistemas Informáticos
dc.rightsAtribución-NoComercial-SinDerivadas 4.0 Internacional
dc.rightsinfo:eu-repo/semantics/openAccess
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0
dc.subject.keywordscontent analysis
dc.subject.keywordsinformation retrieval
dc.subject.keywordslanguage models (LMs)
dc.subject.keywordslink integrity
dc.subject.keywordscontent analysis
dc.subject.keywordsweb spam detection
dc.titleWeb spam detection : new classification features based on qualified link analysis and language modelses
dc.typeactas de congresoes
dc.typeconference proceedingsen
dspace.entity.typePublication
relation.isAuthorOfPublication77c4023e-4374-442a-9dfb-b9d4b609c31e
relation.isAuthorOfPublication91b7e317-2a30-494f-98e9-3a0e026747b1
relation.isAuthorOfPublication.latestForDiscovery77c4023e-4374-442a-9dfb-b9d4b609c31e
Archivos
Bloque original
Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
Documento.pdf
Tamaño:
781.76 KB
Formato:
Adobe Portable Document Format