Web spam detection : new classification features based on qualified link analysis and language models

Araujo, Lourdes; Martínez-Romo, Juan

doi:10.1109/TIFS.2010.2050767

Web spam detection : new classification features based on qualified link analysis and language models

Araujo, Lourdes y Martínez-Romo, Juan . (2010) Web spam detection : new classification features based on qualified link analysis and language models. IEEE Transactions On Information Forensics and Security vol. 5(3), 2010, pp.581-590. ISSN: 1556-6013, DOI: 10.1109/TIFS.2010.205076

Ficheros (Some files may be inaccessible until you login with your e-spacio credentials)
Nombre			Descripción	Tipo MIME		Size
Documento.pdf			Enter a label here.		application/pdf

Título	Web spam detection : new classification features based on qualified link analysis and language models
Autor(es)	Araujo, Lourdes Martínez-Romo, Juan
Resumen	Web spam is a serious problem for search engines because the quality of their results can be severely degraded by the presence of this kind of page. In this paper, we present an efficient spam detection system based on a classifier that combines new link-based features with language-model (LM)-based ones. These features are not only related to quantitative data extracted from the Web pages, but also to qualitative properties, mainly of the page links.We consider, for instance, the ability of a search engine to find, using information provided by the page for a given link, the page that the link actually points at. This can be regarded as indicative of the link reliability. We also check the coherence between a page and another one pointed at by any of its links. Two pages linked by a hyperlink should be semantically related, by at least a weak contextual relation. Thus, we apply an LM approach to different sources of information from aWeb page that belongs to the context of a link, in order to provide high-quality indicators of Web spam. We have specifically applied the Kullback–Leibler divergence on different combinations of these sources of information in order to characterize the relationship between two linked pages. The result is a system that significantly improves the detection of Web spam using fewer features, on two large and public datasets such as WEBSPAM-UK2006 and WEBSPAM-UK2007.
Palabras clave	content analysis information retrieval language models (LMs) link integrity content analysis web spam detection
Editor(es)	Institute of Electrical and Electronics Engineers (IEEE)
Fecha	2010-09-01
Formato	application/pdf
Identificador	http://e-spacio.uned.es/fez/view/bibliuned:DptoLSI-ETSI-MA2VICMR-1080 bibliuned:DptoLSI-ETSI-MA2VICMR-1080
DOI - identifier	10.1109/TIFS.2010.2050767
ISSN - identifier	1556-6013
Publicado en la Revista	IEEE Transactions On Information Forensics and Security vol. 5(3), 2010, pp.581-590. ISSN: 1556-6013, DOI: 10.1109/TIFS.2010.205076
Idioma	eng
Versión de la publicación	publishedVersion
Relacionado con el proyecto:	info:eu-repo/grantAgreement/S2009/TIC-1542
Tipo de recurso	Article
Derechos de acceso y licencia	http://creativecommons.org/licenses/by-nc-nd/4.0 info:eu-repo/semantics/openAccess
Tipo de acceso	Acceso abierto

Tipo de documento:	Artículo de revista
Collections:	Grupo de Procesamiento del Lenguaje Natural y Recuperación de Información. Proyecto MA2VICMR-CM Set de artículo Set de proyectos financiados Set de openaire

Contador de citas:	Search Google Scholar
Estadísticas de acceso:	556 Visitas, 591 Descargas - Estadísticas en detalle
Creado:	Wed, 26 Nov 2014, 15:19:27 CET

e-spacio

Web spam detection : new classification features based on qualified link analysis and language models