Web people search

Artiles Picón, Javier. Web people search . 2009. Universidad Nacional de Educación a Distancia (España). Escuela Técnica Superior de Ingeniería Informática. Departamento de Lenguajes y Sistemas Informáticos

Ficheros (Some files may be inaccessible until you login with your e-spacio credentials)
Nombre Descripción Tipo MIME Size
Documento.pdf Pdf del documento application/pdf

Título Web people search
Autor(es) Artiles Picón, Javier
Abstract In this thesis we have addressed the problem of name ambiguity while searching for people on the Web. At the beginning of our work, in 2004, there were very few research papers on this topic, and no commercial web search engine would provide this type of facility. For this reason, our research methodology initially focused on the design and organisation (together with Prof. Sekine from New York University) of a competitive evaluation campaign for Web People Search systems. Once the campaign had been run for two years, we used the standard test suites built to perform our own empirical studies on the nature and challenges of the task. The evaluation campaign, WePS, was organized in 2007 (as a SemEval 2007 task) and in 2009 (as a WWW 2009 workshop). WePS was crucial in the process to lay the foundations of a proper scientific study of the Web People Search problem. These were the main accomplishments: • Standardisation of the problem: now a majority of researchers focus on the problem as a search results mining task (clustering and information extraction), as it has been defined in WePS. • Creation of standard benchmarks for the task: since the first WePS campaign in 2007, the number of publications related to Web People Search has grown substantially, and most of them use the WePS test suites as a de-facto standard benchmark. As of summer 2009, there were already more than 70 research papers citing WePS overviews; this not only suggests that WePS has indeed become a standard reference for the task, but also that it has contributed to arouse the interest in this kind of research problems. • Design of evaluation metrics for the task: 1. We have performed a careful formal analysis of several extrinsic clustering evaluation metrics based on formal constraints, to conclude that BCubed metrics are the most suitable for the task. We have also extended the original BCubed definition to allow for overlapping clusters, which is a practical requirement of the task. Our results are general enough to be employed in other clustering tasks. 2. We have introduced a new metric combination function, Unanimous Improvement Ratio (UIR), which, unlike Van Rijsbergen’s F, does not require an a-priori weighting of metrics (in our case, BCubed Precision and Recall). In an extensive empirical study we have shown that UIR provides rich information to compare the performance of systems, which was impossible with previous existing metric combinations functions (most prominently F). Using the results of the WePS-2 campaign, we have shown that F and UIR provide complementary information and, altogether, constitute a powerful analytical tool to compare systems. Although we have tested UIR only in the context of our task, it could be potentially useful in any task where several evaluation metrics are needed to capture the quality of a system, as it happens in several Natural Language Processing problems. Using the test suites produced in the two WePS evaluation campaigns, we have then performed a number of empirical studies in order to enhance a better understanding and comprehension of both the nature of the task involved and the way to solve it: • First, we have studied the potential effects of using (interactive) query re- finements to perform the Web People Search task. We have discovered that, although in most occasions there is an expression that can be used as a nearperfect refinement to retrieve all and only those documents referring to an individual, the nature of these ideal refinements is unpredictable and very unlikely to be hypothesized by the user. This confirms the need for search results clustering, and also suggests that looking for an optimal refinement may be a strategy of automatic systems to accomplish the task (and one that has not been used by any participant in the WePS campaigns). • Second, we have studied the usefulness of linguistic (computationally intensive) features as compared to word n-grams and other cheap features to solve our clustering problem. Notably, named entities, which are the most popular feature immediately after bag-of-words approaches, does not seem to provide a direct competitive advantage to solve the task. We have reached this conclusion abstracting from a particular choice of Machine Learning and Text Clustering algorithms, by using a Maximal Pairwise Accuracy estimator introduced in this thesis. • As a side effect of our empirical study, we have built a system which, using the confidence of a binary classifier (whether two pages are coreferent or not) as a similarity metric between document pairs to feed a Hierarchical Agglomerative Clustering algorithm, provides the best results for the task known to us (F0.5 = 0.83 vs. 0.82 for the best WePS-2 system), without using computationally intensive linguistic features.
Materia(s) Ingeniería Informática
Palabras clave World Wide Web
sistema de recuperación de la información
lenguajes de programación
sistemas informáticos
Editor(es) Universidad Nacional de Educación a Distancia (España). Escuela Técnica Superior de Ingeniería Informática. Departamento de Lenguajes y Sistemas Informáticos
Director de tesis Gonzalo Arroyo, Julio (Director de Tesis)
Amigó Cabrera, Enrique (Director de Tesis)
Fecha 2009-10-09
Formato application/pdf
Identificador tesisuned:IngInf-Jartiles
Idioma eng
Versión de la publicación acceptedVersion
Nivel de acceso y licencia http://creativecommons.org/licenses/by-nc-nd/4.0
Tipo de recurso Thesis
Tipo de acceso Acceso abierto

Versión Tipo de filtro
Contador de citas: Google Scholar Search Google Scholar
Estadísticas de acceso: 257 Visitas, 373 Descargas  -  Estadísticas en detalle
Creado: Wed, 17 Feb 2010, 13:59:05 CET