Big Data Clustering

Tizón Galisteo, Daniel

Publicación:
Big Data Clustering

dc.contributor.author	Tizón Galisteo, Daniel
dc.contributor.director	Sarro Baro, Luis Manuel
dc.date.accessioned	2024-05-20T12:35:09Z
dc.date.available	2024-05-20T12:35:09Z
dc.date.issued	2017-07-07
dc.description.abstract	En este trabajo he realizado una investigación sobre algoritmos de clusterización que tienen órdenes de complejidad lineales o logarítmicos respecto al tiempo de ejecución, y que pueden ser paralelizables, y por tanto nos permitan trabajar con grandes cantidades de datos. Además, hay que tener en cuenta que puesto que utilizaré un cluster de Spark, los algoritmos que podremos utilizar estarán limitados por aquellos que se encuentran implementados en la librería MLlib de Apache Spark. También he llevado a cabo un estudio de distintos índices de validación interna y externa que podemos emplear para evaluar la calidad de los grupos o clusters creados por dichos algoritmos. Como caso de uso, he utilizado los datos astrométricos procedentes de millones de estrellas de nuestra galaxia proporcionados por la misión Gaia de la Agencia Espacial Europea para realizar una clusterización de dichas estrellas, con el objetivo de tratar de encontrar cúmulos estelares nuevos o recabar más información sobre los ya existentes. Para llevar a cabo el caso de estudio, dada la gran cantidad de datos a tratar, he utilizado la infraestructura facilitada por la DPAC (Data Processing and Analysis Consortium), consistente en un cluster de Apache Spark formado por 6 nodos con 16 cores y 64Gb de RAM cada uno.	es
dc.description.abstract	In this work I have done an investigation about clustering algorithms with linear or logarithmic orders of complexity in execution time, and can work in a distributed way, so we can work with a lot of data. Furthermore, due to that I will use a cluster of Apache Spark, the choice of the algorithms will be limited by the clustering algorithms implemented in the machine learning library of Spark (MLlib). I have also carried out a study of some internal and external validation indexes used to evaluate the quality of the groups or clusters created by these algorithms. As a use case, I have used the astrometric data from millions of stars in our galaxy provided by the Gaia mission of the European Space Agency (ESA) to perform a clustering of these stars, the objective will be to find new star clusters or gather new information about existing ones. In order to carry out the case study, given the large amount of data to be processed, I had to make use of the infrastructure provided by the Data Processing and Analysis Consortium (DPAC), which consisted of 6 nodes with 16 cores and 64Gb of RAM each, which featured the distributed computing framework Apache Spark.	en
dc.description.version	versión final
dc.identifier.uri	https://hdl.handle.net/20.500.14468/14561
dc.language.iso	es
dc.publisher	Universidad Nacional de Educación a Distancia (España). Escuela Técnica Superior de Ingeniería Informática. Departamento de Inteligencia Artificial
dc.relation.center	E.T.S. de Ingeniería Informática
dc.relation.degree	Máster Universitario en I.A. Avanzada: Fundamentos, Métodos y Aplicaciones
dc.relation.department	Inteligencia Artificial
dc.rights	info:eu-repo/semantics/openAccess
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/deed.es
dc.title	Big Data Clustering	es
dc.type	master thesis	en
dspace.entity.type	Publication

Archivos

Bloque original

Mostrando 1 - 1 de 1

Nombre:: Tizon_Galisteo_Daniel_TFM.pdf
Tamaño:: 2.53 MB
Formato:: Adobe Portable Document Format

Descargar

Colecciones

Trabajos de fin de máster (TFM)

Publicación: Big Data Clustering

Archivos

Bloque original

Colecciones

Publicación:
Big Data Clustering