Creating and sharing knowledge for telecommunications

Efficient Evidence Accumulation Clustering for large datasets

Silva, D. ; Aidos, H. ; Fred, A. L. N.

Efficient Evidence Accumulation Clustering for large datasets, Proc INSTICC International Conf. on Pattern Recognition Applications and Methods - ICPRAM, Rome, Italy, Vol. 0, pp. 367 - 374, February, 2016.

Digital Object Identifier:

Abstract
The unprecedented collection and storage of data in electronic format has given rise to an interest in automated analysis for generation of knowledge and new insights. Cluster analysis is a good candidate since it makes as few assumptions about the data as possible. A vast body of work on clustering methods exist, yet, typically, no single method is able to respond to the specificities of all kinds of data. Evidence Accumulation Clustering (EAC) is a robust state of the art ensemble algorithm that has shown good results.
However, this robustness comes with higher computational cost. Currently, its application is slow or restricted to small datasets. The objective of the present work is to scale EAC, allowing its applicability to big datasets, with technology available at a typical workstation. Three approaches for different parts of EAC are presented: a parallel GPU K-Means implementation, a novel strategy to build a sparse CSR matrix specialized to EAC and Single-Link based on Minimum Spanning Trees using an external memory sorting algorithm. Combining these approaches, the application of EAC to much larger datasets than before was accomplished.