Creating and sharing knowledge for telecommunications

Project: Evidence Accumulation in Unsupervised and Semi-Supervised Learning: a Cluster Ensemble Approach

Acronym: EvaClue
Main Objective:
The main objectives of this project are twofold:
1. To provide original, state-of-the-art contributions to the area of semi-supervised learning, in particular exploring cluster ensemble
2. To address the practical, real world problem of automatic structuring of text data, focusing on three types of objects: scientific
papers; electronic email; and web pages.
The key idea of semi-supervised learning, specifically constrained clustering, is to exploit both a priori information, and unlabeled data to organize data into sensible groups. A priori information can be provided as labeled data, or as constraints (usually pair-wise relations) that one wishes to enforce or just encourage. We focus on pair-wise associations, which are to be defined from available a priori information, or provided by means of user queries or data mining techniques on the application domain, exploring semi-structured data from multiple sources. Specifically, we will address the problem of active acquisition of constraints, by identifying key-objects at cluster boundaries, and exploring boosting approaches. A particular breakthrough on semisupervised learning is expected by its extension into the cluster ensemble framework.
Cluster ensemble methods are a recent and very promising research area in clustering. Introduced in 2001, it was given increasing attention by the scientific community, both from theoretical and application domain perspectives. It is our goal to make novel contributions to this topic, in particular addressing the following issues: (a) processing of large data sets, focusing on efficient and scalable algorithms; (b) dimensionality reduction, through feature selection and extraction techniques; (c) use of a priori information, in a semi-supervised perspective. The later issue will be addressed by incorporating and/or extending existing single clustering methods into the ensemble framework, but most importantly by designing new methods directly driven by the ensemble perspective and evidence accumulation paradigm. Built upon previous work on multi-criteria evidence accumulation and dissimilarity-based clustering, we will extend it taking into consideration the previous issues, and furthermore for the purpose of local feature selection (identification of relevant features at a cluster level). Cluster validation methods, exploring stability-based and information-theoretic concepts, will play an important role in the above mentioned research topics.
Reference: PTDC/EIACCO/103230/2008
Funding: FCT/PTDC
Start Date: 01-01-2010
End Date: 01-03-2013
Team: Ana Luisa Nobre Fred, Mario Alexandre Teles de Figueiredo, Andre Ribeiro Lourenco, Artur Jorge Ferreira, Joaquim Belo Lopes Filipe
Groups: Pattern and Image Analysis – Lx
Partners: INSTICC - Institute for Systems and Technologies of Information, Control and Communication, Instituto Superior de Engenharia do Porto (ISEP/IPP)
Local Coordinator: Ana Luisa Nobre Fred
Associated Publications