Unsupervised Organisation of Scientific Documents

Lourenço, A. ; Medina, LM ; Fred, A. L. N. ; Filipe, JBF

Unsupervised Organisation of Scientific Documents, Proc International Conf. on Knowledge Discovery and Information Retrieval - KDIR, Paris, France, Vol. ., pp. 557 - 568, October, 2011.

Unsupervised organisation of documents, and in particular research papers,
into meaningful groups is a difficult problem. Using the typical
vector-space-model representation (Bag-of-words paradigm), difficulties arise
due to its intrinsic high dimensionality, high redundancy of features, and
the lack of semantic information. In this work we propose a document
representation relying on a statistical feature reduction step, and an
enrichment phase based on the introduction of higher abstraction terms,
designated as metaterms, derived from text, using as prior knowledge papers
topics and keywords. The proposed representation, combined with a clustering
ensemble approach, leads to a novel document organization strategy. We
evaluate the proposed approach taking as application domain conference
papers, topic information being extracted from conference topics or areas.
Performance evaluation on data sets from NIPS and INSTICC conferences show
that the proposed approach leads to interesting and encouraging results.