Feature Transformation and Reduction for Text Classification
Figueiredo, M. A. T.
Feature Transformation and Reduction for Text Classification, Proc International Workshop on Pattern Recognition in Information Systems, Funchal, Portugal, Vol. , pp. 72 - 81, June, 2010.
Digital Object Identifier:
Text classification is an important tool for many applications, in supervised,
semi-supervised, and unsupervised scenarios. In order to be processed
by machine learning methods, a text (document) is usually represented as a bagof-
words (BoW). A BoW is a large vector of features (usually stored as floating
point values), which represent the relative frequency of occurrence of a given
word/term in each document. Typically, we have a large number of features, many
of which may be non-informative for classification tasks and thus the need for
feature transformation, reduction, and selection arises. In this paper, we propose
two efficient algorithms for feature transformation and reduction for BoW-like
representations. The proposed algorithms rely on simple statistical analysis of
the input pattern, exploiting the BoW and its binary version. The algorithms are
evaluated with support vector machine (SVM) and AdaBoost classifiers on standard
benchmark datasets. The experimental results show the adequacy of the reduced/
transformed binary features for text classification problems as well as the
improvement on the test set error rate, using the proposed methods.