Dimensionality Reduction and Feature Selection in Information Retrieval

Dimensionality Reduction

To reduce the size of feature vector, the following two methods are used:


What is feature selection?

In information retrieval, while processing the enormous amount of data at hand, selection of relevant features plays an important role. So much so that, it sometimes overpower the selection of classification technique.
Feature selection is the process of selecting relevant features for construction of the classification model. This can be seen as choosing subset of all the features to represent data.
 
 

Why feature selection?

There are few important reason for this, which shows the significance of feature selection:
  • removal of irrelevant features (or noise)
  • curse of dimensionality (eg. sparseness)
  • removal of redundant features
  • simpler training of model (less complex)
    All this is performed while not loosing much of the information.

Steps in feature selection

Feature selection technique tries to find best possible subset of features that can be used, hence this is an optimization technique.
Step1 : search for all possible subsets of features
Step2 : select the subset which gives performance as desired (near optimal)

Evaluation Measures 

  1. Wrapper Methods
  2. Filter Methods

Wrapper Methods

  • This evaluation is related to the classification model used
  • predictive accuracy or cross-validation is used for evaluation 

Filter Methods

  • This evaluation is independent of the classification model used
  • some theoretic measures are used to evaluate the information of the subset of features

Feature Extraction

This is not merely finding the subset of features, it is applying some function on the whole feature set to combine them into lesser no. of features.

 

References:

[1] https://en.wikipedia.org/wiki/Feature_selection
[2] https://youtu.be/KTzXVnRlnw4

Comments