Dimensionality Reduction and Feature Selection in Information Retrieval

Dimensionality Reduction

To reduce the size of feature vector, the following two methods are used:

What is feature selection?

In information retrieval, while processing the enormous amount of data at hand, selection of relevant features plays an important role. So much so that, it sometimes overpower the selection of classification technique.

Feature selection is the process of selecting relevant features for construction of the classification model. This can be seen as choosing subset of all the features to represent data.

Why feature selection?

There are few important reason for this, which shows the significance of feature selection:

removal of irrelevant features (or noise)
curse of dimensionality (eg. sparseness)
removal of redundant features
simpler training of model (less complex)
All this is performed while not loosing much of the information.

Steps in feature selection

Feature selection technique tries to find best possible subset of features that can be used, hence this is an optimization technique.

Step1 : search for all possible subsets of features

Step2 : select the subset which gives performance as desired (near optimal)

Evaluation Measures

Wrapper Methods
Filter Methods

Wrapper Methods

This evaluation is related to the classification model used
predictive accuracy or cross-validation is used for evaluation

Filter Methods

This evaluation is independent of the classification model used
some theoretic measures are used to evaluate the information of the subset of features

Feature Extraction

This is not merely finding the subset of features, it is applying some function on the whole feature set to combine them into lesser no. of features.

References:

[1] https://en.wikipedia.org/wiki/Feature_selection

[2] https://youtu.be/KTzXVnRlnw4

IIITD IR MELANAGE

Search This Blog