Preprocessing Affects Sentiment Learning !


Introduction

Preprocessing is a very crucial step, used for converting unstructured text to structured one. Now a days huge amount of data is floating around us on social websites, channels etc. With the help of this data, we can extract useful information and train our information retrieval system models for any need or requirement. But this data is randomly mixed with data which has no useful information, and it will take a long time to train our model. To extract out the useful data according to our training models, preprocessing techniques will be required.


Fig - Basic Sentiment Learning Process


Problem

Preprocessing steps are dependent on the programmar skills. If programmar will leave something useful during extraction of data, it can hamper our training model a lot and if programmar will add redundant and unuseful data, it again can crash our memory systems and training model. So this step should be done with proper care and attention.


Fig - Preprocessing Steps Important!

Enhancement in sentiment learning

During preprocessing steps, we mostly eliminate following specified data from our corpus:
  1. Stop Words Removal
  2. Punctuation Removal
  3. HashTags Removal
If we do not remove these above specified data during preprocessing steps of a sentiment learning model, then accuracy of our system will be enhanced. Surprising!
Yeah It is correct! Because this is a useful data for our sentiment training model, generally we remove this data and it affect our learning process a lot. Now a days social channels like twitter and facebook provide a facility for hashtags and smileys expressions. These channels have a large amount of data which is sufficient for train our sentiment learning process. Hashtags and smileys are the important part of this data because it automatically label the posts or tweets in to specified classes.

Smileys are the combination of punctuations and we always remove punctuations from our corpus or dataset. But in the above case smileys also categorise the posts accurate into some classes. These channels provide us labelled data directly and help in improving our system.

Hence preprocessing steps are dependent on the problem, which we want to solve. Even a single step of preprocessing step can change the results from 90% accuracy to 10% accuracy.

Conclusion

Above discussed method of not removing smileys and punctuations are enhancing the accuracy of the sentiment learning process a lot. After preprocessing steps we should think about the models which we will apply for classification the events. These models are also important. But these will give accurate and proper results when our preprocessing steps are not faulty. If preprocessing steps will be faulty, then even our best multi modal classification system will not do anything.


REFERENCES

  1. D. Davidov, O. Tsur, and A. Rappoport. Enhanced sentiment learning using HashTags and Smileys

Comments