HIDDEN IN PLAIN SIGHT: email classification using image based Contents

Consider the scenario, you are in a store buying something, suddenly you get a notification. You open the notification. It's your virtual intelligent assistant(maybe its siri or cortana) telling you about the offer on the product you are about to buy. Surprised? Yes it,s possible to improve the efficiency of your assistant to such level.

Nowadays, we get a lot of offers via emails. Many of these are a result of our activities on different online platforms, like online shopping transactions, online movie ticket bookings, hotel/restaurant bookings or any other type of online purchases. Hence, we all are interested in getting promotional offers for all such activities.

The Problem?

In the present time, half of this world's population is familiar with the internet and access to emails. Thus, subscribing to such promotional commercial offers provides a way to getting offers. But the sheer number of getting these emails makes it very hard to get the relevant offer according to the user's need.

The authors of the paper[1], have presented a way to make utilization of OCR (Optical Character Recognition) techniques and information retrieval concepts to highlight the offers that may be expiring soon without the need for the user to read all of such emails.

"A" solution.

Here, a key challenge is, many of such emails have content which is image-rich. Considering the idea of training a machine learning model for tackling this problem is highly cost expensive. Hence, the authors have proposed a way utilizing the concepts of OCR and information retrieval to combat this problem.

Following key points are observed:

(1) The offers are mainly present in the image section an email. Making the offers present in the visual form in order to have more influence over the intentional user.

(2) In case of large-scale organizational email campaigns(big companies) follow the concepts of alt-text attribute and Microdata Markup in HTML.

Many of the small companies fail to make use of such concept.

Hence, the proposed solution uses:

(1) The templates Induction Techniques. It helps in reducing the cost of analyzing the OCR extraction from the images.

(2) The structure template of the emails has also been leveraged in making the decisions of the model better.

Template Induction

These commercial promotional emails are machine generated. For example, on a retail store, after selling, the retailer uses generic templates to generate a bill and sends it over email. Hence this association of emails with templates provides a pattern for easy extraction of information. Different template induction techniques may include general similarity based measures or may include, user-based clustering[2] followed by deduction based on a subject regular expression. Another technique may be HTML body clustering technique based on the structure of the document.

Vertical Classification

Training phase- For performing the training, the authors have created a set of heuristics manually. These heuristics are based on a small number of high volume of sender domains. With the help of these heuristics (regular expressions and XPath queries), machine learning based model is trained for different email templates with their manual labels. To separate out the possibility of an email belonging to multiple categories, a binary probabilistic model has been implemented for each of the category classifiers.

Features- The tags of various tags of HTML and their counts are treated as features. The textual content (present in an email along with extracted from OCR from images), is treated as the feature space. Various preprocessing concepts are applied to them (stopword removal, punctuation handling, tokenization, stemming etc.).

Evaluation- The experimentation setup present in the paper, shows that the proposed method helps in extracting ~9% more templates as compared to text-based approach. Finally, the model trained including the OCR based features shows slightly better performance than the conventional approach.

OCR

The utilization of OCR on images to get the textual data is explained in the following flow diagram. The OCR creates the feature space by converting into the same manner as email body text. Thus, two bag of words models are created- one containing the words extracted from OCR and another the textual words from the email body. This combined input is given to the classifier model.

Conclusion

Considering the fact that nearly 269 billion emails are received each day, a nearly 9% more efficiency can be very much cost effective. The technique proposed shows the utilization of the images present in the emails along with the email template structure utilization for getting better email based offer personalization.

References

[1] Potti, Navneet, James B. Wendt, Qi Zhao, Sandeep Tata, and Marc Najork. "Hidden in Plain Sight." (2018).

[2] Wendt, James B., Michael Bendersky, Lluis Garcia-Pueyo, Vanja Josifovski, Balint Miklos, Ivo Krka, Amitabh Saikia, Jie Yang, Marc-Allen Cartright, and Sujith Ravi. "Hierarchical label propagation and discovery for machine generated email." In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pp. 317-326. ACM, 2016.

IIITD IR MELANAGE

Search This Blog

HIDDEN IN PLAIN SIGHT: email classification using image based Contents

Comments

Post a Comment