Text Summarization

Much of the crucial information usually exists in form of unstructured data, identifying  and extracting relevant and important information from this data is an important task. Every day huge amount of data is generated in the form of emails, webpages, articles, social media etc. We cannot manually comb through this huge amount of data, at best we can read the title or  maybe summary.

There is a great need for automatic summarization tools which can generate summaries which capture the essence of the text document. These summarization tools are helpful because they generate unbiased summaries and  since they are automatic they can process a large amount of documents which can be used for screening or other purposes. The summaries generated must be accurate, must imitate manually generated summaries and they itself should represent a standalone document.

Summarization is used everywhere in human life,  generating headlines/headings for different text documents, reviews of songs and movies , presentations made for a document etc. These all are summarization tasks, which is the goal for our automatic summarization toolboxes.

Some of the earlier techniques used for text summarization were based on the location of words, like if they are at the beginning of the document or at the beginning of a paragraph then they are more relevant for our task. Certain words' presence or absence, words which are predetermined , tag words or the words which occur in the headings are used to determine the ranking of sentences.

Later more complex methods were used for document summarization which can be broadly defined into two categories:
  • Extraction:  In extraction method no new sentence is generated. The summary is generated by identifying and extracting important and relevant sentences/sections and using them verbatim. 
  • Abstraction:In abstraction, new sentences are generated using advanced NLP techniques which are used to interpret and examine original sentences. The goal is to generate shorter sentences which summarizes that section of the document.
Most of the techniques today are extraction. Extraction techniques also give better result compared to abstraction. Even humans usually, extract important sections, then maybe use abstraction to generate a summary. Similarly, most abstraction techniques are not purely abstraction, they also have some degree of extraction involved.
The goal of these methods is:
  • Construct an intermediate representation
  • Rank sentences according to this representation
  • Generate a summary.
Many techniques like topic words, Bayesian word models, frequency based approaches are used to achieve these goals.

Recently, deep learning have also generated impressive results in generating headlines etc. These deep models are purely abstraction methods and are data driven. Compared to existing model which require some preprocessing or using specific sub-models for different data, deep models are entirely data driven which is a huge plus point.






Comments