Document Summarization approaches

Introduction


Document summarization is an important part in Information Retrieval(IR) as most of the people today would
like to get summarized data and get an idea about the content instead of reading large documents which
sometimes may be irrelevant.A document summarization algorithm just gives the most crucial information
present in the document.Document summarization can be classified into two types:Extractive and Abstractive.
Extractive summarization is just like how humans summarize a document.It can contain new words,sentences,
phrases whereas extractive summarization is just picking some important phrases or sentences from a particular
document and it doesn’t contain new words or phrases.Abstractive summarization is very hard to implement as
compared to extractive.In this blog we will talk about the extractive part.


Methods for extractive summarization


1) Rule based approach - Sentence or phrases are extracted as per some encoded rules.For example sentences
which contains the title word,each sentence of a paragraph or which contains max words from a list of words.These
sentences are generally considered important.It has certain drawbacks as it needs a set of rules which are defined
for each domain.


2) Statistical based approach - It uses stats to classify some sentence as important or worthy for being in the summary.
For example using frequency as a feature(tf-idf) or using the location as a feature and giving weights accordingly.


3) Hybrid Model used in this paper
The model used in this paper for extractive summarization uses features like location,tf-idf,aggregate similarity and
semantic(meaning of something) features.


a) Location feature: P. Baxendale gave a formula for giving the score to a sentence based in its location.

Where i refers to the position in the document and N is the number of sentences in the document.


b) Frequency feature(tf-idf): term frequency and inverse of document frequency as features.
Where tfi is the term frequency of the ith word ,ND is the number of documents,idf document frequency of the  ith  word.


c) Aggregate similarity feature: It is used to find the sentence which is similar to most of the sentences in the
document.Score is also calculated based on the sentences that a sentence is familiar to.


d) Centroid Score: Each word is given some score according to statistical importance to a cluster of documents.
Then for a sentence the total score is the sum of the scores of each word.


e) Semantic or sentiment score: For each entity we define three types of sentiments neutral,positive and negative.
Then for each sentence total score is the sum of modulus of sentiment score of each word which determines the
sentimental strength.


Then we calculate the total score by summing up these individual scores and then select the sentences with the highest
score.


References




Comments