Text Summarization : Evaluating summaries with the ROUGE tool

Remember the time when you had to read long chapters of your English literature books where as you were just interested in the gist of the chapter ? In today's world we hardly have time to read anything that is lengthy. Our emails and text writing style have evolved and they clearly reflect the need of brevity in our texts. Due to such needs, text summarization is an interesting problem.
Many algorithms and techniques have been developed for generating automatic summaries for texts.
But the BIG question is, how to pick the best technique of summarization among the plethora of techniques available.
Thus we talk about Evaluation Metrics. How to compare the performance of these algorithms or how to pick the best summary out of all automatically generated summaries.
One such tool is ROUGE. This tool can automatically compare the generated summaries against human generated and curated summaries. Let's see what rouge is all about.

What is ROUGE?

ROGUE stands for Recall-Oriented Understudy for Gisting Evaluation. It provides measures that can be used to evaluate generated summaries or gist against one or more standard summaries. It combines Precision, Recall and overlap in the measures it produces.
Rogue offers many kinds of measures like

ROUGE-N
ROUGE-L
ROUGE-W
ROUGE-S

But before jumping into the various kinds of rouge measures, lets briefly look at recall and precision.

ROUGE - Recall

Recall is how much of the generated summary captures the original / human generated summary.
Let's take an example to understand how recall is calculated in ROUGE.

Human Generated Summary - Gold Standard

you know nothing john snow

Summary generated by your algorithm

you know nothing little john snow

So here

Recall = (no. of common words / no. of words in ground truth) = 5 / 5 = 1

ROUGE - Precision

Precision measures how much of the generated summary is needed or relevant.
For the above example,

Precision = (no. of common words / no. of words in generated summary) = 5/6

But just Precision and Recall are insufficient in evaluating summaries. Thus we use Precision and Recall based F measure.

ROUGE-N

It calculates the n-gram recall for the generated summary and the ground truth summary.
For example ROUGE-2 will calculate recall for bigrams in the summary. For the previous example

Bigrams in Gold Standard

you know
know nothing
nothing john
john snow

Bigrams in generated summary

oh you
you know
know nothing
nothing little
little john
john snow

In this case,

ROUGE2 Recall = 3 / 4

ROUGE2 Precision = 3 / 6

A general formula for ROUGE-N measure is

ROUGE-N measure

n - the length of the n-gram,gram n
Count match (gram n ) is the maximum number of common n-grams in a generated summary

and gold standard summaries.

ROUGE-L

Rouge-L measures the overlap between summaries based on the longest common sub-sequences (LCS) encountered in the summaries. It uses a LCS-based F-measure to evaluate two summaries X of length m and Y of length n, where X is a gold standard summary and Y is a generated summary sentence :

Recall, Precision and F-measure for ROUGE-L

So here ROUGE-L will be 1 when X=Y. But ROUGE-L will be equal to 0 if there is no common sub sequence in X and Y.

ROUGE-W

As Rouge-L measures takes only sub-sequences into consideration, it does not check if the sub-sequences are consecutive or not. To improve the LCS method we can use a weighted scheme which will prefer consecutive sub-sequences over non-consecutive ones. One can simply store the
length of consecutive sub-sequence found so far. So this way consecutive common sub- sequences are awarded more score than the ones which are common but not consecutive.

ROUGE-S

This is a co-occurrence statistics based on Skip-bi-grams. A Skip-bi-gram is a bi-gram which can have arbitrary gap between them. For the sentence

you know nothing john snow

skip bi-grams are :

you know
you nothing
you john
you snow
know nothing
know john
know snow
nothing john
nothing snow
john snow

Rouge-S can be calculated as :

Recall, Precision and F-measure for ROUGE-S

We can also limit the skip distance between two skip bi-grams. This will reduce spurious matches.

References

ROUGE : A Package for Automatic Evaluation of Summaries, Chin-Yew Lin, Text Summarization Branches Out, 2004 - aclweb.org
http://rxnlp.com/how-rouge-works-for-evaluation-of-summarization-tasks/
http://kavita-ganesan.com/rouge-howto/

IIITD IR MELANAGE

Search This Blog