Remember the time when you had to read long chapters of your English literature books where as you were just interested in the gist of the chapter ? In today's world we hardly have time to read anything that is lengthy. Our emails and text writing style have evolved and they clearly reflect the need of brevity in our texts. Due to such needs, text summarization is an interesting problem.
Many algorithms and techniques have been developed for generating automatic summaries for texts.
But the BIG question is, how to pick the best technique of summarization among the plethora of techniques available.
Thus we talk about Evaluation Metrics. How to compare the performance of these algorithms or how to pick the best summary out of all automatically generated summaries.
One such tool is ROUGE. This tool can automatically compare the generated summaries against human generated and curated summaries. Let's see what rouge is all about.
Rogue offers many kinds of measures like
Let's take an example to understand how recall is calculated in ROUGE.
For the above example,
For example ROUGE-2 will calculate recall for bigrams in the summary. For the previous example
know nothing
nothing john
john snow
you know
know nothing
nothing little
little john
john snow
In this case,
n - the length of the n-gram,gram n
Count match (gram n ) is the maximum number of common n-grams in a generated summary
and gold standard summaries.
So here ROUGE-L will be 1 when X=Y. But ROUGE-L will be equal to 0 if there is no common sub sequence in X and Y.
length of consecutive sub-sequence found so far. So this way consecutive common sub- sequences are awarded more score than the ones which are common but not consecutive.
We can also limit the skip distance between two skip bi-grams. This will reduce spurious matches.
Many algorithms and techniques have been developed for generating automatic summaries for texts.
But the BIG question is, how to pick the best technique of summarization among the plethora of techniques available.
Thus we talk about Evaluation Metrics. How to compare the performance of these algorithms or how to pick the best summary out of all automatically generated summaries.
One such tool is ROUGE. This tool can automatically compare the generated summaries against human generated and curated summaries. Let's see what rouge is all about.
What is ROUGE?
ROGUE stands for Recall-Oriented Understudy for Gisting Evaluation. It provides measures that can be used to evaluate generated summaries or gist against one or more standard summaries. It combines Precision, Recall and overlap in the measures it produces.Rogue offers many kinds of measures like
- ROUGE-N
- ROUGE-L
- ROUGE-W
- ROUGE-S
But before jumping into the various kinds of rouge measures, lets briefly look at recall and precision.
ROUGE - Recall
Recall is how much of the generated summary captures the original / human generated summary.Let's take an example to understand how recall is calculated in ROUGE.
Human Generated Summary - Gold Standard
you know nothing john snow
Summary generated by your algorithm
you know nothing little john snowSo here
Recall = (no. of common words / no. of words in ground truth) = 5 / 5 = 1
ROUGE - Precision
Precision measures how much of the generated summary is needed or relevant.For the above example,
Precision = (no. of common words / no. of words in generated summary) = 5/6But just Precision and Recall are insufficient in evaluating summaries. Thus we use Precision and Recall based F measure.
ROUGE-N
It calculates the n-gram recall for the generated summary and the ground truth summary.For example ROUGE-2 will calculate recall for bigrams in the summary. For the previous example
Bigrams in Gold Standard
you knowknow nothing
nothing john
john snow
Bigrams in generated summary
oh youyou know
know nothing
nothing little
little john
john snow
In this case,
ROUGE2 Recall = 3 / 4
ROUGE2 Precision = 3 / 6A general formula for ROUGE-N measure is
ROUGE-N measure |
Count match (gram n ) is the maximum number of common n-grams in a generated summary
and gold standard summaries.
ROUGE-L
Rouge-L measures the overlap between summaries based on the longest common sub-sequences (LCS) encountered in the summaries. It uses a LCS-based F-measure to evaluate two summaries X of length m and Y of length n, where X is a gold standard summary and Y is a generated summary sentence :Recall, Precision and F-measure for ROUGE-L |
ROUGE-W
As Rouge-L measures takes only sub-sequences into consideration, it does not check if the sub-sequences are consecutive or not. To improve the LCS method we can use a weighted scheme which will prefer consecutive sub-sequences over non-consecutive ones. One can simply store thelength of consecutive sub-sequence found so far. So this way consecutive common sub- sequences are awarded more score than the ones which are common but not consecutive.
ROUGE-S
This is a co-occurrence statistics based on Skip-bi-grams. A Skip-bi-gram is a bi-gram which can have arbitrary gap between them. For the sentenceyou know nothing john snowskip bi-grams are :
- you know
- you nothing
- you john
- you snow
- know nothing
- know john
- know snow
- nothing john
- nothing snow
- john snow
Rouge-S can be calculated as :
Recall, Precision and F-measure for ROUGE-S |
We can also limit the skip distance between two skip bi-grams. This will reduce spurious matches.
References
- ROUGE : A Package for Automatic Evaluation of Summaries, Chin-Yew Lin, Text Summarization Branches Out, 2004 - aclweb.org
- http://rxnlp.com/how-rouge-works-for-evaluation-of-summarization-tasks/
- http://kavita-ganesan.com/rouge-howto/
Comments
Post a Comment