Introduction
Technology
has reached everywhere and it has provided ample opportunities to people around
the world to connect with each other. With the advent of Internet and
availability of suitable platforms for users to generate content, it has become
an extremely easy task to post queries and get answers.
We know that
queries are generally classified into two categories namely, Factoid Queries
and Non-Factoid Queries. Huge amount of research has been done to provide answers
for factoid queries.
But, to
search for an appropriate answer to a non-factoid question from the pool of
the given answers is a difficult task.
So, this blog post aims to discuss a
method of summarizing answers to a non-factoid questions.
Major hurdles on the way😟😟😟
We know that
the answers to a non-factoid question is a passage that aims to explain the answer in
brief but the shortness of available text, sparsity associated with it and diversity of the available answers
form an exciting challenge for its summarization.
The major
issues that come in the path of answering a non-factoid question is as follows:
- PROBLEM-1: The shortness of answers in a non-factoid query is one of the major problem that document summarization algorithms face while providing a summary for the given answers.
- PROBLEM-2: The sparsity of syntactic and context information hampers the summarization process.
- PROBLEM-3: While providing a summary to a non-factoid query, it is extremely important to memorise the information that is related to query. However, the vast spread distribution of answers in non-factoid CQA makes it difficult to produce a summarization that have a high recall value.
A plausible solution to above
mentioned problems😊😊😊
PROBLEMS
|
SOLUTIONS
|
shortness of answers-
|
Document Expansion strategy is followed that make use of the
concept such as entity linking and sentence ranking to gather and filter related
information from Wikipedia.
|
sparsity of syntactic and context
information in sentence presentation-
|
A text CNN is used to model sentences
given the input question. Then a match is made between the question and
sentences of the related answers. The aim is to achieve an optimal sentence
vector for each candidate sentence using strategies such as: stochastic
gradient descent (SGD) and back propagation.
|
Diversity challenge-
|
To make an efficient
summarization of the sparse semantic
units, a loss function was introduced within sparse coding strategy that has been proved as an effective
strategy.
|
Methodology:
An overview of the proposed framework |
Document Expansion Phase:
- This phase assumes that for every non-factoid query, a question is provided that is represented by q and a set of candidate answers in its response are encountered that are represented by D.
- Another set S is considered which denotes the answers in thread where S= {s1 , s2 , . . . , sS } and every element of the set denotes the total number of sentences in a particular answer d.
- We aim to extract sentences from S to construct a summary.
- To deal with shortness, we aim to increase the length candidate answers by extracting a set of relevant sentences that are denoted by S’. These relevant sentences are obtained from a knowledge base (i.e. Wikipedia). In total, we have T sentences after document expansion, where T is combination of sentences present in answer combined with those that are extracted from the external source (i.e. T = S ∪ S’).
Sentence Representation Phase:
In this phase, a convolutional neural network model is used,
each sentence obtained from both sources (i.e. S and S’, combined as T) can be represented as a vector
whose dimension will be 'm'. Therefore, after sentence vectorization, we are able
to obtain two sets, X and X’ of vectors, corresponding to S and S’,
respectively.
Answer Summarization Phase:
This phase aims to select a subset of
sentences from the given pool S that will be part of the summarization. The strategy that is proposed to re-design the semantic
space of an aspect is termed as sparse coding-based algorithm. The diagram is explained
below.
Overview of
coding framework
- This diagram indicates an unsupervised strategy for solving of non-factoid query to give summarisation result. The boxes indicate the questions that could either be sentences of question or answer. The grey boxes indicate those sentences that are chosen to be part of the summary. The aim of this model is to find a linear amalgamation of basis vectors that will minimize the reconstruction error function.
- Each candidate sentence vector xi ∈ X is taken as a candidate basis vector, and all xi’s are considered to construct semantic space of the aspect including X and X’. To harness the characteristics of the summarization problem setting more effectively, we refine the preliminary error formulation.
Error-Formulation Function |
where aj
is the saliency score for the sentence vector xj. A maximal marginal
relevance algorithm is then applied on the saliency score of each candidate
answer. MMR incrementally computes the saliency-ranked list and generate a set
of sentences that best suits the question and present them as a summary.
Experiments and Results:
👉 Dataset: A benchmark dataset provided by Tomasoni and Huang was used. This dataset consisted of 100 non-factoid questions that had 361 answers which consisted of 2793 answer sentence. Such sentences has 59321 words and 275 summaries that were manually generated.
👉Evaluation Metric: The evaluation of SPQAS method for sparse-coding based strategy was performed. A comparison was made between the sentence representation method based on CNN architecture against word2vec, to explore an alternative representation (SPQAS +word2vec). To deal with pre-processing task, Porter stemming was used. The length limit of the CQA answer summary was set to 250 words. For document summarization, ROUGE evaluation metrics was evaluated.
👉Results: Proposed Model performed better than the state-of-the-art methods.
REFERENCES
- M. Tomasoni and M. Huang. Metadata-aware measures for answer summarization in community question answering. In ACL, 2010.
- Y. Zhao, S. Liang, Z. Ren, J. Ma, E. Yilmaz, and M. de Rijke. Explainable user clustering in short text streams. In SIGIR, 2016
- Y. Kim. Convolutional neural networks for sentence classification. In EMNLP, 2014.
Comments
Post a Comment