SUMMARIZATION OF A NON-FACTOID QUERY



Introduction
Technology has reached everywhere and it has provided ample opportunities to people around the world to connect with each other. With the advent of Internet and availability of suitable platforms for users to generate content, it has become an extremely easy task to post queries and get answers.
We know that queries are generally classified into two categories namely, Factoid Queries and Non-Factoid Queries. Huge amount of research has been done to provide answers for factoid queries.
But, to search for an appropriate answer to a non-factoid question from the pool of the given answers is a difficult task.



So, this blog post aims to discuss a method of summarizing answers to a non-factoid questions.




Major hurdles on the way😟😟😟
We know that the answers to a non-factoid question is a passage that aims to explain the answer in brief but the shortness of available text, sparsity  associated with it and diversity of the available answers form an exciting challenge for its summarization.
The major issues that come in the path of answering a non-factoid question is as follows:

  • PROBLEM-1: The shortness of answers in a non-factoid query is one of the major problem that document summarization algorithms face while providing a summary for the given answers.

  • PROBLEM-2: The sparsity of syntactic and context information hampers the summarization process.

  • PROBLEM-3:  While providing a summary to a non-factoid query, it is extremely important to memorise the information that is related to query. However, the vast spread distribution of answers in non-factoid CQA makes it difficult to produce a summarization that have a high recall value.


A plausible solution to above mentioned problems😊😊😊

          PROBLEMS
                         SOLUTIONS
shortness of answers-

Document Expansion strategy is followed that make use of the concept such as entity linking and sentence ranking to gather and filter related information from Wikipedia.
sparsity of syntactic and context information in sentence presentation-

A text CNN is used to model sentences given the input question. Then a match is made between the question and sentences of the related answers. The aim is to achieve an optimal sentence vector for each candidate sentence using strategies such as: stochastic gradient descent (SGD) and back propagation.
Diversity challenge-
To make an efficient summarization  of the sparse semantic units, a loss function was introduced within sparse coding strategy that has been proved as an effective strategy.

Methodology:

An overview of the proposed framework



Document Expansion Phase:
  •        This phase assumes that for every non-factoid query,  a question is provided that is represented by q and a set of candidate answers in its response are encountered that are represented by D.
  •        Another set S is considered which denotes the answers in thread where S= {s1 , s2 , . . . , sS } and every element of the set  denotes the total number of sentences in a particular answer d.
  •        We aim to extract sentences from S to construct a summary.
  •       To deal with shortness, we aim to increase the length candidate answers by extracting a set of relevant sentences that are denoted by S. These relevant sentences are obtained from a knowledge base (i.e. Wikipedia). In total, we have T sentences after document expansion, where T is combination of sentences present in answer combined with those that are extracted from the external source (i.e. T = S S’).


Sentence Representation Phase:

In this phase, a convolutional neural network model is used, each sentence obtained from both sources (i.e. S and S,  combined as T) can be represented as a vector whose dimension will be 'm'. Therefore, after sentence vectorization, we are able to obtain two sets, X and X of vectors, corresponding to S and S, respectively.

Answer Summarization Phase:

This phase aims to select a subset of sentences from the given pool S that will be part of the summarization. The strategy that is proposed to re-design the semantic space of an aspect is termed as sparse coding-based  algorithm. The diagram is explained below.


Overview of coding framework
  • This diagram indicates an unsupervised strategy for solving of non-factoid query to give summarisation result. The boxes indicate the questions that could either be sentences of question or answer.  The grey boxes indicate those sentences that are chosen to be part of the summary. The aim of this model is to find a linear amalgamation of basis vectors that will minimize the reconstruction error function.
  • Each candidate sentence vector xi ∈ X is taken as a candidate basis vector, and all is are considered to construct semantic space of the aspect including X and X.  To harness the characteristics of the summarization problem setting more effectively, we refine the preliminary error formulation.

Error-Formulation Function


where aj is the saliency score for the sentence vector xj. A maximal marginal relevance algorithm is then applied on the saliency score of each candidate answer. MMR incrementally computes the saliency-ranked list and generate a set of sentences that best suits the question and present them as a summary.

Experiments and Results:

👉 Dataset:  A benchmark dataset provided by  Tomasoni and Huang was used. This dataset consisted of 100 non-factoid questions that had 361 answers which consisted of 2793 answer sentence. Such sentences has 59321 words and 275 summaries that were manually generated.

👉Evaluation Metric: The evaluation of SPQAS method for sparse-coding based strategy was performed. A comparison was made between the sentence representation method  based on CNN architecture against word2vec, to explore an alternative representation (SPQAS +word2vec). To deal with pre-processing task, Porter stemming was used. The length limit of the CQA answer summary was set to 250 words. For document summarization, ROUGE evaluation metrics was evaluated.

👉Results: Proposed Model performed better than the state-of-the-art methods.


REFERENCES
  • M. Tomasoni and M. Huang. Metadata-aware measures for answer summarization in community question answering. In ACL, 2010.
  • Y. Zhao, S. Liang, Z. Ren, J. Ma, E. Yilmaz, and M. de Rijke. Explainable user clustering in short text streams. In SIGIR, 2016
  • Y. Kim. Convolutional neural networks for sentence classification. In EMNLP, 2014.



Comments