In today’s world, there are lots of questions among masses or people which needs to be answered, but there is not a single trustable source which can fully satisfy the need of the question. There exist lots of question answering system, but they utilize various resources to answer the question. Therefore, the retrieved information may contain various types of answers from different sources for a particular query which can cause problems in reaching towards an exact answer.
We can make use of Wikipedia as a unique source for question answering system which can answer a factual question in any domain very accurately which will limit search time as the focused evidence will be seen decidedly less. Unlike other Knowledge Bases(KB) which contains less content to give appropriate answers, Wikipedia is a content-rich source which keeps on evolving and contains up-to-date content. But being a content-rich source Wikipedia needs a brilliant system to utilize its power.
To answer a query firstly, relevant documents related to a question are fetched and then the text of the relevant documents is searched to give an exact answer. This two-step process is known as machine reading at scale(MRS).
Method
The primary focus is on the text content and utilizing Wikipedia documents as a single source for MRS system. There exists a DrQA system for question answering from Wikipedia. It mainly consists of two parts:
-
Document Retriever - From a massive number of documents in Wikipedia, a few relevant documents are selected related to the query by using Natural Language Processing techniques of bigram and TF-IDF(Term Frequency-Inverse Document Frequency). Document Retriever is made using an inverted index lookup, and then vector model is used for scoring. Questions are matched using TF-IDF weighted word vectors that use a bigram model.
-
Document Reader - From all the relevant documents a multi-layer recurrent neural network model is used to find the window of text that represents the answer to the query. This task is done using paragraph encoding and query encoding.In paragraph encoding, the tokens of a paragraph are passed as a feature vector in recurrent neural network and this feature vector consists of word embeddings which are trained from Web Crawl Data, exact match which contains the exact matches in paragraph for query, token features which show the context of text using part-of-speech(POS) tagging and aligned question embedding which represents the similarity score between paragraph words and query tokens. In query encoding, query tokens are passed as a feature vector in recurrent neural network and importance of each word is also computed. If query tokens are t1, t2, t3….tn then a weight vector is learned and importance(S) is computed as :
Si = exp(wi ti)/∑i exp(wi ti)
After the training, the prediction is made using two classifiers. Each classifier will determine the start and end token of the answer related to query by looking at the similarity between paragraph token and answer and computing the probability for any token to be the starting or ending of the answer. The start and end of answer are chosen so that length is the answer is less than fifteen words, and multiplication of probabilities of start and end word is maximum.
This question answering system depends on data from Wikipedia as it is the primary source of all knowledge to answer the queries.
After applying this technique to Wikipedia as a knowledge source, the same method is used to different QA datasets, and it is observed that results with Wikipedia are the most accurate answers generated for queries.
Conclusion
In most of the question answering system knowledge is gathered from a bunch of resources which causes the noise in answers but in DrQA system Wikipedia is the only source of knowledge and hence more accurate and precise answers are fetched. Also, two steps used in this system are , document retrieving and document reading which is modeled using recurrent neural network which is very successful for question answering system.
Comments
Post a Comment