Improving search using context - continuous bag of words

Improving search using context - continuous bag of words

Traditional IR and NLP techniques use a large vocabulary to generate an index, and fetch query results using this index only. There is no attention given to the similarity between words, and how this similarity can be exploited. There is also no use of the context involved in a given query, since all naïve retrieval techniques do a lookup on the query terms with minor modifications such as stemming, lemmatization and spelling correction. Consider a search query of 'tata steel chess 2018', where a user intended to search for documents on the games played in the Tata Steel chess tournament held in Wijk Aan Zee, 2018, but the results they get are not satisfactory since the term 'Wijk Aan Zee' was not mentioned in the query, and was not detected by simply looking up the index.

A general idea of converting tokens to vectors has gained notice in the past few years, with recent models outperforming the prevalent n-gram model convincingly. The n-gram model assigns high values to the joint probability of n tokens, conditioning it as a Markov decision problem. This assumption simplifies the task of computing the joint probability, but hampers performance based on what assumptions are considered. The model also suffers from a horizon effect of sorts, with the assumption that words depend on only the previous n - 1 tokens giving varying results for the window size of n. Instead of manually trying out various values of n, which would vary across a set of tokens, we pose the problem as a learning problem, that given a set of tokens constituting a query, we try to determine the 'center' or most important token of this query, using the remaining n - 1 tokens as a bag of words to draw the context from.


The problem is formulated as determining two matrices U and V (or W and W') such that the cross-entropy between our predicted center word and the actual center word, given the context of the remaining n - 1 words, is minimized. The xi's are one-hot encoded vectors, which we multiply with U to get the input vectors to our neural network after averaging, and train the network using standard backpropagation, with the loss being the cross-entropy(H(yˆ, y) = − |V| ∑ j=1 yj log(yˆj)) between the softmaxed output of our network, and the one hot encoding of the actual center word.

The paper introduced in ICLR'13 showcases the introduction and working of this concept, with performances exceeding traditional methods such as the n-gram method described initially, by a substantial margin. Skipgrams were introduced in the same paper, and can be roughly seen as the compliment to the CBOW model, where the context is derived from a set of given center words. This is understandably more non-trivial than CBOW, and is harder to train as well, but is known to provide state-of-the-art results in conjunction with CBOW.

CBOW and skipgrams constitute the general set of embedding generation models known as word2vec, which have, as mentioned before, seen a dramatic rise in performance and usage in the past half-decade. These data-driven approaches improve over traditional hand-crafted/statistical methods, and are extensively used in modern day search engines like Google to generate predictions using the context of a given search. Similar approaches have also been deployed in doc2vec, and sent2vec.

Comments