Google currently serves more than 4 billion text queries on an average per day. With this huge number, the need to make the algorithms processing these queries as computationally efficient as possible is of utmost interest to the internet community. Moreover, the user expects to see the results to his/her query ordered on their relevance to his information requirement. This makes the task of ranking documents as efficiently as possible of huge importance.
In a world of machine learning, it comes as no surprise that it is a hot research area to employ learning techniques to rank documents. This concept is popularly known as learning to rank [1].
More formally, we wish to learn to rank a document-query pair (di,qi) by returning a relevance score (si Є ℝ) using a training set consisting of ground truth of similarly structured pairs. Such a training set can be created by the help of human assessors, analysis of clickthrough rates, etc.
However, it is noticed that such ranking algorithms are computationally expensive, and can prove to be fatally slow if applied on all of the candidate documents.
The solution to this problem is simple: a regular web surfer is seldom interested in retrieving a complete list of relevant documents. In other words, ranking documents properly is more important than trying to get a high recall value. Hence, in an initial processing phase, a faster yet much simpler algorithm can be used to filter out a list of documents which then gives us an opportunity to apply the more costly ranking algorithm on this reduced set of documents.
One such interesting LtR approach named QuickScorer is discussed by Claudio et al in their seminal paper [2] published in the SIGIR conference in 2015. They use the BM25 algorithm in the initial phase to scope down the set of documents on which they then proceed to apply an additive ensemble of regression trees, which does a weighted scalar sum of the predictions made by the trees in the ensemble. They also show an interesting approach to go inside the bit-level manipulations to further speed up the working of the algorithm.
They do so by encoding the ensemble of regression trees as a vector of bits (called bitvectors). Each tree h consists of a bitvector vh which encodes the current state of the leaves, marking the leaves that can still be a possible resultant leaf.
An interleaved traversal of the trees of the whole ensemble is performed at once, contrary to the more common approach of traversing the ensemble tree by tree.
Inside the memory, representing a node of the tree can be done in separate arrays, each representing a component of a node.
The algorithm cycles through each feature fi ( Є ℝ), striking off the leaf nodes as the candidates of being the exit-node of their respective tree by simply performing a logical AND operation between vh and the bitvector corresponding to a false node.
QuickScorer was found to be about 6.5x faster than the then state-of-the-art algorithm VPRED, which used a vectorized implementation of an ensemble of regression trees, however did not apply the clever bit-level manipulations shown in QuickScorer.
This shows how such simple yet cunning optimizations can convert an intractable algorithm into a production-ready one.
[2] C. Lucchese, F. M. Nardini, S. Orlando, R. Perego, N. Tonellotto, and R. Venturini. “Quickscorer: A fast algorithm to rank documents with additive ensembles of regression trees”. In Proc. of the 38th ACM SIGIR Conference, pages 73–82, 2015
Comments
Post a Comment