IIITD IR MELANAGE

Query Expansion Using WordNet

Whenever you need to find about something what is the most common resource that fulfils your requirements? Friends, professors or newspapers? I guess the answer is Google for most of us. So, are you able to find your information just in one query search? Or you have to modify the query to meet your requirement? This reformulation or expansion of query is called query expansion (QE) in the IR systems. Now, you must have all wondered as to sometimes the results are as per your requirement and sometimes how hard you take you just have to satisfy yourself with whatever results the search engine has to offer. It also happens that some of our friends are very good in searching something. When are not able to find something we often ask that one friend whom we call a pro in googling.

Figure 1. Query Length used by users.

Source: https://www.researchgate.net/figure/Words-Per-Query-in-the-Log_fig1_221194620

There are four major reasons for the inefficiency of the search engines.

First, the query terms might be related to multiple topics, none of which satisfy the user's information need. Second, as per the statistics, the length of most of the queries is 2.4 words.[1] This keyword length is insufficient to model the exact information need. Third, we as users often are not sure what we are looking for until we find something which we think is relevant. Fourth, not all of us are sure what should be the keywords which will meet their information need. The IR systems create an index of the terms present the corpus after removal of stop words, stemming and lemmatization. This is called the pre-processing of the data. For retrieving the documents, query words are pre-processed and matched with the index words. If a match is found then those documents meeting a certain threshold or retrieval criteria are returned to the user.

This results in an issue known as vocabulary problem. The terms present in the documents might be similar but may not be an exact match to the query term and hence some relevant documents might not be retrieved.

Hence, we need some technique which overcomes the above mentioned issues along with the problems of synonymy(different words having the same meaning) and polynymy (same words having different meaning). There are multiple QE techniques which work on these issues using the concept of the words in the query, co-occurrence of words, probability-based model, wordnet based model.

In this blog, I am gonna talk about QE using WordNet. WordNet is a manually created thesaurus by the researchers at Princeton University. It is a network of synsets. Synsets are set of words which mean the same in a context. Using WordNet it is possible to find all possible expression that can be used in place of a given word in the same context[2],[3].

A concept of a word is represented by three elements, that is the word, the POS and the POS_num, we denote it as a triple < word , POS , POS _ num > . For example, the term good has four concepts when it is used as a noun, as shown in Fig. 2.

Figure 2. Concepts of noun good [4]

Using WordNet we can find the concept of a word or word given a concept. But, what we require in QE are other words which mean the same in a given context, which cannot be found using WordNet. Hence, we need to resort to our all time favourite technique called Word Sense Disambiguation(WSD). WSD finds the weight of words in a specific context. The words having the maximum score are chosen for disambiguation.

Synsets are represented in form of an array. The terms form the elements of the array. Hence, it can be viewed as a two-dimensional array where each row represents a concept and the elements in the array come from the same synsets. The QE technique selects terms from each row to form a sentence. Such a sentence, may not always be meaningful. Hence, we need to remove such sentences from the list of possible expanded queries. The filtration step ensures that the new expanded query and the original query have the same dimensionality so that it is easier to compare and find out which one is better. The expanded and the original queries are both used to fetch results for a particular information need.

The researchers used average precision at 10 for 5 queries to analyse their results.

They found an average improvement of 7% while using query expansion than not using it at all. This technique has a lot of scope for research and improvement [4].

References:

[1] C. Peters and P. Sheridan, “Multilingual information access,” in Lectures on Information Retrieval. Springer, 2000, pp. 51–80.

[2] S. Liu, F. Liu, C. Yu, and W. Meng, “An effective approach to document retrieval via utilizing wordnet and recognizing phrases,” in Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR’04. New York, NY, USA: ACM, 2004, pp. 266–272. [Online]. Available: http://doi.acm.org/10.1145/1008992.1009039

[3] D. Pal, M. Mitra, and K. Datta, “Improving query expansion using wordnet,” Journal of the Association for Information Science and Technology, vol. 65, no. 12, pp.2469–2478, 2014.

[4] J. Zhang, B. Deng, and X. Li, “Concept based query expansion using wordnet,” in Proceedings of the 2009 international e-conference on advanced science and technology. IEEE Computer Society, 2009, pp. 52–55.

IIITD IR MELANAGE

Search This Blog

Comments

Post a Comment