Improving Cross-language Information Retrieval Systems

Introduction

Cross-language Information Retrieval is a sub-field of Information Retrieval that deals with returning results to queries in different languages. So, for example, a user may query a system in English and expect results in Spanish or French. The idea behind the requirement of such systems is because of globally available (yet linguistically impenetrable) information. There is a large amount of information present but many users are monoglots and they are not able to properly access this information. So a Cross-language IR system can be defined as a system with which when a query is entered, documents from another language of which that are requested, are retrieved and subsequently ranked.

How it works

CLIR systems work in the similar fashion as traditional IR systems. When a query is entered, it is indexed and the subsequent documents are fetched. These documents are matched and then properly ranked based whatever "best fit" the query. The only addition to CSIR systems is that when a query entered it must also be translated at some point because the results returned must be in a different language.  The best options available here are:
  1. Translate the query into the language of the document.
  2. Translate the documents retrieved into the language of the query or,
  3. Translate both the query and the document into something different together and then return the results.
The first option is usually done the most because it is computationally much faster than the other two options. Blind relevance feedback has been used to return better matched and ranked documents. But the whole idea is to improve the search results because many of the words and search requests are lost in translation. There is often a scenario where end users are able to interpret the information in the language they are looking for but are not able to express their information need in that language. In such a system the documents would be present in the language the user is looking for and would be able to query the system in the English itself.

An Improvement to Current CLIR Systems

Current systems already use translators, synonym generators, and query suggestions. By incorporating the Naïve Bayes algorithm and the Particle Swarm Optimization to better cluster and rank the documents using n-gram matching, it would actually make the use if CLIR systems much easier and much more useful. 

So initially, the user enters the query in English with which pre-query expansion takes place. Once that happens, the query entered and is translated to the other language using a bilingual dictionary. Finally, the respective queries are matched against the documents and the ranked documents are returned. For this final process, Naïve Bayes algorithm and the Particle Swarm Optimization are used.


With the Naïve Bayes algorithm, various n-gram clusters are formed for documents returned. This is based on how many words it matches the query. After this PSO is used the perform in-cluster ranking. It calculates the term count of each document and the document with the highest term count is placed at the top among the other documents in that cluster and then removed from the cluster. This is done until no document is left unranked.

Clearly, from this approach, it was able to return much more relevant documents than other systems that did not use n-gram matching with Naive Bayes. That way it improves the ranking of the documents and provides better results for language based IR systems.


References

Comments