Improving IR Systems using Concept Search Method

Improving IR Systems using Concept Search Method


The aim of an IR system is to increase the no. of relevant documents returned for every query. IR system follows two different paths: Keyword Matching and Concept Matching.

We often receive a proportion of irrelevant documents while using keyword matching as it is imprecise. The imprecisions are due to:
  • Polysemy – A word can have more than one meaning. eg- Doctor can refer to neurologist, dentist, gynecologist etc.
  • Synonymy ­­– Two or more words can have same meaning. eg- IR and Information Retrieval means the same.
  • Spelling errors which leads to term mismatching. 
To overcome this we can use Concept Searching where rather than only keyword matching, the meaning of the keyword is matched. A conceptual matching is done between the query and the documents which helps in retrieving more relevant and important results.

It helps in increasing recall without decreasing precision as well as finding relevant documents which may or may not contain any of the query terms but share conceptual meaning.

One can combine concept search with keyword search to make the result more refined i.e. by applying a conceptual search for a given keyword (which have more than one meaning) and getting those documents which includes the keyword reflecting similar concept.

To be able to provide users with different possible meanings of a keyword and be able to identify the concepts in a document collection, a Concept based IR System should have a repository of word meanings and one such repository is WordNet a lexicographic database.
One can refer to this site for more details regarding WordNet.

Implementation of Concept Search:

Concept Search can be implemented in the following ways:-

  • Synonym Searching – One can use a thesaurus matching technique which can provide with similar or close meaning words to the original word.
  • Semantic Indexing - All the terms in the inverted index have a position vector in the conceptual space. Also, all the searchable documents have a position vector in the conceptual space. If both of these vectors are close to each other, then they share a conceptual relationship regardless of the fact that they may or may not share same term. Increase in distance between the two vectors indicates a decrease in conceptual relationship between the two. Hence, during concept search one does the same by mapping the position of the query into the concept space as it did for the searchable documents. One now locates the documents which are closer to the query’s vector and returns it. The document which is closest to the query’s vector has the highest conceptual score.
Word2Vec is also a technique developed by google which remembers vector representation of words. It works by training a machine learning model to predict words surrounding the word in a sentence. Same meaning words got same vector representation. 

Benefit of Concept Search over Keyword Search:

  • Different languages leads to term mismatches like a single word can have many meanings (e.g. java) or different words can mean the same (e.g. hot and warm) and so concept search helps to overcome this as it doesn’t depend on word matching.
  • Depends on concept rather than a single word so more accurate in finding relevant documents.
  • Concept search can handle long queries hence allowing users to describe the queries as their concepts or ideas.
  • Concept search can provide us with accurate synonym list. 

      References:


Comments