information retrieval approaches

Information Retrieval Introduction

Human beings learn a lot from nature. In nature, all the elements are found in the form of ores. No element is found in its pure valuable form. So we have to extract it from ores. Similarly, there is important information which is beneficial and useful but we have to extract the piece of information by analysis and pass on to the concerned user in a concentrated form.

The same way, this subject is about extracting the useful information from a bunch of files and documents. This subject helps us how to collectively read an available set of data, bifurcate it into useful and not useful data and then present it to the user.

Since no knowledge is a waste. It's just that, a different piece of information stands useful for a different set of people. For example, some user's might find it important how the hardware wires of a system are to be connected, whereas on the other hand software engineers might find it completely useless.

Purpose

The sole purpose of this subject is the approach. The approach to extract. This extraction of information depends on some concerns:

What the user wants.
Mapping of system’s understanding and user need.
The capability of our system to understand user need.

Many approaches to extracting information are available. Different measures are adopted to rank the approaches. Some approaches can be like TF-IDF, cosine score, title indexes, champion lists, authority scoring etc.

Benefit

Use of a retrieval system is it maps the input data to the user need. The better the mapping, the better the outcome. Hence, the benefit is the response of the system to the user when consulted.

For better information retrieval, following approach is to be followed:

Understanding the user need correctly.
Converting the user needs to query form.
Querying the retrieval system.
Post-filtering the extracted information.
Ranking the output of the system.
Presenting to the user.

Approaches in detail:

Understanding the user need correctly: This is the first step to building a successful retrieval system. This step is important as it states to the system, what user wants. Until this step is performed successfully, a system would be hitting shots in the dark.

To avoid this kind of random guesses, its utmost important to understand the user need correctly.

Converting the user needs to query form: After successfully learning what user needs, it is required to be converted to an intermediate stage. This intermediate stage is a representation of user need in our system’s language. This is known as a Query.

User needs can be divided into 3 categories:

Informational- These are generally seeking for some kind of information. Ex: an introduction to a topic, depth queries etc.
Navigational- These are generally meant to ask for website or homepage of a certain specific organization or an entity. Ex: searching for the official webpage of world health organization etc.
Transactional- Used to generally perform a transaction like e-commerce website etc.

This is a certain concern in today’s retrieval systems, what kind of output should be presented for queries belonging to above categories.The general belief is that, for a navigational query, the system should return the intended homepage of the required organization. For an informational query, any kind of relevant webpage is useful.

After accepting the user query, it is preprocessed to convert to a query. Preprocessing helps in bifurcating the category of the query. This preprocessing helps in providing a better search result to the user.

Querying the retrieval system: There are various ways of querying a retrieval system.

On basis of keywords - While preprocessing the query, we might extract out important keywords from the query. Then, these keywords are to be searched in the documents.
On basis of phrase queries - Extracting out keywords in this category is not that much benefit. Rather than what needs to be done is the removal of stop words and the querying the system as a single set of words. But sometime, stop words might be useful as well. EX: ‘flight london’ won’t make any sense even to the system but ‘flight to london’ would surely make some sense.
On basis of intended meaning - Sometime, query can’t be directly searched in the system. So, there stands a need for query expansion. Query expansion is like searching for a query as well as synonyms of the query. Ex: ‘ dine in restaurants’ is the query. So, rather than searching for only this query, we also need to refer ‘dine in cafe’.

This way, all the successful searches for the query can be done in a system.

Post filtering the extracted information: After a successful search is performed on the input data set to find out relevant information, it’s still not according to the user’s needs. Searches can be of following types:

SIngle keyword search: In this type of searching/querying, the dataset is matched against every single keyword present in the query. All those documents which contain any keyword, are returned.
Multi-keyword search/ Phrase queries: In this type of searching/querying, the dataset is matched against query as a whole. This can be done in many ways:
- Bigram Search: Searching every pair of keyword in the dataset.
- Extended bigram search: Similar to bigram indexing. The difference is that we ignore all stop words in the query and then form pairs.
- Proximity Search: We need a modified dataset for this kind of search. In the dataset, along with the words and the doc-id, position of each word in the document should also be stored. So we can search like: ‘word1 distance word2’ where distance represents the distance between their positions to be searched in the document.

After the search is performed, we need to perform post filtering like:

The intersection of the output doc-ids, to find out most relevant documents from the dataset.
In case of bigram search, we need to set a threshold like those documents containing at least 4 pairs from the query etc.

So, this explains the post-filtering step to finding out the final return value of the system.

Ranking the output of the system: This step also falls into the category of post filtering. In this step, the doc ids so retrieved from the system are ranked on some measures. Some are:

Ranking on basis of authority: In this method, the documents which have a high page rank value are ranked above than others.
Ranking on basis of title: Some documents might contain the query keywords in their title as well. So, these documents need to be ranked higher than others on basis of their relevance.
Cosine score: Cosine score is finding the similarity between the documents and the query. The closer the document to the query, the higher will be its score.
TF-IDF score: This is a combination of two scores i.e. TF score and IDF score. TF score is the score representing the term count in each document i.e. term frequency. It represents how relevant a document is for each keyword. IDF score is the inverse of the count of the documents in which a term is present i.e. inverse document frequency. This represents the importance of a term. The more the documents in which term is present, lower is its IDF score.

There exist more of ranking methods, but these are the most used ones.

Conclusion

So, above mentioned are the basic steps for an information retrieval system like understanding user need, searching methods,post-filtering, the ranking of outputs etc. Just that if we have a data set, we need to turn it into an inverted index.

Inverted index construction and optimization will soon be discussed in my next blog.

So, if you enjoyed the blog, I would be grateful. Do write at ojasvi17033@iiitd.ac.in.

IIITD IR MELANAGE

Search This Blog

information retrieval approaches

Comments

Post a Comment