Information
Retrieval Introduction
Human beings learn a lot from
nature. In nature, all the elements are found in the form of ores. No
element is found in its pure valuable form. So we have to extract it
from ores. Similarly, there is important information which is
beneficial and useful but we have to extract the piece of information
by analysis and pass on to the concerned user in a concentrated form.
The same way, this subject is
about extracting the useful information from a bunch of files and
documents. This subject helps us how to collectively read an
available set of data, bifurcate it into useful and not useful data
and then present it to the user.
Since no knowledge is a waste.
It's just that, a different piece of information stands useful for a
different set of people. For example, some user's might find it
important how the hardware wires of a system are to be connected,
whereas on the other hand software engineers might find it completely
useless.
Purpose
The
sole purpose of this subject is the approach.
The approach to extract. This extraction of information depends on
some concerns:
-
What the user wants.
-
Mapping of system’s understanding and user need.
-
The capability of our system to understand user need.
Many
approaches to extracting information are available. Different
measures are adopted to rank the approaches. Some approaches can be
like TF-IDF, cosine score, title indexes, champion lists, authority
scoring etc.
Benefit
Use
of a retrieval system is it maps the input data to the user need. The
better the mapping, the better the outcome. Hence, the benefit
is the response of
the system to the user when consulted.
For
better information retrieval, following approach is to be followed:
-
Understanding the user need correctly.
-
Converting the user needs to query form.
-
Querying the retrieval system.
-
Post-filtering the extracted information.
-
Ranking the output of the system.
-
Presenting to the user.
Approaches
in detail:
-
Understanding the user need correctly: This is the first step to building a successful retrieval system. This step is important as it states to the system, what user wants. Until this step is performed successfully, a system would be hitting shots in the dark.
To avoid this kind of random
guesses, its utmost important to understand the user need correctly.
-
Converting the user needs to query form: After successfully learning what user needs, it is required to be converted to an intermediate stage. This intermediate stage is a representation of user need in our system’s language. This is known as a Query.
User
needs can be divided into 3 categories:
-
Informational- These are generally seeking for some kind of information. Ex: an introduction to a topic, depth queries etc.
-
Navigational- These are generally meant to ask for website or homepage of a certain specific organization or an entity. Ex: searching for the official webpage of world health organization etc.
-
Transactional- Used to generally perform a transaction like e-commerce website etc.
This is a certain concern in
today’s retrieval systems, what kind of output should be presented
for queries belonging to above categories.The general belief is that,
for a navigational query, the system should return the intended
homepage of the required organization. For an informational query,
any kind of relevant webpage is useful.
After accepting the user
query, it is preprocessed to convert to a query. Preprocessing helps
in bifurcating the category of the query. This preprocessing helps in
providing a better search result to the user.
-
Querying the retrieval system: There are various ways of querying a retrieval system.
-
On basis of keywords - While preprocessing the query, we might extract out important keywords from the query. Then, these keywords are to be searched in the documents.
-
On basis of phrase queries - Extracting out keywords in this category is not that much benefit. Rather than what needs to be done is the removal of stop words and the querying the system as a single set of words. But sometime, stop words might be useful as well. EX: ‘flight london’ won’t make any sense even to the system but ‘flight to london’ would surely make some sense.
-
On basis of intended meaning - Sometime, query can’t be directly searched in the system. So, there stands a need for query expansion. Query expansion is like searching for a query as well as synonyms of the query. Ex: ‘ dine in restaurants’ is the query. So, rather than searching for only this query, we also need to refer ‘dine in cafe’.
This way, all the successful
searches for the query can be done in a system.
-
Post filtering the extracted information: After a successful search is performed on the input data set to find out relevant information, it’s still not according to the user’s needs. Searches can be of following types:
-
SIngle keyword search: In this type of searching/querying, the dataset is matched against every single keyword present in the query. All those documents which contain any keyword, are returned.
-
Multi-keyword search/ Phrase queries: In this type of searching/querying, the dataset is matched against query as a whole. This can be done in many ways:
-
Bigram Search: Searching every pair of keyword in the dataset.
-
Extended bigram search: Similar to bigram indexing. The difference is that we ignore all stop words in the query and then form pairs.
-
Proximity Search: We need a modified dataset for this kind of search. In the dataset, along with the words and the doc-id, position of each word in the document should also be stored. So we can search like: ‘word1 distance word2’ where distance represents the distance between their positions to be searched in the document.
-
After the search is performed,
we need to perform post filtering like:
-
The intersection of the output doc-ids, to find out most relevant documents from the dataset.
-
In case of bigram search, we need to set a threshold like those documents containing at least 4 pairs from the query etc.
So, this explains the
post-filtering step to finding out the final return value of the
system.
-
Ranking the output of the system: This step also falls into the category of post filtering. In this step, the doc ids so retrieved from the system are ranked on some measures. Some are:
-
Ranking on basis of authority: In this method, the documents which have a high page rank value are ranked above than others.
-
Ranking on basis of title: Some documents might contain the query keywords in their title as well. So, these documents need to be ranked higher than others on basis of their relevance.
-
Cosine score: Cosine score is finding the similarity between the documents and the query. The closer the document to the query, the higher will be its score.
-
TF-IDF score: This is a combination of two scores i.e. TF score and IDF score. TF score is the score representing the term count in each document i.e. term frequency. It represents how relevant a document is for each keyword. IDF score is the inverse of the count of the documents in which a term is present i.e. inverse document frequency. This represents the importance of a term. The more the documents in which term is present, lower is its IDF score.
There
exist more of ranking methods, but these are the most used ones.
Conclusion
So, above mentioned are the
basic steps for an information retrieval system like understanding
user need, searching methods,post-filtering, the ranking of outputs
etc. Just that if we have a data set, we need to turn it into an
inverted index.
Inverted index construction
and optimization will soon be discussed in my next blog.
So,
if you enjoyed the blog, I would be grateful. Do write at
ojasvi17033@iiitd.ac.in.
Comments
Post a Comment