Focussed crawling :Give me relevant documents

Introduction:

We use web search engine everyday to retrieve the documents or information related to our query.But does it always give to us the relevant one? Sometime we get the irrelevant documents too.The main person behind all this retrieval of information is web crawler.

The crawler is a person or technically a program which searches through the world wide web in a automated and systematic manner and give the user the documents appropriate to his query.

Basically there are two types of crawlers:

1.Universal crawler (General crawler)

2.Focussed or topical crawler

Universal crawler:

It is used for normal search engine which retrieves the documents sometimes not matching to query i.e irrelevant ones too.It indexes all terms of documents and follow each and every hyperlink.

Focussed crawler

These are the crawler which are used where user wants information regarding a pre-defined topic or domain for example , about news websites or someone wants to keep track of a technology related to his domain say,speech processing. Here user wants relevant results for his topic entered as query ,so universal crawler does not help here.

In this blog we will concentrate on Focussed crawler and its working.

Motivation and initial background

A general web crawler visits all web pages(relevants or irrelevants) and create copy of them for fast retrieval at later time,but it is at the cost of high computation power and large retrieval time.In contrast, focussed crawler try to download only relevant set of documents. Fig.1[1] show it.

Fig. 1

Basis of its working:
It works on the basis of:

1.Hyperlink relevancy

2.Documents content relevancy

General Working:
1.Focussed crawler on finding a query takes the URLs form seed URLs from URL frontier(a priority queue)and start downloading the documents.
2.The parser and extractor extracts the text of documents with their hyperlink and relevance of text and hyperlink is calculated based on topic(query).
3.The relevant URLs are added to URL frontier in priority fashion and irrelevant ones are discarded.Relevant document till now are added to Database.
4.The step 1 to 3 repeats again.

Main point is how relevancy of links are calculated.

There are different approaches for it.

1.Structure based focussed crawler:
The content of the downloaded page is checked and the link relevancy is calculated by determining the score of parent page containing the link.

2.Priority based focussed crawler:
The above one discussed in general working is actually priority based one.The URL wish resembles topic are stored in URL frontier.

3.Learning based focussed crawler:
This approach train a classifier on some set of training data which contains anchor text relevancy, relevancy of surrounding text and parent page and then testing is done on real query.A page downloaded from a link it tested and then accepted or rejected .

The quality of seed URLs should be high which are generated after several run of general crawling of web plus applying relevancy meaures on URLs.

Conclusion:
The focussed crawler in short saves time and give the relevant documents related to the topic.

References:

1.Image source

2.A Study of Focused Crawler Approaches ,IJIRCCE.2015
3.Focussed crawler

IIITD IR MELANAGE

Search This Blog

Focussed crawling :Give me relevant documents

Comments

Post a Comment