Introduction:
We use web search engine everyday to retrieve the documents
or information related to our query.But does it always give to us the relevant
one? Sometime we get the irrelevant documents too.The main person behind all
this retrieval of information is web crawler.
The crawler is a person or technically a program which
searches through the world wide web in a automated and systematic manner and
give the user the documents appropriate to his query.
Basically there are two types of crawlers:
1.Universal crawler (General crawler)
2.Focussed or topical crawler
Universal crawler:
It is used for normal search engine which retrieves the
documents sometimes not matching to query i.e irrelevant ones too.It indexes
all terms of documents and follow each and every hyperlink.
Focussed crawler
These are the crawler which are used where user wants
information regarding a pre-defined topic or domain for example , about news
websites or someone wants to keep track of a technology related to his domain
say,speech processing. Here user wants relevant results for his topic entered
as query ,so universal crawler does not help here.
In this blog we will concentrate on Focussed crawler and its
working.
Motivation and initial background
A general web crawler visits all web pages(relevants or
irrelevants) and create copy of them for fast retrieval at later time,but it is
at the cost of high computation power and large retrieval time.In contrast, focussed
crawler try to download only relevant set of documents. Fig.1[1] show it.
Fig. 1
Basis of its working:
It works on the basis of:
It works on the basis of:
1.Hyperlink relevancy
2.Documents content relevancy
General Working:
1.Focussed crawler on finding a query takes the URLs form seed URLs from URL frontier(a priority queue)and start downloading the documents.
2.The parser and extractor extracts the text of documents with their hyperlink and relevance of text and hyperlink is calculated based on topic(query).
3.The relevant URLs are added to URL frontier in priority fashion and irrelevant ones are discarded.Relevant document till now are added to Database.
4.The step 1 to 3 repeats again.
Main point is how relevancy of links are calculated.
There are different approaches for it.
1.Structure based focussed crawler:
The content of the downloaded page is checked and the link relevancy is calculated by determining the score of parent page containing the link.
2.Priority based focussed crawler:
The above one discussed in general working is actually priority based one.The URL wish resembles topic are stored in URL frontier.
3.Learning based focussed crawler:
This approach train a classifier on some set of training data which contains anchor text relevancy, relevancy of surrounding text and parent page and then testing is done on real query.A page downloaded from a link it tested and then accepted or rejected .
1.Focussed crawler on finding a query takes the URLs form seed URLs from URL frontier(a priority queue)and start downloading the documents.
2.The parser and extractor extracts the text of documents with their hyperlink and relevance of text and hyperlink is calculated based on topic(query).
3.The relevant URLs are added to URL frontier in priority fashion and irrelevant ones are discarded.Relevant document till now are added to Database.
4.The step 1 to 3 repeats again.
Main point is how relevancy of links are calculated.
There are different approaches for it.
1.Structure based focussed crawler:
The content of the downloaded page is checked and the link relevancy is calculated by determining the score of parent page containing the link.
2.Priority based focussed crawler:
The above one discussed in general working is actually priority based one.The URL wish resembles topic are stored in URL frontier.
3.Learning based focussed crawler:
This approach train a classifier on some set of training data which contains anchor text relevancy, relevancy of surrounding text and parent page and then testing is done on real query.A page downloaded from a link it tested and then accepted or rejected .
The quality of seed URLs should be high which are generated after several run of general crawling of web plus applying relevancy meaures on URLs.
Conclusion:
The focussed crawler in short saves time and give the relevant documents related to the topic.
References:
The focussed crawler in short saves time and give the relevant documents related to the topic.
References:
Comments
Post a Comment