Querying the web: The path to the modern web search engine

Introduction

The World Wide Web today is filled with humongous amounts of data and web-pages. The idea of even trying to traverse through them without the likes of a good search engine like Google seems completely impossible. But what is it that modern search engines do that help them find the webpages and answer our queries so well, that we rarely have the need to go the second page of query results. This blog will deal with the evolution of the modern Search Engines and the algorithms that led to their formation.

History

Before the mid 1990s, the WWW was mostly directory based. This means, either one had to manually traverse through the links present on a webpage to find the webpage s/he had to go to or, use basic search engines that used to do a keyword based search on this directory. But at the same time, the internet began to expand rapidly. The amount of webpages started to increase and there was a need for a good search engine that could return good results.

Building a Search Engine

Basic Search

A basic search engine could use similarity measures like TF-IDF and returned the results with maximum similarity. Unfortunately, this method would not have worked too well. Consider the case where a user searches the term “Facebook”. As we can clearly say, the top result by the search engine must be the website www.facebook.com. Now say there was a website that had as its content the word “Facebook” a billion times. The Facebook website in itself might not have Facebook written in it so many times, as a result, this fake website would have been ranked higher by the search engine. This way, it would have been extremely easy to game this system. Hence, a better way to rank the pages was needed.
A lot of search engines came, which focussed on trying to provide fix to such problems and provide good search results. Unfortunately, most of them had some flaws. Like the earlier versions of AltaVista gave random results when a query like “University” was searched. These results showed that AltaVista used length of URL as a quality measure.

All these measures search engines were just classifying the text on the web, but the web in itself contains much more than text. The web contains Hyperlinks!

Using Hyper Information

Massimo Marchiori [2] tried to use hyperlinks provided to rank documents on the web. He called this, the hyper information. He formulated the information on every page A, as INFORMATION(A) = TEXTINFO(A) + HYPERINFO(A)
Where TEXTINFO(A) is the information provided by the text in the document, and HYPERINFO(A), refers to the information provided by the hyperlinks.
He defined HYPERINFO(A) as the normalised sum of TEXTINFO present in the documents A points to.
So consider a simple case where A points to a document B, here, the INFORMATION(A) = TEXTINFO(A) + F.INFORMATION(B), where F is Fading factor 0<F<1.
If B points to no document, then, INFORMATION(A) = TEXTINFO(A) + F.TEXTINFO(B)

Now, if Document A pointed to Document B which pointed to C and so on, then, INFORMATION(A) = TEXTINFO(A) + F.TEXTINFO(B) + (F^2).TEXTINFO(C) ….
The value being added to INFORMATION(A) would decrease exponentially as one goes to link farther away from A.

The basis of the heuristic is that a document that contains relevant information and points to documents that contain more relevant information will get a good INFORMATION score. Though, here one can see, that a mere collection of hyperlinks may get a really good INFORMATION score even though, the document in itself contains no Text based information.

Page Rank

Shortly, after came PageRank, that gave rise to Google. PageRank, gave webpages score based on their value in the network. This is similar to how we humans see a person’s value. A person who is recommended by a bunch of good people, must be good, this is the principle of PageRank. Here, webpages were were ranked on the basis of backlinks, i.e. the relevance of a webpage was the mean of the sum of the webpages that pointed to it. This way, if we searched “Facebook”, we are most likely to end up with the facebook’s website getting the highest priority as it is referred by other webpages with high relevance. The page rank score for a webpage u is defined as:

where R(u) is the PageRank, c is a factor 0<c<1 and B is the set if backlinks to u with set size N.

IIITD IR MELANAGE

Search This Blog