Adversarial IR
The pervasive use of search engines for accessing desired information is probably strong evidence in favor of Information Retrieval algorithms. However, the Web is a nasty place; the prospect of earning more dollars with minimum investment seems compelling. So, such Web site administrators trot on an ‘unethical’ route of sabotaging IR algorithms of search engines for their economic gain. They do so by means of ‘Spamdexing’, elucidated by my course mate Himanshu Punetha in his blog [2]. This has made the relationship between a Web site administrator and a Search Engine administrator “adversarial”. Rightfully so, Adversarial Information Retrieval is a budding research area that aims at detecting and preventing such attempts of unjustifiable ranking of Web sites.
Researchers aim to develop automatic web spam classifiers, and to do so, they consider the corpus of web pages as a Web Graph. Each vertex represents a web page and each edge represents a link between two web pages. ‘Link Farming’ is a common form of spamdexing. Luca Becchett et al. from ‘Yahoo! Research’, suggest a method to detect link farms by considering a simple observation that nodes in a link farm tend to have a higher in-degree (to amplify their PageRank) values. Considering ‘degree’ as a feature alone, they were able to classify 73-74% of spams in their dataset, with 2-3% of false positives. A combination of other features such as distributions of PageRank of the direct in-neighbors for a page, TrustRank score of a page improves the detection rate to 80-81% and reduces false positives to 1-3%.
Image Source: [1] |
A criticism of the method described above is that it labels the complete host as ‘spam’ and does not take into account hosts that contain a mix of spam and legitimate content. Still, much research is left to be done in this field, because why not?
Temporal IR
Information Theory states that there are five aspects for determining a document’s credibility. These are relevance, accuracy, objectivity, coverage, and timeliness i.e. currency.
Why currency?
Consider a scenario where you search for scores for an ongoing cricket match between India and Pakistan, by typing “India Pakistan match score”, and the search results in the final scores of the teams from a match played in 1999. Clearly, you are never using that search engine again.
So, Temporal IR or T-IR aims at combining the document relevance with its temporal (or currency) relevance. Queries that demand temporal search results can be broadly classified into three categories:
- Implicit Temporal Expression: The query expressions that require the focus time to be implicitly inferred from the content of the web pages. For example: “Christmas Day”, “Earth Day”, “Elections” etc.
- Explicit Temporal Expression: The query expression that explicitly mentions the focus time such as, “December 2009”, “September 11” etc.
- Relative Temporal Expression: The query expression with focus time relative to the current time, such as “tomorrow”, “45 minutes later”, etc.
Many models have been proposed to consider temporal attributes for search results. Many temporal taggers have been developed that recognizes and normalizes time expressions, i.e. converts expressions like “next Saturday 4 pm“, to “2018-03-03T16:00”. These taggers can be used, to build indexes to retrieve search results.
References:
1) Becchetti, Luca, et al. "Link-based characterization and detection of web spam." AIRWeb. 2006.
2) http://iiitd-ir-melanage.blogspot.in/2018/02/spamdexing-means-to-search-engine.html
3) https://en.wikipedia.org/wiki/Temporal_information_retrieval
Comments
Post a Comment