Extracting Local Events from Webpages

The Need

In the era of social media advertising, the popularity of an event is increasingly being dependent on the amount of money the organizers put into advertising. The cost of advertising is repaid with high ticket costs at such events. However, there are many events such as poetry reading, farmers market, wine tasting, charity dinners, garage/library sales, local band concerts, etc. do not have the quality to generate huge crowds and hence lack the resources to do paid promotions on big platforms.
Another trend which occurs on big social media websites is the bias towards the bigger cities - events in small cities are often lost in between them. Thus, a person living in a smaller city is more likely to miss updates on such events. In a preliminary analysis of a famous ClueWeb12 dataset, it was found that most events are concentrated in Los Angeles, New York, Chicago, etc.
What makes the task difficult and essential is the fact that local event organizers will not have the technical expertise to make their data in a structured format which can be easily picked up by standard context-specific information pickers of the kind.

The Approach 

Defining an Event

An event is defined as something which happens at a particular distinct location, has a certain start and stop time with a title and description. In short, it must answer 3 fundamental questions - What, When and Where?

Extracting Structured Information

Once a webpage is found, the next task is to scrape through it and extract useful information which might point to an event. Let the set of fields which are discriminants of an event (date, time, venue, etc.) be F. Now the region enclosed by the fields in F can be denoted by R.  As usual, the document is denoted by D. 
We generate a scoring function which has weighted composition of all F, D and R


As these are the major constituent of any event scoring scheme, our final score will depend on the weights.

Alpha - Document Score

Alpha can be said as the ratio of the probability of a Document containing an event to it not containing an event. To compute this, priors can be learnt using bag-of-words models and final probability can be found out using Naive Baye's. As this score will require a threshold, we can set one appropriately with enough training data.

Beta - Region Score

This score calculates how much of the fields present in inside R has potential to be an event. This threshold can be somewhat surprisingly calculated appropriately using manual annotations. As this field is neither complicated nor varying, it is set to a constant value.

Gamma - Field Scoring

Starting with 3 fundamental questions - {What, When and Where} and measure the overlap/matching of each field with the context of these questions. Overlaps and matchings are achieved by using a variety of regexes. Using the prior training data, each document fields were predicted by a multi-class classifier to classify them into information.

Combining the above

Once the phi score is calculated, we greedily choose events with highest scores and lowest overlaps, i.e. if event fields do not overlap with already existing events in our output set. This algorithm may vary in trivialities but will always favour few overlaps over many overlaps.

Geographical Bias

Testing algorithms like these were tested on Schema.org type datasets and compare it to legacy detectors. The blue data in the map is the locations predicted with legacy algorithms and yellow is with the aforementioned style of algorithms. Clearly, they capture local data better. Hence they have solved problems related to high bias towards big cities.


References:



Comments