How to Practically Retrieve Information

A guide to building a scraper, or how to retrieve information practically for education purposes. This post shows how to make a tool that can extract results out of University of Delhi's website for students of B. Tech. Computer Science (batch of 2017).

Motivation

I found it hard to remember my overall percentage up to the current semester, as it was required in many places while filling forms. Usually, I had to calculate it every time from the mark sheet, which I had to download from the non-mobile friendly DU's website. Downloading the mark sheet itself is a cumbersome task, as it requires filling a lot of fields which can be redundant. So I thought of making a web tool to extract the required information from the DU's website automatically and present the useful information to me in a convenient form.

University of Delhi Website

NOTE: It's important to understand that it can sometimes be illegal to extract such information in large quantities. Often people get their IPs blocked at some sites for hitting it too many times.

Procedure

Pre-requisite

Basic Knowledge of Python
Basic idea of HTML elements
Cloud / VPS to make a public website for the same

Python is a prevalent language for automation as well as backend services. It is a developer friendly language that provides standard functionality in the core of python. Advanced activities can be done using modules in python efficiently.

Download python 2.7.x on your system. A tool named pip gets automatically downloaded with it. Use pip to install advanced modules.
We would require mechanize as well as BeautifulSoup libraries to get the job done.

Mechanize is used to handle web interactions with a webpage without actually needing a screen, so it does all those interactions internally.

Importing Libraries
import re
from mechanize import Browser
from bs4 import BeautifulSoup
Initialising Variables
br = Browser()
br.set_handle_robots(False)
br.set_handle_redirect(True)
br.addheaders = [('User-agent', 'Firefox')]

"""
url of the page from where you intend to extract information
"""
url = "http://abc.xyz/123/456"
# Parameters
p_type = 'Semester'
p_exam_flag = 'UG_SEMESTER_4Y'
p_stream = 'SC'
p_year = 'IV'
p_sem = 'VIII'
Open a URL in the virtual browser
br.open(url)
Filling up Form Information (optional depending on application of scraper)
The URL opened above shows the page shown above. Analyse the HTML for the form to know the id's and sections of HTML to identify each field and fill them automatically:
br.select_form('form1')
br.form['ddlcollege'] = [p_colg, ].
.
.
br.submit()

Do this for every form element to fill your chosen value in the corresponding field.
Extracting Useful Information
Now it's time to retrieve the HTML response from the page. It's just a plain HTML text of the webpage displayed. Also, identify the tags and id's of interest in the response via Chrome Developer Tools or simply by analyzing the page source of the respective HTML page. Then extracting the inner text in those fields and using that information for computation. BeautifulSoup library helps in manipulating this HTML response.
# Receive the complete HTML response from the page
htmltext = br.response()

# Initialise the BeautifulSoup library with the HTML text, to
# see the library in action
soup = BeautifulSoup(htmltext, "html.parser")

# Find element in the HTML body,
marks_raw = soup.find_all(id="gvrslt")

# Find all the table rows in the HTML body
# of the root element extracted above
soup = BeautifulSoup(str(marks_raw[0]), "html.parser")
marks_raw_list = soup.find_all("tr")

# Now iterate over all rows of the table
for r in marks_raw_list:
# extract the useful information between the data tags
rw = re.findall('<td align="center">(.+?)</td>', str(r))

Live Action of Data getting Downloaded:

Extending the Data Extracted
Now you can save this information within the database, either just for you or for your entire batch. This database can be used to do various analysis with the data.
Also, you can use various other libraries to publish this information in a more presentable format rather than a console result.
You can use a simple framework and host it on a virtual private server to let the world see your production.
The framework I used is called Flask. See the following screenshots to see how it looks.

List of colleges

List of students along with their percentage

This idea can be extended further and also can be adapted in various ways. We can do various kinds of data retrieval. Also, we can develop web crawler using this method, wherein we can extract URL's present in the 'a' tags present on a site and traverse along the URLs.

To view the project in action and view the code, refer to following links:
Demo | Github

IIITD IR MELANAGE

Search This Blog