A guide to building a scraper, or how to retrieve information practically for education purposes. This post shows how to make a tool that can extract results out of University of Delhi's website for students of B. Tech. Computer Science (batch of 2017).
NOTE: It's important to understand that it can sometimes be illegal to extract such information in large quantities. Often people get their IPs blocked at some sites for hitting it too many times.
Motivation
I found it hard to remember my overall percentage up to the current semester, as it was required in many places while filling forms. Usually, I had to calculate it every time from the mark sheet, which I had to download from the non-mobile friendly DU's website. Downloading the mark sheet itself is a cumbersome task, as it requires filling a lot of fields which can be redundant. So I thought of making a web tool to extract the required information from the DU's website automatically and present the useful information to me in a convenient form.University of Delhi Website |
NOTE: It's important to understand that it can sometimes be illegal to extract such information in large quantities. Often people get their IPs blocked at some sites for hitting it too many times.
Procedure
Pre-requisite
- Basic Knowledge of Python
- Basic idea of HTML elements
- Cloud / VPS to make a public website for the same
Download python 2.7.x on your system. A tool named pip gets automatically downloaded with it. Use pip to install advanced modules.
We would require mechanize as well as BeautifulSoup libraries to get the job done.
We would require mechanize as well as BeautifulSoup libraries to get the job done.
Mechanize is used to handle web interactions with a webpage without actually needing a screen, so it does all those interactions internally.
- Importing Libraries
import re
from mechanize import Browser
from bs4 import BeautifulSoup - Initialising Variables
br = Browser()
br.set_handle_robots(False)
br.set_handle_redirect(True)
br.addheaders = [('User-agent', 'Firefox')]
"""
url of the page from where you intend to extract information
"""
url = "http://abc.xyz/123/456"
# Parameters
p_type = 'Semester'
p_exam_flag = 'UG_SEMESTER_4Y'
p_stream = 'SC'
p_year = 'IV'
p_sem = 'VIII' - Open a URL in the virtual browser
br.open(url) - Filling up Form Information (optional depending on application of scraper)
The URL opened above shows the page shown above. Analyse the HTML for the form to know the id's and sections of HTML to identify each field and fill them automatically:
br.select_form('form1')
br.form['ddlcollege'] = [p_colg, ].
.
.
br.submit()
Do this for every form element to fill your chosen value in the corresponding field. - Extracting Useful Information
Now it's time to retrieve the HTML response from the page. It's just a plain HTML text of the webpage displayed. Also, identify the tags and id's of interest in the response via Chrome Developer Tools or simply by analyzing the page source of the respective HTML page. Then extracting the inner text in those fields and using that information for computation. BeautifulSoup library helps in manipulating this HTML response.
# Receive the complete HTML response from the page
htmltext = br.response()
# Initialise the BeautifulSoup library with the HTML text, to
# see the library in action
soup = BeautifulSoup(htmltext, "html.parser")
# Find element in the HTML body,
marks_raw = soup.find_all(id="gvrslt")
# Find all the table rows in the HTML body
# of the root element extracted above
soup = BeautifulSoup(str(marks_raw[0]), "html.parser")
marks_raw_list = soup.find_all("tr")
# Now iterate over all rows of the table
for r in marks_raw_list:
# extract the useful information between the data tags
rw = re.findall('<td align="center">(.+?)</td>', str(r))
Live Action of Data getting Downloaded: - Extending the Data Extracted
Now you can save this information within the database, either just for you or for your entire batch. This database can be used to do various analysis with the data.
Also, you can use various other libraries to publish this information in a more presentable format rather than a console result.
You can use a simple framework and host it on a virtual private server to let the world see your production.
The framework I used is called Flask. See the following screenshots to see how it looks.List of colleges List of students along with their percentage
To view the project in action and view the code, refer to following links:
Demo | Github
Comments
Post a Comment