How to Practically Retrieve Information

A guide to building a scraper, or how to retrieve information practically for education purposes. This post shows how to make a tool that can extract results out of University of Delhi's website for students of B. Tech. Computer Science (batch of 2017).


Motivation

I found it hard to remember my overall percentage up to the current semester, as it was required in many places while filling forms. Usually, I had to calculate it every time from the mark sheet, which I had to download from the non-mobile friendly DU's website. Downloading the mark sheet itself is a cumbersome task, as it requires filling a lot of fields which can be redundant. So I thought of making a web tool to extract the required information from the DU's website automatically and present the useful information to me in a convenient form.

University of Delhi Website


NOTE: It's important to understand that it can sometimes be illegal to extract such information in large quantities. Often people get their IPs blocked at some sites for hitting it too many times.

Procedure

Pre-requisite

  • Basic Knowledge of Python
  • Basic idea of HTML elements
  • Cloud / VPS to make a public website for the same
Python is a prevalent language for automation as well as backend services. It is a developer friendly language that provides standard functionality in the core of python. Advanced activities can be done using modules in python efficiently. 
Download python 2.7.x on your system. A tool named pip gets automatically downloaded with it. Use pip to install advanced modules.
We would require mechanize as well as BeautifulSoup libraries to get the job done.
Mechanize is used to handle web interactions with a webpage without actually needing a screen, so it does all those interactions internally.
  1. Importing Libraries
    import re
    from mechanize import Browser
    from bs4 import BeautifulSoup
  2. Initialising Variables
    br = Browser()
    br.set_handle_robots(False)
    br.set_handle_redirect(True)
    br.addheaders = [('User-agent', 'Firefox')]

    """
       url of the page from where you intend to extract information
    """
    url = "http://abc.xyz/123/456"

    # Parameters
    p_type = 'Semester'
    p_exam_flag = 'UG_SEMESTER_4Y'
    p_stream = 'SC'
    p_year = 'IV'
    p_sem = 'VIII'
  3. Open a URL in the virtual browser
    br.open(url)
  4. Filling up Form Information (optional depending on application of scraper)
    The URL opened above shows the page shown above. Analyse the HTML for the form to know the id's and sections of HTML to identify each field and fill them automatically:
    br.select_form('form1')
    br.form['ddlcollege'] = [p_colg, ].
    .
    .
    br.submit()

    Do this for every form element to fill your chosen value in the corresponding field.
  5. Extracting Useful Information
    Now it's time to retrieve the HTML response from the page. It's just a plain HTML text of the webpage displayed. Also, identify the tags and id's of interest in the response via Chrome Developer Tools or simply by analyzing the page source of the respective HTML page. Then extracting the inner text in those fields and using that information for computation. BeautifulSoup library helps in manipulating this HTML response.
    # Receive the complete HTML response from the page
    htmltext = br.response()

    # Initialise the BeautifulSoup library with the HTML text, to
    # see the library in action
    soup = BeautifulSoup(htmltext, "html.parser")

    # Find element in the HTML body, 
    marks_raw = soup.find_all(id="gvrslt")

    # Find all the table rows in the HTML body
    # of the root element extracted above
    soup = BeautifulSoup(str(marks_raw[0]), "html.parser")
    marks_raw_list = soup.find_all("tr")

    # Now iterate over all rows of the table
    for r in marks_raw_list:
        # extract the useful information between the data tags
        rw = re.findall('<td align="center">(.+?)</td>', str(r))

    Live Action of Data getting Downloaded:

  6. Extending the Data Extracted
    Now you can save this information within the database, either just for you or for your entire batch. This database can be used to do various analysis with the data.
    Also, you can use various other libraries to publish this information in a more presentable format rather than a console result.
    You can use a simple framework and host it on a virtual private server to let the world see your production.
    The framework I used is called Flask. See the following screenshots to see how it looks.
    List of colleges

    List of students along with their percentage
    This idea can be extended further and also can be adapted in various ways. We can do various kinds of data retrieval. Also, we can develop web crawler using this method, wherein we can extract URL's present in the 'a' tags present on a site and traverse along the URLs.

    To view the project in action and view the code, refer to following links:
    Demo | Github

Resources

Comments