Web Scraping A-Z Indian University Data into MS Excel using Python Selenium library

Bhaveshkumar Rathod
4 min readNov 21, 2020

What is Web Scrapping?

Web scraping is the method of extracting or scraping data from a website. This knowledge is gathered and then translated to a medium that is more accessible to the user. It’s either a spreadsheet or an API.

Approach:

  1. Request for a response from the webpage.
  2. Parse and extract with the help of the Selenium module.
  3. Download and export the data with pandas into Excel.

Selenium Overview

Selenium is a powerful browser automation tool. It supports various browsers like Firefox, Chrome, Internet Explorer, Edge, Safari. Webdriver is the heart of Selenium Python. Use-cases of using selenium for web scraping are automating a login, submitting form elements and much more.

Installation

Assuming that Python is installed in the system, we can install the below library using pip/conda.

pip install selenium

OR

conda install selenium

We will be using a Google Chrome driver. We can download it from this site: https://chromedriver.chromium.org/downloads

The Data Source:

We need a webpage to fetch the Indian university data. So we will be using the “uniRank” https://www.4icu.org/reviews/index2.htm website here.

The webpage will look something like this:

After scrolling down the website and clicking on the university name that itself is a link we will be redirected to the new page that contains university information.

We will scrap the university name, country rank and world rank from the web page.

Implementation

1. Import packages

We need a selenium web driver, time and pandas Python packages

from selenium import webdriver
import time
import pandas as pd

2. Declare Variables

We need to define variables to make it easier for later use. In chromedriver_location add the downloaded chrome driver path.

indias_clg_list=[]
links_list=[]
country_rank_list=[]
world_rank_list=[]
chromedriver_location='C:\chromedriver'
search_query='https://www.4icu.org/reviews/index2.htm'
driver=webdriver.Chrome(chromedriver_location)

Where, “indias_clg_list” list for storing the name of the universities. “links_list” list for storing the link of a particular university, contry_rank_list and world_rank_list for storing the university country rank and world rank respectively. The “search_query” indicates which website you want to scrap.

After running the above cell one chrome browser instance will be created.

3. Hit the required URL to get the necessary information

We need to get the specific web element tags for getting the correct information. You can obtain this by doing a right-click on the page and click on inspect. We can click on the arrow in the top left corner or Ctrl+Shift+C to inspect a particular element and get the necessary HTML tag. A good or professional HTML site contains a unique identifier for almost all the tags associated with the information. We will leverage this property to scrape the web page.

for i in range(2,34):
driver.get("https://www.4icu.org/reviews/index"+str(i)+".htm")
time.sleep(1.5)
table=driver.find_elements_by_xpath("//table[@class='table table-hover text-left']/tbody/tr")

for j in table:
country=j.find_elements_by_tag_name("td")

if(country[1].text=='in'):
indias_clg_list.append(country[0].text)
time.sleep(1)
link=country[0].find_element_by_tag_name("a").get_attribute("href")
links_list.append(link)

In the above code, there is one loop used because we are scraping A-Z all the universities and all the alphabetic universities have separate pages starting from A. In the URL we can see there is “/index2.htm” for the universities that are starting from the letter “A”.

table=driver.find_elements_by_xpath("//table[@class='table table-hover text-left']/tbody/tr")

Here, we fetch all the universities that are on the page and store their references in the “table” variable.

for j in table:
country=j.find_elements_by_tag_name("td")

if(country[1].text=='in'):
indias_clg_list.append(country[0].text)
time.sleep(1)
link=country[0].find_element_by_tag_name("a").get_attribute("href")
links_list.append(link)

There were two columns on the page that contains the university name(URL) and country name. We stored the name of the university in “indias_clg_list” and the link of the university in “links_list”.

links_list[:5]

Output: These are the links of the universities.

['https://www.4icu.org/reviews/17817.htm',
'https://www.4icu.org/reviews/1978.htm',
'https://www.4icu.org/reviews/2077.htm',
'https://www.4icu.org/reviews/17885.htm',
'https://www.4icu.org/reviews/17856.htm']

4. Get university info from the link_list:

We aim to fetch university “world rank” and “country rank”. We will iterate the links_list element, go to that page and extract the required information using find_elements_by_xpath attribute of the selenium web driver. Once the iteration is over, we will quit the driver to close the browser

for link in links_list:

driver.get(link)
time.sleep(1)

rows=driver.find_elements_by_xpath("//table[@class='text-right']/tbody/tr")

columns=rows[0].find_elements_by_tag_name("td")
country_rank_list.append(columns[1].text)
columns=rows[1].find_elements_by_tag_name("td")
world_rank_list.append(columns[1].text)
driver.quit()

5. Save the data in a CSV file

We will add proper columns to the data frame and use the to_csv attribute of the data frame to save it as CSV

data=pd.DataFrame({'University_Name':indias_clg_list,
'Country_Rank':country_rank_list,
'World_Rank':world_rank_list
})
data.to_csv('Indian_University_Data.csv',index=False)

Output

The following CSV file will be downloaded in the “Indian_University_Data” variable and the dataset will look like the following that contains “925” records with “3” columns.

To download the generated dataset click on the following link and get it: https://rb.gy/jlclwh

You can get the source code from the following Github link:

My LinkedIn Profile:

--

--