top of page
Programming

How I created my first Web Crawler!

Updated: Oct 16, 2022


What is a web crawler?

A web crawler is a bot that crawls the internet to index and downloads the contents of websites for scraping. Web crawlers are also called web spiders or crawling bots. A web crawler needs to be provided with a list of initial websites to start from which it will index and crawl the links present in the indexed websites to discover new pages.


The Library Analogy

To give an analogy, let’s consider all the websites on the internet as books present in a library. A web crawler is a librarian whose job is to enter the book’s information in a catalog so that it is easy to find the books when required. To organize the books, the librarian will store the title, description, and category of the books in a catalog. A web crawler will also do the same thing. The goal of a web crawler is accomplished when it indexes all the pages on the internet. Something which is impossible to achieve!


Creating a Web Crawler

In this blog, I will be coding in python. There are a couple of web crawling and web scraping frameworks present in python. I will be using scrapy.

Installing scrapy:

$ pip install scrapy


1. Create a python application using scrapy

To create a scrapy project run the following command. Here the name of my application is my_first_web_crawler

$ scrapy startproject my_first_web_crawler

This will generate a scrapy boilerplate code and folder structure that should look like this:


2. Creating a Web Crawler

The folder named spiders contains the files which scrapy uses to crawl the websites. I will create a file named spider1.py in this directory and write the following lines of code:

You can find the above code here: https://github.com/gouravdhar/my-first-web-crawler/blob/main/test_spider.py

I have provided the URLs of my web pages which I will be crawling. These pages contain links to my blogs. You can provide any number of URLs since this is a list. My URLs which I will be crawling :

https://gourav-dhar.com 
https://gourav-dhar.com/profile

The above code crawls through the web pages provided in the links and downloads the pages.

To execute the code, run the following command :

scrapy crawl <your-spider-name>

My spider name is blogs (Defined in line 7 of the above code)

And tada!!! The data of the links have been downloaded in the project folder.

But that’s not enough, I want to actually download the data of the links this page points to. For this, I have to scrape all the links present on the main page and crawl through it. I will be using scrapy shell to write code to scrape the website information.

Note: Scrapy Shell is an interactive shell where you can try and debug scraping code very quickly

To start scrapy shell, just write :

$ scrapy shell 'https://gourav-dhar.com'

i.e. scrapy shell followed by the url

Once the shell is opened, type response to confirm that you get a 200 response.