Home

Welcome to the web_scraping wiki!

The web_scraping project is using pythons famous scrapy framework to crawl and scrape the webpages.
* Scrape - to extract the data
* Crawl - Navigate between various data sources

Pre Requisites to Follow:

python 3 installed
Scrapy installed
Little bit of Xpath selectors knowledge

Why Scrapy?

Scrapy is a python framework built on top of the "Twisted "- an asynchronous networking framework which gives it the speed and asynchronus execution.- Got the reason why scrapy is blazing fast 🥇

Play time with scrapy

Scrapy provides some tools that will help you to experiment some of the common tasks of scraping

Scrapy View


scrapy view https://github.com/JyothishArumugam/web_scraping/wiki/Home/_edit

This block of the code will launch the browser with the url you given, This will be way in which the page is visible by scrapy.

Scrapy Shell Its the scrapy's way of interactive programming, as like a ipython notebook . Experimentalise and have fun


Scrapy shell https://github.com/JyothishArumugam/web_scraping/wiki/Home/_edit

Starting a Scrapy Project

Starting a scrapy project is simple as a command


scrapy startproject project_name

Breakdown of our sample project - bikes

Bikes is a simple project which will help to collect the data that will let you study about the Indian MotorCycle market.In this project we will be collecting the bike specs per manufacturer.

Scrapy Spiders

Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or, in some cases, a group of sites.
All the scrapy spiders are organised in the Spiders folder. Creation of the spider using the default template


scrapy genspider hondaspy bikewale.com/honda-bikes


Breakdown of the "hondaspy.py " spider
# -*- coding: utf-8 -*-
import scrapy
from bikes2.items import Bikes2Item
from scrapy.loader import ItemLoader


class HondaspySpider(scrapy.Spider):
    name = 'hondaspy'
    allowed_domains = ['bikewale.com']
    start_urls = ['https://www.bikewale.com/honda-bikes/']

    def parse(self, response):
        all_bikes = response.xpath("//*[@class='bikeDescWrapper']")
        for bike in all_bikes:
            next_url = bike.xpath('.//*[@class="modelurl"]/@href').extract_first()
            absolute_url = response.urljoin(next_url)
            yield scrapy.Request(url=absolute_url,callback=self.parse2)
    def parse2(self,sec_response):
        item = Bikes2Item()
        item['price'] = sec_response.xpath("//*[@id='new-bike-price']/text()").extract_first()
        item['name'] = sec_response.xpath('//*[@class="breadcrumb-link__label"]/text()').extract()[-1]

        yield item


Class HondaspySpider 

This is the default created classes by scrapy during the project creation with the attributes of 






name >> name of the spider








allowed domains >> allowed domains to crawl to, restricting this will not allow to crawl through add pages








start_urls >> home page of our crawl





parse method

This is the method which parses the response once the start_urls being hitted  

The Xpath selectors will select the "bikeDescWrapper" class and will loop through the all bikes listed in the page

The "yield" will call all the url found in the bike description and will hit the urls ,and the callback forwards the response to the "parse2" method which will be defining the parsing of the pages in a different manner to extract the data.


Run the spider

Navigate tot the spiders folder and run the following command

scrapy runspider hondaspy.py -o honda.csv

This part will generate the honda.csv file after scrapping the honda bikes prices and name
Supported formats will be .csv,.json,.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly