Scrapy is an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
Source: Scrapy.org

Note: You should have prior hands-on knowledge with Scrapy.

  1. Lets start with installing Scrapy.
    pip install Scrapy

  2. Generate spider project
    scrapy startproject githubspider

  3. Generate spider
    scrapy genspider githubligin github.com

  4. Open the githublogin.py file in the spider folder in your favourite code editor, and paste the below contents.

# -*- coding: utf-8 -*-
import scrapy

class GithubloginSpider(scrapy.Spider):
    name = 'githublogin'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    """
    login in to github
    """
    def parse(self, response):
        hidden_utf8 = response.css('#login > form > input[type="hidden"]:nth-child(1)::attr(value)').extract_first()
        hidden_authenticity_token = response.css('#login > form > input[type="hidden"]:nth-child(2)::attr(value)').extract_first()
        
        return scrapy.FormRequest.from_response(
            response,
            formdata = {
                'hidden_utf8': hidden_utf8,
                'hidden_authenticity_token': hidden_authenticity_token,
                'login': 'githubuser',
                'password': 'githubpass',
                'commit': 'Sign in'
            },
            callback=self.scrape_homepage)
    

    def scrape_homepage(self, response):
        # yield {'response':response.text}
        yield scrapy.Request('https://github.com/dipakdotyadav', callback=self.scrape_profilepage)
    

    def scrape_profilepage(self, response):
        yield {'response':response.text}

  1. Replace the 'githubuser' and 'githubpass' with your GitHub username and password.

  2. Run the spider i.e., using below command
    scrapy crawl githublogin -o githublogin.json

Enjoy!!

Please share if you enjoyed reading this blog, and subscribe to get update on latest blog.