Scrapy Framework

Crawlers:

Crawlers are programs or bots that will visit sites and read the pages submitted by the owners of the sites on the internet for indexing purposes. Most of the search engines will use the crawlers for finding the sites on the internet. Spiders or Programs will visit the entire site or specific pages and index those pages. We can use the crawlers for extracting information from the sites. We can process the extracted information and can have a deep understanding of the data.

Scrapy:

Scrapy is a free and open-source web crawling framework that will be helpful for extracting data from web pages or using API or as general-purpose web crawlers.

Scrapy Architecture:

scrapy_architecture

  1. Spider is a user-defined program or class where we write the crawler for the site. This class will have the start_requests() function where we give the URL of the site which we want to crawl. This URL will be passed to the engine.
  2. The engine will get the request from the start_request function from the spider class and schedule the request in the scheduler and ask for the next request to crawl from the initial requests. The engine is the heart of the scrapy. It maintains the data flow between all the components.
  3. The scheduler will return the request to the engine. The scheduler starts dequeuing the requests queued up. By default, it’s scheduling mode is LIFO. We can change the scheduling mode also.
  4. The Engine sends the request to the downloader passing through the downloader middleware(process requests()). We can alter the request attributes for the URL in the process requests() also otherwise we can add the request arguments to the URL in spider also.
  5. The downloader will download the response from the internet and send it back to the engine passing through the downloader middleware(process_response).In the process response, we can process the downloaded response to our use case. For example, in this function, we can add a check if the required response is not available instead of passing the response to the engine we can pass the new request also.
  6. The engine receives the response from the downloader and passes it to the spider through the spider middleware.
  7. The spider processes the response and can extract the necessary information from the response returned from the engine. The response can be a JSON response or html response or script or en empty page based on the site. Items extracted can be yielded either as dict or items from the function.
  8. The Engine sends processed items to the item pipelines, if we yielded it as items in the spider, then sends the processed request to the scheduler and asks for the possible request to crawl. We can process the extracted information in the item pipelines before storing the data in the database or in any other means.
  9. The process repeats(from step 1) until there are no more requests from the scheduler.

Middlewares:

Downloader Middlewares:

Downloader middleware is specific hooks that sit between the Engine and the downloader process the requests.

Downloader middleware is useful to do these things:

  1. Process a request just before it is sent to the downloader. For example, we can add a proxy to the request.
  2. We can change the received response before passing to the spider.
  3. Send a new request instead of passing the received response to the spider.
  4. Pass response to a spider without fetching a web page and drop some requests.

Spider Middlewares:

Spider middlewares are specific hooks that sit between the engine and the spider. We can process spider input(response) and output(items and requests).

Scrapy has some in-built default middlewares which are enabled by default. Requests coming from the engine will be passed to the middlewares based on its priority value set. We can avoid changing the default values set to the default Middlewares. We can also disable them in settings by assigning is the order to None value.

Signals:

Signals are used in scrapy to notify certain events to occur. We can use that signal to do some tasks at the point of the events which are not provided in the scrapy framework.

Types of Signals:
Deferred Signal Handlers
Built-in signals reference

Deferred Signal Handlers:

Some signals allow us to run asynchronous code by returning deferred objects from their handlers. If a signal handler returns a deferred, scrapy waits for that deferred to fire.

Built-in signals reference:

  • engine_started
  • engine_stopped
  • item_scraped
  • item_dropped
  • item_error
  • spider_closed
  • spider_opened
  • spider_idle
  • spider_error
  • request_scheduled
  • request_dropped
  • request_reached_downloader
  • request_left_downloader
  • bytes_received
  • response_received
  • Response_downloaded

Crawler API:

Crawler Process:

We can also trigger the spider from the python script instead of triggering it from the crawler. Scrapy has provided an API to trigger it from the python script. Scrapy Runs on twisted reactors. To run the spider in a twisted reactor we have to include a utility from scrapy. crawler import CrawlerProcess. This class will start the twisted reactor.

 

CrawlerRunner:

Crawler Runner is helpful in running any additional function inside the reactor while running the crawler. In the crawler process, we cannot run any additional functionality inside the same reactor. This problem can be solved in the CrawlerRunner().

 

HOW TO WRITE A SPIDER FOR A SITE:

Spider – Spiders are classes where we extract data from the sites using XPath or from css and crawl the sites. It is good to take the data from XPath.
Step 1:
Command: scrapy startproject project_name
Here is the project_directory looks like:

project_directory

Step 2:
Create a spider file inside the spider folder.

Step 3:
Give a spider name. It is mandatory. Spider names help us to call the appropriate spider. We can write any number of spiders for a site.

Step 4:
Define the request we want to crawl in the start_requests function and yield as a scrapy request and add a callback function to scrape the listing page.

Step 5:
In the callback function crawl the items available on the site using XPath. If any next page link is available, extract the next_page_url and give its callback function as the same function name and this looping will continue till we scrape the last page. We can extract the item_urls available on each page and make a scrapy request and scrape the items in detail page.

Step 6:
Each item URL can be scrapped in another callback function. We can extract whatever data we need using XPath and we can yield as dict.

Step 7:
Running the crawler.
scrapy crawl spider_name

If we want to yield items as JSON lines, we can run the script as
scrapy crawl demoproject -o crawled_data.jl