Python Web Scraping Cookbook
上QQ阅读APP看书,第一时间看更新

How to build robust ETL pipelines with AWS SQS

Scraping a large quantity of sites and data can be a complicated and slow process.  But it is one that can take great advantage of parallel processing, either locally with multiple processor threads, or distributing scraping requests to report scrapers using a message queue system. There may also be the need for multiple steps in a process similar to an Extract, Transform, and Load pipeline (ETL). These pipelines can also be easily built using a message queuing architecture in conjunction with the scraping.

Using a message queuing architecture gives our pipeline two advantages:

  • Robustness
  • Scalability

The processing becomes robust, as if processing of an individual message fails, then the message can be re-queued for processing again. So if the scraper fails, we can restart it and not lose the request for scraping the page, or the message queue system will deliver the request to another scraper.

It provides scalability, as multiple scrapers on the same, or different, systems can listen on the queue. Multiple messages can then be processed at the same time on different cores or, more importantly, different systems. In a cloud-based scraper, you can scale up the number of scraper instances on demand to handle greater load.

Common message queueing systems that can be used include: Kafka, RabbitMQ, and Amazon SQS. Our example will utilize Amazon SQS, although both Kafka and RabbitMQ are quite excellent to use (we will see RabbitMQ in use later in the book). We use SQS to stay with a model of using AWS cloud-based services as we did earlier in the chapter with S3.