From aba8404020567a46e599e76758e5ebdc4079c0e5 Mon Sep 17 00:00:00 2001 From: Renne Rocha Date: Fri, 14 Feb 2025 09:21:22 -0300 Subject: [PATCH] Dynamic rules for following links declaratively with Scrapy --- ...llowing-links-declaratively-with-scrapy.md | 154 ++++++++++++++++++ 1 file changed, 154 insertions(+) create mode 100644 content/posts/20250214-dynamic-rules-for-following-links-declaratively-with-scrapy.md diff --git a/content/posts/20250214-dynamic-rules-for-following-links-declaratively-with-scrapy.md b/content/posts/20250214-dynamic-rules-for-following-links-declaratively-with-scrapy.md new file mode 100644 index 0000000..d91a1c1 --- /dev/null +++ b/content/posts/20250214-dynamic-rules-for-following-links-declaratively-with-scrapy.md @@ -0,0 +1,154 @@ +--- +title: "Dynamic rules for following links declaratively with Scrapy" +date: 2025-02-14 +tags: ["web scraping", "scrapy"] +slug: dynamic-rules-for-following-links-declaratively-with-scrapy +--- + +When using `CrawlSpider`, we have a fixed set of rules that declares how we should follow and process links extracted in the website. + +But sometimes we don't want the rules to be static. We need a certain level of dynamism, where the rules vary according to parameters provided as input in our spiders. + +Consider that we are scraping product URLs from an ecommerce website, and we have the following patterns for category URLs: + +- `https://store.example.com/` - main page of our store +- `https://store.example.com/electronics` - list of products of `electronics` category +- `https://store.example.com/food` - list of products of `food` category + +We can notice the pattern `https://store.example.com/` in our URLs. Using `CrawlSpider` as [explained in my last post]({{< ref "20250212-following-links-declaratively-with-scrapy" >}}) this set of `rules` can be defined as: + +```python +import scrapy +from scrapy.spiders import CrawlSpider, Rule +from scrapy.linkextractors import LinkExtractor + +class StoreSpider(CrawlSpider): + name = "store" + start_urls = ["https://store.example.com/"] + + rules = ( + Rule( + LinkExtractor(allow=r"\/electronics$"), + callback=self.parse_category, + ), + Rule( + LinkExtractor(allow=r"\/electronics/ID\d+$"), + callback=self.parse_product, + ), + Rule( + LinkExtractor(allow=r"\/food$"), + callback=self.parse_category, + ), + Rule( + LinkExtractor(allow=r"\/food/ID\d+$"), + callback=self.parse_product, + ), + ) + + def parse_category(self, response): + ... # Code to parse a category + + def parse_product(self, response): + ... # Code to parse a product +``` + +There are a few potential problems with this approach: + +- We need to create a rule for each category we want to extract data from; +- We need to change the code to add any new categories that we want to start processing; +- If processing a particular category takes too long, we might want to run the spider in parallel so that each process extracts data from just one category. + +What if we could send the name of the category that we want to process as an argument to our spider? We can do it using `-a argument=value` when calling `scrapy crawl` such as: + +```bash +scrapy crawl store -a category=food +``` + +If we run the spider passing this argument, we now have access on the spider instance the attribute `self.category` with the value `food` which can be used to limit the links we want to be extracted. + +Then we can filter our requests, preventing any page that is not in the desired category from being processed. + +```python + def parse_category(self, response): + if self.parse_category not in response.url: + return + + ... # Code to parse a category +``` + +The problem with this approach is that we send a real request to all links (even for the categories we don't want to), no matter if the response will be discarded or not. + +A better solution would be making the rules collection to be dynamically: + +```python + rules = ( + Rule( + LinkExtractor(allow=rf"\/{self.category}$"), + callback=self.parse_category, + ), + Rule( + LinkExtractor(allow=rf"\/{self.category}/ID\d+$"), + callback=self.parse_product, + ), + ) +``` + +Unfortunately this will not work, as the `rules` is a class attribute, so we don't have an instance of the spider to get the content of our input. + +Investigating Scrapy's code, we found that the defined rules are [processed in a call](https://github.com/scrapy/scrapy/blob/f041f26a6ff636b764d2bf584ddbc9b9e4334d1b/scrapy/spiders/crawl.py#L97) to the [`_compile_rule`](https://github.com/scrapy/scrapy/blob/f041f26a6ff636b764d2bf584ddbc9b9e4334d1b/scrapy/spiders/crawl.py#L181) method. + +Inside this method, each rule in `self.rules` is evaluated and then appended to `self._rules` attribute, which is where the spider [decides which link to follow](https://github.com/scrapy/scrapy/blob/f041f26a6ff636b764d2bf584ddbc9b9e4334d1b/scrapy/spiders/crawl.py#L127). + +In this way, we can define our custom `_compile_rules` method, which will take the value passed as an argument to our spider and define the rules to only extract links that are of the desired category. + +Our spider can look like this: + +```python +import scrapy +from scrapy.spiders import CrawlSpider, Rule +from scrapy.linkextractors import LinkExtractor + +class StoreSpider(CrawlSpider): + name = "store" + start_urls = ["https://store.example.com/"] + + def _compile_rules(self): + self.rules = ( + Rule( + LinkExtractor(allow=rf"\/{self.category}$"), + callback=self.parse_category, + ), + Rule( + LinkExtractor(allow=rf"\/{self.category}/ID\d+$"), + callback=self.parse_product, + ), + ) + + # After setting our rules, just use the existing _compile_rules() method + super()._compile_rules() + + def parse_category(self, response): + ... # Code to parse a category + + def parse_product(self, response): + ... # Code to parse a product +``` + +We can now run a separate spider job for each category and extract product data from each category individually: + +```bash +# Extracts data only of 'electronics' category +scrapy crawl store -a category=electronics +``` + +```bash +# Extracts data only of 'food' category +scrapy crawl store -a category=food +``` + +If we have new categories, we just pass them as a new value for the argument: + +```bash +# Extracts data only of 'cars' category +scrapy crawl store -a category=cars +``` \ No newline at end of file