PyCon US 2024 - Gathering data from the web using Python

# Gathering data from the web using Python

### PyCon US 2024 - 16 / 05 / 2024

---

# Agenda

- Web scraping fundamentals 🧑‍🏫

- Scrapy basic concepts 🧑‍🏫

- Exercise 1: Scraping a basic HTML page 👩‍💻 🧑‍💻

- Exercise 2: Scraping Javascript generated content (external API) 👩‍💻 🧑‍💻

- Exercise 3: Scraping Javascript generated content (data into HTML) 👩‍💻 🧑‍💻

- Exercise 4: Scraping page with multiple requests for an item 👩‍💻 🧑‍💻

- Headless browsers 🧑‍🏫

- Beyond the spiders 🧑‍🏫

- Q&A 🧑‍🏫 👩‍💻 🧑‍💻

---

# Renne Rocha

<div id="right">
  <ul>
    <li>Maintainer of <strong>Querido Diário</strong><br/>
    https://queridodiario.ok.org.br/</li>
    <br/>
    <li>Maintainer of <strong>Spidermon</strong><br/>
    https://spidermon.readthedocs.io/</li>
    <br/>
    <li>Co-founder of <strong>Laboratório Hacker de Campinas</strong><br/>
    https://lhc.net.br</li>
    <br/>
    <li><strong>@rennerocha@chaos.social</strong> 🐘</li>
    <br/>
    <li><strong>@rennerocha</strong> (other social networks)</li>
  </ul>
</div>

---

# Why gather data from the web?

- We need data to make decisions

- Gather structured data from unstructured sources

- Quantity of data available can be overwhelming and time-consuming to navigate through manually

---

# Common Use Cases

- Machine learning training data

- Government data

- Price inteligence

- Brand monitoring

- Consumer sentiment

- Competitors’ product data

- Real estate data

- Any application that benefits from data gathered from the web

---

# Common tools in Python ecosystem

## Get content

- **requests** (https://pypi.org/project/requests/)

- **httpx** (https://pypi.org/project/httpx/)

## Parse content

- **Beautiful Soup** (https://pypi.org/project/beautifulsoup4/)

- **parsel** (https://pypi.org/project/parsel/)

---

# Common tools in Python ecosystem

## Headless Browser

- **Selenium** (https://www.selenium.dev/)

- **Playwright** (https://playwright.dev/python/)

## Complete framework

- **Scrapy** (https://scrapy.org/)

---

# PyCon US 2024 - Tutorial Titles

``` python
# code/pyconus2024-tutorials-requests.py
import requests
from parsel import Selector

response = requests.get('https://us.pycon.org/2024/schedule/tutorials/')

sel = Selector(text=response.text)
for tutorial in sel.css('.calendar a::text').getall():
    print(tutorial)
```

---

# PyCon US 2024 - Tutorial Titles

``` python
# code/pyconus2024-tutorials-requests.py
import requests
from parsel import Selector

*response = requests.get('https://us.pycon.org/2024/schedule/tutorials/')

sel = Selector(text=response.body)
for tutorial in sel.css('.calendar a::text').getall():
    print(tutorial)
```

---

# PyCon US 2024 - Tutorial Titles

``` python
# code/pyconus2024-tutorials-requests.py
import requests
from parsel import Selector

response = requests.get('https://us.pycon.org/2024/schedule/tutorials/')

sel = Selector(text=response.body)
*for tutorial in sel.css('.calendar a::text').getall():
*   print(tutorial)
```
---

---

# What if?

- You have **thousands of URLs** for the same (or different) domain?

- You need to **export data** in some specific format and schema?

- You need to **manage the rate** to avoid degrating your target server?

- You need to **monitor the execution** of your web crawlers?

- You need to **run** the same web crawler **multiple times**?

---

# Why **Scrapy**?

<div id="left">
  <br/><br/>
  <img src="images/scrapylogo.png" width="100%"/>
  <center>https://scrapy.org/</center>
</div>

<div id="right">
  <ul>
    <li>Application framework for crawling web sites</li><br/>
    <li>Batteries included (HTML parsing, asynchronous, data pipeline, sessions, data exporting, etc)</li><br/>
    <li>Extensible (middlewares, downloaders, extensions)</li><br/>
    <li>Open Source</li>
  </ul>
</div>

---

# Tutorial material

<div id="right">
    <br/><br/>
    <h3><center>https://bit.ly/pyconus2024-tutorial</center></h3>
</div>

---

# Installing (Linux)

```bash
$ git clone https://github.com/rennerocha/pyconus2024-tutorial pyconus2024-tutorial
$ cd pyconus2024-tutorial
$ python -m venv .venv
$ source .venv/bin/activate
$ cd code
$ pip install -r requirements

(...) Many lines installing a lot of things

$ scrapy version
Scrapy 2.11.1
```

For other platforms:
https://docs.scrapy.org/en/latest/intro/install.html

---

---

# Scrapy Architecture

![Scrapy Architecture](images/scrapy_architecture_02.png)

---

# Spiders

`scrapy.Spider`

- Define how a certain site will be scraped

- How to perform the crawl (i.e. follow links)

- How to extract structured data from the pages (i.e. scraping items)

- Usually one for each domain

---

# Spiders

```python
# code/pyconus2024.py
import scrapy

class PyConUS2024Spider(scrapy.Spider):
    name = "pyconus"

start_urls = [
        'https://us.pycon.org/2024/schedule/tutorials/',
    ]

def parse(self, response):
        for tutorial in response.css('.calendar a::text').getall():
            yield {"title": tutorial}
```

---

# Spiders

```python
# code/pyconus2024.py
import scrapy

*class PyConUS2024Spider(scrapy.Spider):
    name = "pyconus"

start_urls = [
        'https://us.pycon.org/2024/schedule/tutorials/',
    ]

def parse(self, response):
        for tutorial in response.css('.calendar a::text').getall():
            yield {"title": tutorial}
```

---

# Spiders

```python
# code/pyconus2024.py
import scrapy

class PyConUS2024Spider(scrapy.Spider):
*   name = "pyconus"

start_urls = [
        'https://us.pycon.org/2024/schedule/tutorials/',
    ]

def parse(self, response):
        for tutorial in response.css('.calendar a::text').getall():
            yield {"title": tutorial}
```
---

# Spiders

```python
# code/pyconus2024.py
import scrapy

class PyConUS2024Spider(scrapy.Spider):
    name = "pyconus"

*   start_urls = [
*       'https://us.pycon.org/2024/schedule/tutorials/',
*   ]

def parse(self, response):
        for tutorial in response.css('.calendar a::text').getall():
            yield {"title": tutorial}
```
---

# Spiders

```python
# code/pyconus2024.py
import scrapy

class PyConUS2024Spider(scrapy.Spider):
    name = "pyconus"

*   def start_requests(self):
*       start_urls = [
*          'https://us.pycon.org/2024/schedule/tutorials/',
*       ]
*       for url in start_urls:
*           yield scrapy.Request(url)

def parse(self, response):
        for tutorial in response.css('.calendar a::text').getall():
            yield {"title": tutorial}
```
---

# Spiders

```python
# code/pyconus2024.py
import scrapy

class PyConUS2024Spider(scrapy.Spider):
    name = "pyconus"

start_urls = [
        'https://us.pycon.org/2024/schedule/tutorials/',
    ]

*   def parse(self, response):
*       for tutorial in response.css('.calendar a::text').getall():
*           yield {"title": tutorial}
```
---

---

---

# CSS Selectors
### https://www.w3.org/TR/CSS2/selector.html

# XPath
### https://www.w3.org/TR/xpath/all/

---

# Parsing Data

```
# code/pyconus2024-css.py
import scrapy

class PyConUS2024Spider(scrapy.Spider):
    name = "pyconus"

start_urls = [
        'https://us.pycon.org/2024/schedule/tutorials/',
    ]

def parse(self, response):
        for tutorial in response.css('.presentation'):
            yield {
                'speaker': tutorial.css('.speaker::text').get().strip(),
                'url': response.urljoin(
                    tutorial.css('.title a::attr(href)').get()
                ),
                'title': tutorial.css('.title a::text').get()
            }
```

---

# Parsing Data

```
# code/pyconus2024-css.py
import scrapy

class PyConUS2024Spider(scrapy.Spider):
    name = "pyconus"

start_urls = [
        'https://us.pycon.org/2024/schedule/tutorials/',
    ]

def parse(self, response):
*       for tutorial in response.css('.presentation'):
            yield {
*               'speaker': tutorial.css('.speaker::text').get().strip(),
                'url': response.urljoin(
*                   tutorial.css('.title a::attr(href)').get()
                ),
*               'title': tutorial.css('.title a::text').get()
            }
```

### CSS Selectors

---

# Parsing Data

```
# code/pyconus2024-xpath.py
import scrapy

class PyConUS2024Spider(scrapy.Spider):
    name = "pyconus"

start_urls = [
        'https://us.pycon.org/2024/schedule/tutorials/',
    ]

def parse(self, response):
*       for tutorial in response.xpath('//div[@class="presentation"]'):
            yield {
*               'speaker': tutorial.xpath('./div[@class="speaker"]/text()').get().strip(),
                'url': response.urljoin(
*                   tutorial.xpath('.//a/@href').get()
                ),
*               'title': tutorial.xpath('.//a/text()').get()
            }
```

### XPath

---

# Parsing Data

```
# code/pyconus2024-xpath-and-css.py
import scrapy

class PyConUS2024Spider(scrapy.Spider):
    name = "pyconus"

start_urls = [
        'https://us.pycon.org/2024/schedule/tutorials/',
    ]

def parse(self, response):
*       for tutorial in response.css('.presentation'):
            yield {
*               'speaker': tutorial.xpath('.speaker::text').get().strip(),
                'url': response.urljoin(
*                   tutorial.xpath('.//a/@href').get()
                ),
*               'title': tutorial.xpath('.//a/text()').get()
            }
```

### XPath and CSS Selector

---

# CSS Selectors Examples

```
response.css("h1")
```

```
response.css("ul#offers")
```

```
response.css(".product")
```

```
response.css("ul#offers .product a::attr(href)")
```

```
response.css("ul#offers .product *::text")
```

```
response.css("ul#offers .product p::text")
```

---

# XPath Examples

```
response.xpath("//h1")
```

```
response.xpath("//h1[2]")
```

```
response.xpath("//ul[@id='offers']")
```

```
response.xpath("//li/a/@href")
```

```
response.xpath("//li//text()")
```

```
response.xpath("//li[@class='ad']/following-sibling::li")
```

---

# Exporting Results

```
$ scrapy runspider pyconus2024-css.py
```

```
$ scrapy runspider pyconus2024-css.py -o results.csv
```

```
$ scrapy runspider pyconus2024-css.py -o results.json
```

```
$ scrapy runspider pyconus2024-css.py -o results.jl
```

```
$ scrapy runspider pyconus2024-css.py -o results.xml
```

### You can export in your own custom format if you like...

https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports

---

---

class: center, middle
We will use http://toscrape.com/, a sandbox containing fictional websites
with a simplified version of real world challenges we find during web scraping tasks.

---

# Exercise 1

**Target:** https://quotes.toscrape.com/

On this page, you will find a collection of quotes along with their respective authors.
Each quote is accompanied by a link that directs you to a dedicated page providing
additional details about the author, the quote itself, and a list of associated tags.

Your task is to extract all of this information and export it into a JSON lines file.

---
<img class="fragment" src="images/exercise-1-page.png" width="100%">
---

<img class="fragment" src="images/exercise-1-sc.png" width="100%">
---

# Exercise 1

**Target:** https://quotes.toscrape.com/

Your task is to extract all of this information and export it into a JSON lines file.

**TIP**: your parse method can be used to yield items or schedule new requests for later processing.

```
# if callback is not provided, the default is self.parse
scrapy.Request("https://someurl.com", callback=self.parse_someurl)
```
---

# Exercise 1