feat: implement robots.txt compliance and fix type checking by AbdullahY36 · Pull Request #221 · D4Vinci/Scrapling

AbdullahY36 · 2026-04-01T22:34:55Z

Proposed change

This PR implements robots.txt compliance support for the Spider framework, allowing crawlers to automatically respect directives specified in target websites' robots.txt files. Additionally, it resolves type checking errors detected by mypy and pyright.

Key Features

1. Robots.txt Compliance System

New RobotsTxtManager class for fetching, parsing, and caching robots.txt files
Automatic per-domain robots.txt parsing using the protego library
Support for standard robots.txt directives:
- User-agent specific rules
- Allow/Disallow directives with wildcards and anchors
- Crawl-delay per user-agent
- Sitemap declarations

2. Spider Configuration

New robots_txt_obey configuration flag to enable/disable robots.txt compliance
Per-domain crawl-delay enforcement from robots.txt Crawl-delay directives
Tracking of disallowed request attempts in crawl statistics

3. Type Safety

Resolved mypy and pyright type checking errors
All public code now passes strict type checking

Implementation Details

Protego Integration: Uses protego library for robust robots.txt parsing
Caching: RobotsTxtManager caches parsed robots.txt files per domain and session to minimize fetches
Async Support: Full async/await support throughout the robots.txt system
Per-Domain Delays: Extracts and enforces Crawl-delay directives from robots.txt

Changes Made

Added protego dependency for robots.txt parsing
Created RobotsTxtManager class (scrapling/spiders/robotstxt.py)
Added robots_txt_obey flag to Spider configuration
Implemented per-domain delay enforcement in CrawlerEngine
Added disallowed request tracking to CrawlStats
Updated MockSpider test fixture with robots.txt parameter
Comprehensive test suite for RobotsTxtManager
Fixed type annotations in CrawlerEngine

Type of change

New feature (which adds functionality to an existing integration)
Code quality improvements to existing code or addition of tests

Additional information

Robots.txt compliance is optional (controlled by robots_txt_obey flag)
Default behavior is backward compatible (compliance disabled by default)
All changes include full type hints and pass mypy/pyright

Checklist

I have read CONTRIBUTING.md.
This pull request is all my own work -- I have not plagiarized.
I know that pull requests will not be merged if they fail the automated tests.
All new Python files are placed inside an existing directory.
All filenames are in all lowercase characters with no spaces or dashes.
All functions and variable names follow Python naming conventions.
All function parameters and return values are annotated with Python type hints.
All functions have doc-strings.

The stub shadows the real implementation, and proxy rotation always hits NotImplementedError. Possible fix for D4Vinci#215 Co-Authored-By: Yuval Dinodia <102706514+yetval@users.noreply.github.com>

* Add protego>=0.2.1 to project dependencies * Required by RobotsTxtManager for parsing and caching robots.txt

* Add RobotsTxtManager class with per-domain, per-session caching * Implement can_fetch() to check URL against robots.txt rules * Implement get_crawl_delay() to extract Crawl-delay directive * Implement get_request_rate() to extract Request-rate directive * Implement get_sitemaps() to list sitemap URLs from robots.txt * Add double-checked locking to prevent race conditions on first domain fetch * Add clear_cache() with selective invalidation by domain/sid * Handle 404/5xx responses by allowing all URLs (graceful fallback) * Handle fetch errors by allowing all URLs (network-safe) * Support arbitrary text encodings via response.encoding parameter

* Add robots_txt_obey: bool = False class attribute * Allows per-spider opt-in to robots.txt compliance

* Add robots_disallowed_count: int = 0 to CrawlStats dataclass * Include in to_dict() export for telemetry and logging

* Initialize RobotsTxtManager when spider.robots_txt_obey=True * Add _get_domain_delay() method for lazy per-domain delay resolution: * Caches resolved delays per domain to avoid repeated computation * Uses double-checked locking (asyncio.Lock) to prevent concurrent fetch races * Takes max(spider.download_delay, robots.txt delay) per domain * Auto-creates per-domain CapacityLimiter(1) when robots.txt enforces any delay * Update _rate_limiter() to check existing domain limiters first: * Respects robots.txt-created limiters even when spider.concurrent_requests_per_domain=0 * Prevents silent ignoring of robots.txt concurrency restrictions * Update _process_request() to: * Check can_fetch() before processing each request * Resolve and apply per-domain delay before acquiring limiter * Track robots_disallowed_count in stats * Clear per-domain state (_domain_limiters, _domain_delays, _domain_delay_locks) at start of each crawl() * Add asyncio import for per-domain locking

* Add robots_txt_obey: bool = False parameter to MockSpider.__init__ * Allow engine tests to opt-in to robots.txt compliance testing

Test coverage: * can_fetch() with allow/disallow rules, overrides, wildcards, edge cases * get_crawl_delay() with float parsing, None fallback, fetch errors * get_request_rate() with tuple parsing, None fallback * get_sitemaps() with multi-sitemap, empty list, error fallback * Caching: single fetch per domain, shared cache across methods, separate cache per domain+sid * URL construction: scheme preservation, port handling, path-independence * Encoding: non-UTF-8 decoding, bytes handling * Cache management: clear_all(), clear by domain, clear by sid, clear by both * Concurrency: double-checked locking, per-domain fetching, consistent results under load

D4Vinci and others added 10 commits April 1, 2026 17:36

fix(Proxy Rotation): Fix an MRO issue

f614651

The stub shadows the real implementation, and proxy rotation always hits NotImplementedError. Possible fix for D4Vinci#215 Co-Authored-By: Yuval Dinodia <102706514+yetval@users.noreply.github.com>

build: pump version up

966e17a

feat: add protego as dependency for robots.txt parsing

05f5c65

* Add protego>=0.2.1 to project dependencies * Required by RobotsTxtManager for parsing and caching robots.txt

feat: add robots_txt_obey configuration flag to Spider

7cf1bca

* Add robots_txt_obey: bool = False class attribute * Allows per-spider opt-in to robots.txt compliance

feat: track robots.txt disallowed request count in stats

94b7ec4

* Add robots_disallowed_count: int = 0 to CrawlStats dataclass * Include in to_dict() export for telemetry and logging

test: add robots_txt_obey parameter to MockSpider

bb99bd1

* Add robots_txt_obey: bool = False parameter to MockSpider.__init__ * Allow engine tests to opt-in to robots.txt compliance testing

fix(types): resolve mypy and pyright type checking errors

47f31fd

AbdullahY36 closed this Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement robots.txt compliance and fix type checking#221

feat: implement robots.txt compliance and fix type checking#221
AbdullahY36 wants to merge 10 commits into
D4Vinci:mainfrom
AbdullahY36:main

AbdullahY36 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

AbdullahY36 commented Apr 1, 2026

Proposed change

Key Features

Implementation Details

Changes Made

Type of change

Additional information

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants