Skip to content

feat: implement robots.txt compliance and fix type checking#221

Closed
AbdullahY36 wants to merge 10 commits into
D4Vinci:mainfrom
AbdullahY36:main
Closed

feat: implement robots.txt compliance and fix type checking#221
AbdullahY36 wants to merge 10 commits into
D4Vinci:mainfrom
AbdullahY36:main

Conversation

@AbdullahY36

Copy link
Copy Markdown
Contributor

Proposed change

This PR implements robots.txt compliance support for the Spider framework, allowing crawlers to automatically respect directives specified in target websites' robots.txt files. Additionally, it resolves type checking errors detected by mypy and pyright.

Key Features

1. Robots.txt Compliance System

  • New RobotsTxtManager class for fetching, parsing, and caching robots.txt files
  • Automatic per-domain robots.txt parsing using the protego library
  • Support for standard robots.txt directives:
    • User-agent specific rules
    • Allow/Disallow directives with wildcards and anchors
    • Crawl-delay per user-agent
    • Sitemap declarations

2. Spider Configuration

  • New robots_txt_obey configuration flag to enable/disable robots.txt compliance
  • Per-domain crawl-delay enforcement from robots.txt Crawl-delay directives
  • Tracking of disallowed request attempts in crawl statistics

3. Type Safety

  • Resolved mypy and pyright type checking errors
  • All public code now passes strict type checking

Implementation Details

  • Protego Integration: Uses protego library for robust robots.txt parsing
  • Caching: RobotsTxtManager caches parsed robots.txt files per domain and session to minimize fetches
  • Async Support: Full async/await support throughout the robots.txt system
  • Per-Domain Delays: Extracts and enforces Crawl-delay directives from robots.txt

Changes Made

  1. Added protego dependency for robots.txt parsing
  2. Created RobotsTxtManager class (scrapling/spiders/robotstxt.py)
  3. Added robots_txt_obey flag to Spider configuration
  4. Implemented per-domain delay enforcement in CrawlerEngine
  5. Added disallowed request tracking to CrawlStats
  6. Updated MockSpider test fixture with robots.txt parameter
  7. Comprehensive test suite for RobotsTxtManager
  8. Fixed type annotations in CrawlerEngine

Type of change

  • New feature (which adds functionality to an existing integration)
  • Code quality improvements to existing code or addition of tests

Additional information

  • Robots.txt compliance is optional (controlled by robots_txt_obey flag)
  • Default behavior is backward compatible (compliance disabled by default)
  • All changes include full type hints and pass mypy/pyright

Checklist

  • I have read CONTRIBUTING.md.
  • This pull request is all my own work -- I have not plagiarized.
  • I know that pull requests will not be merged if they fail the automated tests.
  • All new Python files are placed inside an existing directory.
  • All filenames are in all lowercase characters with no spaces or dashes.
  • All functions and variable names follow Python naming conventions.
  • All function parameters and return values are annotated with Python type hints.
  • All functions have doc-strings.

D4Vinci and others added 10 commits April 1, 2026 17:36
The stub shadows the real implementation, and proxy rotation always hits NotImplementedError.

Possible fix for D4Vinci#215

Co-Authored-By: Yuval Dinodia <102706514+yetval@users.noreply.github.com>
* Add protego>=0.2.1 to project dependencies
* Required by RobotsTxtManager for parsing and caching robots.txt
* Add RobotsTxtManager class with per-domain, per-session caching
* Implement can_fetch() to check URL against robots.txt rules
* Implement get_crawl_delay() to extract Crawl-delay directive
* Implement get_request_rate() to extract Request-rate directive
* Implement get_sitemaps() to list sitemap URLs from robots.txt
* Add double-checked locking to prevent race conditions on first domain fetch
* Add clear_cache() with selective invalidation by domain/sid
* Handle 404/5xx responses by allowing all URLs (graceful fallback)
* Handle fetch errors by allowing all URLs (network-safe)
* Support arbitrary text encodings via response.encoding parameter
* Add robots_txt_obey: bool = False class attribute
* Allows per-spider opt-in to robots.txt compliance
* Add robots_disallowed_count: int = 0 to CrawlStats dataclass
* Include in to_dict() export for telemetry and logging
* Initialize RobotsTxtManager when spider.robots_txt_obey=True
* Add _get_domain_delay() method for lazy per-domain delay resolution:
  * Caches resolved delays per domain to avoid repeated computation
  * Uses double-checked locking (asyncio.Lock) to prevent concurrent fetch races
  * Takes max(spider.download_delay, robots.txt delay) per domain
  * Auto-creates per-domain CapacityLimiter(1) when robots.txt enforces any delay
* Update _rate_limiter() to check existing domain limiters first:
  * Respects robots.txt-created limiters even when spider.concurrent_requests_per_domain=0
  * Prevents silent ignoring of robots.txt concurrency restrictions
* Update _process_request() to:
  * Check can_fetch() before processing each request
  * Resolve and apply per-domain delay before acquiring limiter
  * Track robots_disallowed_count in stats
* Clear per-domain state (_domain_limiters, _domain_delays, _domain_delay_locks) at start of each crawl()
* Add asyncio import for per-domain locking
* Add robots_txt_obey: bool = False parameter to MockSpider.__init__
* Allow engine tests to opt-in to robots.txt compliance testing
Test coverage:
* can_fetch() with allow/disallow rules, overrides, wildcards, edge cases
* get_crawl_delay() with float parsing, None fallback, fetch errors
* get_request_rate() with tuple parsing, None fallback
* get_sitemaps() with multi-sitemap, empty list, error fallback
* Caching: single fetch per domain, shared cache across methods, separate cache per domain+sid
* URL construction: scheme preservation, port handling, path-independence
* Encoding: non-UTF-8 decoding, bytes handling
* Cache management: clear_all(), clear by domain, clear by sid, clear by both
* Concurrency: double-checked locking, per-domain fetching, consistent results under load
@AbdullahY36 AbdullahY36 closed this Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants