feat: implement robots.txt compliance and fix type checking#221
Closed
AbdullahY36 wants to merge 10 commits into
Closed
feat: implement robots.txt compliance and fix type checking#221AbdullahY36 wants to merge 10 commits into
AbdullahY36 wants to merge 10 commits into
Conversation
The stub shadows the real implementation, and proxy rotation always hits NotImplementedError. Possible fix for D4Vinci#215 Co-Authored-By: Yuval Dinodia <102706514+yetval@users.noreply.github.com>
* Add protego>=0.2.1 to project dependencies * Required by RobotsTxtManager for parsing and caching robots.txt
* Add RobotsTxtManager class with per-domain, per-session caching * Implement can_fetch() to check URL against robots.txt rules * Implement get_crawl_delay() to extract Crawl-delay directive * Implement get_request_rate() to extract Request-rate directive * Implement get_sitemaps() to list sitemap URLs from robots.txt * Add double-checked locking to prevent race conditions on first domain fetch * Add clear_cache() with selective invalidation by domain/sid * Handle 404/5xx responses by allowing all URLs (graceful fallback) * Handle fetch errors by allowing all URLs (network-safe) * Support arbitrary text encodings via response.encoding parameter
* Add robots_txt_obey: bool = False class attribute * Allows per-spider opt-in to robots.txt compliance
* Add robots_disallowed_count: int = 0 to CrawlStats dataclass * Include in to_dict() export for telemetry and logging
* Initialize RobotsTxtManager when spider.robots_txt_obey=True * Add _get_domain_delay() method for lazy per-domain delay resolution: * Caches resolved delays per domain to avoid repeated computation * Uses double-checked locking (asyncio.Lock) to prevent concurrent fetch races * Takes max(spider.download_delay, robots.txt delay) per domain * Auto-creates per-domain CapacityLimiter(1) when robots.txt enforces any delay * Update _rate_limiter() to check existing domain limiters first: * Respects robots.txt-created limiters even when spider.concurrent_requests_per_domain=0 * Prevents silent ignoring of robots.txt concurrency restrictions * Update _process_request() to: * Check can_fetch() before processing each request * Resolve and apply per-domain delay before acquiring limiter * Track robots_disallowed_count in stats * Clear per-domain state (_domain_limiters, _domain_delays, _domain_delay_locks) at start of each crawl() * Add asyncio import for per-domain locking
* Add robots_txt_obey: bool = False parameter to MockSpider.__init__ * Allow engine tests to opt-in to robots.txt compliance testing
Test coverage: * can_fetch() with allow/disallow rules, overrides, wildcards, edge cases * get_crawl_delay() with float parsing, None fallback, fetch errors * get_request_rate() with tuple parsing, None fallback * get_sitemaps() with multi-sitemap, empty list, error fallback * Caching: single fetch per domain, shared cache across methods, separate cache per domain+sid * URL construction: scheme preservation, port handling, path-independence * Encoding: non-UTF-8 decoding, bytes handling * Cache management: clear_all(), clear by domain, clear by sid, clear by both * Concurrency: double-checked locking, per-domain fetching, consistent results under load
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed change
This PR implements robots.txt compliance support for the Spider framework, allowing crawlers to automatically respect directives specified in target websites' robots.txt files. Additionally, it resolves type checking errors detected by mypy and pyright.
Key Features
1. Robots.txt Compliance System
RobotsTxtManagerclass for fetching, parsing, and caching robots.txt filesprotegolibrary2. Spider Configuration
robots_txt_obeyconfiguration flag to enable/disable robots.txt compliance3. Type Safety
Implementation Details
protegolibrary for robust robots.txt parsingChanges Made
protegodependency for robots.txt parsingRobotsTxtManagerclass (scrapling/spiders/robotstxt.py)robots_txt_obeyflag to Spider configurationType of change
Additional information
robots_txt_obeyflag)Checklist