Skip to content

v0.4.5#235

Merged
D4Vinci merged 11 commits into
mainfrom
dev
Apr 7, 2026
Merged

v0.4.5#235
D4Vinci merged 11 commits into
mainfrom
dev

Conversation

@D4Vinci

@D4Vinci D4Vinci commented Apr 7, 2026

Copy link
Copy Markdown
Owner

A focused update with one big quality-of-life feature for spider developers and a couple of important fixes 🎉

🚀 New Stuff and quality of life changes

  • Spider Development Mode: Iterating on a spider's parse() logic used to mean re-hitting the target servers on every run, which is slow, noisy, and a great way to get rate-limited while you're still figuring out your selectors. The new development mode caches every response to disk on the first run and replays them from disk on every subsequent run, so you can tweak your callbacks and re-run as many times as you want without making a single network request. Enable it with one class attribute:

    class MySpider(Spider):
        name = "my_spider"
        start_urls = ["https://example.com"]
        development_mode = True
    
        async def parse(self, response):
            yield {"title": response.css("title::text").get("")}

    The cache lives in .scrapling_cache/{spider.name}/ by default and can be redirected anywhere with development_cache_dir. Two new stat counters, cache_hits and cache_misses, let you see how the cache performed. Cache replay bypasses download_delay, rate limiting, and the blocked-request retry path so iteration is as fast as the disk allows. Don't ship a spider with development_mode = True -- it's a development tool, not a production cache. See the docs for the full story.

  • Safer redirects by default: follow_redirects now defaults to "safe" across all HTTP fetchers, the MCP server, and the shell. Redirects are still followed, but ones targeting internal/private IPs (loopback, private networks, link-local) are rejected. This protects you from SSRF when scraping user-supplied URLs. Pass follow_redirects="all" to get the old behavior, or False to disable redirects entirely.

🐛 Bug Fixes

  • Force-stop no longer loses your checkpoint: Pressing Ctrl+C twice (force-stop) on a spider with crawldir enabled used to race against the checkpoint write -- the cancel scope would tear down the task before the pickle finished, leaving paused=False and triggering the cleanup path that deletes the previous checkpoint. The result was that force-stopping a long crawl could lose all the progress you were trying to save. The engine now writes the checkpoint before calling cancel_scope.cancel(), so a force-stop always preserves the latest pending state. By @voidborne-d in #230.

🙏 Special thanks to the community for all the continuous testing and feedback


D4Vinci and others added 11 commits April 5, 2026 18:41
curl_cffi v0.15.0 introduced CurlFollow.SAFE, which follows redirects but rejects those targeting internal/private IPs (loopback, private networks, link-local). This is now the default for all HTTP fetchers, the MCP server, and the shell curl converter.

Added FollowRedirects type alias supporting all curl_cffi redirect
modes: bool, "safe", "all", "obeycode", "firstonly".
…event data loss

On force-stop (second Ctrl+C), cancel_scope.cancel() was called BEFORE
_save_checkpoint(). Since cancel_scope.cancel() causes all subsequent
awaits within the scope to raise Cancelled, the checkpoint write was
silently aborted:

1. _save_checkpoint() uses anyio.open_file + rename — both are await
   checkpoints that get cancelled immediately
2. self.paused never gets set to True (code after the aborted save)
3. The finally block sees 'not self.paused' and calls cleanup() which
   DELETES the previous checkpoint file

Result: a user who ran a long crawl, pressed Ctrl+C twice to force-stop,
loses their entire checkpoint irrecoverably. The old checkpoint (from
periodic saves or a previous graceful pause) is deleted, and the new
one was never written.

Fix: move the cancel_scope.cancel() call AFTER the checkpoint save.
The save completes normally, self.paused is set to True, and only then
does the scope get cancelled to abort in-flight tasks. The finally
block correctly sees paused=True and skips cleanup.

Adds 6 regression tests covering:
- Force-stop checkpoint preservation (core regression)
- Graceful pause still works
- Force-stop checkpoint is loadable
- Normal completion cleanup still works
- Force-stop without checkpoint system
- Existing checkpoint not deleted on force-stop
@D4Vinci D4Vinci merged commit cb449af into main Apr 7, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant