v0.4.5 by D4Vinci · Pull Request #235 · D4Vinci/Scrapling

D4Vinci · 2026-04-07T04:20:54Z

A focused update with one big quality-of-life feature for spider developers and a couple of important fixes 🎉

Note

Follow us on X for daily tips and tricks

🚀 New Stuff and quality of life changes

Spider Development Mode: Iterating on a spider's parse() logic used to mean re-hitting the target servers on every run, which is slow, noisy, and a great way to get rate-limited while you're still figuring out your selectors. The new development mode caches every response to disk on the first run and replays them from disk on every subsequent run, so you can tweak your callbacks and re-run as many times as you want without making a single network request. Enable it with one class attribute:
```
class MySpider(Spider):
    name = "my_spider"
    start_urls = ["https://example.com"]
    development_mode = True

    async def parse(self, response):
        yield {"title": response.css("title::text").get("")}
```
The cache lives in .scrapling_cache/{spider.name}/ by default and can be redirected anywhere with development_cache_dir. Two new stat counters, cache_hits and cache_misses, let you see how the cache performed. Cache replay bypasses download_delay, rate limiting, and the blocked-request retry path so iteration is as fast as the disk allows. Don't ship a spider with development_mode = True -- it's a development tool, not a production cache. See the docs for the full story.
Safer redirects by default: follow_redirects now defaults to "safe" across all HTTP fetchers, the MCP server, and the shell. Redirects are still followed, but ones targeting internal/private IPs (loopback, private networks, link-local) are rejected. This protects you from SSRF when scraping user-supplied URLs. Pass follow_redirects="all" to get the old behavior, or False to disable redirects entirely.

🐛 Bug Fixes

Force-stop no longer loses your checkpoint: Pressing Ctrl+C twice (force-stop) on a spider with crawldir enabled used to race against the checkpoint write -- the cancel scope would tear down the task before the pickle finished, leaving paused=False and triggering the cleanup path that deletes the previous checkpoint. The result was that force-stopping a long crawl could lose all the progress you were trying to save. The engine now writes the checkpoint before calling cancel_scope.cancel(), so a force-stop always preserves the latest pending state. By @voidborne-d in #230.

🙏 Special thanks to the community for all the continuous testing and feedback

curl_cffi v0.15.0 introduced CurlFollow.SAFE, which follows redirects but rejects those targeting internal/private IPs (loopback, private networks, link-local). This is now the default for all HTTP fetchers, the MCP server, and the shell curl converter. Added FollowRedirects type alias supporting all curl_cffi redirect modes: bool, "safe", "all", "obeycode", "firstonly".

…event data loss On force-stop (second Ctrl+C), cancel_scope.cancel() was called BEFORE _save_checkpoint(). Since cancel_scope.cancel() causes all subsequent awaits within the scope to raise Cancelled, the checkpoint write was silently aborted: 1. _save_checkpoint() uses anyio.open_file + rename — both are await checkpoints that get cancelled immediately 2. self.paused never gets set to True (code after the aborted save) 3. The finally block sees 'not self.paused' and calls cleanup() which DELETES the previous checkpoint file Result: a user who ran a long crawl, pressed Ctrl+C twice to force-stop, loses their entire checkpoint irrecoverably. The old checkpoint (from periodic saves or a previous graceful pause) is deleted, and the new one was never written. Fix: move the cancel_scope.cancel() call AFTER the checkpoint save. The save completes normally, self.paused is set to True, and only then does the scope get cancelled to abort in-flight tasks. The finally block correctly sees paused=True and skips cleanup. Adds 6 regression tests covering: - Force-stop checkpoint preservation (core regression) - Graceful pause still works - Force-stop checkpoint is loadable - Normal completion cleanup still works - Force-stop without checkpoint system - Existing checkpoint not deleted on force-stop

…op to prevent data loss (#230)

D4Vinci and others added 11 commits April 5, 2026 18:41

docs: update pages with the new changes

f756e51

build: pump version up

d0a19a6

docs(agent): update skill with the latest changes

a61113f

test: align force-stop regression stubs with dev branch

9950dde

fix(spider): save checkpoint before cancel_scope.cancel() on force-st…

fad9efd

…op to prevent data loss (#230)

feat(spiders): add a development mode

d1baf1f

Merge branch 'main' into dev

fc6a034

docs: adding the new development mode

88d0459

docs(agent): update skill with the latest changes

664e419

D4Vinci merged commit cb449af into main Apr 7, 2026
8 checks passed

D4Vinci temporarily deployed to PyPI April 7, 2026 04:21 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.4.5#235

v0.4.5#235
D4Vinci merged 11 commits into
mainfrom
dev

D4Vinci commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

D4Vinci commented Apr 7, 2026

🚀 New Stuff and quality of life changes

🐛 Bug Fixes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant