Conversation
curl_cffi v0.15.0 introduced CurlFollow.SAFE, which follows redirects but rejects those targeting internal/private IPs (loopback, private networks, link-local). This is now the default for all HTTP fetchers, the MCP server, and the shell curl converter. Added FollowRedirects type alias supporting all curl_cffi redirect modes: bool, "safe", "all", "obeycode", "firstonly".
…event data loss On force-stop (second Ctrl+C), cancel_scope.cancel() was called BEFORE _save_checkpoint(). Since cancel_scope.cancel() causes all subsequent awaits within the scope to raise Cancelled, the checkpoint write was silently aborted: 1. _save_checkpoint() uses anyio.open_file + rename — both are await checkpoints that get cancelled immediately 2. self.paused never gets set to True (code after the aborted save) 3. The finally block sees 'not self.paused' and calls cleanup() which DELETES the previous checkpoint file Result: a user who ran a long crawl, pressed Ctrl+C twice to force-stop, loses their entire checkpoint irrecoverably. The old checkpoint (from periodic saves or a previous graceful pause) is deleted, and the new one was never written. Fix: move the cancel_scope.cancel() call AFTER the checkpoint save. The save completes normally, self.paused is set to True, and only then does the scope get cancelled to abort in-flight tasks. The finally block correctly sees paused=True and skips cleanup. Adds 6 regression tests covering: - Force-stop checkpoint preservation (core regression) - Graceful pause still works - Force-stop checkpoint is loadable - Normal completion cleanup still works - Force-stop without checkpoint system - Existing checkpoint not deleted on force-stop
…op to prevent data loss (#230)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A focused update with one big quality-of-life feature for spider developers and a couple of important fixes 🎉
Note
Follow us on X for daily tips and tricks
🚀 New Stuff and quality of life changes
Spider Development Mode: Iterating on a spider's
parse()logic used to mean re-hitting the target servers on every run, which is slow, noisy, and a great way to get rate-limited while you're still figuring out your selectors. The new development mode caches every response to disk on the first run and replays them from disk on every subsequent run, so you can tweak your callbacks and re-run as many times as you want without making a single network request. Enable it with one class attribute:The cache lives in
.scrapling_cache/{spider.name}/by default and can be redirected anywhere withdevelopment_cache_dir. Two new stat counters,cache_hitsandcache_misses, let you see how the cache performed. Cache replay bypassesdownload_delay, rate limiting, and the blocked-request retry path so iteration is as fast as the disk allows. Don't ship a spider withdevelopment_mode = True-- it's a development tool, not a production cache. See the docs for the full story.Safer redirects by default:
follow_redirectsnow defaults to"safe"across all HTTP fetchers, the MCP server, and the shell. Redirects are still followed, but ones targeting internal/private IPs (loopback, private networks, link-local) are rejected. This protects you from SSRF when scraping user-supplied URLs. Passfollow_redirects="all"to get the old behavior, orFalseto disable redirects entirely.🐛 Bug Fixes
crawldirenabled used to race against the checkpoint write -- the cancel scope would tear down the task before the pickle finished, leavingpaused=Falseand triggering the cleanup path that deletes the previous checkpoint. The result was that force-stopping a long crawl could lose all the progress you were trying to save. The engine now writes the checkpoint before callingcancel_scope.cancel(), so a force-stop always preserves the latest pending state. By @voidborne-d in #230.🙏 Special thanks to the community for all the continuous testing and feedback