Conversation
…core The adaptive scraping feature relies on SQLiteStorageSystem to persist and relocate elements, but the existing tests only verified object creation — not the actual save/retrieve workflow. This adds: - _get_base_url: None, empty, valid URL, and case-normalization paths - _get_hash: determinism, uniqueness, strip/lowercase, length suffix - save/retrieve round-trip: basic, overwrite (upsert), nonexistent key - URL-based isolation between different websites - element_to_dict: with/without text, attributes, whitespace filtering - _get_element_path: nested and root element paths - Thread safety: 20 concurrent saves with result verification Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Several critical code paths in custom_types.py lacked test coverage: - TextHandler.re(check_match=True): returns bool, not TextHandlers - TextHandler.re(replace_entities=False): entity preservation path - TextHandler.re() with capture groups: flatten behavior - TextHandler.re_first() default value when no match - TextHandler.clean(remove_entities=True): entity replacement path - TextHandler.json() valid and invalid input - TextHandlers.re(): list-level regex with result flattening - TextHandlers.extract()/get_all(): identity return Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When _checkpoint_system_enabled is False, the method uses a bare `raise` with no active exception, which causes RuntimeError at runtime. The method's docstring says it returns False when restoration is not possible, so return False is the correct behavior. The caller in crawl() currently guards with `if self._checkpoint_system_enabled`, but the method's own contract should be self-consistent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both _get_page_content and _get_async_page_content use a while-True loop that retries page.content() on PlaywrightError with no upper bound. If the page is in a permanently broken state (crashed tab, closed context), this loops forever and hangs the process. Replace with a bounded for-loop (default 10 retries = 5s), returning an empty string if all attempts fail. This preserves the existing retry behavior for the transient Windows issue (playwright#16108) while preventing hangs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- test_selectors_filter.py: covers chained filter(), empty result, all-pass predicate, and calling filter() on empty Selectors - test_ancestor_navigation.py: covers iterancestors() order, depth, text-node safety, root-element edge case, and find_ancestor() with no match - test_find_similar_advanced.py: covers similarity_threshold levels, match_text=True behavior, ignore_attributes combinations, and text-node safety
SessionManager.fetch() pops `method` from `_session_kwargs`, which mutates the original request dict. When the engine retries a blocked request via request.copy(), the copy no longer has `method`, so it defaults to GET. Steps to reproduce: 1. Yield Request(url, method="POST", data=...) 2. Target returns a response that triggers is_blocked() 3. Engine retries via request.copy() → second fetch uses GET Fix: copy the kwargs dict before popping, so the original request stays intact.
… raise error on max retries Ref.: #197 (comment)
Shortened the code by 210 lines. Also, removed docstrings because they are not needed for CLI commands (more maintenance burden). - `_common_http_options`: shared decorator for 10 Click options used by get/post/put/delete (was repeated 4x) - `_common_browser_options`: shared decorator for 11 Click options used by fetch/stealthy_fetch (was repeated 2x) - `_data_options`: shared decorator for `--data`/`--json` options used by post/put - `__http_command()`: shared implementation body for all HTTP commands (was 4 separate `from scrapling.fetchers import Fetcher` + `__Request_and_Save` blocks) - `__build_browser_kwargs()`: shared kwargs builder for fetch/stealthy_fetch (was duplicated)
- `get()` now delegates to `bulk_get([url])[0]` (was a separate sync implementation) - `fetch()` now delegates to `bulk_fetch([url])[0]` (eliminated duplicate fetcher call) - `stealthy_fetch()` now delegates to `bulk_stealthy_fetch([url])[0]` (same) - Replaced 6x repeated `_content_translator(Convertor._extract_content(...), page)` with a single `_translate_response()` helper - Removed unused imports (`Fetcher`, `DynamicFetcher`, `StealthyFetcher`, `Generator`)
Now you can open a browser, keep using it for other requests as you want, and close it when you want.
Before I forget lol
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A new update with many important changes 🎉
New Stuff and quality of life changes
--ai-targetedto the Web Scraping commands to make content targeted to AI and safe against common Prompt Injection attacks like the MCP server.executable_pathto allow setting a custom browser path (Solves #202)Solved bugs
raisewithreturn Falsein_restore_from_checkpointby @haosenwang1018 in #196get_allwithgetallinTexthandlerto match the Selector class.Coverage/tests improvement
_normalize_credentialsedge case coverage tests by @Bortlesboat in #192TextHandlerregex paths andTextHandlers.re()by @haosenwang1018 in #194filter,iterancestors, andfind_similarby @awanawana in #200Agent Skill improvement
--ai-targetedcommandline option when scraping through commandline commands.Docs improvement
🙏 Special thanks to the community for all the continuous testing and feedback
Big shoutout to our Platinum Sponsors