Skip to content

v0.4.3#217

Merged
D4Vinci merged 59 commits into
mainfrom
dev
Mar 30, 2026
Merged

v0.4.3#217
D4Vinci merged 59 commits into
mainfrom
dev

Conversation

@D4Vinci

@D4Vinci D4Vinci commented Mar 30, 2026

Copy link
Copy Markdown
Owner

A new update with many important changes 🎉

New Stuff and quality of life changes

  • Added a new MCP tool to open a persistent normal/stealthy browser to keep using it with the rest of the tools, and another new tool to close it. (Examples)
  • Added a new MCP tool to list all existing browser sessions. Aimed to be used with the new tools.
  • Added a new option to browser sessions to automatically collect all background requests that happen during a request (Solves #159) [Examples].
  • Added a new sanitizer to protect the MCP server from common Prompt Injection attacks by removing hidden/invisible content.
  • Added a new commandline option called --ai-targeted to the Web Scraping commands to make content targeted to AI and safe against common Prompt Injection attacks like the MCP server.
  • Added a new option to browser sessions called executable_path to allow setting a custom browser path (Solves #202)
  • Refactored the MCP server code to be easily maintained and unified all tools to be async.
  • Refactored the CLI commands code to be easily maintained and shorter by 210 lines.

Solved bugs

  • A fix to preserve HTTP method across retries in spider session by @karesansui-u in #201
  • Added a max retry limit to getting page content to prevent infinite loop by @haosenwang1018 & @D4Vinci in #197
  • Replace bare raise with return False in _restore_from_checkpoint by @haosenwang1018 in #196
  • Replaced get_all with getall in Texthandler to match the Selector class.

Coverage/tests improvement

  • Added _normalize_credentials edge case coverage tests by @Bortlesboat in #192
  • Added save/retrieve round-trip and core storage coverage tests by @haosenwang1018 in #193
  • Added coverage for TextHandler regex paths and TextHandlers.re() by @haosenwang1018 in #194
  • Added edge case tests for filter, iterancestors, and find_similar by @awanawana in #200

Agent Skill improvement

  • Fixed broken markdown links in skill references by @yetval in #204
  • Improved the skill structure to be more acceptable by Clawhub validation.
  • Forced the skill to use the --ai-targeted commandline option when scraping through commandline commands.

Docs improvement

  • Added Korean README translation by @greatsk55 in #187
  • CJK Latin spacing fixes for the Chinese and Japanese READMEs.
  • Fixed broken links from the old website design.

🙏 Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

Bortlesboat and others added 30 commits March 15, 2026 00:14
…core

The adaptive scraping feature relies on SQLiteStorageSystem to persist
and relocate elements, but the existing tests only verified object
creation — not the actual save/retrieve workflow. This adds:

- _get_base_url: None, empty, valid URL, and case-normalization paths
- _get_hash: determinism, uniqueness, strip/lowercase, length suffix
- save/retrieve round-trip: basic, overwrite (upsert), nonexistent key
- URL-based isolation between different websites
- element_to_dict: with/without text, attributes, whitespace filtering
- _get_element_path: nested and root element paths
- Thread safety: 20 concurrent saves with result verification

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Several critical code paths in custom_types.py lacked test coverage:

- TextHandler.re(check_match=True): returns bool, not TextHandlers
- TextHandler.re(replace_entities=False): entity preservation path
- TextHandler.re() with capture groups: flatten behavior
- TextHandler.re_first() default value when no match
- TextHandler.clean(remove_entities=True): entity replacement path
- TextHandler.json() valid and invalid input
- TextHandlers.re(): list-level regex with result flattening
- TextHandlers.extract()/get_all(): identity return

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When _checkpoint_system_enabled is False, the method uses a bare
`raise` with no active exception, which causes RuntimeError at
runtime. The method's docstring says it returns False when restoration
is not possible, so return False is the correct behavior.

The caller in crawl() currently guards with `if
self._checkpoint_system_enabled`, but the method's own contract
should be self-consistent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both _get_page_content and _get_async_page_content use a while-True
loop that retries page.content() on PlaywrightError with no upper
bound. If the page is in a permanently broken state (crashed tab,
closed context), this loops forever and hangs the process.

Replace with a bounded for-loop (default 10 retries = 5s), returning
an empty string if all attempts fail. This preserves the existing
retry behavior for the transient Windows issue (playwright#16108)
while preventing hangs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- test_selectors_filter.py: covers chained filter(), empty result,
  all-pass predicate, and calling filter() on empty Selectors
- test_ancestor_navigation.py: covers iterancestors() order, depth,
  text-node safety, root-element edge case, and find_ancestor() with
  no match
- test_find_similar_advanced.py: covers similarity_threshold levels,
  match_text=True behavior, ignore_attributes combinations, and
  text-node safety
SessionManager.fetch() pops `method` from `_session_kwargs`,
which mutates the original request dict. When the engine retries
a blocked request via request.copy(), the copy no longer has
`method`, so it defaults to GET.

Steps to reproduce:
1. Yield Request(url, method="POST", data=...)
2. Target returns a response that triggers is_blocked()
3. Engine retries via request.copy() → second fetch uses GET

Fix: copy the kwargs dict before popping, so the original
request stays intact.
D4Vinci added 28 commits March 23, 2026 16:40
Shortened the code by 210 lines. Also, removed docstrings because they are not needed for CLI commands (more maintenance burden).

- `_common_http_options`: shared decorator for 10 Click options used by get/post/put/delete (was repeated 4x)
- `_common_browser_options`: shared decorator for 11 Click options used by fetch/stealthy_fetch (was repeated 2x)
- `_data_options`: shared decorator for `--data`/`--json` options used by post/put
- `__http_command()`: shared implementation body for all HTTP commands (was 4 separate `from scrapling.fetchers import Fetcher` + `__Request_and_Save` blocks)
- `__build_browser_kwargs()`: shared kwargs builder for fetch/stealthy_fetch (was duplicated)
- `get()` now delegates to `bulk_get([url])[0]` (was a separate sync implementation)
- `fetch()` now delegates to `bulk_fetch([url])[0]` (eliminated duplicate fetcher call)
- `stealthy_fetch()` now delegates to `bulk_stealthy_fetch([url])[0]` (same)
- Replaced 6x repeated `_content_translator(Convertor._extract_content(...), page)` with a single `_translate_response()` helper
- Removed unused imports (`Fetcher`, `DynamicFetcher`, `StealthyFetcher`, `Generator`)
Now you can open a browser, keep using it for other requests as you want, and close it when you want.
@D4Vinci D4Vinci merged commit e173f81 into main Mar 30, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants