v0.4.3 by D4Vinci · Pull Request #217 · D4Vinci/Scrapling

D4Vinci · 2026-03-30T03:49:57Z

A new update with many important changes 🎉

New Stuff and quality of life changes

Added a new MCP tool to open a persistent normal/stealthy browser to keep using it with the rest of the tools, and another new tool to close it. (Examples)
Added a new MCP tool to list all existing browser sessions. Aimed to be used with the new tools.
Added a new option to browser sessions to automatically collect all background requests that happen during a request (Solves #159) [Examples].
Added a new sanitizer to protect the MCP server from common Prompt Injection attacks by removing hidden/invisible content.
Added a new commandline option called --ai-targeted to the Web Scraping commands to make content targeted to AI and safe against common Prompt Injection attacks like the MCP server.
Added a new option to browser sessions called executable_path to allow setting a custom browser path (Solves #202)
Refactored the MCP server code to be easily maintained and unified all tools to be async.
Refactored the CLI commands code to be easily maintained and shorter by 210 lines.

Solved bugs

A fix to preserve HTTP method across retries in spider session by @karesansui-u in #201
Added a max retry limit to getting page content to prevent infinite loop by @haosenwang1018 & @D4Vinci in #197
Replace bare raise with return False in _restore_from_checkpoint by @haosenwang1018 in #196
Replaced get_all with getall in Texthandler to match the Selector class.

Coverage/tests improvement

Added _normalize_credentials edge case coverage tests by @Bortlesboat in #192
Added save/retrieve round-trip and core storage coverage tests by @haosenwang1018 in #193
Added coverage for TextHandler regex paths and TextHandlers.re() by @haosenwang1018 in #194
Added edge case tests for filter, iterancestors, and find_similar by @awanawana in #200

Agent Skill improvement

Fixed broken markdown links in skill references by @yetval in #204
Improved the skill structure to be more acceptable by Clawhub validation.
Forced the skill to use the --ai-targeted commandline option when scraping through commandline commands.

Docs improvement

Added Korean README translation by @greatsk55 in #187
CJK Latin spacing fixes for the Chinese and Japanese READMEs.
Fixed broken links from the old website design.

🙏 Special thanks to the community for all the continuous testing and feedback

Big shoutout to our Platinum Sponsors

…core The adaptive scraping feature relies on SQLiteStorageSystem to persist and relocate elements, but the existing tests only verified object creation — not the actual save/retrieve workflow. This adds: - _get_base_url: None, empty, valid URL, and case-normalization paths - _get_hash: determinism, uniqueness, strip/lowercase, length suffix - save/retrieve round-trip: basic, overwrite (upsert), nonexistent key - URL-based isolation between different websites - element_to_dict: with/without text, attributes, whitespace filtering - _get_element_path: nested and root element paths - Thread safety: 20 concurrent saves with result verification Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Several critical code paths in custom_types.py lacked test coverage: - TextHandler.re(check_match=True): returns bool, not TextHandlers - TextHandler.re(replace_entities=False): entity preservation path - TextHandler.re() with capture groups: flatten behavior - TextHandler.re_first() default value when no match - TextHandler.clean(remove_entities=True): entity replacement path - TextHandler.json() valid and invalid input - TextHandlers.re(): list-level regex with result flattening - TextHandlers.extract()/get_all(): identity return Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When _checkpoint_system_enabled is False, the method uses a bare `raise` with no active exception, which causes RuntimeError at runtime. The method's docstring says it returns False when restoration is not possible, so return False is the correct behavior. The caller in crawl() currently guards with `if self._checkpoint_system_enabled`, but the method's own contract should be self-consistent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Both _get_page_content and _get_async_page_content use a while-True loop that retries page.content() on PlaywrightError with no upper bound. If the page is in a permanently broken state (crashed tab, closed context), this loops forever and hangs the process. Replace with a bounded for-loop (default 10 retries = 5s), returning an empty string if all attempts fail. This preserves the existing retry behavior for the transient Windows issue (playwright#16108) while preventing hangs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- test_selectors_filter.py: covers chained filter(), empty result, all-pass predicate, and calling filter() on empty Selectors - test_ancestor_navigation.py: covers iterancestors() order, depth, text-node safety, root-element edge case, and find_ancestor() with no match - test_find_similar_advanced.py: covers similarity_threshold levels, match_text=True behavior, ignore_attributes combinations, and text-node safety

SessionManager.fetch() pops `method` from `_session_kwargs`, which mutates the original request dict. When the engine retries a blocked request via request.copy(), the copy no longer has `method`, so it defaults to GET. Steps to reproduce: 1. Yield Request(url, method="POST", data=...) 2. Target returns a response that triggers is_blocked() 3. Engine retries via request.copy() → second fetch uses GET Fix: copy the kwargs dict before popping, so the original request stays intact.

…#196)

…()` (#194)

…tor` class

…ssion (#201)

…finite loop (#197)

… raise error on max retries Ref.: #197 (comment)

…#200)

Shortened the code by 210 lines. Also, removed docstrings because they are not needed for CLI commands (more maintenance burden). - `_common_http_options`: shared decorator for 10 Click options used by get/post/put/delete (was repeated 4x) - `_common_browser_options`: shared decorator for 11 Click options used by fetch/stealthy_fetch (was repeated 2x) - `_data_options`: shared decorator for `--data`/`--json` options used by post/put - `__http_command()`: shared implementation body for all HTTP commands (was 4 separate `from scrapling.fetchers import Fetcher` + `__Request_and_Save` blocks) - `__build_browser_kwargs()`: shared kwargs builder for fetch/stealthy_fetch (was duplicated)

- `get()` now delegates to `bulk_get([url])[0]` (was a separate sync implementation) - `fetch()` now delegates to `bulk_fetch([url])[0]` (eliminated duplicate fetcher call) - `stealthy_fetch()` now delegates to `bulk_stealthy_fetch([url])[0]` (same) - Replaced 6x repeated `_content_translator(Convertor._extract_content(...), page)` with a single `_translate_response()` helper - Removed unused imports (`Fetcher`, `DynamicFetcher`, `StealthyFetcher`, `Generator`)

Now you can open a browser, keep using it for other requests as you want, and close it when you want.

Solves #159

Before I forget lol

Solves #214 as well

Solves #202

Bortlesboat and others added 30 commits March 15, 2026 00:14

test(ai): add _normalize_credentials edge case coverage

54fcb7c

Merge branch 'dev' into test/normalize-credentials-coverage

656dbc7

test(ai): add _normalize_credentials edge case coverage (#192)

e63e57e

Merge branch 'dev' into test/custom-types-coverage

1e77048

Merge branch 'main' into dev

72a2c8d

Merge branch 'main' into dev

440da11

Merge branch 'dev' into test/storage-core-coverage

6699213

Merge branch 'dev' into test/custom-types-coverage

5eefbc6

Merge branch 'dev' into fix/checkpoint-restore-bare-raise

bcd45e4

fix: replace bare raise with return False in _restore_from_checkpoint (…

1fb7961

…#196)

fix: adjust test to the _restore_from_checkpoint fix

cc6c0db

Merge branch 'dev' into test/custom-types-coverage

5123590

test: add coverage for TextHandler regex paths and `TextHandlers.re…

980be18

…()` (#194)

fix(Texthandler): Replace get_all with getall to match the `Selec…

136c389

…tor` class

Merge branch 'dev' into fix/page-content-infinite-loop

27f3062

Merge branch 'dev' into fix/preserve-http-method-on-retry

4c07b29

fix(spider/fetcher): preserve HTTP method across retries in spider se…

d9932f2

…ssion (#201)

Merge branch 'dev' into fix/page-content-infinite-loop

8e4e59e

fix(fetchers): add max retry limit to _get_page_content to prevent in…

17ea982

…finite loop (#197)

fix(fetchers/content): increase the default max number of retries and…

1dc0b7a

… raise error on max retries Ref.: #197 (comment)

Merge branch 'dev' into test/edge-cases-filter-ancestors-find-similar

bf9e2da

test: add edge case tests for filter, iterancestors, and find_similar (…

7ca02eb

…#200)

Merge branch 'main' into dev

bc509e9

Merge branch 'main' into dev

d7c0807

D4Vinci added 28 commits March 23, 2026 16:40

Merge branch 'main' into dev

718db42

Merge branch 'main' into dev

3b4cae6

Merge branch 'dev' into test/storage-core-coverage

30ffe76

test: add save/retrieve round-trip and core storage coverage (#193)

3757cf6

test: fixes to storage tests

9fee49c

build: pump up version

422b471

docs: updating the agent skill for Clawhub validation

c2c02dc

docs(skill): adjustment

8e19f82

Merge branch 'main' into dev

d47599f

feat(mcp): Add three new tools to control browser sessions

c458ab6

Now you can open a browser, keep using it for other requests as you want, and close it when you want.

fix(mcp): remove unneeded code and fix type hint for mypy

0f6dccc

feat(browser sessions): Collect XHR requests done while loading the page

68f7c5c

Solves #159

fix: improve type hints for the static checkers

5c450a3

docs: update pages with the XHR feature

61cda58

build(docs): Pump up Zensical version to the latest

8e89a73

docs: Add docs for the new MCP tools

7f552be

docs: update the agent skill with the new features

bcd39d5

Before I forget lol

feat(mcp): Protect from Prompt Injection by removing hidden content

375951b

Solves #214 as well

feat(cli): Add an option to make content safe/targets AI

4efbffa

docs: update pages and skill with the new commandline option

786093f

docs(mcp): add section about prompt injection in docs and skill

3ac7f76

fix(agent): update skill zip file with the latest changes

6598664

feat(browsers): Add a new option to set browser path

8a4c5ff

Solves #202

docs: add the browser path to docs

1f1e475

docs: style adjustment

a403156

fix(agent): update skill zip file with the latest changes

b354be0

D4Vinci merged commit e173f81 into main Mar 30, 2026
13 checks passed

D4Vinci temporarily deployed to PyPI March 30, 2026 03:50 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.4.3#217

v0.4.3#217
D4Vinci merged 59 commits into
mainfrom
dev

D4Vinci commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

D4Vinci commented Mar 30, 2026

New Stuff and quality of life changes

Solved bugs

Coverage/tests improvement

Agent Skill improvement

Docs improvement

Big shoutout to our Platinum Sponsors

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants