-
Notifications
You must be signed in to change notification settings - Fork 1.4k
feat: expose Download objects on PlaywrightCrawlingContext #3596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
5078c96
bc67698
b2fd820
365d68d
bb5a1c1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -34,7 +34,7 @@ import type { BatchAddRequestsResult } from '@crawlee/types'; | |||||||||||||||||||||||
| import { type CheerioRoot, type Dictionary, expandShadowRoots, sleep } from '@crawlee/utils'; | ||||||||||||||||||||||||
| import * as cheerio from 'cheerio'; | ||||||||||||||||||||||||
| import ow from 'ow'; | ||||||||||||||||||||||||
| import type { Page, Response, Route } from 'playwright'; | ||||||||||||||||||||||||
| import type { Download, Page, Response, Route } from 'playwright'; | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| import { LruCache } from '@apify/datastructures'; | ||||||||||||||||||||||||
| import log_ from '@apify/log'; | ||||||||||||||||||||||||
|
|
@@ -1062,12 +1062,40 @@ export interface PlaywrightContextUtils { | |||||||||||||||||||||||
| * @param [options] | ||||||||||||||||||||||||
| */ | ||||||||||||||||||||||||
| handleCloudflareChallenge(options?: HandleCloudflareChallengeOptions): Promise<void>; | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| /** | ||||||||||||||||||||||||
| * A list of {@link https://playwright.dev/docs/api/class-download | Download} objects | ||||||||||||||||||||||||
| * triggered during the current page navigation. | ||||||||||||||||||||||||
| * | ||||||||||||||||||||||||
| * Playwright intercepts downloads before they complete, so the objects are available | ||||||||||||||||||||||||
| * as soon as the browser starts the download — including inside `errorHandler` when | ||||||||||||||||||||||||
| * `page.goto` throws `"Download is starting"`. | ||||||||||||||||||||||||
| * | ||||||||||||||||||||||||
| * > **Note:** Playwright saves download data to a temporary file on disk. For very large | ||||||||||||||||||||||||
| * > files this may be a concern; prefer re-enqueueing the URL to a streaming downloader | ||||||||||||||||||||||||
| * > when file size is unpredictable. | ||||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🤔 does PlaywrightCrawler store auto-downloaded files by default now?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
crawlee/packages/browser-pool/src/playwright/playwright-plugin.ts Lines 122 to 126 in 1d2528b
crawlee/packages/browser-pool/src/playwright/playwright-controller.ts Lines 51 to 56 in 1d2528b
https://playwright.dev/docs/api/class-browser#browser-new-context |
||||||||||||||||||||||||
| * | ||||||||||||||||||||||||
| * **Example usage** | ||||||||||||||||||||||||
| * ```ts | ||||||||||||||||||||||||
| * errorHandler: async ({ downloads, request }, error) => { | ||||||||||||||||||||||||
| * if (error.message.includes('Download is starting')) { | ||||||||||||||||||||||||
| * for (const download of downloads) { | ||||||||||||||||||||||||
| * const stream = await download.createReadStream(); | ||||||||||||||||||||||||
| * // stream to storage... | ||||||||||||||||||||||||
| * } | ||||||||||||||||||||||||
| * } | ||||||||||||||||||||||||
| * }, | ||||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This won't happen outside of WCC |
||||||||||||||||||||||||
| * ``` | ||||||||||||||||||||||||
| */ | ||||||||||||||||||||||||
| downloads: Download[]; | ||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe it's just me, but all the other Any chance we can make this a async requestHandler ({ page, listDownloads }) {
await listDownloads() // empty array
await page.click('download');
await listDownloads() // 1 download
}
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed on the consistency point. I went with synchronous return (rather than Promise<Download[]>) because the operation is purely a state read so no I/O, no waiting. Happy to flip it to async if you'd prefer to keep the "everything is awaited" muscle memory consistent, it's a one-line change. While at it, I also moved the listener wiring + getDownloads registration into _enhanceCrawlingContextWithPageInfo (one closure for the array, listener and getter). Side benefit: this also fixes a latent race where download events fired during navigation would .push() into a not-yet-initialized context.downloads field. Also added an integration test mirroring the before click → empty / after click → 1 download.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I have a different point - if we ever decide to make some The move to
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair call, fipping it 😄 |
||||||||||||||||||||||||
| } | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| export function registerUtilsToContext( | ||||||||||||||||||||||||
| context: PlaywrightCrawlingContext, | ||||||||||||||||||||||||
| crawlerOptions: PlaywrightCrawlerOptions, | ||||||||||||||||||||||||
| ): void { | ||||||||||||||||||||||||
| context.downloads = []; | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| context.injectFile = async (filePath: string, options?: InjectFileOptions) => | ||||||||||||||||||||||||
| injectFile(context.page, filePath, options); | ||||||||||||||||||||||||
| context.injectJQuery = async () => { | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd refrain from mentioning internals of Apify's Actors here - Crawlee is a standalone project.