Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion packages/playwright-crawler/src/internals/playwright-crawler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ import type {
RouterRoutes,
} from '@crawlee/browser';
import { BrowserCrawler, Configuration, Router } from '@crawlee/browser';
import type { BrowserPoolOptions, PlaywrightController, PlaywrightPlugin } from '@crawlee/browser-pool';
import type { BrowserPoolOptions, CommonPage, PlaywrightController, PlaywrightPlugin } from '@crawlee/browser-pool';
import type { Dictionary } from '@crawlee/types';
import ow from 'ow';
import type { LaunchOptions, Page, Response } from 'playwright';
Expand Down Expand Up @@ -238,6 +238,15 @@ export class PlaywrightCrawler extends BrowserCrawler<
super({ ...browserCrawlerOptions, launchContext, browserPoolOptions }, config);
}

protected override _enhanceCrawlingContextWithPageInfo(
crawlingContext: PlaywrightCrawlingContext,
page: CommonPage,
createNewSession?: boolean,
): void {
super._enhanceCrawlingContextWithPageInfo(crawlingContext, page, createNewSession);
(page as Page).on('download', (download) => crawlingContext.downloads.push(download));
}

protected override async _runRequestHandler(context: PlaywrightCrawlingContext) {
registerUtilsToContext(context, this.options);
await super._runRequestHandler(context);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ import type { BatchAddRequestsResult } from '@crawlee/types';
import { type CheerioRoot, type Dictionary, expandShadowRoots, sleep } from '@crawlee/utils';
import * as cheerio from 'cheerio';
import ow from 'ow';
import type { Page, Response, Route } from 'playwright';
import type { Download, Page, Response, Route } from 'playwright';

import { LruCache } from '@apify/datastructures';
import log_ from '@apify/log';
Expand Down Expand Up @@ -1062,12 +1062,40 @@ export interface PlaywrightContextUtils {
* @param [options]
*/
handleCloudflareChallenge(options?: HandleCloudflareChallengeOptions): Promise<void>;

/**
* A list of {@link https://playwright.dev/docs/api/class-download | Download} objects
* triggered during the current page navigation.
*
* Playwright intercepts downloads before they complete, so the objects are available
* as soon as the browser starts the download — including inside `errorHandler` when
* `page.goto` throws `"Download is starting"`.
*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd refrain from mentioning internals of Apify's Actors here - Crawlee is a standalone project.

* > **Note:** Playwright saves download data to a temporary file on disk. For very large
* > files this may be a concern; prefer re-enqueueing the URL to a streaming downloader
* > when file size is unpredictable.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 does PlaywrightCrawler store auto-downloaded files by default now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double triple-checked. Playwright itself saves downloads to a temp dir on disk (that's why download.path() and download.createReadStream() exist in the first place.
This is Playwright's own behavior, set byacceptDownloads wjich defaults to true in Playwright, and we don't override it.

const browserContext = await this.library
.launchPersistentContext(userDataDir, launchOptions)
.catch((error) => {
return this._throwOnFailedLaunch(launchContext, error);
});

if (this.launchContext.useIncognitoPages) {
// Each page requires to have all the context options applied
contextOptions = {
...this.launchContext.launchOptions,
...contextOptions,
};

https://playwright.dev/docs/api/class-browser#browser-new-context

*
* **Example usage**
* ```ts
* errorHandler: async ({ downloads, request }, error) => {
* if (error.message.includes('Download is starting')) {
* for (const download of downloads) {
* const stream = await download.createReadStream();
* // stream to storage...
* }
* }
* },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't happen outside of WCC

* ```
*/
downloads: Download[];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's just me, but all the other utils in this interface are functions.

Any chance we can make this a (list|get)(File)?Downloads method to make the API consistent? Since this array can grow, it makes more sense to me too:

async requestHandler ({ page, listDownloads }) {
    await listDownloads() // empty array
    await page.click('download');
    await listDownloads() // 1 download
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the consistency point. I went with synchronous return (rather than Promise<Download[]>) because the operation is purely a state read so no I/O, no waiting. Happy to flip it to async if you'd prefer to keep the "everything is awaited" muscle memory consistent, it's a one-line change.

While at it, I also moved the listener wiring + getDownloads registration into _enhanceCrawlingContextWithPageInfo (one closure for the array, listener and getter). Side benefit: this also fixes a latent race where download events fired during navigation would .push() into a not-yet-initialized context.downloads field.

Also added an integration test mirroring the before click → empty / after click → 1 download.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to flip it to async if you'd prefer to keep the "everything is awaited" muscle memory consistent; it's a one-line change.

I have a different point - if we ever decide to make some async action here (write stats somewhere, upload something to KVS, etc.), we'd need to make a breaking API change. Having this async from the get-go makes this, imo, easier in the long run.

The move to _enhanceCrawlingContextWithPageInfo could be a little confusing (since the rest of the utils is somewhere else), but you have a point, we might miss some downloads due to a race condition. I'm personally fine with the move 👍

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair call, fipping it 😄

}

export function registerUtilsToContext(
context: PlaywrightCrawlingContext,
crawlerOptions: PlaywrightCrawlerOptions,
): void {
context.downloads = [];

context.injectFile = async (filePath: string, options?: InjectFileOptions) =>
injectFile(context.page, filePath, options);
context.injectJQuery = async () => {
Expand Down