Skip to content

CLI: send screenshots from studio code to Telegram remote sessions#3272

Draft
gcsecsey wants to merge 1 commit intotrunkfrom
gcsecsey/screenshot-support
Draft

CLI: send screenshots from studio code to Telegram remote sessions#3272
gcsecsey wants to merge 1 commit intotrunkfrom
gcsecsey/screenshot-support

Conversation

@gcsecsey
Copy link
Copy Markdown
Contributor

@gcsecsey gcsecsey commented Apr 28, 2026

Related issues

How AI was used in this PR

Claude wrote the bulk of the implementation and the tests, on top of a handoff prompt I drafted for the wpcom backend agent. I reviewed and tested every change end-to-end against a sandbox before opening the PR; the manual test plan is in the testing instructions below.

Proposed Changes

The Telegram remote-session bridge is currently text-only. When the agent finishes a visible task, the user gets a prose summary but no image. This PR lets the local agent deliver screenshots inline:

  • New share_screenshot tool. Captures a 16:9 above-the-fold view of a URL by default and emits a media.share JSON event. fullPage: true is opt-in for the rare case where the user wants the whole scroll length. take_screenshot stays unchanged as the model-internal reasoning tool.
  • Remote-session controller (turn-runner + poll-loop) collects media.share events from the spawned studio code --json child and posts each photo before the text reply.
  • respondMessage now picks transport based on payload. Photo present means multipart/form-data with a photo file part (matches the wpcom contract); text-only stays on the existing JSON path.
  • Spawned child gets STUDIO_REMOTE_SESSION=1 so the system prompt knows to keep replies short, deliver visible work via share_screenshot, follow up with a "Want me to publish this as a preview site?" line, and stop fabricating "gist stored / preview link saved" epilogues that aren't backed by any actual storage.

Testing Instructions

Prerequisites: be an Automattician (backend gates on is_automattician()), be logged in via studio auth login so the bearer falls through from ~/.studio/shared.json, and have a Telegram bot routing into your account.

  • Build the CLI: npm run cli:build
  • Start the bridge: node apps/cli/dist/cli/main.mjs code --remote-session
  • In a second terminal, tail the log: tail -F ~/.studio/remote-session.log
  • From Telegram, send: "send to my local agent: take a screenshot of and show me"
  • Verify in Telegram:
    • A 1280x720 above-the-fold screenshot arrives inline (not the full-page strip).
    • The caption is a short one-liner that does NOT mention "full page" or "viewport".
    • A follow-up text message asks about publishing a preview site.
    • No "Screenshot shared with the user" progress message before the photo.
    • No "gist stored" or "preview link saved" epilogue.
  • Verify the text-only regression by sending a non-visual request like "list my local sites": text reply arrives, no photo.
  • Optional: ask explicitly for "the full page" and confirm share_screenshot is called with fullPage: true and the long capture is delivered.
  • Optional backend-direct check: after the routing has set the auth key, POST a multipart photo with curl as documented in the wpcom PR; should return { "success": true, "photo_sent": true }.

Pre-merge Checklist

  • Have you checked for TypeScript, React or other console errors?

…ions

Adds a `share_screenshot` tool that captures a 16:9 above-the-fold view of
a URL and emits a `media.share` JSON event the remote-session controller
forwards to Telegram via the existing `/local-agent-respond` endpoint as
multipart/form-data with a `photo` part. The agent uses this to deliver
visible results back to the user; `take_screenshot` stays internal for
visual reasoning.

Also threads `STUDIO_REMOTE_SESSION=1` to the spawned child so the system
prompt can favor short, visual replies and steers the agent away from
fabricating "gist stored / preview link saved" epilogues.
@gcsecsey gcsecsey changed the title apps/cli: send screenshots from studio code to Telegram remote sessions CLI: send screenshots from studio code to Telegram remote sessions Apr 28, 2026
@gcsecsey gcsecsey requested a review from Copilot April 29, 2026 10:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the existing Telegram remote-session bridge for studio code to support inline screenshot delivery by introducing a new user-facing screenshot tool and a new JSON event type that the remote-session controller forwards as Telegram photos.

Changes:

  • Add a new share_screenshot tool that emits a media.share JSON event (plus returns the image to the agent) for user-visible screenshot delivery.
  • Extend the remote-session controller to collect media.share events and post photos before the text reply.
  • Update Telegram response transport to use multipart/form-data when a photo is present (JSON for text-only), plus add remote-session-specific system prompt guidance toggled by STUDIO_REMOTE_SESSION=1.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tools/common/ai/tools.ts Adds display name + URL detail extraction for the new share_screenshot tool.
tools/common/ai/json-events.ts Introduces MediaShareEvent and extends JsonEvent union with media.share.
apps/cli/ai/tools.ts Implements share_screenshot, refactors screenshot capture into captureScreenshotPng, and registers the new tool.
apps/cli/ai/system-prompt.ts Adds Telegram remote-session guidance addendum (including share_screenshot usage expectations).
apps/cli/ai/agent.ts Enables the remote-session system prompt addendum when STUDIO_REMOTE_SESSION=1.
apps/cli/remote-session/turn-runner.ts Collects media.share events from the subprocess and returns them in TurnOutcome.
apps/cli/remote-session/poll-loop.ts Posts collected media shares to Telegram before posting the text reply; avoids “no result” warning when media exists.
apps/cli/remote-session/telegram-client.ts Updates respondMessage to support multipart photo uploads + caption; logs partial failures without throwing.
apps/cli/remote-session/tests/* Adds/updates unit tests for media collection, ordering, and multipart photo transport behavior.
apps/cli/remote-session/tests/fixtures/mock-studio-code.mjs Adds a media-share fixture scenario emitting media.share events.
apps/cli/ai/tests/system-prompt.test.ts New tests verifying remote-session prompt addendum is included/excluded appropriately.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread apps/cli/ai/tools.ts
Comment on lines +696 to +723
// Scroll through the page to trigger lazy-loaded images, then wait
// for all images to finish loading (with a timeout so we don't hang
// on images that never settle).
await page.evaluate( async () => {
const delay = ( ms: number ) =>
new Promise< void >( ( resolve ) => setTimeout( resolve, ms ) );
const scrollHeight = document.body.scrollHeight;
const viewportHeight = window.innerHeight;
for ( let y = 0; y < scrollHeight; y += viewportHeight ) {
window.scrollTo( 0, y );
await delay( 100 );
}
window.scrollTo( 0, 0 );

const timeout = new Promise< void >( ( resolve ) => setTimeout( resolve, 5000 ) );
const allImages = Promise.all(
Array.from( document.images )
.filter( ( img ) => ! img.complete )
.map(
( img ) =>
new Promise< void >( ( resolve ) => {
img.addEventListener( 'load', () => resolve() );
img.addEventListener( 'error', () => resolve() );
} )
)
);
await Promise.race( [ allImages, timeout ] );
} );
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

captureScreenshotPng() scrolls the entire document to the bottom to trigger lazy-loading even when options.fullPage is false. For the default share_screenshot above-the-fold use case this can add significant latency on long pages and undermines the goal of a quick viewport capture. Consider skipping the full-page scroll/wait logic when fullPage is false (or limiting it to the first viewport), and only doing the full scroll pass for full-page captures.

Suggested change
// Scroll through the page to trigger lazy-loaded images, then wait
// for all images to finish loading (with a timeout so we don't hang
// on images that never settle).
await page.evaluate( async () => {
const delay = ( ms: number ) =>
new Promise< void >( ( resolve ) => setTimeout( resolve, ms ) );
const scrollHeight = document.body.scrollHeight;
const viewportHeight = window.innerHeight;
for ( let y = 0; y < scrollHeight; y += viewportHeight ) {
window.scrollTo( 0, y );
await delay( 100 );
}
window.scrollTo( 0, 0 );
const timeout = new Promise< void >( ( resolve ) => setTimeout( resolve, 5000 ) );
const allImages = Promise.all(
Array.from( document.images )
.filter( ( img ) => ! img.complete )
.map(
( img ) =>
new Promise< void >( ( resolve ) => {
img.addEventListener( 'load', () => resolve() );
img.addEventListener( 'error', () => resolve() );
} )
)
);
await Promise.race( [ allImages, timeout ] );
} );
// For full-page screenshots, scroll through the entire page to trigger
// lazy-loaded images. For viewport screenshots, avoid the expensive
// full-document scroll and only wait on images intersecting the first
// viewport so above-the-fold captures stay fast.
await page.evaluate( async ( fullPage ) => {
const delay = ( ms: number ) =>
new Promise< void >( ( resolve ) => setTimeout( resolve, ms ) );
if ( fullPage ) {
const scrollHeight = document.body.scrollHeight;
const viewportHeight = window.innerHeight;
for ( let y = 0; y < scrollHeight; y += viewportHeight ) {
window.scrollTo( 0, y );
await delay( 100 );
}
window.scrollTo( 0, 0 );
}
const timeout = new Promise< void >( ( resolve ) => setTimeout( resolve, 5000 ) );
const pendingImages = Array.from( document.images ).filter( ( img ) => {
if ( img.complete ) {
return false;
}
if ( fullPage ) {
return true;
}
const rect = img.getBoundingClientRect();
return rect.bottom > 0 && rect.top < window.innerHeight;
} );
const allImages = Promise.all(
pendingImages.map(
( img ) =>
new Promise< void >( ( resolve ) => {
img.addEventListener( 'load', () => resolve(), { once: true } );
img.addEventListener( 'error', () => resolve(), { once: true } );
} )
)
);
await Promise.race( [ allImages, timeout ] );
}, options.fullPage );

Copilot uses AI. Check for mistakes.
...logContext,
media_type: media.mediaType,
mime_type: media.mimeType,
bytes: media.dataBase64.length,
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the media.share debug log, bytes: media.dataBase64.length is reporting base64 character count, not decoded byte length. To avoid misleading telemetry, consider renaming this field (e.g., base64_chars) or computing the decoded byte length when you actually need “bytes”.

Suggested change
bytes: media.dataBase64.length,
base64_chars: media.dataBase64.length,

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants