CLI: send screenshots from studio code to Telegram remote sessions#3272
CLI: send screenshots from studio code to Telegram remote sessions#3272
studio code to Telegram remote sessions#3272Conversation
…ions Adds a `share_screenshot` tool that captures a 16:9 above-the-fold view of a URL and emits a `media.share` JSON event the remote-session controller forwards to Telegram via the existing `/local-agent-respond` endpoint as multipart/form-data with a `photo` part. The agent uses this to deliver visible results back to the user; `take_screenshot` stays internal for visual reasoning. Also threads `STUDIO_REMOTE_SESSION=1` to the spawned child so the system prompt can favor short, visual replies and steers the agent away from fabricating "gist stored / preview link saved" epilogues.
studio code to Telegram remote sessionsstudio code to Telegram remote sessions
There was a problem hiding this comment.
Pull request overview
This PR extends the existing Telegram remote-session bridge for studio code to support inline screenshot delivery by introducing a new user-facing screenshot tool and a new JSON event type that the remote-session controller forwards as Telegram photos.
Changes:
- Add a new
share_screenshottool that emits amedia.shareJSON event (plus returns the image to the agent) for user-visible screenshot delivery. - Extend the remote-session controller to collect
media.shareevents and post photos before the text reply. - Update Telegram response transport to use
multipart/form-datawhen a photo is present (JSON for text-only), plus add remote-session-specific system prompt guidance toggled bySTUDIO_REMOTE_SESSION=1.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/common/ai/tools.ts | Adds display name + URL detail extraction for the new share_screenshot tool. |
| tools/common/ai/json-events.ts | Introduces MediaShareEvent and extends JsonEvent union with media.share. |
| apps/cli/ai/tools.ts | Implements share_screenshot, refactors screenshot capture into captureScreenshotPng, and registers the new tool. |
| apps/cli/ai/system-prompt.ts | Adds Telegram remote-session guidance addendum (including share_screenshot usage expectations). |
| apps/cli/ai/agent.ts | Enables the remote-session system prompt addendum when STUDIO_REMOTE_SESSION=1. |
| apps/cli/remote-session/turn-runner.ts | Collects media.share events from the subprocess and returns them in TurnOutcome. |
| apps/cli/remote-session/poll-loop.ts | Posts collected media shares to Telegram before posting the text reply; avoids “no result” warning when media exists. |
| apps/cli/remote-session/telegram-client.ts | Updates respondMessage to support multipart photo uploads + caption; logs partial failures without throwing. |
| apps/cli/remote-session/tests/* | Adds/updates unit tests for media collection, ordering, and multipart photo transport behavior. |
| apps/cli/remote-session/tests/fixtures/mock-studio-code.mjs | Adds a media-share fixture scenario emitting media.share events. |
| apps/cli/ai/tests/system-prompt.test.ts | New tests verifying remote-session prompt addendum is included/excluded appropriately. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Scroll through the page to trigger lazy-loaded images, then wait | ||
| // for all images to finish loading (with a timeout so we don't hang | ||
| // on images that never settle). | ||
| await page.evaluate( async () => { | ||
| const delay = ( ms: number ) => | ||
| new Promise< void >( ( resolve ) => setTimeout( resolve, ms ) ); | ||
| const scrollHeight = document.body.scrollHeight; | ||
| const viewportHeight = window.innerHeight; | ||
| for ( let y = 0; y < scrollHeight; y += viewportHeight ) { | ||
| window.scrollTo( 0, y ); | ||
| await delay( 100 ); | ||
| } | ||
| window.scrollTo( 0, 0 ); | ||
|
|
||
| const timeout = new Promise< void >( ( resolve ) => setTimeout( resolve, 5000 ) ); | ||
| const allImages = Promise.all( | ||
| Array.from( document.images ) | ||
| .filter( ( img ) => ! img.complete ) | ||
| .map( | ||
| ( img ) => | ||
| new Promise< void >( ( resolve ) => { | ||
| img.addEventListener( 'load', () => resolve() ); | ||
| img.addEventListener( 'error', () => resolve() ); | ||
| } ) | ||
| ) | ||
| ); | ||
| await Promise.race( [ allImages, timeout ] ); | ||
| } ); |
There was a problem hiding this comment.
captureScreenshotPng() scrolls the entire document to the bottom to trigger lazy-loading even when options.fullPage is false. For the default share_screenshot above-the-fold use case this can add significant latency on long pages and undermines the goal of a quick viewport capture. Consider skipping the full-page scroll/wait logic when fullPage is false (or limiting it to the first viewport), and only doing the full scroll pass for full-page captures.
| // Scroll through the page to trigger lazy-loaded images, then wait | |
| // for all images to finish loading (with a timeout so we don't hang | |
| // on images that never settle). | |
| await page.evaluate( async () => { | |
| const delay = ( ms: number ) => | |
| new Promise< void >( ( resolve ) => setTimeout( resolve, ms ) ); | |
| const scrollHeight = document.body.scrollHeight; | |
| const viewportHeight = window.innerHeight; | |
| for ( let y = 0; y < scrollHeight; y += viewportHeight ) { | |
| window.scrollTo( 0, y ); | |
| await delay( 100 ); | |
| } | |
| window.scrollTo( 0, 0 ); | |
| const timeout = new Promise< void >( ( resolve ) => setTimeout( resolve, 5000 ) ); | |
| const allImages = Promise.all( | |
| Array.from( document.images ) | |
| .filter( ( img ) => ! img.complete ) | |
| .map( | |
| ( img ) => | |
| new Promise< void >( ( resolve ) => { | |
| img.addEventListener( 'load', () => resolve() ); | |
| img.addEventListener( 'error', () => resolve() ); | |
| } ) | |
| ) | |
| ); | |
| await Promise.race( [ allImages, timeout ] ); | |
| } ); | |
| // For full-page screenshots, scroll through the entire page to trigger | |
| // lazy-loaded images. For viewport screenshots, avoid the expensive | |
| // full-document scroll and only wait on images intersecting the first | |
| // viewport so above-the-fold captures stay fast. | |
| await page.evaluate( async ( fullPage ) => { | |
| const delay = ( ms: number ) => | |
| new Promise< void >( ( resolve ) => setTimeout( resolve, ms ) ); | |
| if ( fullPage ) { | |
| const scrollHeight = document.body.scrollHeight; | |
| const viewportHeight = window.innerHeight; | |
| for ( let y = 0; y < scrollHeight; y += viewportHeight ) { | |
| window.scrollTo( 0, y ); | |
| await delay( 100 ); | |
| } | |
| window.scrollTo( 0, 0 ); | |
| } | |
| const timeout = new Promise< void >( ( resolve ) => setTimeout( resolve, 5000 ) ); | |
| const pendingImages = Array.from( document.images ).filter( ( img ) => { | |
| if ( img.complete ) { | |
| return false; | |
| } | |
| if ( fullPage ) { | |
| return true; | |
| } | |
| const rect = img.getBoundingClientRect(); | |
| return rect.bottom > 0 && rect.top < window.innerHeight; | |
| } ); | |
| const allImages = Promise.all( | |
| pendingImages.map( | |
| ( img ) => | |
| new Promise< void >( ( resolve ) => { | |
| img.addEventListener( 'load', () => resolve(), { once: true } ); | |
| img.addEventListener( 'error', () => resolve(), { once: true } ); | |
| } ) | |
| ) | |
| ); | |
| await Promise.race( [ allImages, timeout ] ); | |
| }, options.fullPage ); |
| ...logContext, | ||
| media_type: media.mediaType, | ||
| mime_type: media.mimeType, | ||
| bytes: media.dataBase64.length, |
There was a problem hiding this comment.
In the media.share debug log, bytes: media.dataBase64.length is reporting base64 character count, not decoded byte length. To avoid misleading telemetry, consider renaming this field (e.g., base64_chars) or computing the decoded byte length when you actually need “bytes”.
| bytes: media.dataBase64.length, | |
| base64_chars: media.dataBase64.length, |
Related issues
studio code(PoC) #3196How AI was used in this PR
Claude wrote the bulk of the implementation and the tests, on top of a handoff prompt I drafted for the wpcom backend agent. I reviewed and tested every change end-to-end against a sandbox before opening the PR; the manual test plan is in the testing instructions below.
Proposed Changes
The Telegram remote-session bridge is currently text-only. When the agent finishes a visible task, the user gets a prose summary but no image. This PR lets the local agent deliver screenshots inline:
share_screenshottool. Captures a 16:9 above-the-fold view of a URL by default and emits amedia.shareJSON event.fullPage: trueis opt-in for the rare case where the user wants the whole scroll length.take_screenshotstays unchanged as the model-internal reasoning tool.turn-runner+poll-loop) collectsmedia.shareevents from the spawnedstudio code --jsonchild and posts each photo before the text reply.respondMessagenow picks transport based on payload. Photo present meansmultipart/form-datawith aphotofile part (matches the wpcom contract); text-only stays on the existing JSON path.STUDIO_REMOTE_SESSION=1so the system prompt knows to keep replies short, deliver visible work viashare_screenshot, follow up with a "Want me to publish this as a preview site?" line, and stop fabricating "gist stored / preview link saved" epilogues that aren't backed by any actual storage.Testing Instructions
Prerequisites: be an Automattician (backend gates on
is_automattician()), be logged in viastudio auth loginso the bearer falls through from~/.studio/shared.json, and have a Telegram bot routing into your account.npm run cli:buildnode apps/cli/dist/cli/main.mjs code --remote-sessiontail -F ~/.studio/remote-session.logshare_screenshotis called withfullPage: trueand the long capture is delivered.{ "success": true, "photo_sent": true }.Pre-merge Checklist