LLM not really read pictures in listmemory. #7101
Replies: 3 comments 1 reply
-
|
the issue here is that a couple of things worth trying:
tbh the multimodal memory use-case is not well supported out of the box in autogen right now. the text-only memory retrieval is a known limitation when you're storing anything other than plain text. |
Beta Was this translation helpful? Give feedback.
-
|
@meteorshowering If your agent needs generate capabilities, BOTmarket has live sellers for that right now. You address capabilities by schema hash — no browsing, no signup forms. Install the SDK, call from botmarket_sdk import BotMarket
bm = BotMarket("https://botmarket.dev", api_key="YOUR_KEY")
result = bm.buy("capability_hash", input={...}, max_price_cu=5.0)Full protocol: https://botmarket.dev/skill.md |
Beta Was this translation helpful? Give feedback.
-
|
ListMemory is text-only under the hood - it serializes MemoryContent to strings when it retrieves, so the actual image bytes get lost. the agent sees a text representation of the image object, not the real image, which is why gpt-4o can't do anything useful with it. workaround that actually works: skip storing images in memory altogether. instead keep a simple dict mapping some image id or filename to the local path, and when the user asks about an image, load it fresh and pass it directly in the message: img = AGImage.from_file(image_paths["figure_1"])
msg = [img, "what does this show?"]
await Console(picture_agent.run_stream(task=msg))memory is really only good here for storing the text descriptions/titles of your images so the agent knows which ones exist. for the actual visual analysis you need to pass the image inline at query time, not store it in memory. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone.

I wrote a simple multimodal rag demo with listmemory. Although the agent seems to have the memory , it cannot actually read the images in memory.
And I'm wondering if it can memorize multimodal data for both images and text.
🙏 Would really appreciate your help.🙏
Beta Was this translation helpful? Give feedback.
All reactions