Note sull'episodio
For AI agents to move from reasoning to action, they need more than text alone. But what “senses” actually matter?
In this episode of The Shift Podcast: Agentic Edition, members of the Microsoft Foundry team discuss how multimodal inputs—such as text, vision, and speech—shape how agents perceive and interact with the world. The conversation explores what’s practical today, rather than assuming fully autonomous systems.
The discussion covers:
· Why multimodal AI expands what agents can understand.
· How vision, voice, and text models are combined in applications.
· The role of tools and APIs in enabling agent action.
· Where modality adds value—and where it introduces complexity.
Rather than framing modalities as future cap ...