Beyond ChatGPT: Embodied language understanding

Illustration depicting embodies AI

An illustration of a scenario depicting many tasks of interest to researchers in Embodied AI. Here, we have multiple robots operating in a kitchen environment, with a human asking one of the robots if there is any cereal left, while the other one cleans the dishes. The robots must use their navigation, manipulation, and reasoning skills to answer and achieve tasks in the environment. Illustration courtesy of Winson Han.

Stefan Lee, assistant professor of computer science at Oregon State University, is making significant strides at the intersection of artificial intelligence and robotics. Having held visiting research positions at Indiana University, Virginia Tech, Georgia Tech, and Meta AI before joining Oregon State in 2019, Lee was drawn to the university’s leadership in robotics and AI, as well as the strong collaborative spirit of its faculty. His primary research interest lies in language grounding, which aims to associate words with their real-world meanings and representations.

Lee and his team leverage advancements in natural language processing for increasingly intelligent embodied systems. Surpassing the capabilities of language-generation applications like ChatGPT, Lee’s approach — combining natural language processing and computer vision in embodied contexts ― opens up the potential for AI systems to interact more fluidly with humans in the physical world.

Internationally acclaimed research

Lee was recently honored at the International Conference on Learning Representations, alongside collaborators from Meta AI and Georgia Tech, with one of four Outstanding Paper Awards. Their paper, “Emergence of Maps in the Memories of Blind Navigation Agents,” was selected from among 4,900 submitted to the conference.

The research delves into how “blind” AI navigation agents, equipped solely with egomotion sensing, can learn to navigate unfamiliar environments and construct maplike representations that enable them to take shortcuts, follow walls, predict free space, and detect collisions.

“My focus is the development of agents that can perceive their environment and communicate about this understanding with humans in order to coordinate their actions to achieve mutual goals — in short, agents that can see, talk, and act,” Lee said. “Consequently, I work on problems in computer vision, natural language processing, and deep learning in general.”

The importance of language grounding

Lee is fundamentally interested in language grounding — associating words with sights, sounds, and actions, in order to anchor their meanings in day-to-day life and in communicable expressions.

Animated illustration of a robot following instructions to walk through a building

The instructions and path of a robot using computer vision to navigate.

Grounding is crucial for robots with diverse embodiments, such as legs, wheels, or different types of manipulators. While Lee acknowledges that large language models play a significant role in his research, he points out that these models lack the ability to ground words and concepts in the real world.

“ChatGPT can write you a poem about cats, and it can even identify one in a photo,” Lee said. “However, it doesn’t know that cats are furry, in that it lacks tactical sensor to identify what ‘furry’ even means or what its experiential implications are.”

As an example, Lee highlights the complexity of the challenges a robot faces when given the simple command to go to the kitchen and slice an apple.

“If it actually wants to follow that, it has to be able to ground references to ‘kitchen’ and ‘apple’ to the stimuli it collects from onboard sensors, like cameras,” Lee said. “The robot also has to understand what ‘go’ and ‘slice’ mean, for the particular embodiments it has. We have hands, so slicing looks like a particular motion for us. For a robot with a different set of manipulators, slicing may require very different motions, even if the outcome we want is the same.”

Lee added that the ability to draw conclusions from perceptual data will continue to be a focus for AI researchers.

The future of AI and language grounding

The current surge of interest in AI has been driven by recent advancements in the field’s ability to deal with sound, text, and imagery. Consumer AI applications have become profitable, spurring further excitement for the technology. To what degree large language models like ChatGPT will end up augmenting or replacing creative or intellectual work remains an open question. Looking beyond ChatGPT, Lee sees significant opportunities in language grounding as a means to expand interactions with embodied agents.

“One of the reasons I’m excited about language grounding is the issue of access,” he said. “Most of us are not programmers, and even fewer are mechanical engineers and roboticists. It would be great if you could talk to a robot to get it to perform actions, which would require the robot to be able to reason about grounding appropriately.”

As robots and embodied agents become increasingly integrated into our daily lives, our ability to communicate with and control them easily using natural language will be essential to ensure accessibility. This is particularly true for people with disabilities, who stand to benefit most from these technologies. By advancing research in language grounding, Lee and his colleagues are working to create a better future for human-AI cooperation.

If you’re interested in connecting with the AI and Robotics Program for hiring and collaborative projects, please contact

Subscribe to AI @ Oregon State

Return to AI @ Oregon State


June 6, 2023