GRAPPA: Generalizing and Adapting Robot Policies via Online Agentic Guidance

Arthur Bucker1, Pablo Ortega-Kral1, Jonathan Francis1,2, Jean Oh1
1Robotics Institute, Carnegie Mellon University 2Robot Learning Lab, Bosch Center for Artificial Intelligence
GRAPPA overview

GRAPPA Steers Robot Policies by modifying the action distribution with grounded visuomotor cues.

Abstract

Robot learning approaches such as behavior cloning and reinforcement learning have shown great promise in synthesizing robot skills from human demonstrations in specific environments. However, these approaches often require task-specific demonstrations or designing complex simulation environments, which limits the development of generalizable and robust policies for unseen real-world settings. Recent advances in the use of foundation models for robotics (e.g., LLMs, VLMs) have shown great potential in enabling systems to understand the semantics in the world from large-scale internet data. However, it remains an open challenge to use this knowledge to enable robotic systems to understand the underlying dynamics of the world, to generalize policies across different tasks, and to adapt policies to new environments. To alleviate these limitations, we propose an agentic framework for robot self-guidance and self-improvement, which consists of a set of role-specialized conversational agents, such as a high-level advisor, a grounding agent, a monitoring agent, and a robotic agent. Our framework iteratively grounds a base robot policy to relevant objects in the environment and uses visuomotor cues to shift the action distribution of the policy to more desirable states, online, while remaining agnostic to the subjective configuration of a given robot hardware platform. We demonstrate that our approach can effectively guide manipulation policies to achieve significantly higher success rates, both in simulation and in real-world experiments, without the need for additional human demonstrations or extensive exploration.

GRAPPA information flow

Information flow between the agents to produce a guidance code. a) The advisor agent orchestrates guidance code generation by collaborating with other agents and using their feedback to refine the generated code. b) The grounding agent uses segmentation and classification models to locate objects of interest provided by the advisor, reporting findings back to the advisor. c) The robotic agent uses a Python interpreter to test the code for the specific robotic platform and judge the adequacy of the code. d) The monitor agent analyses the sequence of frames corresponding to the rollout of the guidance and give feedback on potential improvements.

Generalization on Out-Off-Distribution Appearances


Out-off-distribution Generalization

position and appearance generalization

GRAPPA guides the base policy for out-of-distribution cases. The task involves grasping a deformable toy ball and placing it inside a box.

Guidance influence

Illustration of the effect of different guidance percentages on a failure case of the base policy. In red we show the base policy failing in an out-of-distribution scenario; with 100\% of guidance (yellow), the end position is successfully above the box, but it has lost low-level notions. By balancing both with intermediate guidance (50%) shown in green, we can complete the task.

Pretrained Act3d failing
(No guidance)

Act3d with no guidance: the policy fails to press the last button (blue), but manages to correctly approach the first 2 buttons reaching them from above with the gripper closed.

+ 100% of GRAPPA
guidance

Guidance only (overwriting the base policy): The sequence of movements is correct, but the initial guidance code doesn’t account that the buttons should be approached from above.

Act3d + 1% of GRAPPA
guidance

Act3d with 1% guidance: The modified policy captures both the low-level motion of the pre-trained policy and the high-level guidance provided, successfully pressing the sequence of buttons.

Performance improvement on the RL-Bench benchmark, by applying 5 iterations of guidance improvement over unsuccessful rollouts.

RL-BENCH simulation results

When guiding a completely random policy, GRAPPA is still able to achieve great success rates on tasks that don’t require fine-grained motions. Improving its performance at each iteration.

Heatmap visualization of the guidance distribution, generated online by our proposed agentic framework. The distribution expresses the relevance of each possible future state for completing the task (e.g. "press the blue button"). The guidance is then used to bias the robot policy's action distribution towards the desirable behavior.


Appendix

Breakdown for failure cases from the learning-from-scratch experiment (push buttons and turn tap), classifying trials by logs, guidance codes, and observed behavior. Note that this analysis is performed on the learning-from-scratch experiment to decouple the errors of GRAPPA from the base policy.


Open Perception: A tool kit for visual grounding.

We modularize the grounding agent used in GRAPPA and release it as Open Perception, a standalone software package to aid researchers in challenging open-vocabulary object detection and state estimation for embodied robots. Our perception agent uses GroudingDINO and Segment Anything (SAM2) to locate and track objects in the scene, giving a text prompt; we then feed the bounding boxes and masks to two reasoning agents to further refine the selection.

A multigranular search agent checks if the desired object was found; if not, it proposes related semantic classes and recursively searches in them, cropping the parent objects to narrow the search. A verification agent can disambiguate between the final detections to choose the most relevant one to the initial query; to do this, we overlay the detection bounding boxes on the images and assign numeric labels and different colors, we then prompt the VLM to choose the most appropriate one given the search objective. Though in the full implementation of GRAPPA, these agents are used in tandem, called by the orchestrator agent, we release them as modular components.

Once a detection is made, we use SAM2 to segment and track these objects across the frames. If a corresponding point cloud is available, we provide tools to estimate the 3D-oriented bounding box using PCA. This allows the agent to continuously search, locate, and track an object and provide 3D position reports.

At an implementation level, we design Open Perception to support multiple backends, accommodating different software stacks. We provide integrations with REDIS and ROS2; for instantiating the multimodal agents, we utilize LiteLLM, allowing users to choose different models and APIs. We also provide Docker containers for development and deployment.

We benchmark our agent on the open-vocabulary detection dataset OCID-Ref*. OCID-Ref extends the Object Clutter Indoor Dataset (OCID) with natural language annotations referencing objects in cluttered scenes. Each environment presents several distractor objects and ambiguous prompts. From this dataset, we sample 300 scenes, whose annotations assume a viewer positioned at the front of the table. We compare the performance of our Grounding Agent against the highest-scoring detection from GroundingDINO.

Grounding Dino Performance with and without the VLM Agent for multi-granular search and disambiguation


Our results demonstrate that enhancing Open-Vocabulary detectors with our VLM reasoning scheme allows the model to achieve a detection accuracy 5.3% higher than the model by itself. This represents a relative improvement of 10.2% without changing any parameters of the base model.

Agent Prompts

You are a supervisor AI agent whose job is to guide a robot in the execution of a task. You will be provided with the name of a task that the robot is trying to learn (e.g. open door) and an image of the environment. With that, you must follow the following steps:
1- determine the key steps to solve the tasks.
2- come up with the names of features or objects in the environment required to solve the task.
3- check if the objects are present in the scene and can be detected by the robot by providing the image to the perception agent and asking the perception agent (e.g. ’Can you find the door handle?’ wait for feedback), If the answer goes against of what you expected to repeat the steps 1 to 3.
4- Only proceed to this step after receiving positive feedback on 3. Write a Python code to guide the robot in the execution of the task. The output code needs to have a function that takes the robot’s state as input (def guidance(state, previous vars=’condition1’:False, ...):), queries the position of different elements in the environment (e.g get position(’door’)) and outputs a continuous score for how close is the robot to completing the task (e.g. the robot is far away from the door the score should be low).

When writing the guidance function, you can make use of the following functions that are al- ready implemented: get position(object name) ->[x,y,z], get size(object name) ->[height, width, depth], get orientation(object name) ->euler angles rotation [rx,ry,rz]. and any other function that you think is necessary to guide the robot (e.g. numpy, scipy, etc).
The guidance function must return a score (float) and a vars dict (dict). The vars dict will be used to store the status of conditions relevant to the task completion. The previous vars dict input with contain the vars dict from the previous iteration. The score must be a continuous value having different values for different states of the robot. States slightly closer to the goal should have slightly higher scores. The next action of the robot will depend on the score returned by the guidance function when queried for many possible future states.
The state of the robot is a list with 7 elements of the end-effector position, orientation and gripper state [x, y, z, rotation x, rotation y, rotation z, gripper], gripper represents the distance between the two gripper fingers. All distance values are expressed in meters. and the rotation values are expressed in degrees.” start your code with the following import: ’from motor cortex.common.perception functions import get position, get size, get orientation’. Do not include any example of the guidance function in the code, only the function itself.

code format example:


            '''
            from motor cortex.common.perception functions import get position, get size, get orientation
            # relevant imports
            # helper functions
            def guidance(state, previous_vars={'condition1': False, ...}):
              # your code here
              return score, vars_dict
            '''
            
You are encouraged to break down the task into sub-tasks, and implement helper functions to better organize the code. You can communicate with a perception agent and a robotic agent. Always indicate who you are talking with by adding ’NEXT: perception agent’ or ’NEXT: robotic agent’ at the end of your message.

You are a perception AI agent whose job is to identify and track objects in an image. You will be provided with an image of the environment and a list of objects that the robot is trying to find (e.g. door, handle, key, etc). With that, you can make use of the following function to try to locate the objects in the image: in the image(image path, object name, parent name) ->yes/no. If the object is not found it might be because the object was too small, too far, or partially occluded, in this case, try to find a broader category that could encompass the object. In this case, report the function call used followed by ’NEXT: perception agent’ to look for the objects using similar object names or with a parent name that cloud encompasses the object. (e.g. first answer: ’in the image(’door handle’) ->no NEXT: perception agent’, second answer: ’in the image(’door handle’, ’door’) ->no NEXT: perception agent’, third answer: ’in the image(’handle’, ’gate’) ->yes. couldn’t find a door handle but found a gate handle NEXT: supervisor agent’). Report back to the supervisor agent in a clear and concise way if the objects were found or not. If an object was found using a parent name, report the parent name and the object name. Use ’NEXT: supervisor agent’ at the end of your message to indicate that you are talking with the supervisor agent, or ’NEXT: perception agent’ to look further for the objects.

You are an AI agent responsible for controlling the learning process of a robot. You will receive Python code containing a guidance function that helps the robot with the execution of certain tasks. Your job is to analyze the environment and criticize the code provided by checking if the guidance code is correct and makes sense.
You SHOULD NOT create any code, only analyze the code provided by the supervisor. Attend to the following:
- The score provided by the guidance function is continuous and makes sense.
- The task is being solved correctly.
- The code can be further improved.
- The states of the robot are being correctly expressed.
- The code correctly conveys the steps to solve the task in the correct order.
BE CRITICAL!
Make sure that the robot state is expressed as its end-effector position and orientation in the format by using the function test guidance code format(). If the code is not correct or can be further improved, provide feedback to the supervisor agent and ask for a new code. Use ’NEXT: supervisor agent’ at the end of your message to indicate that you are talking with the supervisor agent. If no code is provided, ask the supervisor agent to generate the guidance code. If the code received makes sense and is correct, simply output the word ’TERMINATE’.

You will be given a sequence of frames of a robotic manipulator performing a task, and a guidance code used by the robot to perform the task.
Your job is to describe what the sequence of frames captures, and then list how the robot could better perform the task in a simple and concise way.
Do not provide any code, just describe the task and how it could be improved.

Agentic Framework Diagram

Agentic Framework Diagram

The agents in the GRAPPA framework are instances of large multimodal models that communicate with each other to produce a final guidance code; leveraging the reasoning capabilities of this type of model. This image exemplifies the chain of thought each agent is encouraged to follow, which in practice is encoded as a natural language prompt shown in the Appendix. The agents can call external tools to aid their analysis such as detection models and a Python interpreter for scrutinizing the code. The advisor agent acts as the main orchestrator, querying the other agents as necessary and generating and refining the guidance code with the provided feedback.

BibTeX


@misc{bucker2025grappa,
     title={GRAPPA: Generalizing and Adapting Robot Policies via Online Agentic Guidance},
     author={Bucker, Arthur and Ortega-Kral, Pablo and Francis, Jonathan and Oh, Jean},
     year={2025},
     eprint={2410.06473},
     archivePrefix={arXiv},
     primaryClass={cs.RO},
}