University of Michigan researchers release AFUN for robot affordance understanding
The University of Michigan-led model pairs affordance masks with 3D motion curves, but its public repo is still inference-only.
By Ryan Merket ยท Published
Why it matters
AFUN targets a robotics bottleneck that end-to-end policies often hide: translating visual understanding into executable contact and motion. The partial release gives researchers a model to test, but the missing training, dataset, and evaluation code will determine how auditable the claim becomes.

Jun Gao (@JunGao33210520) and his PhD student Zhaoning Wang (@Zhaoning_Eric_W) are putting a sharper interface between robot perception and robot action with AFUN, a new affordance model that predicts not only where a robot should touch an object, but how it should move after contact.
Gao, a University of Michigan assistant professor whose work sits at the intersection of 3D computer vision, computer graphics, and generative models, highlighted AFUN in a thread on X on June 22. The underlying arXiv preprint was submitted on June 1, with Wang, Yi Zhong, Jiawei Fu, Henrik I. Christensen, and Gao listed as authors. Wang's own site identifies him as a University of Michigan PhD student advised by Gao, with prior internships at Hillbot and UC San Diego's SuLab.
The model's premise is straightforward: a robot that can identify a handle still has not solved the task of opening a door. It also needs to infer the physical trajectory that follows contact. AFUN tries to combine those two steps. From an RGB-D observation and a language task description, the model predicts a task-conditional functional segmentation mask, the "where," and a 3D post-contact motion curve, the "how."
That makes the work more specific than the broad wave of vision-language-action research now moving through robotics. AFUN is not presented as a general robot policy that maps camera input directly to motor commands. It is presented as an affordance layer: an interpretable middle step between visual understanding and manipulation planning. In practical terms, that means the model can mark the part of an object that matters for a task and output a motion curve a downstream robot system can use.
The bet is a common affordance schema
The research team's real claim is not only the model architecture. It is the data pipeline. The AFUN project page says the team built a standardized pipeline that converts robot data, human egocentric data, simulation data, and real-world scan data into a shared affordance schema with language phrases, functional masks, and object-centric 3D motion labels.
That is the part robotics labs and embodied AI startups should read closely. Robotics remains bottlenecked by data that is fragmented across embodiments, sensors, simulators, tasks, and annotation formats. A robot arm demonstration, a human first-person video, and a 3D scan do not naturally yield the same training target. AFUN's approach is to normalize those sources around the object and the functional interaction rather than around the body that performed the action.
The preprint says AFUN outperformed baselines across eight test sets from four affordance benchmarks, improving mean gIoU and cIoU by 23.9 and 26.3 points, respectively. It also reports a 12.7% to 61.3% hit-rate gain over the best baseline for contact-point prediction and the best performance across three 3D motion test sets. Those numbers are reported by the authors in an arXiv preprint, not by an independent benchmark operator.
The project page also says AFUN can be deployed for real-world robot manipulation without robot-specific finetuning. That is the most commercially resonant claim in the paper because embodiment transfer is one of the expensive problems in robot learning. A model that can generalize across object categories, language instructions, and robot embodiments would reduce the amount of task-specific engineering needed to move from demo to deployment. AFUN's evidence for that claim is still bounded by the authors' reported experiments.
The repo is open, but not fully open yet
The public AFUN GitHub repository is useful, but it is not a complete reproduction package yet. The README says the repo currently provides inference-only code: users can load a trained AFUN checkpoint and run it on their own images. The same README lists training code, dataset, evaluation and benchmark code, and robot demonstration code as TODO items.
That distinction matters. The release gives other researchers a path to test AFUN as a model. It does not yet give them the full machinery to audit the training process, rerun the benchmark, or reproduce the data pipeline that underpins the most important claim. For a robotics model whose pitch depends heavily on standardized, cross-source affordance data, the missing dataset and evaluation code are not housekeeping details. They determine how much of the result the field can stress-test.
The repo is licensed under Apache-2.0 and includes demo tooling. Its README says depth can come from a provided sensor depth map or be estimated from a single RGB image using Depth-Anything-3. It also says the project builds on Qwen3-VL, SAM3, Depth-Anything-3, lingbot-depth, and Sonata. In a typical demo flow, a user provides an RGB image and a text prompt such as "Open the refrigerator door," and AFUN outputs a predicted affordance mask plus a motion trajectory.
Why Gao's lab is pushing into robot functionality
The founder-style story here is a research trajectory rather than a company launch. Gao's lab has been built around 3D representations and generative world models, with Gao writing on his homepage that he is interested in controllable, consistent, interpretable, and scalable world models for realistic interactions with humans and autonomous agents. AFUN pushes that agenda from generating or understanding 3D scenes toward acting inside them.
Wang's background also fits the pivot. His homepage describes an interest in perception and understanding, embodied AI, and world models, with a belief that data-driven computer vision and graphics can bridge the virtual and physical worlds. AFUN is a concrete version of that thesis: use perception and 3D structure to make the function of an object legible enough for a robot to act.
The unresolved question is whether AFUN's affordance layer can survive the messiness of deployment outside curated experiments. Household and warehouse manipulation tasks are filled with occlusions, deformable objects, damaged handles, unfamiliar tools, bad lighting, and tasks where the right interaction depends on force as much as geometry. AFUN's current public materials emphasize RGB-D observations, language commands, segmentation masks, and 3D motion curves. They do not establish that this alone is sufficient for force-sensitive or failure-prone manipulation.
Still, the release lands on a real fault line in robotics. End-to-end robot policies promise simplicity but can be hard to interpret, debug, and transfer. Classical pipelines are more transparent but often brittle. AFUN sits between those poles: it uses foundation-model components and large-scale data, but exposes an intermediate representation that a robot planner can inspect. If the remaining code and data ship, the field will be able to test whether that middle layer is a durable primitive or another promising preprint result that narrows under real-world contact.