Resources
Our Method
D-Sub
USB
Terminal Block
BNC
Abstract
Diffusion-based policy learning frameworks excel in learning diverse tasks and achieving high success rates. However, in manufacturing settings, success rate alone is insufficient for real-world deployment. Tasks must be executed efficiently, minimizing idle time while maintaining precision. Additionally, in assembly and disassembly settings, a single scene often contains multiple task goals that need to be completed—such as picking up an engine while simultaneously securing a suspension—requiring the robot to reason over multiple objectives within the same observation space. In human-robot collaboration, enabling humans to specify task preferences is crucial for flexible and intuitive interaction.
In this paper, we address two key challenges: (1) improving task execution efficiency by structuring tasks into distinct sub-task modes via language, and (2) enabling human operators to select tasks using natural language commands. Additionally, we introduce adaptive parameter selection framework and reliance on different sensory modalities depending on these sub-task modes. We evaluate our approach on the NIST Task Board, a representative benchmark of real-world tasks where multiple task goals exist within the same scene. Our method improves execution speed by 57% and show 19% improvement in task success rates.
Hardware Setup
.png)
System Architecture

Experiment Setup

Results
D-Sub | USB | BNC | Terminal | Avg. | |
---|---|---|---|---|---|
DP-S | 0.80 | 0.75 | 0.65 | 0.80 | 0.76 |
DP-M | 1.00 | 0.95 | 0.70 | 0.85 | 0.88 |
DP-M-AM | 1.00 | 1.00 | 0.85 | 0.95 | 0.95 |

A comparison between DP-S and DP-M-AM Methods of the average time taken to complete the task end to end

A comparison between DP-S and DP-M-AM Methods of the average time taken for mode change within subtasks