Version 1
: Received: 14 August 2024 / Approved: 15 August 2024 / Online: 15 August 2024 (06:28:53 CEST)
Version 2
: Received: 14 September 2024 / Approved: 15 September 2024 / Online: 16 September 2024 (10:03:54 CEST)
How to cite:
Lu, S.; Haerdtlein, C.; Schilp, J. Visual Imitation Learning from One-Shot Demonstration for Multi-Step Robot Pick-and-Place Tasks. Preprints2024, 2024081123. https://doi.org/10.20944/preprints202408.1123.v2
Lu, S.; Haerdtlein, C.; Schilp, J. Visual Imitation Learning from One-Shot Demonstration for Multi-Step Robot Pick-and-Place Tasks. Preprints 2024, 2024081123. https://doi.org/10.20944/preprints202408.1123.v2
Lu, S.; Haerdtlein, C.; Schilp, J. Visual Imitation Learning from One-Shot Demonstration for Multi-Step Robot Pick-and-Place Tasks. Preprints2024, 2024081123. https://doi.org/10.20944/preprints202408.1123.v2
APA Style
Lu, S., Haerdtlein, C., & Schilp, J. (2024). Visual Imitation Learning from One-Shot Demonstration for Multi-Step Robot Pick-and-Place Tasks. Preprints. https://doi.org/10.20944/preprints202408.1123.v2
Chicago/Turabian Style
Lu, S., Christian Haerdtlein and Johannes Schilp. 2024 "Visual Imitation Learning from One-Shot Demonstration for Multi-Step Robot Pick-and-Place Tasks" Preprints. https://doi.org/10.20944/preprints202408.1123.v2
Abstract
Imitation learning, also known as programming by demonstration, has been shown to be a promising paradigm for intuitive robot programming by non-expert users. However, the classical kinesthetic approach with physical hand guidance suffers from generalizability across different robot types and is impractical for demonstrating tasks with long horizons. Visual imitation learning enables the recording of multi-step tasks as a single continuous video, allowing non-experts to demonstrate tasks naturally. Existing approaches typically require a large amount of data to develop end-to-end deep learning models that map raw pixels to robot actions. This paper explores the application of visual imitation learning from one-shot demonstration, significantly reducing the data requirements and simplifying the programming process. To achieve this target, a framework is proposed to map hand trajectories to the robot end-effector, consisting of four essential components: hand detection; object detection; segmentation of the trajectories into elemental skills; and learning the skills. Methods are developed for each component and evaluated on recorded videos to demonstrate the effectiveness of the proposed framework.
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.