Imitation learning, also known as programming by demonstration, has been shown to be a promising paradigm for intuitive robot programming by non-expert users. However, the classical kinesthetic approach with physical hand guidance suffers from generalizability across different robot types and is impractical for demonstrating tasks with long horizons. Visual imitation learning enables the recording of multi-step tasks as a single continuous video, allowing non-experts to demonstrate tasks naturally. Existing approaches typically require a large amount of data to develop end-to-end deep learning models that map raw pixels to robot actions. This paper explores the application of visual imitation learning from one-shot demonstration, significantly reducing the data requirements and simplifying the programming process. To achieve this target, a framework is proposed to map hand trajectories to the robot end-effector, consisting of four essential components: hand detection; object detection; segmentation of the trajectories into elemental skills; and learning the skills. Methods are developed for each component and evaluated on recorded videos to demonstrate the effectiveness of the proposed framework.